Re: toJSON function for tuples, bags and strings, PIG-2641

Dmitriy Ryaboy Tue, 10 Apr 2012 09:59:55 -0700

Russ,
I appreciate the passion, but let's drop fiery rhetoric in favor of
technical discussion, yeah? :-)

No one is against accessibility. The problem with making things simple
is that it's really hard to make them simple. Without appropriate
amount of forethought, you paint yourself into ugly corners like the
ones you are hitting right now with AvroStorage, for example.

Here's the problem with Json: it's a map, with no schema. Inferring a
schema is guaranteed to be error-prone. The last thing you want is a
loader that reads different records in the same file differently,
because you won't be able to refer to things and get weird errors. The
first record might not have all the keys. The first 10 records might
not have all the keys. They might have values that look like ints, but
are just ints by accident, and are actually strings in the general
case (or hex). Inferring can lead to serious, unintuitive problems
that are the opposite of simple and accessible.  The existing
JsonLoader can take a schema that you expect to read as an argument,
which is pretty good I think (you should know what you are reading,
yeah?). You can see the discussion here:
https://issues.apache.org/jira/browse/PIG-2332

I can more or less guarantee that for any inference scheme that you
propose, I can find an example of real life production logs that blow
it up. There's a reason we stopped doing json logging at Twitter...

Does it make sense to have something simple that will satisfy 80% of
cases based on some sampling and inference? Sure. But let's not just
hack out the first thing that sounds good and then deal with backwards
compatibility issues for the next 10 releases. And maybe let's make it
clear that it's an 80% solution by putting it into a separate loader
that extends JsonLoader, rather than having questionable behavior
front and center.

D

On Tue, Apr 10, 2012 at 8:12 AM, Russell Jurney
<russell.jur...@gmail.com> wrote:
> I forgot about UDFContext providing the schema, and the pig docs are
> out of date. Is no problem now.
>
> About default behavior for json, that would seem to be: tuples ->
> objects, bags -> arrays, integers -> long, decimals -> double, and
> configs for setting low precision to substitute int/float. Maps are
> loaded separately anyway, and can use their own loadfunc.
>
> Easily loading json would be a huge boon to Pig's accessibility. I
> don't see a reason to postpone acessibility.
>
> Russell Jurney
> twitter.com/rjurney
> russell.jur...@gmail.com
> datasyndrome.com
>
> On Apr 10, 2012, at 7:53 AM, Dmitriy Ryaboy <dvrya...@gmail.com> wrote:
>
>> first question: you can do this when outputSchema() is called, as it's
>> passed the input schema. IIRC, in trunk you have hooks to pass that
>> info to the backend in a udf.
>>
>> second question: see discussion on JsonLoader jira.. short answer:
>> non-trivial, no clear decision on what the most sensible thing to do
>> is (other than "map" which is unlikely to be what you want). Rather
>> than do something bad and then be stuck with a poor decision, allowing
>> people to provide their own schema instead for now.
>>
>> D
>>
>> On Tue, Apr 10, 2012 at 1:48 AM, Russell Jurney
>> <russell.jur...@gmail.com> wrote:
>>> Followup question: would it be nice if JsonLoader inferred schemas when
>>> none is present, according to some defaults?
>>>
>>> On Tue, Apr 10, 2012 at 12:48 AM, Russell Jurney
>>> <russell.jur...@gmail.com>wrote:
>>>
>>>> Is there a way to get the field names in an EvalFunc? I am close to done
>>>> but... no cigar :)  I need these to finish.
>>>>
>>>>
>>>> On Mon, Apr 9, 2012 at 11:03 PM, Russell Jurney 
>>>> <russell.jur...@gmail.com>wrote:
>>>>
>>>>> So far this is not easy.
>>>>>
>>>>>
>>>>> On Mon, Apr 9, 2012 at 5:42 PM, Russell Jurney 
>>>>> <russell.jur...@gmail.com>wrote:
>>>>>
>>>>>> I see Jackson being used in the Mozilla stuff.  It looks pretty
>>>>>> straightforward.
>>>>>>
>>>>>>
>>>>>> On Mon, Apr 9, 2012 at 5:38 PM, Dmitriy Ryaboy <dvrya...@gmail.com>wrote:
>>>>>>
>>>>>>> Jackson is your friend.
>>>>>>>
>>>>>>> On Mon, Apr 9, 2012 at 5:14 PM, Russell Jurney <
>>>>>>> russell.jur...@gmail.com> wrote:
>>>>>>>> I need to be able to JSONize and return json:chararray's of any pig
>>>>>>>> datatypes, to be able to index complex types in ElasticSearch via
>>>>>>>> Wonderdog.  See: https://issues.apache.org/jira/browse/PIG-2641
>>>>>>>>
>>>>>>>> Does anyone have existing code they can contribute to a toJSON UDF
>>>>>>> that
>>>>>>>> handles all these types?
>>>>>>>>
>>>>>>>> For instance, Mozilla has this Map to JSON UDF:
>>>>>>>>
>>>>>>> https://github.com/mozilla-metrics/akela/blob/master/src/main/java/com/mozilla/pig/eval/json/MapToJson.java
>>>>>>>>
>>>>>>>> It is apache licensed, so I think I can paste it into a general
>>>>>>> toJSON UDF?
>>>>>>>>
>>>>>>>>
>>>>>>>> Elephant-bird has this code, which turns JSON to Maps:
>>>>>>>>
>>>>>>> https://github.com/kevinweil/elephant-bird/blob/master/src/java/com/twitter/elephantbird/pig/piggybank/JsonStringToMap.java
>>>>>>>>
>>>>>>>>  ehh... thinking out loud... I'm just gonna do this in JRuby. If that
>>>>>>> has
>>>>>>>> issues, Python.
>>>>>>>>
>>>>>>>> Solved! :)
>>>>>>>>
>>>>>>>> --
>>>>>>>> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com
>>>>>>> datasyndrome.com
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome
>>>>>> .com
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.
>>>>> com
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.
>>>> com
>>>>
>>>
>>>
>>>
>>> --
>>> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com

Re: toJSON function for tuples, bags and strings, PIG-2641

Reply via email to