[jira] [Commented] (PIG-1942) script UDF (jython) should utilize the intended output schema to more directly convert Py objects to Pig objects

Woody Anderson (JIRA) Tue, 26 Jul 2011 04:38:42 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13071048#comment-13071048
 ]


Woody Anderson commented on PIG-1942:
-------------------------------------

I think your feedback raises some fair questions, but I have some reasons for 
disagreeing:

wrt schema has fewer fields than actual:
 this is a common case for me b/c pig doesn't allow specification of \*-tuple 
i.e. all rows of data will have the same number of (int) elements, but it's not 
known how many. This is a week area of pig in general imho. If there is only 1 
element in the tuple it can be seen to infer some type information for the 
remaining rows. (at least i think this is how 'tuple(int)' shows up). I think 
that when there are more than 1 columns in a tuple, then it's not a *generic* 
tuple, then i can see an error being appropriate. but for 1-tuples i appreciate 
the flexibility of using it for the type and writing udfs that accept tuples of 
arbitrary dimension, even if the args-to-function stuff is to simplistic to 
apply in this scenario it's easy enough to write useful udfs that utilize tuple 
dimension flexibility.

I am against the idea of returning null and WARN (nearly as a rule). I think a 
reasonable interpretation is always better than NULL (with WARN). I would only 
advocate for an actual error that forces a user to rectify their code. This may 
be where reasonable people disagree, but i think null rather than a tuple 
reflecting the returned data is less expected.

The whole reporting of 'schema != data' could be improved tho. I am not sure of 
the best way to reflect that anything "grey/WARN" is happening. It seems liking 
logging 1 line per encountered edge case is major overkill, and prone to 
generate huge log output. We could count each WARN scenario and log/counter 
that information to give a more succinct description of execution behavior that 
a simple user can fix, and an advanced user can ignore judiciously. Possibly 
more specific counters and only 1 warn per type per execution.

Pig schemas are often so.. imprecise, that i think best effort coercion is 
useful, but i think a fine compromise would be to support only a specific set 
of conversions that would be a subset of this patch, but perform the others b/c 
they are mostly intuitive and useful, but a WARN will be generated when 
executed if we think it's too esoteric. We may draw lines in slightly different 
places, but i tried to cover a fair number of cases in the test code, which is 
think is a fairly survey of expected coercions.


wrt JU.asBag, i think auto-tupling is a must. This is one of the most common 
mistakes for jython udf devs. "why must i wrap tokens inside of tuples" is a 
very common refrain, and just silly 99.9% of the time. Plus it's a bunch of 
extra unnecessary objects that one must create, and causes a bit slower 
execution for simple udfs.
I'd have to re-read the code again to examine the edge cases. I do recall the 
disambiguation for embedded bags being a pain to write and describe. 
Documentation being the remaining concern. That said, i think it does something 
reasonable and still executes faster than existing rigid code. Also in the code 
is a decent synopsis of the disambiguations that are intended.

wrt skipping nulls: can you cite the line number? do you mean skipping null 
bags? or null element/tuples when creating a bag? This might just be me not 
understanding something properly. I thought bags didn't have null tuples, just 
tuples with null elements?

wrt various types:
jython is fully capable of returning any jvm type. so that means anything 
really.
I decided to cover the collections classes, lang classes, base types, and PY* 
classes.
Jython is nice in that many classes implement the collections ifaces, but not 
always as efficiently as using the python classes directly.
this is common in python/jython of course. not in udfs as of yet... b/c it 
wasn't allowed. But i began doing it pretty quickly once it was possible.


> script UDF (jython) should utilize the intended output schema to more 
> directly convert Py objects to Pig objects
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-1942
>                 URL: https://issues.apache.org/jira/browse/PIG-1942
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.8.0, 0.9.0
>            Reporter: Woody Anderson
>            Assignee: Woody Anderson
>            Priority: Minor
>              Labels: python, schema, udf
>             Fix For: 0.10
>
>         Attachments: 1942.patch, 1942_with_junit.patch
>
>
> from https://issues.apache.org/jira/browse/PIG-1824
> {code}
> import re
> @outputSchema("y:bag{t:tuple(word:chararray)}")
> def strsplittobag(content,regex):
>         return re.compile(regex).split(content)
> {code}
> does not work because split returns a list of strings. However, the output 
> schema is known, and it would be quite simple to implicitly promote the 
> string element to a tupled element.
> also, a list/array/tuple/set etc. are all equally convertable to bag, and 
> list/array/tuple are equally convertable to Tuple, this conversion can be 
> done in a much less rigid way with the use of the schema.
> this allows much more facile re-use of existing python code and less memory 
> overhead to create intermediate re-converting of object types.
> I have written the code to do this a while back as part of my version of the 
> jython script framework, i'll isolate that and attach.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-1942) script UDF (jython) should utilize the intended output schema to more directly convert Py objects to Pig objects

Reply via email to