[ https://issues.apache.org/jira/browse/PIG-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13071048#comment-13071048 ]
Woody Anderson commented on PIG-1942: ------------------------------------- I think your feedback raises some fair questions, but I have some reasons for disagreeing: wrt schema has fewer fields than actual: this is a common case for me b/c pig doesn't allow specification of \*-tuple i.e. all rows of data will have the same number of (int) elements, but it's not known how many. This is a week area of pig in general imho. If there is only 1 element in the tuple it can be seen to infer some type information for the remaining rows. (at least i think this is how 'tuple(int)' shows up). I think that when there are more than 1 columns in a tuple, then it's not a *generic* tuple, then i can see an error being appropriate. but for 1-tuples i appreciate the flexibility of using it for the type and writing udfs that accept tuples of arbitrary dimension, even if the args-to-function stuff is to simplistic to apply in this scenario it's easy enough to write useful udfs that utilize tuple dimension flexibility. I am against the idea of returning null and WARN (nearly as a rule). I think a reasonable interpretation is always better than NULL (with WARN). I would only advocate for an actual error that forces a user to rectify their code. This may be where reasonable people disagree, but i think null rather than a tuple reflecting the returned data is less expected. The whole reporting of 'schema != data' could be improved tho. I am not sure of the best way to reflect that anything "grey/WARN" is happening. It seems liking logging 1 line per encountered edge case is major overkill, and prone to generate huge log output. We could count each WARN scenario and log/counter that information to give a more succinct description of execution behavior that a simple user can fix, and an advanced user can ignore judiciously. Possibly more specific counters and only 1 warn per type per execution. Pig schemas are often so.. imprecise, that i think best effort coercion is useful, but i think a fine compromise would be to support only a specific set of conversions that would be a subset of this patch, but perform the others b/c they are mostly intuitive and useful, but a WARN will be generated when executed if we think it's too esoteric. We may draw lines in slightly different places, but i tried to cover a fair number of cases in the test code, which is think is a fairly survey of expected coercions. wrt JU.asBag, i think auto-tupling is a must. This is one of the most common mistakes for jython udf devs. "why must i wrap tokens inside of tuples" is a very common refrain, and just silly 99.9% of the time. Plus it's a bunch of extra unnecessary objects that one must create, and causes a bit slower execution for simple udfs. I'd have to re-read the code again to examine the edge cases. I do recall the disambiguation for embedded bags being a pain to write and describe. Documentation being the remaining concern. That said, i think it does something reasonable and still executes faster than existing rigid code. Also in the code is a decent synopsis of the disambiguations that are intended. wrt skipping nulls: can you cite the line number? do you mean skipping null bags? or null element/tuples when creating a bag? This might just be me not understanding something properly. I thought bags didn't have null tuples, just tuples with null elements? wrt various types: jython is fully capable of returning any jvm type. so that means anything really. I decided to cover the collections classes, lang classes, base types, and PY* classes. Jython is nice in that many classes implement the collections ifaces, but not always as efficiently as using the python classes directly. this is common in python/jython of course. not in udfs as of yet... b/c it wasn't allowed. But i began doing it pretty quickly once it was possible. > script UDF (jython) should utilize the intended output schema to more > directly convert Py objects to Pig objects > ---------------------------------------------------------------------------------------------------------------- > > Key: PIG-1942 > URL: https://issues.apache.org/jira/browse/PIG-1942 > Project: Pig > Issue Type: Improvement > Components: impl > Affects Versions: 0.8.0, 0.9.0 > Reporter: Woody Anderson > Assignee: Woody Anderson > Priority: Minor > Labels: python, schema, udf > Fix For: 0.10 > > Attachments: 1942.patch, 1942_with_junit.patch > > > from https://issues.apache.org/jira/browse/PIG-1824 > {code} > import re > @outputSchema("y:bag{t:tuple(word:chararray)}") > def strsplittobag(content,regex): > return re.compile(regex).split(content) > {code} > does not work because split returns a list of strings. However, the output > schema is known, and it would be quite simple to implicitly promote the > string element to a tupled element. > also, a list/array/tuple/set etc. are all equally convertable to bag, and > list/array/tuple are equally convertable to Tuple, this conversion can be > done in a much less rigid way with the use of the schema. > this allows much more facile re-use of existing python code and less memory > overhead to create intermediate re-converting of object types. > I have written the code to do this a while back as part of my version of the > jython script framework, i'll isolate that and attach. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira