[ https://issues.apache.org/jira/browse/PIG-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072187#comment-13072187 ]
Thejas M Nair commented on PIG-1942: ------------------------------------ bq. wrt schema has fewer fields than actual: I think pig schema needs to support the feature of specifying partial schemas, or types for schemas with variable number of fields of certain/unspecified type. But I think it is better to have this feature in schema, rather than doing conversions based on a schema that is not compatible. Also, I think it is a good thing to check for schema consistency, so that the user knows when they make a mistake. bq. I am against the idea of returning null and WARN (nearly as a rule). I think a reasonable interpretation is always better than NULL (with WARN). I would only advocate for an actual error that forces a user to rectify their code. Returning null+WARN is the convention followed in load funcs like PigStorage and in type conversion code. But I see that the situation in type conversion is different because there is no reasonable interpretation if the type conversion fails. Does any body else have opinions on this ? Re: logging 1 line per warning. PigLogger.warn(..) can be used to aggregate the warnings. Regarding auto-tupling, I agree that it is useful when - 1. output schema is a tuple 2. output schema is a bag of tuples with single fields. But if it is a bag of tuples that have multiple fields, I think it is makes sense for the output value to have a list type representing the tuple. {code} If output schema is {(int, int)} I think the output value should look like - ((1,2),(3,4)). I don't see a need to convert (1,2) into ((1,2)). {code} I also have concern that with auto tupling, python udf users will have an incorrect understanding of pig bags of primitive types. They might not realize that the bags always contain a tuple. How much of a performance difference did you notice while adding adding tuple wrappers for fields in a bag? I am trying to evaluate the option of providing utility libraries that python udfs can use to convert to pig type. Regarding skipping nulls in JythonUtils.asBag, it is at line 491. I am not sure about if pig actually works with null tuples in a bag, I need to check that. {code} if (it != null) { while (first == null && it.hasNext()) { first = it.next(); } } {code} > script UDF (jython) should utilize the intended output schema to more > directly convert Py objects to Pig objects > ---------------------------------------------------------------------------------------------------------------- > > Key: PIG-1942 > URL: https://issues.apache.org/jira/browse/PIG-1942 > Project: Pig > Issue Type: Improvement > Components: impl > Affects Versions: 0.8.0, 0.9.0 > Reporter: Woody Anderson > Assignee: Woody Anderson > Priority: Minor > Labels: python, schema, udf > Fix For: 0.10 > > Attachments: 1942.patch, 1942_with_junit.patch > > > from https://issues.apache.org/jira/browse/PIG-1824 > {code} > import re > @outputSchema("y:bag{t:tuple(word:chararray)}") > def strsplittobag(content,regex): > return re.compile(regex).split(content) > {code} > does not work because split returns a list of strings. However, the output > schema is known, and it would be quite simple to implicitly promote the > string element to a tupled element. > also, a list/array/tuple/set etc. are all equally convertable to bag, and > list/array/tuple are equally convertable to Tuple, this conversion can be > done in a much less rigid way with the use of the schema. > this allows much more facile re-use of existing python code and less memory > overhead to create intermediate re-converting of object types. > I have written the code to do this a while back as part of my version of the > jython script framework, i'll isolate that and attach. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira