[jira] [Created] (PIG-2160) recent regression wrt FrontendException: ERROR 1000
recent regression wrt FrontendException: ERROR 1000 --- Key: PIG-2160 URL: https://issues.apache.org/jira/browse/PIG-2160 Project: Pig Issue Type: Bug Reporter: Woody Anderson i recently svn up'd http://svn.apache.org/repos/asf/pig/branches/branch-0.9 and rebuilt and tested the Antispam pig loader against the new 0.9.1 jar ensure everything is fine. this was working previously.. when the build version for the branch was 0.9.0 currently not working at Revision: 1145388 it's not, and i'm a bit confused, so hopefully someone can help me out: contents of ./target/surefire-reports/TEST-com.XTest.xml: .. error message=Error during parsing. lt;line 1, column 113gt; mismatched input apos;(apos; expecting SEMI_COLON type=org.apache.pig.impl.logicalLayer.FrontendExceptionorg.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during parsing. lt;line 1, column 113gt; mismatched input apos;(apos; expecting SEMI_COLON at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1638) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1583) at org.apache.pig.PigServer.registerQuery(PigServer.java:583) at org.apache.pig.PigServer.registerQuery(PigServer.java:596) at com...XTest.testLoadData(XTest.java:74) .. that test code method looks like this: @SuppressWarnings(unchecked) @Test public void testLoadData() throws Exception { ... PigServer pigServer = new PigServer(ExecType.LOCAL); pigServer.registerQuery(A = load 'file: + Util.encodeEscape(f.getAbsolutePath()) + ' using com.Storage( + 'a, b, c, d, e, f, g, h, i' + ) as (a:chararray, b:long, c:chararray, d:chararray, e:int, f:chararray, g:int, h:int, i:int);); ...} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-2116) HashPartitioner is not a safe partitioner for non-prime number of reducers, particularly bad for 2^n, which seems to be a common use
HashPartitioner is not a safe partitioner for non-prime number of reducers, particularly bad for 2^n, which seems to be a common use Key: PIG-2116 URL: https://issues.apache.org/jira/browse/PIG-2116 Project: Pig Issue Type: Bug Affects Versions: 0.8.1, 0.9.0 Reporter: Woody Anderson the implementation of hashCode should not be assumed to be good. in particular, the hashCode of String and List (used by Tuple) are very bad for modulus 2^n. we propose to add an additional perturbation of the int before doing the % reducers bucketing. HashMap.java uses this to prevent the String.hashCode from causing massive bucket collisions etc. but that perturbation is targeted explicitly for a 2^n number of buckets, which Pig is not doing in general. we propose possibly using the final mixing step from murmur3. here is some discussion of this issue for context: This has some amusing implications: this hash is terrible for 2,4,8,16,31, and 32 reducers, so even in normal situations that's pretty bad, especially if pig happens to pick 31 reducers because it has 104-106 mappers * 0.3. 31 is congruent to -1 mod 2^k for all 2 = k = 5, so in that case the hash is effectively: t[0]*(-1)^(n-1) + t[1]*(-1)^(n-2) + ... + t[n-2]*(-1) + t[n-1] = (for odd n) t[0] - t[1] + t[2] - t[3] + t[4] + ... So for example the string mississippim hashes to 0 (mod 2^32), as every even input character is cancelled out by an equal odd input elsewhere. H = 0 for c in mississippim: H = H*31 + ord(c) print %c: H=%d (mod 32) % (c, H%32) m: H=13 (mod 32) i: H=28 (mod 32) s: H=23 (mod 32) s: H=28 (mod 32) i: H=13 (mod 32) s: H=6 (mod 32) s: H=13 (mod 32) i: H=28 (mod 32) p: H=20 (mod 32) p: H=28 (mod 32) i: H=13 (mod 32) m: H=0 (mod 32) Similarly with exactly 31 reducers, the hash function cancels out entirely (31 is 0 mod 31, so everything but the last item is multiplied by 0^i) and the result is simply the value of the last item. A simple fix is to add a post-hash mixing step that nontrivially affects the bits in the state over all other bits in the hash output, ideally with probability 1/2 for all bits. That way the modulo doesn't distribute across the whole function back to the input, and the internal state of the hash above whatever modulus has some effect. H = 0 for c in mississippim: H = H*31 + ord(c) # these 0x ops are to simulate unsigned 32-bit math in python H = H0x Hout = (H + (H3))0x Hout = Hout ^ (Hout11) Hout = (Hout + (Hout15))0x print %c: H=%08x === %d (mod 32) % (c, Hout, Hout%32) m: H=01ea83d5 === 21 (mod 32) i: H=3d39fa73 === 19 (mod 32) s: H=6c78d8d4 === 20 (mod 32) s: H=3c76f555 === 21 (mod 32) i: H=0abb25ff === 31 (mod 32) s: H=40df81c9 === 9 (mod 32) s: H=cfc8a427 === 7 (mod 32) i: H=cea62c2b === 11 (mod 32) p: H=4594d493 === 19 (mod 32) p: H=f14b432a === 10 (mod 32) i: H=169be0b0 === 16 (mod 32) m: H=7d57b59c === 28 (mod 32) The mixing step only needs to be done once at the end. The one I inserted was stolen from Bob Jenkins' hash site, which is required reading for anyone who decides to implement their own hashing. Or you could use a real (good, fast, tested) hash function like murmur3. -Andy On Thu, Jun 02, 2011 at 03:37:56PM -0700, Woody Anderson wrote: This caught me off guard the other day, so i figured i'd pass it along: the hashCode implementation of Tuple and String have very specific expansions which do not provide a lot of hashCode variance mod 2^k when the elements are all equal. string: t[0]*31^(n-1) + t[1]*31^(n-2) + ... + t[n-1] tuple: ..(((31 + t[0])*31 + t[1])*31 + t[2])*31 + t[4].. this expansion modulo powers of 2 is degenerate if t[i] are all equal. eg. you group by (n0, n1) to do some work, and there are an unusually high number of tuples where n0 == n1, the value of n0/n1 makes no difference. this will equal 1 mod 16. the same goes if you're grouping by strings, and have a lot of a, aa, , b, bb, bbb, etc. type data this results in all the data ending up in a single reducer/part file. which is either a waste or going to kill your job. so, if you use 2^k reducers then that's a terrible group-by. and it's not going to be good (in general) for any non-prime. under 'normal' circumstances you probably won't notice this being a factor. I didn't notice until i used string.hashCode as part of a group-by to both group by my string an produce a semi-randomized output ordering (sherpa requirement); this completely blew up when simply grouping by the string hadn't. so, if you have highly varied data elements, this this is less of an issue, though a prime will usually generalize better, and you won't suddenly wonder about the bad dispersal you're getting. -w -- This message is automatically
[jira] [Assigned] (PIG-2098) jython - problem with single item tuple in bag
[ https://issues.apache.org/jira/browse/PIG-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Woody Anderson reassigned PIG-2098: --- Assignee: Woody Anderson jython - problem with single item tuple in bag -- Key: PIG-2098 URL: https://issues.apache.org/jira/browse/PIG-2098 Project: Pig Issue Type: Bug Affects Versions: 0.8.1, 0.9.0 Reporter: Vivek Padmanabhan Assignee: Woody Anderson While using phython udf, if I create a tuple with a single field, Pig execution fails with ClassCastException. Caused by: java.io.IOException: Error executing function: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Cannot convert jython type to pig datatype java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.pig.data.Tuple at org.apache.pig.scripting.jython.JythonFunction.exec(JythonFunction.java:111) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:245) An example to reproduce the issuue ; Pig Script {code} register 'mapkeys.py' using jython as mapkeys; A = load 'mapkeys.data' using PigStorage() as ( aMap: map[] ); C = foreach A generate mapkeys.keys(aMap); dump C; {code} mapkeys.py {code} @outputSchema(keys:bag{t:tuple(key:chararray)}) def keys(map): print mapkeys.py:keys:map:, map outBag = [] for key in map.iterkeys(): t = (key) ## doesn't work, causes Pig to crash #t = (key,) ## adding empty value works :-/ outBag.append(t) print mapkeys.py:keys:outBag:, outBag return outBag {code} Input data 'mapkeys.data' [name#John,phone#5551212] In the udf, t = (key) , because of this the item inside the bag is treated as a string instead of a tuple which causes for the class cast execption. If I provide an additional comma, t = (key,) , then the script goes through fine. From code what I can see is that ,for t = (key,) , pythonToPig(..) recieves the pyObject as [(u'name',), (u'phone',)] from the PyFunction call . But for t = (key) the return from PyFunction call is [u'name', u'phone'] -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2098) jython - problem with single item tuple in bag
[ https://issues.apache.org/jira/browse/PIG-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13039803#comment-13039803 ] Woody Anderson commented on PIG-2098: - to be clear on the parans issue, Nicolas Torzec cleared that up: In Python, a tuple is recognized by the commas that separate its elements, not by its surrounding parenthesis, which are just used for grouping expressions... That’s why both “t = (key, )” and “t = key, ” work, but not “t = (key)”. Nicolas. jython - problem with single item tuple in bag -- Key: PIG-2098 URL: https://issues.apache.org/jira/browse/PIG-2098 Project: Pig Issue Type: Bug Affects Versions: 0.8.1, 0.9.0 Reporter: Vivek Padmanabhan Assignee: Woody Anderson While using phython udf, if I create a tuple with a single field, Pig execution fails with ClassCastException. Caused by: java.io.IOException: Error executing function: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Cannot convert jython type to pig datatype java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.pig.data.Tuple at org.apache.pig.scripting.jython.JythonFunction.exec(JythonFunction.java:111) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:245) An example to reproduce the issuue ; Pig Script {code} register 'mapkeys.py' using jython as mapkeys; A = load 'mapkeys.data' using PigStorage() as ( aMap: map[] ); C = foreach A generate mapkeys.keys(aMap); dump C; {code} mapkeys.py {code} @outputSchema(keys:bag{t:tuple(key:chararray)}) def keys(map): print mapkeys.py:keys:map:, map outBag = [] for key in map.iterkeys(): t = (key) ## doesn't work, causes Pig to crash #t = (key,) ## adding empty value works :-/ outBag.append(t) print mapkeys.py:keys:outBag:, outBag return outBag {code} Input data 'mapkeys.data' [name#John,phone#5551212] In the udf, t = (key) , because of this the item inside the bag is treated as a string instead of a tuple which causes for the class cast execption. If I provide an additional comma, t = (key,) , then the script goes through fine. From code what I can see is that ,for t = (key,) , pythonToPig(..) recieves the pyObject as [(u'name',), (u'phone',)] from the PyFunction call . But for t = (key) the return from PyFunction call is [u'name', u'phone'] -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (PIG-2098) jython - problem with single item tuple in bag
[ https://issues.apache.org/jira/browse/PIG-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Woody Anderson resolved PIG-2098. - Resolution: Duplicate Release Note: dupe of PIG-1942 dupe of PIG-1942 jython - problem with single item tuple in bag -- Key: PIG-2098 URL: https://issues.apache.org/jira/browse/PIG-2098 Project: Pig Issue Type: Bug Affects Versions: 0.8.1, 0.9.0 Reporter: Vivek Padmanabhan Assignee: Woody Anderson While using phython udf, if I create a tuple with a single field, Pig execution fails with ClassCastException. Caused by: java.io.IOException: Error executing function: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Cannot convert jython type to pig datatype java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.pig.data.Tuple at org.apache.pig.scripting.jython.JythonFunction.exec(JythonFunction.java:111) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:245) An example to reproduce the issuue ; Pig Script {code} register 'mapkeys.py' using jython as mapkeys; A = load 'mapkeys.data' using PigStorage() as ( aMap: map[] ); C = foreach A generate mapkeys.keys(aMap); dump C; {code} mapkeys.py {code} @outputSchema(keys:bag{t:tuple(key:chararray)}) def keys(map): print mapkeys.py:keys:map:, map outBag = [] for key in map.iterkeys(): t = (key) ## doesn't work, causes Pig to crash #t = (key,) ## adding empty value works :-/ outBag.append(t) print mapkeys.py:keys:outBag:, outBag return outBag {code} Input data 'mapkeys.data' [name#John,phone#5551212] In the udf, t = (key) , because of this the item inside the bag is treated as a string instead of a tuple which causes for the class cast execption. If I provide an additional comma, t = (key,) , then the script goes through fine. From code what I can see is that ,for t = (key,) , pythonToPig(..) recieves the pyObject as [(u'name',), (u'phone',)] from the PyFunction call . But for t = (key) the return from PyFunction call is [u'name', u'phone'] -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-2093) add comparison order to TOP udf to allow for optional sort order asc/desc
add comparison order to TOP udf to allow for optional sort order asc/desc - Key: PIG-2093 URL: https://issues.apache.org/jira/browse/PIG-2093 Project: Pig Issue Type: Improvement Reporter: Woody Anderson Assignee: Woody Anderson Priority: Minor easy enough to allow the comparison used with the priority queue to be asc/desc with a simple boolean input to the UDF -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-1824) Support import modules in Jython UDF
[ https://issues.apache.org/jira/browse/PIG-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Woody Anderson updated PIG-1824: Release Note: module import state is determined before and after user code is executed. The resolved modules are inspected and added to the pigContext, then they are added to the job jar. this patch addresses the following import modes: - import re, which will (if configured) find re on the filesystem in the jython install root - import foo (which can import bar), this works now provided bar is resolvable JYTHON_HOME, JYTHONPATH, curdir, etc. - from pkg import *, which works when the cachedir is writable - import non.jvm.class, which works when the cachedir is writable - the directly imported module may use schema decorators, but recursively imported modules cannot until PIG-1943 is addressed was: module import state is determined before and after user code is executed. The resolved modules are inspected and added to the pigContext, then they are added to the job jar. this patch addresses the following import modes: - import re, which will (if configured) find re on the filesystem in the jython install root - import foo (which can import bar), this works now provided bar is resolvable JYTHON_HOME, JYTHONPATH, curdir, etc. - from pkg import *, which works when the cachedir is writable - import non.jvm.class, which works when the cachedir is writable Support import modules in Jython UDF Key: PIG-1824 URL: https://issues.apache.org/jira/browse/PIG-1824 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0, 0.9.0 Reporter: Richard Ding Assignee: Woody Anderson Fix For: 0.10 Attachments: 1824.patch, 1824_final.patch, 1824a.patch, 1824b.patch, 1824c.patch, 1824d.patch, 1824x.patch, TEST-org.apache.pig.test.TestGrunt.txt, TEST-org.apache.pig.test.TestScriptLanguage.txt, TEST-org.apache.pig.test.TestScriptUDF.txt Currently, Jython UDF script doesn't support Jython import statement as in the following example: {code} #!/usr/bin/python import re @outputSchema(word:chararray) def resplit(content, regex, index): return re.compile(regex).split(content)[index] {code} Can Pig automatically locate the Jython module file and ship it to the backend? Or should we add a ship clause to let user explicitly specify the module to ship? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-2086) grunt parser fails for: load .. as \n (b:bag{});
grunt parser fails for: load .. as \n (b:bag{}); - Key: PIG-2086 URL: https://issues.apache.org/jira/browse/PIG-2086 Project: Pig Issue Type: Bug Components: grunt Affects Versions: 0.10 Environment: mac 10.5.8 Reporter: Woody Anderson this snippet fails: {code} IN4 = load '$in' using com.zzz.Storage() as ( inpt:bag{} ); {code} this works (as on same line as semi-colon) {code} IN4 = load '$in' using com.zzz.Storage() as ( inpt:bag{} ); {code} this is the grunt error: 2011-05-20 20:19:34,934 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: file loadstore.pig, line 68, column 16 mismatched input ';' expecting RIGHT_PAREN this only happens in cases where the types of the fields are complex e.g. bags/tuples eg. change the type of _inpt_ to be _chararray_ and it will parse. this is very strange! and i spent hours debugging my schema writing skills and reading QueryParser.g before simply trying as (expr); on the same line. _all_ of my scripts had been written with the lines split the other way (with lots of ctor args and as-clause elements: hence the line breaks), this is not an issue if i don't load complicated types, but it fails in this particular case. This is quite unexpected and seems to be undocumented and a bug imho. i don't know enough about antlr (i was a javacc person) to make sense of why this would be an issue for the parser b/c the grammar looks good assuming newline is basically whitespace. though i can't figure out how newlines are treated in the grammar, there does not seem to be a newline routine ala https://supportweb.cs.bham.ac.uk/documentation/tutorials/docsystem/build/tutorials/antlr/antlr.html I'm going to assume the grammar author is much more sophisticated than that tutorial and knows how to fix this. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1824) Support import modules in Jython UDF
[ https://issues.apache.org/jira/browse/PIG-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036496#comment-13036496 ] Woody Anderson commented on PIG-1824: - cool. can we get this into trunk so i don't have to keep fixing the patches? Support import modules in Jython UDF Key: PIG-1824 URL: https://issues.apache.org/jira/browse/PIG-1824 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0, 0.9.0 Reporter: Richard Ding Assignee: Woody Anderson Fix For: 0.10 Attachments: 1824.patch, 1824_final.patch, 1824a.patch, 1824b.patch, 1824c.patch, 1824d.patch, 1824x.patch, TEST-org.apache.pig.test.TestGrunt.txt, TEST-org.apache.pig.test.TestScriptLanguage.txt, TEST-org.apache.pig.test.TestScriptUDF.txt Currently, Jython UDF script doesn't support Jython import statement as in the following example: {code} #!/usr/bin/python import re @outputSchema(word:chararray) def resplit(content, regex, index): return re.compile(regex).split(content)[index] {code} Can Pig automatically locate the Jython module file and ship it to the backend? Or should we add a ship clause to let user explicitly specify the module to ship? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1824) Support import modules in Jython UDF
[ https://issues.apache.org/jira/browse/PIG-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034932#comment-13034932 ] Woody Anderson commented on PIG-1824: - hmm.. i ran each of those tests via: ant -noclasspath test -Dtestcase=org.apache.pig.test.TestScriptUDF etc. and they all passed. is your environment clean? % printenv | grep YTHON (should be empty) is there anything else i should be doing to try to mirror your test framework (while not having to run all tests for the 18 hours that that requires)? Support import modules in Jython UDF Key: PIG-1824 URL: https://issues.apache.org/jira/browse/PIG-1824 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0, 0.9.0 Reporter: Richard Ding Assignee: Woody Anderson Fix For: 0.10 Attachments: 1824.patch, 1824a.patch, 1824b.patch, 1824c.patch, 1824d.patch, 1824x.patch, TEST-org.apache.pig.test.TestGrunt.txt, TEST-org.apache.pig.test.TestScriptLanguage.txt, TEST-org.apache.pig.test.TestScriptUDF.txt Currently, Jython UDF script doesn't support Jython import statement as in the following example: {code} #!/usr/bin/python import re @outputSchema(word:chararray) def resplit(content, regex, index): return re.compile(regex).split(content)[index] {code} Can Pig automatically locate the Jython module file and ship it to the backend? Or should we add a ship clause to let user explicitly specify the module to ship? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-1824) Support import modules in Jython UDF
[ https://issues.apache.org/jira/browse/PIG-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Woody Anderson updated PIG-1824: Attachment: 1824_final.patch ok. my bad! testcase=full.package.path doesn't even run the test, so tho i claimed that the tests were passing, it was in fact simply that junit could run. Here's a new patch: there was an extra line that i mistakenly didn't delete when creating the re-trunked code. this patch will pass the tests Support import modules in Jython UDF Key: PIG-1824 URL: https://issues.apache.org/jira/browse/PIG-1824 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0, 0.9.0 Reporter: Richard Ding Assignee: Woody Anderson Fix For: 0.10 Attachments: 1824.patch, 1824_final.patch, 1824a.patch, 1824b.patch, 1824c.patch, 1824d.patch, 1824x.patch, TEST-org.apache.pig.test.TestGrunt.txt, TEST-org.apache.pig.test.TestScriptLanguage.txt, TEST-org.apache.pig.test.TestScriptUDF.txt Currently, Jython UDF script doesn't support Jython import statement as in the following example: {code} #!/usr/bin/python import re @outputSchema(word:chararray) def resplit(content, regex, index): return re.compile(regex).split(content)[index] {code} Can Pig automatically locate the Jython module file and ship it to the backend? Or should we add a ship clause to let user explicitly specify the module to ship? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-1824) Support import modules in Jython UDF
[ https://issues.apache.org/jira/browse/PIG-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Woody Anderson updated PIG-1824: Attachment: 1824x.patch patch for trunk Support import modules in Jython UDF Key: PIG-1824 URL: https://issues.apache.org/jira/browse/PIG-1824 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0, 0.9.0 Reporter: Richard Ding Assignee: Woody Anderson Fix For: 0.10 Attachments: 1824.patch, 1824a.patch, 1824b.patch, 1824c.patch, 1824d.patch, 1824x.patch Currently, Jython UDF script doesn't support Jython import statement as in the following example: {code} #!/usr/bin/python import re @outputSchema(word:chararray) def resplit(content, regex, index): return re.compile(regex).split(content)[index] {code} Can Pig automatically locate the Jython module file and ship it to the backend? Or should we add a ship clause to let user explicitly specify the module to ship? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2051) new LogicalSchema column prune code does not preserve type information for map subfields
[ https://issues.apache.org/jira/browse/PIG-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Woody Anderson updated PIG-2051: Attachment: 2051.patch this patch propagates type information more correctly (though not recursive/fully) to the pushProjection call. Mainly, this means putting type information into via subfields into map types. It doesn't fully descend and provide type information for subfields of subfields etc. But, provided fields have the correct type information rather than DataType.BYTEARRAY new LogicalSchema column prune code does not preserve type information for map subfields Key: PIG-2051 URL: https://issues.apache.org/jira/browse/PIG-2051 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.10 Reporter: Woody Anderson Assignee: Woody Anderson Fix For: 0.10 Attachments: 2051.patch current impl of ColumnPruneVisitor.visit ignores field type info and passes type BYTEARRAY for all map fields. the corrected type is pretty easy to fill in, especially since map field info is only attempted 1 level deep. i came across this b/c i utilize the type information in the pushProjection call, and this was previously of the 'correct' type information, the change over to LogicalSchema caused a regression. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-2053) PigInputFormat uses class.isAssignableFrom() where instanceof is more appropriate
PigInputFormat uses class.isAssignableFrom() where instanceof is more appropriate - Key: PIG-2053 URL: https://issues.apache.org/jira/browse/PIG-2053 Project: Pig Issue Type: Improvement Affects Versions: 0.10 Reporter: Woody Anderson Priority: Minor Fix For: 0.10 Attachments: 2053.patch This is a code style/quality improvement. isAssignableFrom is appropriate when the class is not known at compile type, but assignment needs to be checked. e.g. foo.getClass().isAssignableFrom(bar.getClass()) but, if the class of foo is known (e.g. X.class), then instanceof is more appropriate and readable. i also made use of de morgan's to simply the is combininable boolean statement, which is hard to grok as written. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2053) PigInputFormat uses class.isAssignableFrom() where instanceof is more appropriate
[ https://issues.apache.org/jira/browse/PIG-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Woody Anderson updated PIG-2053: Attachment: 2053.patch patch PigInputFormat uses class.isAssignableFrom() where instanceof is more appropriate - Key: PIG-2053 URL: https://issues.apache.org/jira/browse/PIG-2053 Project: Pig Issue Type: Improvement Affects Versions: 0.10 Reporter: Woody Anderson Priority: Minor Fix For: 0.10 Attachments: 2053.patch This is a code style/quality improvement. isAssignableFrom is appropriate when the class is not known at compile type, but assignment needs to be checked. e.g. foo.getClass().isAssignableFrom(bar.getClass()) but, if the class of foo is known (e.g. X.class), then instanceof is more appropriate and readable. i also made use of de morgan's to simply the is combininable boolean statement, which is hard to grok as written. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2012) Comments at the begining of the file throws off line numbers in errors
[ https://issues.apache.org/jira/browse/PIG-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13031001#comment-13031001 ] Woody Anderson commented on PIG-2012: - thanks for this one! this has been a major pain for me. Comments at the begining of the file throws off line numbers in errors -- Key: PIG-2012 URL: https://issues.apache.org/jira/browse/PIG-2012 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.9.0 Reporter: Alan Gates Assignee: Richard Ding Fix For: 0.9.0 Attachments: PIG-2012_1.patch, PIG-2012_2.patch, macro.pig The preprocessor does not appear to be handling leading comments properly when calculating line numbers for error messages. In the attached script, the error is reported to be on line 7. It is actually on line 10. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-1824) Support import modules in Jython UDF
[ https://issues.apache.org/jira/browse/PIG-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Woody Anderson updated PIG-1824: Attachment: 1824d.patch patch includes throw new IllegalStateException if the stream is null. Support import modules in Jython UDF Key: PIG-1824 URL: https://issues.apache.org/jira/browse/PIG-1824 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0, 0.9.0 Reporter: Richard Ding Assignee: Woody Anderson Fix For: 0.10 Attachments: 1824.patch, 1824a.patch, 1824b.patch, 1824c.patch, 1824d.patch Currently, Jython UDF script doesn't support Jython import statement as in the following example: {code} #!/usr/bin/python import re @outputSchema(word:chararray) def resplit(content, regex, index): return re.compile(regex).split(content)[index] {code} Can Pig automatically locate the Jython module file and ship it to the backend? Or should we add a ship clause to let user explicitly specify the module to ship? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1824) Support import modules in Jython UDF
[ https://issues.apache.org/jira/browse/PIG-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13030116#comment-13030116 ] Woody Anderson commented on PIG-1824: - i'm not sure what's really left to keep this out of the next release, given we've been going back an forth over issues that don't even affect functionality. but, there are other jython related bugs in the pipe for 0.10 anyway, so perhaps having them all in the same release is a good idea for a feature grouping perspective. Support import modules in Jython UDF Key: PIG-1824 URL: https://issues.apache.org/jira/browse/PIG-1824 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0, 0.9.0 Reporter: Richard Ding Assignee: Woody Anderson Fix For: 0.10 Attachments: 1824.patch, 1824a.patch, 1824b.patch, 1824c.patch, 1824d.patch Currently, Jython UDF script doesn't support Jython import statement as in the following example: {code} #!/usr/bin/python import re @outputSchema(word:chararray) def resplit(content, regex, index): return re.compile(regex).split(content)[index] {code} Can Pig automatically locate the Jython module file and ship it to the backend? Or should we add a ship clause to let user explicitly specify the module to ship? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-1942) script UDF (jython) should utilize the intended output schema to more directly convert Py objects to Pig objects
[ https://issues.apache.org/jira/browse/PIG-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Woody Anderson updated PIG-1942: Attachment: 1942.patch I wanted to get this started, as this is a bit of a change. often, it seems that people misuse the outputSchema annotation such that the output does not match the specified schema. At least, there was a unit test that did this, and it's possible that a few users in the wild have this issue as well. At any rate, this patch includes code in JythonUtils that will coerce jythout object model output into the schema that the function is annotated with. It's faster than the existing code and has quite a bit more functionality. It can convert arrays and many more types than previously. It also makes it much easier and faster to convert [1,2,3] to a bag rather than in jython create [(1), (2), (3)]. Given that this changes the functionality of udfs that use @outputSchema (by coercing schema adherence), we may want to use a different annotation, and allow outputSchema to exist in it's previous form, in that it doesn't actually convert the schema. script UDF (jython) should utilize the intended output schema to more directly convert Py objects to Pig objects Key: PIG-1942 URL: https://issues.apache.org/jira/browse/PIG-1942 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.8.0, 0.9.0 Reporter: Woody Anderson Priority: Minor Labels: python, schema, udf Fix For: 0.10 Attachments: 1942.patch from https://issues.apache.org/jira/browse/PIG-1824 {code} import re @outputSchema(y:bag{t:tuple(word:chararray)}) def strsplittobag(content,regex): return re.compile(regex).split(content) {code} does not work because split returns a list of strings. However, the output schema is known, and it would be quite simple to implicitly promote the string element to a tupled element. also, a list/array/tuple/set etc. are all equally convertable to bag, and list/array/tuple are equally convertable to Tuple, this conversion can be done in a much less rigid way with the use of the schema. this allows much more facile re-use of existing python code and less memory overhead to create intermediate re-converting of object types. I have written the code to do this a while back as part of my version of the jython script framework, i'll isolate that and attach. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-1942) script UDF (jython) should utilize the intended output schema to more directly convert Py objects to Pig objects
[ https://issues.apache.org/jira/browse/PIG-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Woody Anderson updated PIG-1942: Attachment: 1942_with_junit.patch i forgot to svn add my unit test that contains a lot of useful tests and comments. it's included in this patch. it has a timing loop at the end that you can enable by adding an annotation etc. or running it directly in eclipse etc. to show the performance difference between the methods. script UDF (jython) should utilize the intended output schema to more directly convert Py objects to Pig objects Key: PIG-1942 URL: https://issues.apache.org/jira/browse/PIG-1942 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.8.0, 0.9.0 Reporter: Woody Anderson Priority: Minor Labels: python, schema, udf Fix For: 0.10 Attachments: 1942.patch, 1942_with_junit.patch from https://issues.apache.org/jira/browse/PIG-1824 {code} import re @outputSchema(y:bag{t:tuple(word:chararray)}) def strsplittobag(content,regex): return re.compile(regex).split(content) {code} does not work because split returns a list of strings. However, the output schema is known, and it would be quite simple to implicitly promote the string element to a tupled element. also, a list/array/tuple/set etc. are all equally convertable to bag, and list/array/tuple are equally convertable to Tuple, this conversion can be done in a much less rigid way with the use of the schema. this allows much more facile re-use of existing python code and less memory overhead to create intermediate re-converting of object types. I have written the code to do this a while back as part of my version of the jython script framework, i'll isolate that and attach. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (PIG-1942) script UDF (jython) should utilize the intended output schema to more directly convert Py objects to Pig objects
[ https://issues.apache.org/jira/browse/PIG-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Woody Anderson reassigned PIG-1942: --- Assignee: Woody Anderson script UDF (jython) should utilize the intended output schema to more directly convert Py objects to Pig objects Key: PIG-1942 URL: https://issues.apache.org/jira/browse/PIG-1942 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.8.0, 0.9.0 Reporter: Woody Anderson Assignee: Woody Anderson Priority: Minor Labels: python, schema, udf Fix For: 0.10 Attachments: 1942.patch, 1942_with_junit.patch from https://issues.apache.org/jira/browse/PIG-1824 {code} import re @outputSchema(y:bag{t:tuple(word:chararray)}) def strsplittobag(content,regex): return re.compile(regex).split(content) {code} does not work because split returns a list of strings. However, the output schema is known, and it would be quite simple to implicitly promote the string element to a tupled element. also, a list/array/tuple/set etc. are all equally convertable to bag, and list/array/tuple are equally convertable to Tuple, this conversion can be done in a much less rigid way with the use of the schema. this allows much more facile re-use of existing python code and less memory overhead to create intermediate re-converting of object types. I have written the code to do this a while back as part of my version of the jython script framework, i'll isolate that and attach. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1824) Support import modules in Jython UDF
[ https://issues.apache.org/jira/browse/PIG-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13025007#comment-13025007 ] Woody Anderson commented on PIG-1824: - agree: inre: PYTHON_CACHEDIR: the code behaves as you wish, in that it only deletes the dir if it (pig) created it. sorry for not being being clear in comments about that, but if you read the code you'll see it. if we can't write, i (pig) was creating an alternate directory. It may be possible to pre-populate this, and i understand (and had) the desire to have an error instead of a new directory, but I was initially experiencing this error: {code} *sys-package-mgr*: can't create package cache dir, '/grid/0/Releases/pig-0.8.0..1103222002-20110401-000/share/pig-0.8.0..1103222002/lib/cachedir/packages' {code} which is why i added the 'is writable' check, but after reviewing (per your comment), it seems that cachedir is not set on the grid (at least at the point when the static block runs). If left as null, it seems to default to some grid location that is not writable (and thus doesn't work), but if i set it to a writable tmp first, it works. so.. i can safely agree that an error if the dir isn't writable is both desirable and works. as for the getScriptAsStream(): i followed the existing code convention on that one, though i didn't like it either. again, if you read down a bit you'll see that the impl of getScriptAsStream() is: {code} .. if (is == null) { throw new IllegalStateException( Could not initialize interpreter (from file system or classpath) with + scriptPath); } return is; {code} so, the null check is superfluous but does quiet the not null check warnings. i didn't add an additional throw statement in this case b/c essentially, my code wouldn't add any _new_ errors that the existing code didn't already exhibit if somehow the impl of getScriptAsStream changed and could return null. anyway, ill upload a new patch to address the writable issue, if you think it's a big deal we can add an 'else throw' statement around getScriptAsStream Support import modules in Jython UDF Key: PIG-1824 URL: https://issues.apache.org/jira/browse/PIG-1824 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0, 0.9.0 Reporter: Richard Ding Assignee: Woody Anderson Fix For: 0.9.0 Attachments: 1824.patch, 1824a.patch, 1824b.patch Currently, Jython UDF script doesn't support Jython import statement as in the following example: {code} #!/usr/bin/python import re @outputSchema(word:chararray) def resplit(content, regex, index): return re.compile(regex).split(content)[index] {code} Can Pig automatically locate the Jython module file and ship it to the backend? Or should we add a ship clause to let user explicitly specify the module to ship? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-1824) Support import modules in Jython UDF
[ https://issues.apache.org/jira/browse/PIG-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Woody Anderson updated PIG-1824: Attachment: 1824c.patch Support import modules in Jython UDF Key: PIG-1824 URL: https://issues.apache.org/jira/browse/PIG-1824 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0, 0.9.0 Reporter: Richard Ding Assignee: Woody Anderson Fix For: 0.9.0 Attachments: 1824.patch, 1824a.patch, 1824b.patch, 1824c.patch Currently, Jython UDF script doesn't support Jython import statement as in the following example: {code} #!/usr/bin/python import re @outputSchema(word:chararray) def resplit(content, regex, index): return re.compile(regex).split(content)[index] {code} Can Pig automatically locate the Jython module file and ship it to the backend? Or should we add a ship clause to let user explicitly specify the module to ship? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1973) UDFContext.getUDFContext has a thread race condition around it's ThreadLocal
[ https://issues.apache.org/jira/browse/PIG-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13022939#comment-13022939 ] Woody Anderson commented on PIG-1973: - ok. i agree. it's not a bug. though, i still find it misleading code, in that it doesn't utilize the easy concise form, and at least to me looks wrong on 1st and second inspection. UDFContext.getUDFContext has a thread race condition around it's ThreadLocal Key: PIG-1973 URL: https://issues.apache.org/jira/browse/PIG-1973 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0, 0.9.0 Reporter: Woody Anderson Assignee: Woody Anderson Priority: Minor Fix For: 0.9.0 Attachments: 1973.patch this is probably isn't manifesting anywhere, but it's an incorrect use of the ThreadLocal pattern. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1973) UDFContext.getUDFContext has a thread race condition around it's ThreadLocal
[ https://issues.apache.org/jira/browse/PIG-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13022584#comment-13022584 ] Woody Anderson commented on PIG-1973: - incorrect. initialValue is invoked when get() is first called. However, in the old code, initialValue returns null b/c it was not overridden. thus, if 2 threads call getUDFContext() at the same time they may get 2 different UDFContext objects, b/c the method does an unprotected comparison/set check: {code} public static UDFContext getUDFContext() { if (tss.get() == null) { UDFContext ctx = new UDFContext(); tss.set(ctx); } return tss.get(); } {code} this is CLASSIC race condition. UDFContext.getUDFContext has a thread race condition around it's ThreadLocal Key: PIG-1973 URL: https://issues.apache.org/jira/browse/PIG-1973 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0, 0.9.0 Reporter: Woody Anderson Assignee: Woody Anderson Priority: Minor Fix For: 0.9.0 Attachments: 1973.patch this is probably isn't manifesting anywhere, but it's an incorrect use of the ThreadLocal pattern. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2001) DefaultTuple(List) constructor is inefficient, causes List.size() System.arraycopy() calls (though they are 0 byte copies), DefaultTuple(int) constructor is a bit misleadin
[ https://issues.apache.org/jira/browse/PIG-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Woody Anderson updated PIG-2001: Attachment: 2001.patch DefaultTuple(List) constructor is inefficient, causes List.size() System.arraycopy() calls (though they are 0 byte copies), DefaultTuple(int) constructor is a bit misleading wrt time complexity - Key: PIG-2001 URL: https://issues.apache.org/jira/browse/PIG-2001 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.8.0, 0.9.0 Reporter: Woody Anderson Assignee: Woody Anderson Priority: Minor Fix For: 0.10 Attachments: 2001.patch I was perusing the Tuple created by the default Tuple factory, when I wanted it to copy my input list. here i noticed that the List constructor uses List.add(index, element), which is different from set(index, element) in that it shifts the right side of the list, with ArrayList this causes an no-op System.arraycopy call which is completely unnecessary. Even though the array copy call isn't actually copying any bytes, it's still unnecessary, and can be easily avoided. it's also N iterate/add function calls, that can be avoided by using: {code} new ArrayListObject(c); {code} which, is more efficient. For arbitrary collection inputs this is at worst N iterator calls (same as existing code); when constructing from ArrayLists or Arrays.asList, the construction is accomplished via a single System.arraycopy call, which is an actual improvement. There do not seem to be DefaultTuple tests. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1824) Support import modules in Jython UDF
[ https://issues.apache.org/jira/browse/PIG-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017549#comment-13017549 ] Woody Anderson commented on PIG-1824: - 1. i could re-work the initialization into the static block of the inner class Interpreter, it simply needs to be done before the interpreter is allocated. I'm not sure what you mean by not wanting a cache dir when using python udfs or control flow? can you clarify? 2. separate the logic out of init into what? I think it should, in general, be the contract of any script environment to handle resource inclusion (if possible). Are you imagining some scenario where init(file,..) would not actually parse/internalize the code inside init()? I don't much care where the code is parsed and added to a ScriptEngine, but when it is, it should handle all other evaluated resources that are necessary to succeed. In the current API, a user provided script file is given to init(), so that's where it must do this. There is really no other place to evaluate resource inclusions, and i think i might not be understanding your suggestion. As for other ScriptEngines that may not be able to support this concept, are you suggesting a supportsFeature() method that we use to test various SE's to determine if they can support this (or other) features? I'm not sure what we'd do with this knowledge. Support import modules in Jython UDF Key: PIG-1824 URL: https://issues.apache.org/jira/browse/PIG-1824 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0, 0.9.0 Reporter: Richard Ding Assignee: Woody Anderson Fix For: 0.9.0 Attachments: 1824.patch, 1824a.patch Currently, Jython UDF script doesn't support Jython import statement as in the following example: {code} #!/usr/bin/python import re @outputSchema(word:chararray) def resplit(content, regex, index): return re.compile(regex).split(content)[index] {code} Can Pig automatically locate the Jython module file and ship it to the backend? Or should we add a ship clause to let user explicitly specify the module to ship? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-1985) Utils.getSchemaFromString does not use the new parser, and thus fails to parse valid schema
Utils.getSchemaFromString does not use the new parser, and thus fails to parse valid schema --- Key: PIG-1985 URL: https://issues.apache.org/jira/browse/PIG-1985 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.9.0 Reporter: Woody Anderson Fix For: 0.9.0 I've been told this is because Utils.getSchemaFromString does not use the new parser to parse the schema, so we should update the impl to use the new parser: {code} Utils.getSchemaFromString(f: map[]) {code} results in: (org.apache.pig.impl.logicalLayer.schema.Schema) {f: map[]} {code} Utils.getSchemaFromString(f: map[int]) {code} results in: An exception occurred: org.apache.pig.impl.logicalLayer.parser.ParseException .. org.apache.pig.impl.logicalLayer.parser.ParseException: Encountered map map at line 1, column 4. Was expecting one of: int ... long ... float ... double ... chararray ... bytearray ... int ... long ... float ... double ... chararray ... bytearray ... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1985) Utils.getSchemaFromString does not use the new parser, and thus fails to parse valid schema
[ https://issues.apache.org/jira/browse/PIG-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017727#comment-13017727 ] Woody Anderson commented on PIG-1985: - this is a bug, why are we targeting for .10? Utils.getSchemaFromString does not use the new parser, and thus fails to parse valid schema --- Key: PIG-1985 URL: https://issues.apache.org/jira/browse/PIG-1985 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.9.0 Reporter: Woody Anderson Fix For: 0.10 I've been told this is because Utils.getSchemaFromString does not use the new parser to parse the schema, so we should update the impl to use the new parser: {code} Utils.getSchemaFromString(f: map[]) {code} results in: (org.apache.pig.impl.logicalLayer.schema.Schema) {f: map[]} {code} Utils.getSchemaFromString(f: map[int]) {code} results in: An exception occurred: org.apache.pig.impl.logicalLayer.parser.ParseException .. org.apache.pig.impl.logicalLayer.parser.ParseException: Encountered map map at line 1, column 4. Was expecting one of: int ... long ... float ... double ... chararray ... bytearray ... int ... long ... float ... double ... chararray ... bytearray ... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1824) Support import modules in Jython UDF
[ https://issues.apache.org/jira/browse/PIG-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017789#comment-13017789 ] Woody Anderson commented on PIG-1824: - ok. i understand your thoughts on static, and mostly i have them too, but the PythonInterpreter is a static member of the Interperter class, and the code i wrote must run BEFORE that interpreter is constructed. Interpeter is a private inner class, so it cannot be caused to load before normal use patterns. So, moving the static block into the static block for Interpreter addresses your concerns. import will not cause the static block to be executed btw, it's the first executed reference to the class. However, i take the point that some code could have been: {code} Class = JythonScriptEngine.class; {code} or something like that to cause the class to be loaded. Still, as i said: Interpreter static block addresses this, and the ctor is out b/c of the static nature of Interpreter.interpreter. on second point: i dont' see the point of a includeResources() method, if it can be done, it can be done in init(), if not it won't be done. Why add a new method? Support import modules in Jython UDF Key: PIG-1824 URL: https://issues.apache.org/jira/browse/PIG-1824 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0, 0.9.0 Reporter: Richard Ding Assignee: Woody Anderson Fix For: 0.9.0 Attachments: 1824.patch, 1824a.patch Currently, Jython UDF script doesn't support Jython import statement as in the following example: {code} #!/usr/bin/python import re @outputSchema(word:chararray) def resplit(content, regex, index): return re.compile(regex).split(content)[index] {code} Can Pig automatically locate the Jython module file and ship it to the backend? Or should we add a ship clause to let user explicitly specify the module to ship? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-1973) UDFContext.getUDFContext has a thread race condition around it's ThreadLocal
[ https://issues.apache.org/jira/browse/PIG-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Woody Anderson updated PIG-1973: Attachment: 1973.patch use the initialValue method of ThreadLocal, which is how to correctly handle lazy initialization. UDFContext.getUDFContext has a thread race condition around it's ThreadLocal Key: PIG-1973 URL: https://issues.apache.org/jira/browse/PIG-1973 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0, 0.9.0 Reporter: Woody Anderson Assignee: Woody Anderson Priority: Minor Fix For: 0.8.0 Attachments: 1973.patch this is probably isn't manifesting anywhere, but it's an incorrect use of the ThreadLocal pattern. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-1973) UDFContext.getUDFContext has a thread race condition around it's ThreadLocal
UDFContext.getUDFContext has a thread race condition around it's ThreadLocal Key: PIG-1973 URL: https://issues.apache.org/jira/browse/PIG-1973 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0, 0.9.0 Reporter: Woody Anderson Assignee: Woody Anderson Priority: Minor Fix For: 0.8.0 Attachments: 1973.patch this is probably isn't manifesting anywhere, but it's an incorrect use of the ThreadLocal pattern. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-1955) PhysicalOperator has a member variable (non-static) Log object that is non-transient, this causes serialization errors
[ https://issues.apache.org/jira/browse/PIG-1955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Woody Anderson updated PIG-1955: Attachment: 1955-static.patch Agreed, i prefer the static approach. PhysicalOperator has a member variable (non-static) Log object that is non-transient, this causes serialization errors -- Key: PIG-1955 URL: https://issues.apache.org/jira/browse/PIG-1955 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0, 0.9.0 Reporter: Woody Anderson Assignee: Woody Anderson Fix For: 0.8.0, 0.9.0 Attachments: 1955-po.patch, 1955-static.patch, 1955.patch I found this while trying to write unit tests. Creating a local PigServer to test my LoadFunc caused a serialization of the PhysicalOperator class, which failed due to: .. Caused by: java.io.NotSerializableException: org.apache.commons.logging.impl.Log4JCategoryLog .. this is easily fixed by adding the transient keyword to the definition of log. e.g. on trunk: private final transient Log log = LogFactory.getLog(getClass()); on the 0.8 tag: private transient Log log = LogFactory.getLog(getClass()); -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1824) Support import modules in Jython UDF
[ https://issues.apache.org/jira/browse/PIG-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13014086#comment-13014086 ] Woody Anderson commented on PIG-1824: - The following may not be immediately self evident to all developers: import statements that execute from within runtime function calls will not work (unless the dependency has already been satisfied statically), eg: {code} def resplit(content, regex, index): import re return re.compile(regex).split(content)[index] {code} will not work b/c the import is not attempted until after the job has been defined, built, and deployed. This import practice is frowned upon and is used very rarely. If you happen to be doing it (i'll assume you have a good reason), then you probably know how to fix it. If you're using someone else's code that is written like this, you can satisfy the dependency by explicitly importing the module up front, this will cause it to be added to the jar, and subsequent uses will succeed. Support import modules in Jython UDF Key: PIG-1824 URL: https://issues.apache.org/jira/browse/PIG-1824 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0, 0.9.0 Reporter: Richard Ding Assignee: Woody Anderson Fix For: 0.8.0, 0.9.0, 0.10 Attachments: 1824.patch Currently, Jython UDF script doesn't support Jython import statement as in the following example: {code} #!/usr/bin/python import re @outputSchema(word:chararray) def resplit(content, regex, index): return re.compile(regex).split(content)[index] {code} Can Pig automatically locate the Jython module file and ship it to the backend? Or should we add a ship clause to let user explicitly specify the module to ship? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-1955) PhysicalOperator has a member variable (non-static) Log object that is non-transient, this causes serialization errors
[ https://issues.apache.org/jira/browse/PIG-1955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Woody Anderson updated PIG-1955: Attachment: 1955.patch this doesn't have all of the unit test trimmings, but is that all really needed to mark a logger as transient? PhysicalOperator has a member variable (non-static) Log object that is non-transient, this causes serialization errors -- Key: PIG-1955 URL: https://issues.apache.org/jira/browse/PIG-1955 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0, 0.9.0 Reporter: Woody Anderson Assignee: Woody Anderson Fix For: 0.8.0, 0.9.0 Attachments: 1955.patch I found this while trying to write unit tests. Creating a local PigServer to test my LoadFunc caused a serialization of the PhysicalOperator class, which failed due to: .. Caused by: java.io.NotSerializableException: org.apache.commons.logging.impl.Log4JCategoryLog .. this is easily fixed by adding the transient keyword to the definition of log. e.g. on trunk: private final transient Log log = LogFactory.getLog(getClass()); on the 0.8 tag: private transient Log log = LogFactory.getLog(getClass()); -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-1955) PhysicalOperator has a member variable (non-static) Log object that is non-transient, this causes serialization errors
PhysicalOperator has a member variable (non-static) Log object that is non-transient, this causes serialization errors -- Key: PIG-1955 URL: https://issues.apache.org/jira/browse/PIG-1955 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0, 0.9.0 Reporter: Woody Anderson Assignee: Woody Anderson Fix For: 0.9.0, 0.8.0 Attachments: 1955.patch I found this while trying to write unit tests. Creating a local PigServer to test my LoadFunc caused a serialization of the PhysicalOperator class, which failed due to: .. Caused by: java.io.NotSerializableException: org.apache.commons.logging.impl.Log4JCategoryLog .. this is easily fixed by adding the transient keyword to the definition of log. e.g. on trunk: private final transient Log log = LogFactory.getLog(getClass()); on the 0.8 tag: private transient Log log = LogFactory.getLog(getClass()); -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-1955) PhysicalOperator has a member variable (non-static) Log object that is non-transient, this causes serialization errors
[ https://issues.apache.org/jira/browse/PIG-1955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Woody Anderson updated PIG-1955: Attachment: 1955-po.patch Ok. Unfortunately, this issue is more pervasive than i originally thought. The 'simple' fix that is attached makes the PO logger transient protected and removes the loggers from all subclasses which are defined often incorrectly (non-transient members) and inconsistently. Personally, when i define loggers i always make them private and STATIC so that there is no getClass() call. this makes finding the class where the log line resides in source code much simpler to find. I dislike loggers that define themselves with getClass() b/c logging code in A.java will report as class B in output if class B extends class A. I did not change this behavior b/c perhaps someone has their reasons for doing what they did. I did however remove some of the static loggers simply to ensure consistency (the majority were done with member variables). The change to static private is also not such a big deal if anyone agrees we should consistently go that way instead. PhysicalOperator has a member variable (non-static) Log object that is non-transient, this causes serialization errors -- Key: PIG-1955 URL: https://issues.apache.org/jira/browse/PIG-1955 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0, 0.9.0 Reporter: Woody Anderson Assignee: Woody Anderson Fix For: 0.8.0, 0.9.0 Attachments: 1955-po.patch, 1955.patch I found this while trying to write unit tests. Creating a local PigServer to test my LoadFunc caused a serialization of the PhysicalOperator class, which failed due to: .. Caused by: java.io.NotSerializableException: org.apache.commons.logging.impl.Log4JCategoryLog .. this is easily fixed by adding the transient keyword to the definition of log. e.g. on trunk: private final transient Log log = LogFactory.getLog(getClass()); on the 0.8 tag: private transient Log log = LogFactory.getLog(getClass()); -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-1824) Support import modules in Jython UDF
[ https://issues.apache.org/jira/browse/PIG-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Woody Anderson updated PIG-1824: Attachment: 1824a.patch This altered patch removes the explicit 'import re' test, as it relies on having a jython 2.5.0 install on disk and configured as visible to the runtime. test nested accomplishes the test of the mechanism in use by 'import re', so removing the explicit test is simply more portable. Support import modules in Jython UDF Key: PIG-1824 URL: https://issues.apache.org/jira/browse/PIG-1824 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0, 0.9.0 Reporter: Richard Ding Assignee: Woody Anderson Fix For: 0.8.0, 0.9.0, 0.10 Attachments: 1824.patch, 1824a.patch Currently, Jython UDF script doesn't support Jython import statement as in the following example: {code} #!/usr/bin/python import re @outputSchema(word:chararray) def resplit(content, regex, index): return re.compile(regex).split(content)[index] {code} Can Pig automatically locate the Jython module file and ship it to the backend? Or should we add a ship clause to let user explicitly specify the module to ship? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-1824) Support import modules in Jython UDF
[ https://issues.apache.org/jira/browse/PIG-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Woody Anderson updated PIG-1824: Attachment: 1824.patch here's the patch file. Support import modules in Jython UDF Key: PIG-1824 URL: https://issues.apache.org/jira/browse/PIG-1824 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0, 0.9.0 Reporter: Richard Ding Assignee: Woody Anderson Fix For: 0.8.0, 0.9.0, 0.10 Attachments: 1824.patch Currently, Jython UDF script doesn't support Jython import statement as in the following example: {code} #!/usr/bin/python import re @outputSchema(word:chararray) def resplit(content, regex, index): return re.compile(regex).split(content)[index] {code} Can Pig automatically locate the Jython module file and ship it to the backend? Or should we add a ship clause to let user explicitly specify the module to ship? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-1942) script UDF (jython) should utilize the intended output schema to more directly convert Py objects to Pig objects
script UDF (jython) should utilize the intended output schema to more directly convert Py objects to Pig objects Key: PIG-1942 URL: https://issues.apache.org/jira/browse/PIG-1942 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.8.0, 0.9.0 Reporter: Woody Anderson Priority: Minor Fix For: 0.9.0 from https://issues.apache.org/jira/browse/PIG-1824 {code} import re @outputSchema(y:bag{t:tuple(word:chararray)}) def strsplittobag(content,regex): return re.compile(regex).split(content) {code} does not work because split returns a list of strings. However, the output schema is known, and it would be quite simple to implicitly promote the string element to a tupled element. also, a list/array/tuple/set etc. are all equally convertable to bag, and list/array/tuple are equally convertable to Tuple, this conversion can be done in a much less rigid way with the use of the schema. this allows much more facile re-use of existing python code and less memory overhead to create intermediate re-converting of object types. I have written the code to do this a while back as part of my version of the jython script framework, i'll isolate that and attach. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-1824) Support import modules in Jython UDF
[ https://issues.apache.org/jira/browse/PIG-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Woody Anderson updated PIG-1824: Description: Currently, Jython UDF script doesn't support Jython import statement as in the following example: {code} #!/usr/bin/python import re @outputSchema(word:chararray) def resplit(content, regex, index): return re.compile(regex).split(content)[index] {code} Can Pig automatically locate the Jython module file and ship it to the backend? Or should we add a ship clause to let user explicitly specify the module to ship? was: Currently, Jython UDF script doesn't support Jython import statement as in the following example: {code} #!/usr/bin/python import re @outputSchema(y:bag{t:tuple(word:chararray)}) def strsplittobag(content,regex): return re.compile(regex).split(content) {code} Can Pig automatically locate the Jython module file and ship it to the backend? Or should we add a ship clause to let user explicitly specify the module to ship? Support import modules in Jython UDF Key: PIG-1824 URL: https://issues.apache.org/jira/browse/PIG-1824 Project: Pig Issue Type: Improvement Reporter: Richard Ding Assignee: Woody Anderson Fix For: 0.10 Currently, Jython UDF script doesn't support Jython import statement as in the following example: {code} #!/usr/bin/python import re @outputSchema(word:chararray) def resplit(content, regex, index): return re.compile(regex).split(content)[index] {code} Can Pig automatically locate the Jython module file and ship it to the backend? Or should we add a ship clause to let user explicitly specify the module to ship? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-1943) jython functions can use the @outputSchema decorator, but only if in the out script that is imported, we should add a builting module pigdecorators.py so that developers ca
jython functions can use the @outputSchema decorator, but only if in the out script that is imported, we should add a builting module pigdecorators.py so that developers can import and use them in lib scripts Key: PIG-1943 URL: https://issues.apache.org/jira/browse/PIG-1943 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.8.0, 0.9.0 Reporter: Woody Anderson Assignee: Woody Anderson Priority: Minor Fix For: 0.9.0 if you have pig udf functions in a pig script, and want to re-use it (i.g. import from another script) the decorators must be defined. They will not be, due to scoping rules, so the decorators should be available via a standard importable module that ships with the jython framework (as we already define the decorators as part of initializing the interpreter). this simply involves adding an appropriately named: pigdecorators.py to the classpath, so a dev can do: {quote} from pigdecorators import * @outputSchema(w:chararray) def word(): return 'word' {quote} this can be done currently in the primary script, but when https://issues.apache.org/jira/browse/PIG-1824 is completed, that script would not properly import when used within another script in the future. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira