[jira] [Updated] (PIG-4059) Pig on Spark
[ https://issues.apache.org/jira/browse/PIG-4059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-4059: --- Labels: spork (was: ) > Pig on Spark > > > Key: PIG-4059 > URL: https://issues.apache.org/jira/browse/PIG-4059 > Project: Pig > Issue Type: New Feature >Reporter: Rohini Palaniswamy >Assignee: Praveen Rachabattuni > Labels: spork > Attachments: Pig-on-Spark-Design-Doc.pdf > > >There is lot of interest in adding Spark as a backend execution engine for > Pig. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-3558) ORC support for Pig
[ https://issues.apache.org/jira/browse/PIG-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067289#comment-14067289 ] Dmitriy V. Ryaboy commented on PIG-3558: Nice. How much does this increase the weight of the pig build, and what packages does it pull in? I assume this won't get pushed to trunk until hive 0.14.0-SNAPSHOT becomes available as a stable version? > ORC support for Pig > --- > > Key: PIG-3558 > URL: https://issues.apache.org/jira/browse/PIG-3558 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: Daniel Dai >Assignee: Daniel Dai > Labels: porc > Fix For: 0.14.0 > > Attachments: PIG-3558-1.patch, PIG-3558-2.patch, PIG-3558-3.patch, > PIG-3558-4.patch, PIG-3558-5.patch, PIG-3558-6.patch > > > Adding LoadFunc and StoreFunc for ORC. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PIG-2620) Customizable Error Handling in Pig
[ https://issues.apache.org/jira/browse/PIG-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14019121#comment-14019121 ] Dmitriy V. Ryaboy commented on PIG-2620: Hi Qinghao, When looking at tickets in JIRA, to find out whether and in what version they are closed out, you want to look at "resolution" and "fix version". In this case, resolution is "unresolved" meaning this work has not been completed. If it was "fixed", you'd be able to check if this is in your version by checking "fix version" -- if it's a number equal to or lower than what you are running, you have it. It's extremely unlikely that this will ever go into 0.8.1 since the current version is 0.13 (about to be released, and also doesn't have this feature -- so far this feature is only a design, there's no real code). 0.8.1 is quite old, you really should upgrade > Customizable Error Handling in Pig > -- > > Key: PIG-2620 > URL: https://issues.apache.org/jira/browse/PIG-2620 > Project: Pig > Issue Type: New Feature >Reporter: Dmitriy V. Ryaboy >Assignee: Lorand Bendig > Attachments: error_flow.png, rewrite_example.txt > > > The current behavior of Pig when handling exceptions thrown by UDFs is to > fail and stop processing. We want to extend this behavior to let user have > finer grain control on error handling. > Depending on the use-case there are several options users would like to have: > Stop the execution and report an error > Ignore tuples that cause exceptions and log warnings > Ignore tuples that cause exceptions and redirect them to an error relation > (to enable statistics, debugging, ...) > Write their own error handler -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (PIG-3558) ORC support for Pig
[ https://issues.apache.org/jira/browse/PIG-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-3558: --- Labels: porc (was: ) > ORC support for Pig > --- > > Key: PIG-3558 > URL: https://issues.apache.org/jira/browse/PIG-3558 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: Daniel Dai >Assignee: Daniel Dai > Labels: porc > Fix For: 0.13.0 > > Attachments: PIG-3558-1.patch, PIG-3558-2.patch, PIG-3558-3.patch, > PIG-3558-4.patch > > > Adding LoadFunc and StoreFunc for ORC. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PIG-3558) ORC support for Pig
[ https://issues.apache.org/jira/browse/PIG-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13904663#comment-13904663 ] Dmitriy V. Ryaboy commented on PIG-3558: Help me understand this. My understanding is as follows: Compile is minimum required to compile main code. Test is minimum required to compile main code + stuff needed to test (hence, "extends"). Pushing a dependency up to compile means everything, not just test, needs the dependency. Also, the bump from 0.8 to 0.12 is 6 megs worth of code. That's a pretty big version bump. > ORC support for Pig > --- > > Key: PIG-3558 > URL: https://issues.apache.org/jira/browse/PIG-3558 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.13.0 > > Attachments: PIG-3558-1.patch, PIG-3558-2.patch, PIG-3558-3.patch > > > Adding LoadFunc and StoreFunc for ORC. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3558) ORC support for Pig
[ https://issues.apache.org/jira/browse/PIG-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13904627#comment-13904627 ] Dmitriy V. Ryaboy commented on PIG-3558: [~daijy] not quite: {code} - conf="test->master" /> + conf="compile->master" /> {code} > ORC support for Pig > --- > > Key: PIG-3558 > URL: https://issues.apache.org/jira/browse/PIG-3558 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.13.0 > > Attachments: PIG-3558-1.patch, PIG-3558-2.patch, PIG-3558-3.patch > > > Adding LoadFunc and StoreFunc for ORC. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3558) ORC support for Pig
[ https://issues.apache.org/jira/browse/PIG-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13904590#comment-13904590 ] Dmitriy V. Ryaboy commented on PIG-3558: So that's a -1. I would +1 this if it was going into piggybank. Since this depends on unpublished changes, I'd rather we unlink it from 0.13 release (as that would tie us to Hive's release schedule -- obviously we can't make a release that depends on a snapshot). > ORC support for Pig > --- > > Key: PIG-3558 > URL: https://issues.apache.org/jira/browse/PIG-3558 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.13.0 > > Attachments: PIG-3558-1.patch, PIG-3558-2.patch, PIG-3558-3.patch > > > Adding LoadFunc and StoreFunc for ORC. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (PIG-3558) ORC support for Pig
[ https://issues.apache.org/jira/browse/PIG-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13904574#comment-13904574 ] Dmitriy V. Ryaboy edited comment on PIG-3558 at 2/18/14 8:55 PM: - I am pro adding ORC support in Pig, but against introducing massive dependencies. According to http://mvnrepository.com/artifact/org.apache.hive/hive-exec/0.12.0 the hive-exec jar for 0.12 is 9 megs, and hides within it specific versions of jackson, snappy, org.json, chunks of thrift, hadoop.io (?!), avro, commons, protobuf, and guava. If ORC authors are not interested in improving their dependency hygene, they have to live with the fact that their project is unlikely to get integrated into other projects. This is self-inflicted jar hell. Please don't do this. When ORC cleans up their dependencies, let's revisit. was (Author: dvryaboy): I am pro adding ORC support in Pig, but against introducing massive dependencies. According to http://mvnrepository.com/artifact/org.apache.hive/hive-exec/0.12.0 the hive-exec jar for 0.12 is 9 megs, and hides within it specific versions of jackson, snappy, org.json, chunks of thrift, hadoop.io (?!), avro, commons, protobuf, and guava. If ORC authors are not interested in reducing their dependency hygene, they have to live with the fact that their project is unlikely to get integrated into other projects. This is self-inflicted jar hell. Please don't do this. When ORC cleans up their dependencies, let's revisit. > ORC support for Pig > --- > > Key: PIG-3558 > URL: https://issues.apache.org/jira/browse/PIG-3558 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.13.0 > > Attachments: PIG-3558-1.patch, PIG-3558-2.patch, PIG-3558-3.patch > > > Adding LoadFunc and StoreFunc for ORC. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3558) ORC support for Pig
[ https://issues.apache.org/jira/browse/PIG-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13904574#comment-13904574 ] Dmitriy V. Ryaboy commented on PIG-3558: I am pro adding ORC support in Pig, but against introducing massive dependencies. According to http://mvnrepository.com/artifact/org.apache.hive/hive-exec/0.12.0 the hive-exec jar for 0.12 is 9 megs, and hides within it specific versions of jackson, snappy, org.json, chunks of thrift, hadoop.io (?!), avro, commons, protobuf, and guava. If ORC authors are not interested in reducing their dependency hygene, they have to live with the fact that their project is unlikely to get integrated into other projects. This is self-inflicted jar hell. Please don't do this. When ORC cleans up their dependencies, let's revisit. > ORC support for Pig > --- > > Key: PIG-3558 > URL: https://issues.apache.org/jira/browse/PIG-3558 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.13.0 > > Attachments: PIG-3558-1.patch, PIG-3558-2.patch, PIG-3558-3.patch > > > Adding LoadFunc and StoreFunc for ORC. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3456) Reduce threadlocal conf access in backend for each record
[ https://issues.apache.org/jira/browse/PIG-3456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13896199#comment-13896199 ] Dmitriy V. Ryaboy commented on PIG-3456: Added a couple minor comments. Good change overall. BTW not sure if you saw, but PIG-3325 addressed the bag insertion regression you saw as a side effect of PIG-2923 without sacrificing the memory and gc benefits 2923 provides, so if you still have that reverted in your build, consider un-reverting.. > Reduce threadlocal conf access in backend for each record > - > > Key: PIG-3456 > URL: https://issues.apache.org/jira/browse/PIG-3456 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.11.1 >Reporter: Rohini Palaniswamy >Assignee: Rohini Palaniswamy > Fix For: 0.13.0 > > Attachments: PIG-3456-1-no-whitespace.patch, PIG-3456-1.patch > > > Noticed few things while browsing code > 1) DefaultTuple has a protected boolean isNull = false; which is never used. > Removing this gives ~3-5% improvement for big jobs > 2) Config checking with ThreadLocal conf is repeatedly done for each record. > For eg: createDataBag in POCombinerPackage. But initialized only for first > time in other places like POPackage, POJoinPackage, etc. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3347) Store invocation brings side effect
[ https://issues.apache.org/jira/browse/PIG-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13887114#comment-13887114 ] Dmitriy V. Ryaboy commented on PIG-3347: Yikes. [~aniket486] & [~julienledem] this seems like a critical bug to look at. Julien, you investigated this UID situation before, right? > Store invocation brings side effect > --- > > Key: PIG-3347 > URL: https://issues.apache.org/jira/browse/PIG-3347 > Project: Pig > Issue Type: Bug > Components: grunt >Affects Versions: 0.11 > Environment: local mode >Reporter: Sergey >Assignee: Daniel Dai >Priority: Critical > Fix For: 0.12.1 > > Attachments: PIG-3347-1.patch > > > The problem is that intermediate 'store' invocation "changes" the final store > output. Looks like it brings some kind of side effect. We did use 'local' > mode to run script > here is the input data: > 1 > 1 > Here is the script: > {code} > a = load 'test'; > a_group = group a by $0; > b = foreach a_group { > a_distinct = distinct a.$0; > generate group, a_distinct; > } > --store b into 'b'; > c = filter b by SIZE(a_distinct) == 1; > store c into 'out'; > {code} > We expect output to be: > 1 1 > The output is empty file. > Uncomment {code}--store b into 'b';{code} line and see the diffrence. > Yuo would get expected output. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (PIG-3347) Store invocation brings side effect
[ https://issues.apache.org/jira/browse/PIG-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-3347: --- Priority: Critical (was: Major) > Store invocation brings side effect > --- > > Key: PIG-3347 > URL: https://issues.apache.org/jira/browse/PIG-3347 > Project: Pig > Issue Type: Bug > Components: grunt >Affects Versions: 0.11 > Environment: local mode >Reporter: Sergey >Assignee: Daniel Dai >Priority: Critical > Fix For: 0.12.1 > > Attachments: PIG-3347-1.patch > > > The problem is that intermediate 'store' invocation "changes" the final store > output. Looks like it brings some kind of side effect. We did use 'local' > mode to run script > here is the input data: > 1 > 1 > Here is the script: > {code} > a = load 'test'; > a_group = group a by $0; > b = foreach a_group { > a_distinct = distinct a.$0; > generate group, a_distinct; > } > --store b into 'b'; > c = filter b by SIZE(a_distinct) == 1; > store c into 'out'; > {code} > We expect output to be: > 1 1 > The output is empty file. > Uncomment {code}--store b into 'b';{code} line and see the diffrence. > Yuo would get expected output. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3299) Provide support for LazyOutputFormat to avoid creating empty files
[ https://issues.apache.org/jira/browse/PIG-3299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13887109#comment-13887109 ] Dmitriy V. Ryaboy commented on PIG-3299: [~daijy] shall we commit this? > Provide support for LazyOutputFormat to avoid creating empty files > -- > > Key: PIG-3299 > URL: https://issues.apache.org/jira/browse/PIG-3299 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.11.1 >Reporter: Rohini Palaniswamy >Assignee: Lorand Bendig > Attachments: PIG-3299.patch > > > LazyOutputFormat (HADOOP-4927) in hadoop is a wrapper to avoid creating part > files if there is no records output. It would be good to add support for that > by having a configuration in pig which wraps storeFunc.getOutputFormat() with > LazyOutputFormat. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (PIG-3672) pig should not hardcode "hdfs://" path in code, should be configurable to other file system implementations
[ https://issues.apache.org/jira/browse/PIG-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-3672: --- Status: Open (was: Patch Available) cancelling patch available status given Rohini's comments -- please make patch available again when a new patch is submitted > pig should not hardcode "hdfs://" path in code, should be configurable to > other file system implementations > --- > > Key: PIG-3672 > URL: https://issues.apache.org/jira/browse/PIG-3672 > Project: Pig > Issue Type: Bug > Components: data, parser >Affects Versions: 0.11.1, 0.12.0, 0.10.0 >Reporter: Suhas Satish >Assignee: Suhas Satish > Attachments: PIG-3672-1.patch, PIG-3672-2.patch, PIG-3672.patch > > > QueryParserUtils.java has the code - > result.add("hdfs://"+thisHost+":"+uri.getPort()); > I propose to make it generic like - > result.add(uri.getScheme() + "://"+thisHost+":"+uri.getPort()); > Similarly jobControlCompiler.java has - > if (!outputPathString.contains("://") || > outputPathString.startsWith("hdfs://")) { > I have a patch version which I ran passing unit tests on. Will be uploading > it shortly. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3456) Reduce threadlocal conf access in backend for each record
[ https://issues.apache.org/jira/browse/PIG-3456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13887098#comment-13887098 ] Dmitriy V. Ryaboy commented on PIG-3456: Could you post a patch without the whitespace changes (for ease of review) and some microbenchmark results? I had some microbenchmark code in PIG-3325, that might help bootstrap you here. > Reduce threadlocal conf access in backend for each record > - > > Key: PIG-3456 > URL: https://issues.apache.org/jira/browse/PIG-3456 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.11.1 >Reporter: Rohini Palaniswamy >Assignee: Rohini Palaniswamy > Fix For: 0.13.0 > > Attachments: PIG-3456-1.patch > > > Noticed few things while browsing code > 1) DefaultTuple has a protected boolean isNull = false; which is never used. > Removing this gives ~3-5% improvement for big jobs > 2) Config checking with ThreadLocal conf is repeatedly done for each record. > For eg: createDataBag in POCombinerPackage. But initialized only for first > time in other places like POPackage, POJoinPackage, etc. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3722) Udf deserialization for registered classes fails in local_mode
[ https://issues.apache.org/jira/browse/PIG-3722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13886991#comment-13886991 ] Dmitriy V. Ryaboy commented on PIG-3722: +1 > Udf deserialization for registered classes fails in local_mode > -- > > Key: PIG-3722 > URL: https://issues.apache.org/jira/browse/PIG-3722 > Project: Pig > Issue Type: Bug >Affects Versions: 0.13.0 >Reporter: Aniket Mokashi >Assignee: Aniket Mokashi > Fix For: 0.13.0 > > Attachments: PIG-3722.patch > > > Similar to https://issues.apache.org/jira/browse/PIG-2532, registered classes > are not available if jobs are converted to local_mode. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-2672) Optimize the use of DistributedCache
[ https://issues.apache.org/jira/browse/PIG-2672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13883478#comment-13883478 ] Dmitriy V. Ryaboy commented on PIG-2672: [~knoguchi] in the spirit of keeping things moving -- can we commit this? You can feel free to turn the behavior off on your cluster if you are worried about the 1 week boundary. If that's the case, feel free to open another ticket to follow up, or to make sure that YARN-1492 fixes your issue. > Optimize the use of DistributedCache > > > Key: PIG-2672 > URL: https://issues.apache.org/jira/browse/PIG-2672 > Project: Pig > Issue Type: Improvement >Reporter: Rohini Palaniswamy > Fix For: 0.13.0 > > Attachments: PIG-2672-5.patch, PIG-2672.patch > > > Pig currently copies jar files to a temporary location in hdfs and then adds > them to DistributedCache for each job launched. This is inefficient in terms > of >* Space - The jars are distributed to task trackers for every job taking > up lot of local temporary space in tasktrackers. >* Performance - The jar distribution impacts the job launch time. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-2672) Optimize the use of DistributedCache
[ https://issues.apache.org/jira/browse/PIG-2672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13880084#comment-13880084 ] Dmitriy V. Ryaboy commented on PIG-2672: Seems like there is a lot of effort being spent here reinventing what is already designed for the general use case in the yarn ticket Aniket linked. Lets not let best be enemy of the good, and just get something in that will be decent for most cases, and if people don't like it, they can turn it off. This is an intermediate solution until that yarn patch goes in, at which point all of this becomes moot. > Optimize the use of DistributedCache > > > Key: PIG-2672 > URL: https://issues.apache.org/jira/browse/PIG-2672 > Project: Pig > Issue Type: Improvement >Reporter: Rohini Palaniswamy > Fix For: 0.13.0 > > Attachments: PIG-2672-5.patch, PIG-2672.patch > > > Pig currently copies jar files to a temporary location in hdfs and then adds > them to DistributedCache for each job launched. This is inefficient in terms > of >* Space - The jars are distributed to task trackers for every job taking > up lot of local temporary space in tasktrackers. >* Performance - The jar distribution impacts the job launch time. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3621) Python Avro library can't read Avros made with builtin AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13852500#comment-13852500 ] Dmitriy V. Ryaboy commented on PIG-3621: +1 > Python Avro library can't read Avros made with builtin AvroStorage > -- > > Key: PIG-3621 > URL: https://issues.apache.org/jira/browse/PIG-3621 > Project: Pig > Issue Type: Bug > Components: internal-udfs >Affects Versions: 0.12.0 >Reporter: Russell Jurney > Fix For: 0.12.1, 0.13.0 > > Attachments: PIG-3621-3.patch, PIG-3631-2.patch, PIG-3631.patch > > > Using this script: > from avro import schema, datafile, io > import pprint > import sys > import json > field_id = None > # Optional key to print > if (len(sys.argv) > 2): > field_id = sys.argv[2] > # Test reading avros > rec_reader = io.DatumReader() > # Create a 'data file' (avro file) reader > df_reader = datafile.DataFileReader( > open(sys.argv[1]), > rec_reader > ) > the last line fails with: > Traceback (most recent call last): > File "/Users/rjurney/bin/cat_avro", line 22, in > rec_reader > File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/datafile.py", > line 247, in __init__ > self.datum_reader.writers_schema = schema.parse(self.get_meta(SCHEMA_KEY)) > File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/schema.py", > line 784, in parse > return make_avsc_object(json_data, names) > File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/schema.py", > line 740, in make_avsc_object > return RecordSchema(name, namespace, fields, names, type, doc, > other_props) > File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/schema.py", > line 653, in __init__ > other_props) > File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/schema.py", > line 294, in __init__ > new_name = names.add_name(name, namespace, self) > File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/schema.py", > line 268, in add_name > raise SchemaParseException(fail_msg) > avro.schema.SchemaParseException: record is a reserved type name. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (PIG-3621) Python Avro library can't read Avros made with builtin AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13852441#comment-13852441 ] Dmitriy V. Ryaboy commented on PIG-3621: Sorry, that was a "no" to the assignment. Cheolsoo, does that var get set elsewhere? Why remove the logic for checking empty string, etc, and using a default? > Python Avro library can't read Avros made with builtin AvroStorage > -- > > Key: PIG-3621 > URL: https://issues.apache.org/jira/browse/PIG-3621 > Project: Pig > Issue Type: Bug > Components: internal-udfs >Affects Versions: 0.12.0 >Reporter: Russell Jurney > Fix For: 0.12.1, 0.13.0 > > Attachments: PIG-3631-2.patch, PIG-3631.patch > > > Using this script: > from avro import schema, datafile, io > import pprint > import sys > import json > field_id = None > # Optional key to print > if (len(sys.argv) > 2): > field_id = sys.argv[2] > # Test reading avros > rec_reader = io.DatumReader() > # Create a 'data file' (avro file) reader > df_reader = datafile.DataFileReader( > open(sys.argv[1]), > rec_reader > ) > the last line fails with: > Traceback (most recent call last): > File "/Users/rjurney/bin/cat_avro", line 22, in > rec_reader > File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/datafile.py", > line 247, in __init__ > self.datum_reader.writers_schema = schema.parse(self.get_meta(SCHEMA_KEY)) > File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/schema.py", > line 784, in parse > return make_avsc_object(json_data, names) > File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/schema.py", > line 740, in make_avsc_object > return RecordSchema(name, namespace, fields, names, type, doc, > other_props) > File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/schema.py", > line 653, in __init__ > other_props) > File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/schema.py", > line 294, in __init__ > new_name = names.add_name(name, namespace, self) > File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/schema.py", > line 268, in add_name > raise SchemaParseException(fail_msg) > avro.schema.SchemaParseException: record is a reserved type name. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Assigned] (PIG-3621) Python Avro library can't read Avros made with builtin AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy reassigned PIG-3621: -- Assignee: (was: Dmitriy V. Ryaboy) Uh, no thanks :) > Python Avro library can't read Avros made with builtin AvroStorage > -- > > Key: PIG-3621 > URL: https://issues.apache.org/jira/browse/PIG-3621 > Project: Pig > Issue Type: Bug > Components: internal-udfs >Affects Versions: 0.12.0 >Reporter: Russell Jurney > Fix For: 0.12.1, 0.13.0 > > Attachments: PIG-3631-2.patch, PIG-3631.patch > > > Using this script: > from avro import schema, datafile, io > import pprint > import sys > import json > field_id = None > # Optional key to print > if (len(sys.argv) > 2): > field_id = sys.argv[2] > # Test reading avros > rec_reader = io.DatumReader() > # Create a 'data file' (avro file) reader > df_reader = datafile.DataFileReader( > open(sys.argv[1]), > rec_reader > ) > the last line fails with: > Traceback (most recent call last): > File "/Users/rjurney/bin/cat_avro", line 22, in > rec_reader > File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/datafile.py", > line 247, in __init__ > self.datum_reader.writers_schema = schema.parse(self.get_meta(SCHEMA_KEY)) > File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/schema.py", > line 784, in parse > return make_avsc_object(json_data, names) > File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/schema.py", > line 740, in make_avsc_object > return RecordSchema(name, namespace, fields, names, type, doc, > other_props) > File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/schema.py", > line 653, in __init__ > other_props) > File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/schema.py", > line 294, in __init__ > new_name = names.add_name(name, namespace, self) > File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/schema.py", > line 268, in add_name > raise SchemaParseException(fail_msg) > avro.schema.SchemaParseException: record is a reserved type name. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (PIG-3630) Macros that work in Pig 0.11 fail in Pig 0.12 :(
[ https://issues.apache.org/jira/browse/PIG-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13852299#comment-13852299 ] Dmitriy V. Ryaboy commented on PIG-3630: Now that registers are in place, it works in 12 as well: {code} Input(s): Successfully read records from: "/Users/dmitriy/Downloads/trimmed_reviews.avro" Output(s): Successfully stored records in: "file:///Users/dmitriy/src/pig-0.12/tmp/pig_12_ntf_idf_scores" Job DAG: job_local_0001 -> job_local_0003,job_local_0002, job_local_0003 -> job_local_0005, job_local_0002 -> job_local_0004, job_local_0004 -> job_local_0005, job_local_0005 2013-12-18 15:22:02,012 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! {code} Back to you... > Macros that work in Pig 0.11 fail in Pig 0.12 :( > > > Key: PIG-3630 > URL: https://issues.apache.org/jira/browse/PIG-3630 > Project: Pig > Issue Type: Bug > Components: parser >Affects Versions: 0.12.0 >Reporter: Russell Jurney > > http://my.safaribooksonline.com/book/databases/9781449326890/7dot-exploring-data-with-reports/i_sect13_id196600_html > The ntf-idf macro listed there works under 0.11. Under 0.12, it results in > this: > 13/12/16 22:09:19 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL > 2013-12-16 22:09:19,159 [main] INFO org.apache.pig.Main - Apache Pig version > 0.13.0-SNAPSHOT (rUnversioned directory) compiled Dec 09 2013, 14:37:29 > 2013-12-16 22:09:19,159 [main] INFO org.apache.pig.Main - Logging error > messages to: /private/tmp/pig_1387260559120.log > 2013-12-16 22:09:19.268 java[38060:1903] Unable to load realm info from > SCDynamicStore > 2013-12-16 22:09:19,528 [main] INFO > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting > to hadoop file system at: file:/// > 2013-12-16 22:09:20,189 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 1025: > at expanding macro 'tf_idf' (per_business.pig:9) > Invalid field projection. > Projected field [tf_idf] does not exist in schema: > business_id:chararray,token:chararray,term_freq:double,num_docs_with_token:long. > 2013-12-16 22:09:20,189 [main] ERROR org.apache.pig.tools.grunt.Grunt - > org.apache.pig.impl.plan.PlanValidationException: ERROR 1025: > at expanding macro 'tf_idf' (per_business.pig:9) > Invalid field projection. > Projected field [tf_idf] does not exist in schema: > business_id:chararray,token:chararray,term_freq:double,num_docs_with_token:long. > at > org.apache.pig.newplan.logical.expression.ProjectExpression.findColNum(ProjectExpression.java:191) > at > org.apache.pig.newplan.logical.expression.ProjectExpression.setColumnNumberFromAlias(ProjectExpression.java:174) > at > org.apache.pig.newplan.logical.visitor.ColumnAliasConversionVisitor$1.visit(ColumnAliasConversionVisitor.java:53) > at > org.apache.pig.newplan.logical.expression.ProjectExpression.accept(ProjectExpression.java:215) > at > org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) > at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) > at > org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:142) > at > org.apache.pig.newplan.logical.relational.LOInnerLoad.accept(LOInnerLoad.java:128) > at > org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) > at > org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:124) > at > org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.java:76) > at > org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) > at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) > at org.apache.pig.PigServer$Graph.compile(PigServer.java:1694) > at org.apache.pig.PigServer$Graph.compile(PigServer.java:1686) > at org.apache.pig.PigServer$Graph.access$200(PigServer.java:1387) > at org.apache.pig.PigServer.execute(PigServer.java:1302) > at org.apache.pig.PigServer.executeBatch(PigServer.java:391) > at org.apache.pig.PigServer.executeBatch(PigServer.java:369) > at > org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:133) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:195) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166) > at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84) > at org.apache.pig.Main.run(Main.java:600) > at org.apache.pig.Main.main(Main.java:156) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAcces
[jira] [Commented] (PIG-3630) Macros that work in Pig 0.11 fail in Pig 0.12 :(
[ https://issues.apache.org/jira/browse/PIG-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13852281#comment-13852281 ] Dmitriy V. Ryaboy commented on PIG-3630: Actually that failed in 11 due to missing register statements. It does work in 11 if you work around the Avro stuff. Ok, now we have something to look at... > Macros that work in Pig 0.11 fail in Pig 0.12 :( > > > Key: PIG-3630 > URL: https://issues.apache.org/jira/browse/PIG-3630 > Project: Pig > Issue Type: Bug > Components: parser >Affects Versions: 0.12.0 >Reporter: Russell Jurney > > http://my.safaribooksonline.com/book/databases/9781449326890/7dot-exploring-data-with-reports/i_sect13_id196600_html > The ntf-idf macro listed there works under 0.11. Under 0.12, it results in > this: > 13/12/16 22:09:19 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL > 2013-12-16 22:09:19,159 [main] INFO org.apache.pig.Main - Apache Pig version > 0.13.0-SNAPSHOT (rUnversioned directory) compiled Dec 09 2013, 14:37:29 > 2013-12-16 22:09:19,159 [main] INFO org.apache.pig.Main - Logging error > messages to: /private/tmp/pig_1387260559120.log > 2013-12-16 22:09:19.268 java[38060:1903] Unable to load realm info from > SCDynamicStore > 2013-12-16 22:09:19,528 [main] INFO > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting > to hadoop file system at: file:/// > 2013-12-16 22:09:20,189 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 1025: > at expanding macro 'tf_idf' (per_business.pig:9) > Invalid field projection. > Projected field [tf_idf] does not exist in schema: > business_id:chararray,token:chararray,term_freq:double,num_docs_with_token:long. > 2013-12-16 22:09:20,189 [main] ERROR org.apache.pig.tools.grunt.Grunt - > org.apache.pig.impl.plan.PlanValidationException: ERROR 1025: > at expanding macro 'tf_idf' (per_business.pig:9) > Invalid field projection. > Projected field [tf_idf] does not exist in schema: > business_id:chararray,token:chararray,term_freq:double,num_docs_with_token:long. > at > org.apache.pig.newplan.logical.expression.ProjectExpression.findColNum(ProjectExpression.java:191) > at > org.apache.pig.newplan.logical.expression.ProjectExpression.setColumnNumberFromAlias(ProjectExpression.java:174) > at > org.apache.pig.newplan.logical.visitor.ColumnAliasConversionVisitor$1.visit(ColumnAliasConversionVisitor.java:53) > at > org.apache.pig.newplan.logical.expression.ProjectExpression.accept(ProjectExpression.java:215) > at > org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) > at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) > at > org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:142) > at > org.apache.pig.newplan.logical.relational.LOInnerLoad.accept(LOInnerLoad.java:128) > at > org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) > at > org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:124) > at > org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.java:76) > at > org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) > at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) > at org.apache.pig.PigServer$Graph.compile(PigServer.java:1694) > at org.apache.pig.PigServer$Graph.compile(PigServer.java:1686) > at org.apache.pig.PigServer$Graph.access$200(PigServer.java:1387) > at org.apache.pig.PigServer.execute(PigServer.java:1302) > at org.apache.pig.PigServer.executeBatch(PigServer.java:391) > at org.apache.pig.PigServer.executeBatch(PigServer.java:369) > at > org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:133) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:195) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166) > at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84) > at org.apache.pig.Main.run(Main.java:600) > at org.apache.pig.Main.main(Main.java:156) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at org.apache.hadoop.util.RunJar.main(RunJar.java:156) -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (PIG-3630) Macros that work in Pig 0.11 fail in Pig 0.12 :(
[ https://issues.apache.org/jira/browse/PIG-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13852272#comment-13852272 ] Dmitriy V. Ryaboy commented on PIG-3630: That one fails in both 0.11 and 0.12. Do you have something that works in 11 but fails in 12? > Macros that work in Pig 0.11 fail in Pig 0.12 :( > > > Key: PIG-3630 > URL: https://issues.apache.org/jira/browse/PIG-3630 > Project: Pig > Issue Type: Bug > Components: parser >Affects Versions: 0.12.0 >Reporter: Russell Jurney > > http://my.safaribooksonline.com/book/databases/9781449326890/7dot-exploring-data-with-reports/i_sect13_id196600_html > The ntf-idf macro listed there works under 0.11. Under 0.12, it results in > this: > 13/12/16 22:09:19 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL > 2013-12-16 22:09:19,159 [main] INFO org.apache.pig.Main - Apache Pig version > 0.13.0-SNAPSHOT (rUnversioned directory) compiled Dec 09 2013, 14:37:29 > 2013-12-16 22:09:19,159 [main] INFO org.apache.pig.Main - Logging error > messages to: /private/tmp/pig_1387260559120.log > 2013-12-16 22:09:19.268 java[38060:1903] Unable to load realm info from > SCDynamicStore > 2013-12-16 22:09:19,528 [main] INFO > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting > to hadoop file system at: file:/// > 2013-12-16 22:09:20,189 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 1025: > at expanding macro 'tf_idf' (per_business.pig:9) > Invalid field projection. > Projected field [tf_idf] does not exist in schema: > business_id:chararray,token:chararray,term_freq:double,num_docs_with_token:long. > 2013-12-16 22:09:20,189 [main] ERROR org.apache.pig.tools.grunt.Grunt - > org.apache.pig.impl.plan.PlanValidationException: ERROR 1025: > at expanding macro 'tf_idf' (per_business.pig:9) > Invalid field projection. > Projected field [tf_idf] does not exist in schema: > business_id:chararray,token:chararray,term_freq:double,num_docs_with_token:long. > at > org.apache.pig.newplan.logical.expression.ProjectExpression.findColNum(ProjectExpression.java:191) > at > org.apache.pig.newplan.logical.expression.ProjectExpression.setColumnNumberFromAlias(ProjectExpression.java:174) > at > org.apache.pig.newplan.logical.visitor.ColumnAliasConversionVisitor$1.visit(ColumnAliasConversionVisitor.java:53) > at > org.apache.pig.newplan.logical.expression.ProjectExpression.accept(ProjectExpression.java:215) > at > org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) > at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) > at > org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:142) > at > org.apache.pig.newplan.logical.relational.LOInnerLoad.accept(LOInnerLoad.java:128) > at > org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) > at > org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:124) > at > org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.java:76) > at > org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) > at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) > at org.apache.pig.PigServer$Graph.compile(PigServer.java:1694) > at org.apache.pig.PigServer$Graph.compile(PigServer.java:1686) > at org.apache.pig.PigServer$Graph.access$200(PigServer.java:1387) > at org.apache.pig.PigServer.execute(PigServer.java:1302) > at org.apache.pig.PigServer.executeBatch(PigServer.java:391) > at org.apache.pig.PigServer.executeBatch(PigServer.java:369) > at > org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:133) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:195) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166) > at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84) > at org.apache.pig.Main.run(Main.java:600) > at org.apache.pig.Main.main(Main.java:156) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at org.apache.hadoop.util.RunJar.main(RunJar.java:156) -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (PIG-3630) Macros that work in Pig 0.11 fail in Pig 0.12 :(
[ https://issues.apache.org/jira/browse/PIG-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13852221#comment-13852221 ] Dmitriy V. Ryaboy commented on PIG-3630: Sure enough. Once I add that, everything works in 0.12 and now I can't reproduce the bug you are reporting. My pig is [tw-mbp13-dryaboy-2 pig-0.12]$ ./bin/pig -version Apache Pig version 0.12.0-SNAPSHOT (r1526044) compiled Dec 18 2013, 12:15:04 same with more recent: [tw-mbp13-dryaboy-2 pig-0.12]$ ./bin/pig -version Apache Pig version 0.12.1-SNAPSHOT (r1552124) compiled Dec 18 2013, 14:00:21 Back to you to get a reproducible test case > Macros that work in Pig 0.11 fail in Pig 0.12 :( > > > Key: PIG-3630 > URL: https://issues.apache.org/jira/browse/PIG-3630 > Project: Pig > Issue Type: Bug > Components: parser >Affects Versions: 0.12.0 >Reporter: Russell Jurney > > http://my.safaribooksonline.com/book/databases/9781449326890/7dot-exploring-data-with-reports/i_sect13_id196600_html > The ntf-idf macro listed there works under 0.11. Under 0.12, it results in > this: > 13/12/16 22:09:19 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL > 2013-12-16 22:09:19,159 [main] INFO org.apache.pig.Main - Apache Pig version > 0.13.0-SNAPSHOT (rUnversioned directory) compiled Dec 09 2013, 14:37:29 > 2013-12-16 22:09:19,159 [main] INFO org.apache.pig.Main - Logging error > messages to: /private/tmp/pig_1387260559120.log > 2013-12-16 22:09:19.268 java[38060:1903] Unable to load realm info from > SCDynamicStore > 2013-12-16 22:09:19,528 [main] INFO > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting > to hadoop file system at: file:/// > 2013-12-16 22:09:20,189 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 1025: > at expanding macro 'tf_idf' (per_business.pig:9) > Invalid field projection. > Projected field [tf_idf] does not exist in schema: > business_id:chararray,token:chararray,term_freq:double,num_docs_with_token:long. > 2013-12-16 22:09:20,189 [main] ERROR org.apache.pig.tools.grunt.Grunt - > org.apache.pig.impl.plan.PlanValidationException: ERROR 1025: > at expanding macro 'tf_idf' (per_business.pig:9) > Invalid field projection. > Projected field [tf_idf] does not exist in schema: > business_id:chararray,token:chararray,term_freq:double,num_docs_with_token:long. > at > org.apache.pig.newplan.logical.expression.ProjectExpression.findColNum(ProjectExpression.java:191) > at > org.apache.pig.newplan.logical.expression.ProjectExpression.setColumnNumberFromAlias(ProjectExpression.java:174) > at > org.apache.pig.newplan.logical.visitor.ColumnAliasConversionVisitor$1.visit(ColumnAliasConversionVisitor.java:53) > at > org.apache.pig.newplan.logical.expression.ProjectExpression.accept(ProjectExpression.java:215) > at > org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) > at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) > at > org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:142) > at > org.apache.pig.newplan.logical.relational.LOInnerLoad.accept(LOInnerLoad.java:128) > at > org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) > at > org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:124) > at > org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.java:76) > at > org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) > at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) > at org.apache.pig.PigServer$Graph.compile(PigServer.java:1694) > at org.apache.pig.PigServer$Graph.compile(PigServer.java:1686) > at org.apache.pig.PigServer$Graph.access$200(PigServer.java:1387) > at org.apache.pig.PigServer.execute(PigServer.java:1302) > at org.apache.pig.PigServer.executeBatch(PigServer.java:391) > at org.apache.pig.PigServer.executeBatch(PigServer.java:369) > at > org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:133) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:195) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166) > at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84) > at org.apache.pig.Main.run(Main.java:600) > at org.apache.pig.Main.main(Main.java:156) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang
[jira] [Commented] (PIG-3630) Macros that work in Pig 0.11 fail in Pig 0.12 :(
[ https://issues.apache.org/jira/browse/PIG-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13852133#comment-13852133 ] Dmitriy V. Ryaboy commented on PIG-3630: Is this a AvroStorage or data issue? grunt> import '/Users/dmitriy/tmp/tf_idf.macro'; grunt> register build/ivy/lib/Pig/avro-1.7.4.jar grunt> register build/ivy/lib/Pig/json-simple-1.1.jar grunt> register contrib/piggybank/java/piggybank.jar grunt> define AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage(); grunt> emails = load '/Users/dmitriy/Downloads/enron.avro'; grunt> describe emails Schema for emails unknown. (this is the same in both pig 0.11 and pig 0.12). Can you provide a simple reproducible use case that doesn't involve Avro, etc? Can you share what debugging you've done so far? > Macros that work in Pig 0.11 fail in Pig 0.12 :( > > > Key: PIG-3630 > URL: https://issues.apache.org/jira/browse/PIG-3630 > Project: Pig > Issue Type: Bug > Components: parser >Affects Versions: 0.12.0 >Reporter: Russell Jurney > > http://my.safaribooksonline.com/book/databases/9781449326890/7dot-exploring-data-with-reports/i_sect13_id196600_html > The ntf-idf macro listed there works under 0.11. Under 0.12, it results in > this: > 13/12/16 22:09:19 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL > 2013-12-16 22:09:19,159 [main] INFO org.apache.pig.Main - Apache Pig version > 0.13.0-SNAPSHOT (rUnversioned directory) compiled Dec 09 2013, 14:37:29 > 2013-12-16 22:09:19,159 [main] INFO org.apache.pig.Main - Logging error > messages to: /private/tmp/pig_1387260559120.log > 2013-12-16 22:09:19.268 java[38060:1903] Unable to load realm info from > SCDynamicStore > 2013-12-16 22:09:19,528 [main] INFO > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting > to hadoop file system at: file:/// > 2013-12-16 22:09:20,189 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 1025: > at expanding macro 'tf_idf' (per_business.pig:9) > Invalid field projection. > Projected field [tf_idf] does not exist in schema: > business_id:chararray,token:chararray,term_freq:double,num_docs_with_token:long. > 2013-12-16 22:09:20,189 [main] ERROR org.apache.pig.tools.grunt.Grunt - > org.apache.pig.impl.plan.PlanValidationException: ERROR 1025: > at expanding macro 'tf_idf' (per_business.pig:9) > Invalid field projection. > Projected field [tf_idf] does not exist in schema: > business_id:chararray,token:chararray,term_freq:double,num_docs_with_token:long. > at > org.apache.pig.newplan.logical.expression.ProjectExpression.findColNum(ProjectExpression.java:191) > at > org.apache.pig.newplan.logical.expression.ProjectExpression.setColumnNumberFromAlias(ProjectExpression.java:174) > at > org.apache.pig.newplan.logical.visitor.ColumnAliasConversionVisitor$1.visit(ColumnAliasConversionVisitor.java:53) > at > org.apache.pig.newplan.logical.expression.ProjectExpression.accept(ProjectExpression.java:215) > at > org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) > at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) > at > org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:142) > at > org.apache.pig.newplan.logical.relational.LOInnerLoad.accept(LOInnerLoad.java:128) > at > org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) > at > org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:124) > at > org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.java:76) > at > org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) > at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) > at org.apache.pig.PigServer$Graph.compile(PigServer.java:1694) > at org.apache.pig.PigServer$Graph.compile(PigServer.java:1686) > at org.apache.pig.PigServer$Graph.access$200(PigServer.java:1387) > at org.apache.pig.PigServer.execute(PigServer.java:1302) > at org.apache.pig.PigServer.executeBatch(PigServer.java:391) > at org.apache.pig.PigServer.executeBatch(PigServer.java:369) > at > org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:133) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:195) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166) > at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84) > at org.apache.pig.Main.run(Main.
[jira] [Commented] (PIG-3630) Macros that work in Pig 0.11 fail in Pig 0.12 :(
[ https://issues.apache.org/jira/browse/PIG-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13851412#comment-13851412 ] Dmitriy V. Ryaboy commented on PIG-3630: That macro does not refer to a field called tf_idf. Could you post a fully reproducible test case? > Macros that work in Pig 0.11 fail in Pig 0.12 :( > > > Key: PIG-3630 > URL: https://issues.apache.org/jira/browse/PIG-3630 > Project: Pig > Issue Type: Bug > Components: parser >Affects Versions: 0.12.0 >Reporter: Russell Jurney > > http://my.safaribooksonline.com/book/databases/9781449326890/7dot-exploring-data-with-reports/i_sect13_id196600_html > The ntf-idf macro listed there works under 0.11. Under 0.12, it results in > this: > 13/12/16 22:09:19 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL > 2013-12-16 22:09:19,159 [main] INFO org.apache.pig.Main - Apache Pig version > 0.13.0-SNAPSHOT (rUnversioned directory) compiled Dec 09 2013, 14:37:29 > 2013-12-16 22:09:19,159 [main] INFO org.apache.pig.Main - Logging error > messages to: /private/tmp/pig_1387260559120.log > 2013-12-16 22:09:19.268 java[38060:1903] Unable to load realm info from > SCDynamicStore > 2013-12-16 22:09:19,528 [main] INFO > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting > to hadoop file system at: file:/// > 2013-12-16 22:09:20,189 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 1025: > at expanding macro 'tf_idf' (per_business.pig:9) > Invalid field projection. > Projected field [tf_idf] does not exist in schema: > business_id:chararray,token:chararray,term_freq:double,num_docs_with_token:long. > 2013-12-16 22:09:20,189 [main] ERROR org.apache.pig.tools.grunt.Grunt - > org.apache.pig.impl.plan.PlanValidationException: ERROR 1025: > at expanding macro 'tf_idf' (per_business.pig:9) > Invalid field projection. > Projected field [tf_idf] does not exist in schema: > business_id:chararray,token:chararray,term_freq:double,num_docs_with_token:long. > at > org.apache.pig.newplan.logical.expression.ProjectExpression.findColNum(ProjectExpression.java:191) > at > org.apache.pig.newplan.logical.expression.ProjectExpression.setColumnNumberFromAlias(ProjectExpression.java:174) > at > org.apache.pig.newplan.logical.visitor.ColumnAliasConversionVisitor$1.visit(ColumnAliasConversionVisitor.java:53) > at > org.apache.pig.newplan.logical.expression.ProjectExpression.accept(ProjectExpression.java:215) > at > org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) > at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) > at > org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:142) > at > org.apache.pig.newplan.logical.relational.LOInnerLoad.accept(LOInnerLoad.java:128) > at > org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) > at > org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:124) > at > org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.java:76) > at > org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) > at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) > at org.apache.pig.PigServer$Graph.compile(PigServer.java:1694) > at org.apache.pig.PigServer$Graph.compile(PigServer.java:1686) > at org.apache.pig.PigServer$Graph.access$200(PigServer.java:1387) > at org.apache.pig.PigServer.execute(PigServer.java:1302) > at org.apache.pig.PigServer.executeBatch(PigServer.java:391) > at org.apache.pig.PigServer.executeBatch(PigServer.java:369) > at > org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:133) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:195) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166) > at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84) > at org.apache.pig.Main.run(Main.java:600) > at org.apache.pig.Main.main(Main.java:156) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at org.apache.hadoop.util.RunJar.main(RunJar.java:156) -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (PIG-3630) Macros that work in Pig 0.11 fail in Pig 0.12 :(
[ https://issues.apache.org/jira/browse/PIG-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13851239#comment-13851239 ] Dmitriy V. Ryaboy commented on PIG-3630: Could you link to the code directly, rather than the book? The Safari website is giving me interstitials and other unpleasant things. Have you investigated the schemas of relations referred to in the error message, and checked if your field references make sense? 2013-12-16 22:09:20,189 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025: at expanding macro 'tf_idf' (per_business.pig:9) Invalid field projection. Projected field [tf_idf] does not exist in schema: business_id:chararray,token:chararray,term_freq:double,num_docs_with_token:long. > Macros that work in Pig 0.11 fail in Pig 0.12 :( > > > Key: PIG-3630 > URL: https://issues.apache.org/jira/browse/PIG-3630 > Project: Pig > Issue Type: Bug > Components: parser >Affects Versions: 0.12.0 >Reporter: Russell Jurney > > http://my.safaribooksonline.com/book/databases/9781449326890/7dot-exploring-data-with-reports/i_sect13_id196600_html > The ntf-idf macro listed there works under 0.11. Under 0.12, it results in > this: > 13/12/16 22:09:19 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL > 2013-12-16 22:09:19,159 [main] INFO org.apache.pig.Main - Apache Pig version > 0.13.0-SNAPSHOT (rUnversioned directory) compiled Dec 09 2013, 14:37:29 > 2013-12-16 22:09:19,159 [main] INFO org.apache.pig.Main - Logging error > messages to: /private/tmp/pig_1387260559120.log > 2013-12-16 22:09:19.268 java[38060:1903] Unable to load realm info from > SCDynamicStore > 2013-12-16 22:09:19,528 [main] INFO > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting > to hadoop file system at: file:/// > 2013-12-16 22:09:20,189 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 1025: > at expanding macro 'tf_idf' (per_business.pig:9) > Invalid field projection. > Projected field [tf_idf] does not exist in schema: > business_id:chararray,token:chararray,term_freq:double,num_docs_with_token:long. > 2013-12-16 22:09:20,189 [main] ERROR org.apache.pig.tools.grunt.Grunt - > org.apache.pig.impl.plan.PlanValidationException: ERROR 1025: > at expanding macro 'tf_idf' (per_business.pig:9) > Invalid field projection. > Projected field [tf_idf] does not exist in schema: > business_id:chararray,token:chararray,term_freq:double,num_docs_with_token:long. > at > org.apache.pig.newplan.logical.expression.ProjectExpression.findColNum(ProjectExpression.java:191) > at > org.apache.pig.newplan.logical.expression.ProjectExpression.setColumnNumberFromAlias(ProjectExpression.java:174) > at > org.apache.pig.newplan.logical.visitor.ColumnAliasConversionVisitor$1.visit(ColumnAliasConversionVisitor.java:53) > at > org.apache.pig.newplan.logical.expression.ProjectExpression.accept(ProjectExpression.java:215) > at > org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) > at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) > at > org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:142) > at > org.apache.pig.newplan.logical.relational.LOInnerLoad.accept(LOInnerLoad.java:128) > at > org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) > at > org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:124) > at > org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.java:76) > at > org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) > at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) > at org.apache.pig.PigServer$Graph.compile(PigServer.java:1694) > at org.apache.pig.PigServer$Graph.compile(PigServer.java:1686) > at org.apache.pig.PigServer$Graph.access$200(PigServer.java:1387) > at org.apache.pig.PigServer.execute(PigServer.java:1302) > at org.apache.pig.PigServer.executeBatch(PigServer.java:391) > at org.apache.pig.PigServer.executeBatch(PigServer.java:369) > at > org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:133) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:195) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166) > at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84) > at org.apache.pig.Main.run(Main.java:600) > at org.apache.pig.Main.main(Main.java:156) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >
[jira] [Commented] (PIG-3453) Implement a Storm backend to Pig
[ https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13813150#comment-13813150 ] Dmitriy V. Ryaboy commented on PIG-3453: Oh I absolutely just meant collaboration on initial contrib to happen in github, for expediency. and fast iteration. Of course once this work is in a committable/mergeable state, it should go into Apache. > Implement a Storm backend to Pig > > > Key: PIG-3453 > URL: https://issues.apache.org/jira/browse/PIG-3453 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.13.0 >Reporter: Pradeep Gollakota >Assignee: Jacob Perkins > Labels: storm > Fix For: 0.13.0 > > Attachments: storm-integration.patch > > > There is a lot of interest around implementing a Storm backend to Pig for > streaming processing. The proposal and initial discussions can be found at > https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (PIG-3453) Implement a Storm backend to Pig
[ https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13813080#comment-13813080 ] Dmitriy V. Ryaboy commented on PIG-3453: Mridul: In our experience at Twitter, Trident introduces pretty high overhead; in Summingbird, we relax the data delivery guarantees to get better throughput, and use Storm directly. Perhaps you want to try putting pig on top of Summingbird? If you did that, we might even be able to help :). In any case, interested in seeing how all of this will turn out. Cheolsoo: No real objections to svn branch. In the past I've found it far easier to cooperate on significant branches on github, rather than maintain an svn branch (you can easily have multiple branches, reviews are easier, etc). That's how Bill Graham and I did the HBaseStorage rewrite a few years back. But really that's up to developers doing the work. > Implement a Storm backend to Pig > > > Key: PIG-3453 > URL: https://issues.apache.org/jira/browse/PIG-3453 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.13.0 >Reporter: Pradeep Gollakota >Assignee: Jacob Perkins > Labels: storm > Fix For: 0.13.0 > > Attachments: storm-integration.patch > > > There is a lot of interest around implementing a Storm backend to Pig for > streaming processing. The proposal and initial discussions can be found at > https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (PIG-3453) Implement a Storm backend to Pig
[ https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13811726#comment-13811726 ] Dmitriy V. Ryaboy commented on PIG-3453: I don't see why Jacob can't keep working in a github branch... easier to look at what's changing, and he can keep merging the (read-only) git mirror from apache to keep up with changes. Jacob I see you are using Trident. Have you looked at your throughput numbers, vs going directly to storm? > Implement a Storm backend to Pig > > > Key: PIG-3453 > URL: https://issues.apache.org/jira/browse/PIG-3453 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.13.0 >Reporter: Pradeep Gollakota >Assignee: Jacob Perkins > Labels: storm > Fix For: 0.13.0 > > Attachments: storm-integration.patch > > > There is a lot of interest around implementing a Storm backend to Pig for > streaming processing. The proposal and initial discussions can be found at > https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (PIG-3549) Print hadoop jobids for failed, killed job
[ https://issues.apache.org/jira/browse/PIG-3549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13808007#comment-13808007 ] Dmitriy V. Ryaboy commented on PIG-3549: OMG. Thanks. +1. > Print hadoop jobids for failed, killed job > -- > > Key: PIG-3549 > URL: https://issues.apache.org/jira/browse/PIG-3549 > Project: Pig > Issue Type: Bug >Affects Versions: 0.12.0 >Reporter: Aniket Mokashi >Assignee: Aniket Mokashi > Fix For: 0.12.1 > > Attachments: PIG-3549.patch > > > It would be better if we dump the hadoop job ids for failed, killed jobs in > pig log. Right now, log looks like following- > {noformat} > ERROR org.apache.pig.tools.grunt.Grunt: ERROR 6017: Job failed! Error - NA > INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher: > Job job_pigexec_1 killed > {noformat} > From that its hard to say which hadoop job failed if there are multiple jobs > running in parallel. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (PIG-3453) Implement a Storm backend to Pig
[ https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13807458#comment-13807458 ] Dmitriy V. Ryaboy commented on PIG-3453: [~azaroth]: may I suggest https://github.com/twitter/algebird for this and many other approximate counting use cases? :-) Already in use by scalding, summingbird, and spark. > Implement a Storm backend to Pig > > > Key: PIG-3453 > URL: https://issues.apache.org/jira/browse/PIG-3453 > Project: Pig > Issue Type: New Feature >Reporter: Pradeep Gollakota > Labels: storm > > There is a lot of interest around implementing a Storm backend to Pig for > streaming processing. The proposal and initial discussions can be found at > https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (PIG-3082) outputSchema of a UDF allows two usages when describing a Tuple schema
[ https://issues.apache.org/jira/browse/PIG-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13784720#comment-13784720 ] Dmitriy V. Ryaboy commented on PIG-3082: So... that's a breaking change, a bunch of UDF will fail under 12. Intended? > outputSchema of a UDF allows two usages when describing a Tuple schema > -- > > Key: PIG-3082 > URL: https://issues.apache.org/jira/browse/PIG-3082 > Project: Pig > Issue Type: Bug >Reporter: Julien Le Dem >Assignee: Jonathan Coveney > Fix For: 0.12.0 > > Attachments: PIG-3082-0.patch, PIG-3082-1.patch > > > When defining an evalfunc that returns a Tuple there are two ways you can > implement outputSchema(). > - The right way: return a schema that contains one Field that contains the > type and schema of the return type of the UDF > - The unreliable way: return a schema that contains more than one field and > it will be understood as a tuple schema even though there is no type (which > is in Field class) to specify that. This is particularly deceitful when the > output schema is derived from the input schema and the outputted Tuple > sometimes contain only one field. In such cases Pig understands the output > schema as a tuple only if there is more than one field. And sometimes it > works, sometimes it does not. > We should at least issue a warning (backward compatibility) if not plain > throw an exception when the output schema contains more than one Field. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (PIG-3445) Make Parquet format available out of the box in Pig
[ https://issues.apache.org/jira/browse/PIG-3445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13784627#comment-13784627 ] Dmitriy V. Ryaboy commented on PIG-3445: That's a great addition, thanks Lorand. The code looks really tidy now. Looks like ParquetUtil is actually general util? Maybe add that functionality to org.apache.pig.impl.util.JarManager or something along those lines? [~julienledem] do we need to publish a new artifact version so fastutil isn't required for dictionary encoding? > Make Parquet format available out of the box in Pig > --- > > Key: PIG-3445 > URL: https://issues.apache.org/jira/browse/PIG-3445 > Project: Pig > Issue Type: Improvement >Reporter: Julien Le Dem > Fix For: 0.12.0 > > Attachments: PIG-3445-2.patch, PIG-3445-3.patch, PIG-3445.patch > > > We would add the Parquet jar in the Pig packages to make it available out of > the box to pig users. > On top of that we could add the parquet.pig package to the list of packages > to search for UDFs. (alternatively, the parquet jar could contain classes > name or.apache.pig.builtin.ParquetLoader and ParquetStorer) > This way users can use Parquet simply by typing: > A = LOAD 'foo' USING ParquetLoader(); > STORE A INTO 'bar' USING ParquetStorer(); -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (PIG-3445) Make Parquet format available out of the box in Pig
[ https://issues.apache.org/jira/browse/PIG-3445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13783614#comment-13783614 ] Dmitriy V. Ryaboy commented on PIG-3445: [~lbendig] might be more succinct to use StoreFuncWrapper ? > Make Parquet format available out of the box in Pig > --- > > Key: PIG-3445 > URL: https://issues.apache.org/jira/browse/PIG-3445 > Project: Pig > Issue Type: Improvement >Reporter: Julien Le Dem > Fix For: 0.12.0 > > Attachments: PIG-3445-2.patch, PIG-3445.patch > > > We would add the Parquet jar in the Pig packages to make it available out of > the box to pig users. > On top of that we could add the parquet.pig package to the list of packages > to search for UDFs. (alternatively, the parquet jar could contain classes > name or.apache.pig.builtin.ParquetLoader and ParquetStorer) > This way users can use Parquet simply by typing: > A = LOAD 'foo' USING ParquetLoader(); > STORE A INTO 'bar' USING ParquetStorer(); -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (PIG-3480) TFile-based tmpfile compression crashes in some cases
[ https://issues.apache.org/jira/browse/PIG-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13782112#comment-13782112 ] Dmitriy V. Ryaboy commented on PIG-3480: That is fine with me, lets make sequence file optional. It will let people avoid the bug I am encountering, an also do things like use snappy compression. > TFile-based tmpfile compression crashes in some cases > - > > Key: PIG-3480 > URL: https://issues.apache.org/jira/browse/PIG-3480 > Project: Pig > Issue Type: Bug >Reporter: Dmitriy V. Ryaboy > Fix For: 0.12.0 > > Attachments: PIG-3480.patch > > > When pig tmpfile compression is on, some jobs fail inside core hadoop > internals. > Suspect TFile is the problem, because an experiment in replacing TFile with > SequenceFile succeeded. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (PIG-3445) Make Parquet format available out of the box in Pig
[ https://issues.apache.org/jira/browse/PIG-3445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13776879#comment-13776879 ] Dmitriy V. Ryaboy commented on PIG-3445: Other loaders like csv, avro, json, xml, etc (even RC, though it's in piggybank due to heavy dependencies and lack of support) are all in already so I don't see this as unfair, but as consistent. Not packaging the pq jars into pig monojar and instead adding them, the way we add guava et al for hbase, sounds like a good idea. [~julienledem] should we do that by providing a simple wrapper in pig builtins, or by messing with the job conf in parquet's own loader/storer? > Make Parquet format available out of the box in Pig > --- > > Key: PIG-3445 > URL: https://issues.apache.org/jira/browse/PIG-3445 > Project: Pig > Issue Type: Improvement >Reporter: Julien Le Dem > Fix For: 0.12.0 > > Attachments: PIG-3445.patch > > > We would add the Parquet jar in the Pig packages to make it available out of > the box to pig users. > On top of that we could add the parquet.pig package to the list of packages > to search for UDFs. (alternatively, the parquet jar could contain classes > name or.apache.pig.builtin.ParquetLoader and ParquetStorer) > This way users can use Parquet simply by typing: > A = LOAD 'foo' USING ParquetLoader(); > STORE A INTO 'bar' USING ParquetStorer(); -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3479) Fix BigInt, BigDec, Date serialization. Improve perf of PigNullableWritable deserilization
[ https://issues.apache.org/jira/browse/PIG-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-3479: --- Resolution: Fixed Release Note: Skewed join internals improved to get 10% or better improvement on reducers by eliminating unnecessary reflection. Status: Resolved (was: Patch Available) Committed to trunk and 0.12 > Fix BigInt, BigDec, Date serialization. Improve perf of PigNullableWritable > deserilization > -- > > Key: PIG-3479 > URL: https://issues.apache.org/jira/browse/PIG-3479 > Project: Pig > Issue Type: Bug >Reporter: Dmitriy V. Ryaboy >Assignee: Dmitriy V. Ryaboy > Fix For: 0.12.0 > > Attachments: PIG-3479.patch, PIG-3479.whitespace.patch > > > While working on something unrelated I discovered some serialization errors > with recently added data types, and a heavy use of reflection slowing down > PigNullableWritable deserialization. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3479) Fix BigInt, BigDec, Date serialization. Improve perf of PigNullableWritable deserilization
[ https://issues.apache.org/jira/browse/PIG-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-3479: --- Attachment: PIG-3479.whitespace.patch Same patch, but with whitespace changes. Committing this. > Fix BigInt, BigDec, Date serialization. Improve perf of PigNullableWritable > deserilization > -- > > Key: PIG-3479 > URL: https://issues.apache.org/jira/browse/PIG-3479 > Project: Pig > Issue Type: Bug >Reporter: Dmitriy V. Ryaboy >Assignee: Dmitriy V. Ryaboy > Fix For: 0.12.0 > > Attachments: PIG-3479.patch, PIG-3479.whitespace.patch > > > While working on something unrelated I discovered some serialization errors > with recently added data types, and a heavy use of reflection slowing down > PigNullableWritable deserialization. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3480) TFile-based tmpfile compression crashes in some cases
[ https://issues.apache.org/jira/browse/PIG-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13776728#comment-13776728 ] Dmitriy V. Ryaboy commented on PIG-3480: Rohini I suspect this might be something about complex data types, which afaik are pretty rare at Y! and extremely common at Twitter. > TFile-based tmpfile compression crashes in some cases > - > > Key: PIG-3480 > URL: https://issues.apache.org/jira/browse/PIG-3480 > Project: Pig > Issue Type: Bug >Reporter: Dmitriy V. Ryaboy > Fix For: 0.12 > > Attachments: PIG-3480.patch > > > When pig tmpfile compression is on, some jobs fail inside core hadoop > internals. > Suspect TFile is the problem, because an experiment in replacing TFile with > SequenceFile succeeded. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3480) TFile-based tmpfile compression crashes in some cases
[ https://issues.apache.org/jira/browse/PIG-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13776732#comment-13776732 ] Dmitriy V. Ryaboy commented on PIG-3480: Rohini, do you guys use lzo or gz compression? Maybe it's just lzo that's breaking. I can test gz. That never actually occurred to me, I just assumed this is completely busted because I could never get it to work (since 2010..) > TFile-based tmpfile compression crashes in some cases > - > > Key: PIG-3480 > URL: https://issues.apache.org/jira/browse/PIG-3480 > Project: Pig > Issue Type: Bug >Reporter: Dmitriy V. Ryaboy > Fix For: 0.12 > > Attachments: PIG-3480.patch > > > When pig tmpfile compression is on, some jobs fail inside core hadoop > internals. > Suspect TFile is the problem, because an experiment in replacing TFile with > SequenceFile succeeded. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3480) TFile-based tmpfile compression crashes in some cases
[ https://issues.apache.org/jira/browse/PIG-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1377#comment-1377 ] Dmitriy V. Ryaboy commented on PIG-3480: [~knoguchi] yeah, I'm not sure the stack trace is relevant -- it's the only part that's not consistent about this. The problem goes away when I set pig.tmpfilecompression to false, or when I replace TFile with SequenceFile. I've also seen stack traces that were inside TFile, and had to do with some LZO decoding issues.. the actual error is really hard to capture, other than the fact that mappers fail consistently. > TFile-based tmpfile compression crashes in some cases > - > > Key: PIG-3480 > URL: https://issues.apache.org/jira/browse/PIG-3480 > Project: Pig > Issue Type: Bug >Reporter: Dmitriy V. Ryaboy > Fix For: 0.12 > > Attachments: PIG-3480.patch > > > When pig tmpfile compression is on, some jobs fail inside core hadoop > internals. > Suspect TFile is the problem, because an experiment in replacing TFile with > SequenceFile succeeded. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3480) TFile-based tmpfile compression crashes in some cases
[ https://issues.apache.org/jira/browse/PIG-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-3480: --- Attachment: PIG-3480.patch Attaching a rough patch which replaces use of TFile with SequenceFile. Next steps: - evaluate effect on size of compressed data for TFile vs SeqFile when TFile does work - add tests, make TFile tests pass (in this file they fail, because of course TFile is not being used) - make SeqFile the default method, since it doesn't break - allow TFile use by a switch, since current users may want to keep it. I would prefer to not do that, but might if the first step shows significant differences. Thoughts? Especially from folks using TFile-based compression in production ([~rohini]?) > TFile-based tmpfile compression crashes in some cases > - > > Key: PIG-3480 > URL: https://issues.apache.org/jira/browse/PIG-3480 > Project: Pig > Issue Type: Bug >Reporter: Dmitriy V. Ryaboy > Fix For: 0.12 > > Attachments: PIG-3480.patch > > > When pig tmpfile compression is on, some jobs fail inside core hadoop > internals. > Suspect TFile is the problem, because an experiment in replacing TFile with > SequenceFile succeeded. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (PIG-3480) TFile-based tmpfile compression crashes in some cases
[ https://issues.apache.org/jira/browse/PIG-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13776602#comment-13776602 ] Dmitriy V. Ryaboy edited comment on PIG-3480 at 9/24/13 6:36 PM: - For most of the tasks that fail, no stack trace is available on Hadoop 1 (they just die with "nonzero status 134"). I did catch one task with a stack trace: {code} java.io.IOException: Error while reading compressed data at org.apache.hadoop.io.IOUtils.wrappedReadForCompressedData(IOUtils.java:205) at org.apache.hadoop.mapred.IFile$Reader.readData(IFile.java:342) at org.apache.hadoop.mapred.IFile$Reader.rejigData(IFile.java:373) at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:357) at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:389) at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:220) at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:420) at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:381) at org.apache.hadoop.mapred.Merger.merge(Merger.java:77) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1548) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1180) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:582) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:649) at org.apache.hadoop.mapred.MapTask.run(Map {code} No idea if this is relevant. This problem does happen consistently -- 100% of the time on my script that shows this problem. Anecdotally, about 1/10 of our production scripts encounter this; I have not been able to establish a pattern yet. was (Author: dvryaboy): For most of the tasks that fail, no stack trace is available on Hadoop 1 (they just die with "nonzero status 134"). I did catch one task with a stack trace: {code} java.io.IOException: Error while reading compressed data at org.apache.hadoop.io.IOUtils.wrappedReadForCompressedData(IOUtils.java:205) at org.apache.hadoop.mapred.IFile$Reader.readData(IFile.java:342) at org.apache.hadoop.mapred.IFile$Reader.rejigData(IFile.java:373) at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:357) at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:389) at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:220) at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:420) at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:381) at org.apache.hadoop.mapred.Merger.merge(Merger.java:77) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1548) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1180) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:582) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:649) at org.apache.hadoop.mapred.MapTask.run(Map {code} No idea if this is relevant. This problem does happen consistently -- 100% of the time on my script that shows this problem. Anecdotally, about 1/10 of our production scripts encounter this; I have not been able to establish a pattern yet. > TFile-based tmpfile compression crashes in some cases > - > > Key: PIG-3480 > URL: https://issues.apache.org/jira/browse/PIG-3480 > Project: Pig > Issue Type: Bug >Reporter: Dmitriy V. Ryaboy > Fix For: 0.12 > > > When pig tmpfile compression is on, some jobs fail inside core hadoop > internals. > Suspect TFile is the problem, because an experiment in replacing TFile with > SequenceFile succeeded. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3480) TFile-based tmpfile compression crashes in some cases
[ https://issues.apache.org/jira/browse/PIG-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13776602#comment-13776602 ] Dmitriy V. Ryaboy commented on PIG-3480: For most of the tasks that fail, no stack trace is available on Hadoop 1 (they just die with "nonzero status 134"). I did catch one task with a stack trace: {code} java.io.IOException: Error while reading compressed data at org.apache.hadoop.io.IOUtils.wrappedReadForCompressedData(IOUtils.java:205) at org.apache.hadoop.mapred.IFile$Reader.readData(IFile.java:342) at org.apache.hadoop.mapred.IFile$Reader.rejigData(IFile.java:373) at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:357) at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:389) at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:220) at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:420) at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:381) at org.apache.hadoop.mapred.Merger.merge(Merger.java:77) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1548) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1180) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:582) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:649) at org.apache.hadoop.mapred.MapTask.run(Map {code} No idea if this is relevant. This problem does happen consistently -- 100% of the time on my script that shows this problem. Anecdotally, about 1/10 of our production scripts encounter this; I have not been able to establish a pattern yet. > TFile-based tmpfile compression crashes in some cases > - > > Key: PIG-3480 > URL: https://issues.apache.org/jira/browse/PIG-3480 > Project: Pig > Issue Type: Bug >Reporter: Dmitriy V. Ryaboy > Fix For: 0.12 > > > When pig tmpfile compression is on, some jobs fail inside core hadoop > internals. > Suspect TFile is the problem, because an experiment in replacing TFile with > SequenceFile succeeded. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-3480) TFile-based tmpfile compression crashes in some cases
Dmitriy V. Ryaboy created PIG-3480: -- Summary: TFile-based tmpfile compression crashes in some cases Key: PIG-3480 URL: https://issues.apache.org/jira/browse/PIG-3480 Project: Pig Issue Type: Bug Reporter: Dmitriy V. Ryaboy Fix For: 0.12 When pig tmpfile compression is on, some jobs fail inside core hadoop internals. Suspect TFile is the problem, because an experiment in replacing TFile with SequenceFile succeeded. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3445) Make Parquet format available out of the box in Pig
[ https://issues.apache.org/jira/browse/PIG-3445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-3445: --- Fix Version/s: 0.12 > Make Parquet format available out of the box in Pig > --- > > Key: PIG-3445 > URL: https://issues.apache.org/jira/browse/PIG-3445 > Project: Pig > Issue Type: Improvement >Reporter: Julien Le Dem > Fix For: 0.12 > > Attachments: PIG-3445.patch > > > We would add the Parquet jar in the Pig packages to make it available out of > the box to pig users. > On top of that we could add the parquet.pig package to the list of packages > to search for UDFs. (alternatively, the parquet jar could contain classes > name or.apache.pig.builtin.ParquetLoader and ParquetStorer) > This way users can use Parquet simply by typing: > A = LOAD 'foo' USING ParquetLoader(); > STORE A INTO 'bar' USING ParquetStorer(); -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3445) Make Parquet format available out of the box in Pig
[ https://issues.apache.org/jira/browse/PIG-3445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13776172#comment-13776172 ] Dmitriy V. Ryaboy commented on PIG-3445: The size of the dependency introduced by this is orders of magnitude smaller than the HBase (or Avro) one, since everything comes from a single project (unlike HBase's liberal use of guava, metric, ZK, and everything else under the sun). The total size is less than 1 meg. Can we add parquet.pig to udf import list in the same patch? > Make Parquet format available out of the box in Pig > --- > > Key: PIG-3445 > URL: https://issues.apache.org/jira/browse/PIG-3445 > Project: Pig > Issue Type: Improvement >Reporter: Julien Le Dem > Attachments: PIG-3445.patch > > > We would add the Parquet jar in the Pig packages to make it available out of > the box to pig users. > On top of that we could add the parquet.pig package to the list of packages > to search for UDFs. (alternatively, the parquet jar could contain classes > name or.apache.pig.builtin.ParquetLoader and ParquetStorer) > This way users can use Parquet simply by typing: > A = LOAD 'foo' USING ParquetLoader(); > STORE A INTO 'bar' USING ParquetStorer(); -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3325) Adding a tuple to a bag is slow
[ https://issues.apache.org/jira/browse/PIG-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-3325: --- Fix Version/s: 0.12 > Adding a tuple to a bag is slow > --- > > Key: PIG-3325 > URL: https://issues.apache.org/jira/browse/PIG-3325 > Project: Pig > Issue Type: Bug >Affects Versions: 0.11, 0.12, 0.11.1, 0.11.2 >Reporter: Mark Wagner >Assignee: Dmitriy V. Ryaboy >Priority: Critical > Fix For: 0.12 > > Attachments: PIG-3325.2.patch, PIG-3325.3.patch, PIG-3325.demo.patch, > PIG-3325.optimize.1.patch > > > The time it takes to add a tuple to a bag has increased significantly, > causing some jobs to take about 50x longer compared to 0.10.1. I've tracked > this down to PIG-2923, which has made adding a tuple heavier weight (it now > includes some memory estimation). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3479) Fix BigInt, BigDec, Date serialization. Improve perf of PigNullableWritable deserilization
[ https://issues.apache.org/jira/browse/PIG-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-3479: --- Fix Version/s: 0.12 > Fix BigInt, BigDec, Date serialization. Improve perf of PigNullableWritable > deserilization > -- > > Key: PIG-3479 > URL: https://issues.apache.org/jira/browse/PIG-3479 > Project: Pig > Issue Type: Bug >Reporter: Dmitriy V. Ryaboy >Assignee: Dmitriy V. Ryaboy > Fix For: 0.12 > > Attachments: PIG-3479.patch > > > While working on something unrelated I discovered some serialization errors > with recently added data types, and a heavy use of reflection slowing down > PigNullableWritable deserialization. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3325) Adding a tuple to a bag is slow
[ https://issues.apache.org/jira/browse/PIG-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-3325: --- Affects Version/s: 0.12 > Adding a tuple to a bag is slow > --- > > Key: PIG-3325 > URL: https://issues.apache.org/jira/browse/PIG-3325 > Project: Pig > Issue Type: Bug >Affects Versions: 0.11, 0.12, 0.11.1, 0.11.2 >Reporter: Mark Wagner >Assignee: Dmitriy V. Ryaboy >Priority: Critical > Attachments: PIG-3325.2.patch, PIG-3325.3.patch, PIG-3325.demo.patch, > PIG-3325.optimize.1.patch > > > The time it takes to add a tuple to a bag has increased significantly, > causing some jobs to take about 50x longer compared to 0.10.1. I've tracked > this down to PIG-2923, which has made adding a tuple heavier weight (it now > includes some memory estimation). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3479) Fix BigInt, BigDec, Date serialization. Improve perf of PigNullableWritable deserilization
[ https://issues.apache.org/jira/browse/PIG-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-3479: --- Status: Patch Available (was: Open) > Fix BigInt, BigDec, Date serialization. Improve perf of PigNullableWritable > deserilization > -- > > Key: PIG-3479 > URL: https://issues.apache.org/jira/browse/PIG-3479 > Project: Pig > Issue Type: Bug >Reporter: Dmitriy V. Ryaboy >Assignee: Dmitriy V. Ryaboy > Attachments: PIG-3479.patch > > > While working on something unrelated I discovered some serialization errors > with recently added data types, and a heavy use of reflection slowing down > PigNullableWritable deserialization. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (PIG-3479) Fix BigInt, BigDec, Date serialization. Improve perf of PigNullableWritable deserilization
[ https://issues.apache.org/jira/browse/PIG-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy reassigned PIG-3479: -- Assignee: Dmitriy V. Ryaboy > Fix BigInt, BigDec, Date serialization. Improve perf of PigNullableWritable > deserilization > -- > > Key: PIG-3479 > URL: https://issues.apache.org/jira/browse/PIG-3479 > Project: Pig > Issue Type: Bug >Reporter: Dmitriy V. Ryaboy >Assignee: Dmitriy V. Ryaboy > Attachments: PIG-3479.patch > > > While working on something unrelated I discovered some serialization errors > with recently added data types, and a heavy use of reflection slowing down > PigNullableWritable deserialization. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3479) Fix BigInt, BigDec, Date serialization. Improve perf of PigNullableWritable deserilization
[ https://issues.apache.org/jira/browse/PIG-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-3479: --- Attachment: PIG-3479.patch Attaching a patch. I extended an existing test to test the serialziation.. it's the only place we test Nullables at all :(. > Fix BigInt, BigDec, Date serialization. Improve perf of PigNullableWritable > deserilization > -- > > Key: PIG-3479 > URL: https://issues.apache.org/jira/browse/PIG-3479 > Project: Pig > Issue Type: Bug >Reporter: Dmitriy V. Ryaboy >Assignee: Dmitriy V. Ryaboy > Attachments: PIG-3479.patch > > > While working on something unrelated I discovered some serialization errors > with recently added data types, and a heavy use of reflection slowing down > PigNullableWritable deserialization. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-3479) Fix BigInt, BigDec, Date serialization. Improve perf of PigNullableWritable deserilization
Dmitriy V. Ryaboy created PIG-3479: -- Summary: Fix BigInt, BigDec, Date serialization. Improve perf of PigNullableWritable deserilization Key: PIG-3479 URL: https://issues.apache.org/jira/browse/PIG-3479 Project: Pig Issue Type: Bug Reporter: Dmitriy V. Ryaboy Attachments: PIG-3479.patch While working on something unrelated I discovered some serialization errors with recently added data types, and a heavy use of reflection slowing down PigNullableWritable deserialization. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2672) Optimize the use of DistributedCache
[ https://issues.apache.org/jira/browse/PIG-2672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13773636#comment-13773636 ] Dmitriy V. Ryaboy commented on PIG-2672: Aniket, can we prefix the properties with "pig."? That way we won't conflict with potential properties from Hadoop, and it's a little easier to analyze stuff when looking at the jobconf. > Optimize the use of DistributedCache > > > Key: PIG-2672 > URL: https://issues.apache.org/jira/browse/PIG-2672 > Project: Pig > Issue Type: Improvement >Reporter: Rohini Palaniswamy >Assignee: Aniket Mokashi > Fix For: 0.12 > > Attachments: PIG-2672.patch > > > Pig currently copies jar files to a temporary location in hdfs and then adds > them to DistributedCache for each job launched. This is inefficient in terms > of >* Space - The jars are distributed to task trackers for every job taking > up lot of local temporary space in tasktrackers. >* Performance - The jar distribution impacts the job launch time. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3419) Pluggable Execution Engine
[ https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13768649#comment-13768649 ] Dmitriy V. Ryaboy commented on PIG-3419: +1 to marking the interfaces as evolving. > Pluggable Execution Engine > --- > > Key: PIG-3419 > URL: https://issues.apache.org/jira/browse/PIG-3419 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.12 >Reporter: Achal Soni >Assignee: Achal Soni >Priority: Minor > Fix For: 0.12 > > Attachments: execengine.patch, mapreduce_execengine.patch, > stats_scriptstate.patch, test_failures.txt, test_suite.patch, > updated-8-22-2013-exec-engine.patch, updated-8-23-2013-exec-engine.patch, > updated-8-27-2013-exec-engine.patch, updated-8-28-2013-exec-engine.patch, > updated-8-29-2013-exec-engine.patch > > > In an effort to adapt Pig to work using Apache Tez > (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for > a cleaner ExecutionEngine abstraction than existed before. The changes are > not that major as Pig was already relatively abstracted out between the > frontend and backend. The changes in the attached commit are essentially the > barebones changes -- I tried to not change the structure of Pig's different > components too much. I think it will be interesting to see in the future how > we can refactor more areas of Pig to really honor this abstraction between > the frontend and backend. > Some of the changes was to reinstate an ExecutionEngine interface to tie > together the front end and backend, and making the changes in Pig to delegate > to the EE when necessary, and creating an MRExecutionEngine that implements > this interface. Other work included changing ExecType to cycle through the > ExecutionEngines on the classpath and select the appropriate one (this is > done using Java ServiceLoader, exactly how MapReduce does for choosing the > framework to use between local and distributed mode). Also I tried to make > ScriptState, JobStats, and PigStats as abstract as possible in its current > state. I think in the future some work will need to be done here to perhaps > re-evaluate the usage of ScriptState and the responsibilities of the > different statistics classes. I haven't touched the PPNL, but I think more > abstraction is needed here, perhaps in a separate patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3419) Pluggable Execution Engine
[ https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13767220#comment-13767220 ] Dmitriy V. Ryaboy commented on PIG-3419: This is not just for Tez. The point is to enable POC work (in branches, forks, etc) and not have each such attempt redo all the work in this ticket. It's the same reason we provide things like pluggable LoadFuncs to let people work on things they want to load we didn't think of loading. We should certainly work to stabilize 0.12 and fix issues like PIG-3457 > Pluggable Execution Engine > --- > > Key: PIG-3419 > URL: https://issues.apache.org/jira/browse/PIG-3419 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.12 >Reporter: Achal Soni >Assignee: Achal Soni >Priority: Minor > Fix For: 0.12 > > Attachments: execengine.patch, mapreduce_execengine.patch, > stats_scriptstate.patch, test_failures.txt, test_suite.patch, > updated-8-22-2013-exec-engine.patch, updated-8-23-2013-exec-engine.patch, > updated-8-27-2013-exec-engine.patch, updated-8-28-2013-exec-engine.patch, > updated-8-29-2013-exec-engine.patch > > > In an effort to adapt Pig to work using Apache Tez > (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for > a cleaner ExecutionEngine abstraction than existed before. The changes are > not that major as Pig was already relatively abstracted out between the > frontend and backend. The changes in the attached commit are essentially the > barebones changes -- I tried to not change the structure of Pig's different > components too much. I think it will be interesting to see in the future how > we can refactor more areas of Pig to really honor this abstraction between > the frontend and backend. > Some of the changes was to reinstate an ExecutionEngine interface to tie > together the front end and backend, and making the changes in Pig to delegate > to the EE when necessary, and creating an MRExecutionEngine that implements > this interface. Other work included changing ExecType to cycle through the > ExecutionEngines on the classpath and select the appropriate one (this is > done using Java ServiceLoader, exactly how MapReduce does for choosing the > framework to use between local and distributed mode). Also I tried to make > ScriptState, JobStats, and PigStats as abstract as possible in its current > state. I think in the future some work will need to be done here to perhaps > re-evaluate the usage of ScriptState and the responsibilities of the > different statistics classes. I haven't touched the PPNL, but I think more > abstraction is needed here, perhaps in a separate patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2965) RANDOM should allow seed initialization for ease of testing
[ https://issues.apache.org/jira/browse/PIG-2965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13760313#comment-13760313 ] Dmitriy V. Ryaboy commented on PIG-2965: A UDF essentially has a constructor and an exec method. "foreach lines udf(foo)" calls the exec method and passes to it the foo parameter. "define udfinstance udf(foo)" passes foo to the constructor, and makes an instance of the foo udf initialized in that way bound to "udfinstance" (so you can have many differently initialized udfs in the same script). You can read more info on all this in the docs about "define" keyword and the UDF author's guide. > RANDOM should allow seed initialization for ease of testing > --- > > Key: PIG-2965 > URL: https://issues.apache.org/jira/browse/PIG-2965 > Project: Pig > Issue Type: Bug >Reporter: Aneesh Sharma >Assignee: Jonathan Coveney > Labels: newbie > Attachments: PIG-2965-0.patch > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2965) RANDOM should allow seed initialization for ease of testing
[ https://issues.apache.org/jira/browse/PIG-2965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13759609#comment-13759609 ] Dmitriy V. Ryaboy commented on PIG-2965: [~sdeneefe] are you sure you are using it right? I just tested and it works. Here's a test script you can run a few times : {code} define rand RANDOM('12345'); lines = load 'random.pig'; r = foreach lines generate rand(); dump r; {code} run using `pig -x local random.pig 2>/dev/null` > RANDOM should allow seed initialization for ease of testing > --- > > Key: PIG-2965 > URL: https://issues.apache.org/jira/browse/PIG-2965 > Project: Pig > Issue Type: Bug >Reporter: Aneesh Sharma >Assignee: Jonathan Coveney > Labels: newbie > Attachments: PIG-2965-0.patch > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3419) Pluggable Execution Engine
[ https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13754235#comment-13754235 ] Dmitriy V. Ryaboy commented on PIG-3419: [~billgraham] looping you in for Ambrose. > Pluggable Execution Engine > --- > > Key: PIG-3419 > URL: https://issues.apache.org/jira/browse/PIG-3419 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.12 >Reporter: Achal Soni >Assignee: Achal Soni >Priority: Minor > Attachments: execengine.patch, mapreduce_execengine.patch, > stats_scriptstate.patch, test_failures.txt, test_suite.patch, > updated-8-22-2013-exec-engine.patch, updated-8-23-2013-exec-engine.patch, > updated-8-27-2013-exec-engine.patch, updated-8-28-2013-exec-engine.patch, > updated-8-29-2013-exec-engine.patch > > > In an effort to adapt Pig to work using Apache Tez > (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for > a cleaner ExecutionEngine abstraction than existed before. The changes are > not that major as Pig was already relatively abstracted out between the > frontend and backend. The changes in the attached commit are essentially the > barebones changes -- I tried to not change the structure of Pig's different > components too much. I think it will be interesting to see in the future how > we can refactor more areas of Pig to really honor this abstraction between > the frontend and backend. > Some of the changes was to reinstate an ExecutionEngine interface to tie > together the front end and backend, and making the changes in Pig to delegate > to the EE when necessary, and creating an MRExecutionEngine that implements > this interface. Other work included changing ExecType to cycle through the > ExecutionEngines on the classpath and select the appropriate one (this is > done using Java ServiceLoader, exactly how MapReduce does for choosing the > framework to use between local and distributed mode). Also I tried to make > ScriptState, JobStats, and PigStats as abstract as possible in its current > state. I think in the future some work will need to be done here to perhaps > re-evaluate the usage of ScriptState and the responsibilities of the > different statistics classes. I haven't touched the PPNL, but I think more > abstraction is needed here, perhaps in a separate patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3048) Add mapreduce workflow information to job configuration
[ https://issues.apache.org/jira/browse/PIG-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13753942#comment-13753942 ] Dmitriy V. Ryaboy commented on PIG-3048: no objections. after all, usage of the config info is purely optional. We've run into trouble before with information of this sort becoming very big and triggering JobConf too large errors. Might want to look at compression at some point. > Add mapreduce workflow information to job configuration > --- > > Key: PIG-3048 > URL: https://issues.apache.org/jira/browse/PIG-3048 > Project: Pig > Issue Type: Improvement >Reporter: Billie Rinaldi >Assignee: Billie Rinaldi > Fix For: 0.11.2 > > Attachments: PIG-3048.patch, PIG-3048.patch, PIG-3048.patch > > > Adding workflow properties to the job configuration would enable logging and > analysis of workflows in addition to individual MapReduce jobs. Suggested > properties include a workflow ID, workflow name, adjacency list connecting > nodes in the workflow, and the name of the current node in the workflow. > mapreduce.workflow.id - a unique ID for the workflow, ideally prepended with > the application name > e.g. pig_ > mapreduce.workflow.name - a name for the workflow, to distinguish this > workflow from other workflows and to group different runs of the same workflow > e.g. pig command line > mapreduce.workflow.adjacency - an adjacency list for the workflow graph, > encoded as mapreduce.workflow.adjacency. = of target nodes> > mapreduce.workflow.node.name - the name of the node corresponding to this > MapReduce job in the workflow adjacency list -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3419) Pluggable Execution Engine
[ https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13753863#comment-13753863 ] Dmitriy V. Ryaboy commented on PIG-3419: Cheolsoo thanks so much for helping with this work! I think #1 and #3 are the issues (#3 will affect Ambrose and probably Lipstick). We can take care of updating Ambrose if we need to. [~julienledem] do you think this is an important enough semantic change to force advanced clients such as Ambrose to rewrite / recompile? Or should we roll that part back? [~bikassaha] thanks for the heads up, we'll need to update the pig-on-tez branch. Fortunately it doesn't affect this patch, since it's framework-independent and has no TEZ references. > Pluggable Execution Engine > --- > > Key: PIG-3419 > URL: https://issues.apache.org/jira/browse/PIG-3419 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.12 >Reporter: Achal Soni >Assignee: Achal Soni >Priority: Minor > Attachments: execengine.patch, mapreduce_execengine.patch, > stats_scriptstate.patch, test_failures.txt, test_suite.patch, > updated-8-22-2013-exec-engine.patch, updated-8-23-2013-exec-engine.patch, > updated-8-27-2013-exec-engine.patch, updated-8-28-2013-exec-engine.patch > > > In an effort to adapt Pig to work using Apache Tez > (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for > a cleaner ExecutionEngine abstraction than existed before. The changes are > not that major as Pig was already relatively abstracted out between the > frontend and backend. The changes in the attached commit are essentially the > barebones changes -- I tried to not change the structure of Pig's different > components too much. I think it will be interesting to see in the future how > we can refactor more areas of Pig to really honor this abstraction between > the frontend and backend. > Some of the changes was to reinstate an ExecutionEngine interface to tie > together the front end and backend, and making the changes in Pig to delegate > to the EE when necessary, and creating an MRExecutionEngine that implements > this interface. Other work included changing ExecType to cycle through the > ExecutionEngines on the classpath and select the appropriate one (this is > done using Java ServiceLoader, exactly how MapReduce does for choosing the > framework to use between local and distributed mode). Also I tried to make > ScriptState, JobStats, and PigStats as abstract as possible in its current > state. I think in the future some work will need to be done here to perhaps > re-evaluate the usage of ScriptState and the responsibilities of the > different statistics classes. I haven't touched the PPNL, but I think more > abstraction is needed here, perhaps in a separate patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3419) Pluggable Execution Engine
[ https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13749190#comment-13749190 ] Dmitriy V. Ryaboy commented on PIG-3419: Olga, first commit to the spork branch is from *2012*. https://github.com/dvryaboy/pig (the default branch on my github is "spork"). > Pluggable Execution Engine > --- > > Key: PIG-3419 > URL: https://issues.apache.org/jira/browse/PIG-3419 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.12 >Reporter: Achal Soni >Assignee: Achal Soni >Priority: Minor > Attachments: execengine.patch, mapreduce_execengine.patch, > stats_scriptstate.patch, test_failures.txt, test_suite.patch, > updated-8-22-2013-exec-engine.patch > > > In an effort to adapt Pig to work using Apache Tez > (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for > a cleaner ExecutionEngine abstraction than existed before. The changes are > not that major as Pig was already relatively abstracted out between the > frontend and backend. The changes in the attached commit are essentially the > barebones changes -- I tried to not change the structure of Pig's different > components too much. I think it will be interesting to see in the future how > we can refactor more areas of Pig to really honor this abstraction between > the frontend and backend. > Some of the changes was to reinstate an ExecutionEngine interface to tie > together the front end and backend, and making the changes in Pig to delegate > to the EE when necessary, and creating an MRExecutionEngine that implements > this interface. Other work included changing ExecType to cycle through the > ExecutionEngines on the classpath and select the appropriate one (this is > done using Java ServiceLoader, exactly how MapReduce does for choosing the > framework to use between local and distributed mode). Also I tried to make > ScriptState, JobStats, and PigStats as abstract as possible in its current > state. I think in the future some work will need to be done here to perhaps > re-evaluate the usage of ScriptState and the responsibilities of the > different statistics classes. I haven't touched the PPNL, but I think more > abstraction is needed here, perhaps in a separate patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3419) Pluggable Execution Engine
[ https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13749014#comment-13749014 ] Dmitriy V. Ryaboy commented on PIG-3419: Rohini, I want to reiterate that this patch has NO tez dependencies (if it does, that's a bug). The intention is not to make Tez possible. It's to make pluggable execution engines possible; and I do not want that functionality to be tied to a tez branch that will be unstable and in heavy development for the foreseeable future. This work will be immediately useful for the Spork (pig on spark) branch, for example. Also, it allows people to work with new runtimes *without modifying Pig*. So Tez-on-Pig doesn't even have to be done as a branch of this project, someone can go an experiment completely independently. For these reasons, I would like it in trunk. You make a great point about the danger of changing exceptions, public methods, etc. I believe that most of these are project-public, and annotated as such. Do you have specific methods you are concerned about? Ideally we would change as little as possible for the end user. Dmitriy > Pluggable Execution Engine > --- > > Key: PIG-3419 > URL: https://issues.apache.org/jira/browse/PIG-3419 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.12 >Reporter: Achal Soni >Assignee: Achal Soni >Priority: Minor > Attachments: execengine.patch, mapreduce_execengine.patch, > stats_scriptstate.patch, test_failures.txt, test_suite.patch, > updated-8-22-2013-exec-engine.patch > > > In an effort to adapt Pig to work using Apache Tez > (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for > a cleaner ExecutionEngine abstraction than existed before. The changes are > not that major as Pig was already relatively abstracted out between the > frontend and backend. The changes in the attached commit are essentially the > barebones changes -- I tried to not change the structure of Pig's different > components too much. I think it will be interesting to see in the future how > we can refactor more areas of Pig to really honor this abstraction between > the frontend and backend. > Some of the changes was to reinstate an ExecutionEngine interface to tie > together the front end and backend, and making the changes in Pig to delegate > to the EE when necessary, and creating an MRExecutionEngine that implements > this interface. Other work included changing ExecType to cycle through the > ExecutionEngines on the classpath and select the appropriate one (this is > done using Java ServiceLoader, exactly how MapReduce does for choosing the > framework to use between local and distributed mode). Also I tried to make > ScriptState, JobStats, and PigStats as abstract as possible in its current > state. I think in the future some work will need to be done here to perhaps > re-evaluate the usage of ScriptState and the responsibilities of the > different statistics classes. I haven't touched the PPNL, but I think more > abstraction is needed here, perhaps in a separate patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3419) Pluggable Execution Engine
[ https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13747065#comment-13747065 ] Dmitriy V. Ryaboy commented on PIG-3419: I'd like this patch in trunk since it's not Tez-specific, and allows people to experiment with other runtimes (for example, Spark or Drill). > Pluggable Execution Engine > --- > > Key: PIG-3419 > URL: https://issues.apache.org/jira/browse/PIG-3419 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.12 >Reporter: Achal Soni >Assignee: Achal Soni >Priority: Minor > Attachments: execengine.patch, finalpatch.patch, > mapreduce_execengine.patch, stats_scriptstate.patch, test_suite.patch > > > In an effort to adapt Pig to work using Apache Tez > (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for > a cleaner ExecutionEngine abstraction than existed before. The changes are > not that major as Pig was already relatively abstracted out between the > frontend and backend. The changes in the attached commit are essentially the > barebones changes -- I tried to not change the structure of Pig's different > components too much. I think it will be interesting to see in the future how > we can refactor more areas of Pig to really honor this abstraction between > the frontend and backend. > Some of the changes was to reinstate an ExecutionEngine interface to tie > together the front end and backend, and making the changes in Pig to delegate > to the EE when necessary, and creating an MRExecutionEngine that implements > this interface. Other work included changing ExecType to cycle through the > ExecutionEngines on the classpath and select the appropriate one (this is > done using Java ServiceLoader, exactly how MapReduce does for choosing the > framework to use between local and distributed mode). Also I tried to make > ScriptState, JobStats, and PigStats as abstract as possible in its current > state. I think in the future some work will need to be done here to perhaps > re-evaluate the usage of ScriptState and the responsibilities of the > different statistics classes. I haven't touched the PPNL, but I think more > abstraction is needed here, perhaps in a separate patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3419) Pluggable Execution Engine
[ https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738288#comment-13738288 ] Dmitriy V. Ryaboy commented on PIG-3419: oh 3 more things :) I thought you found your way around the -y argument? I still see that in there. Don't comment out blocks of code, just delete them Add some documentation about creating new Exec Engines to the xml-based docs, or at least post it here. Just having it in javadocs is not sufficient. > Pluggable Execution Engine > --- > > Key: PIG-3419 > URL: https://issues.apache.org/jira/browse/PIG-3419 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.12 >Reporter: Achal Soni >Priority: Minor > Attachments: pluggable_execengine.patch > > > In an effort to adapt Pig to work using Apache Tez > (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for > a cleaner ExecutionEngine abstraction than existed before. The changes are > not that major as Pig was already relatively abstracted out between the > frontend and backend. The changes in the attached commit are essentially the > barebones changes -- I tried to not change the structure of Pig's different > components too much. I think it will be interesting to see in the future how > we can refactor more areas of Pig to really honor this abstraction between > the frontend and backend. > Some of the changes was to reinstate an ExecutionEngine interface to tie > together the front end and backend, and making the changes in Pig to delegate > to the EE when necessary, and creating an MRExecutionEngine that implements > this interface. Other work included changing ExecType to cycle through the > ExecutionEngines on the classpath and select the appropriate one (this is > done using Java ServiceLoader, exactly how MapReduce does for choosing the > framework to use between local and distributed mode). Also I tried to make > ScriptState, JobStats, and PigStats as abstract as possible in its current > state. I think in the future some work will need to be done here to perhaps > re-evaluate the usage of ScriptState and the responsibilities of the > different statistics classes. I haven't touched the PPNL, but I think more > abstraction is needed here, perhaps in a separate patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3419) Pluggable Execution Engine
[ https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738285#comment-13738285 ] Dmitriy V. Ryaboy commented on PIG-3419: Hi Achal, That's a large patch. Can you give us a roadmap for reading it -- what are the changes, at a high level? It looks like you had to change a bunch of stuff that's not (at first glance) directly related to exec mode. Procedurally: - please generate the patch using 'git diff -no-prefix' since the apache pig master is on svn - please post the complete patch to Review Board, for ease of commenting - please make sure that all new files have the apache license headers at the top Thanks -D > Pluggable Execution Engine > --- > > Key: PIG-3419 > URL: https://issues.apache.org/jira/browse/PIG-3419 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.12 >Reporter: Achal Soni >Priority: Minor > Attachments: pluggable_execengine.patch > > > In an effort to adapt Pig to work using Apache Tez > (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for > a cleaner ExecutionEngine abstraction than existed before. The changes are > not that major as Pig was already relatively abstracted out between the > frontend and backend. The changes in the attached commit are essentially the > barebones changes -- I tried to not change the structure of Pig's different > components too much. I think it will be interesting to see in the future how > we can refactor more areas of Pig to really honor this abstraction between > the frontend and backend. > Some of the changes was to reinstate an ExecutionEngine interface to tie > together the front end and backend, and making the changes in Pig to delegate > to the EE when necessary, and creating an MRExecutionEngine that implements > this interface. Other work included changing ExecType to cycle through the > ExecutionEngines on the classpath and select the appropriate one (this is > done using Java ServiceLoader, exactly how MapReduce does for choosing the > framework to use between local and distributed mode). Also I tried to make > ScriptState, JobStats, and PigStats as abstract as possible in its current > state. I think in the future some work will need to be done here to perhaps > re-evaluate the usage of ScriptState and the responsibilities of the > different statistics classes. I haven't touched the PPNL, but I think more > abstraction is needed here, perhaps in a separate patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3325) Adding a tuple to a bag is slow
[ https://issues.apache.org/jira/browse/PIG-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13723065#comment-13723065 ] Dmitriy V. Ryaboy commented on PIG-3325: Urgh, you are right of course. I can move the .next() call into the for loop... but I wonder if that will slow us down again. Will check. > Adding a tuple to a bag is slow > --- > > Key: PIG-3325 > URL: https://issues.apache.org/jira/browse/PIG-3325 > Project: Pig > Issue Type: Bug >Affects Versions: 0.11, 0.11.1, 0.11.2 >Reporter: Mark Wagner >Assignee: Dmitriy V. Ryaboy >Priority: Critical > Attachments: PIG-3325.2.patch, PIG-3325.3.patch, PIG-3325.demo.patch, > PIG-3325.optimize.1.patch > > > The time it takes to add a tuple to a bag has increased significantly, > causing some jobs to take about 50x longer compared to 0.10.1. I've tracked > this down to PIG-2923, which has made adding a tuple heavier weight (it now > includes some memory estimation). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3325) Adding a tuple to a bag is slow
[ https://issues.apache.org/jira/browse/PIG-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-3325: --- Assignee: Dmitriy V. Ryaboy (was: Mark Wagner) Status: Patch Available (was: Open) marking as patch available. please review. > Adding a tuple to a bag is slow > --- > > Key: PIG-3325 > URL: https://issues.apache.org/jira/browse/PIG-3325 > Project: Pig > Issue Type: Bug >Affects Versions: 0.11.1, 0.11, 0.11.2 >Reporter: Mark Wagner >Assignee: Dmitriy V. Ryaboy >Priority: Critical > Attachments: PIG-3325.2.patch, PIG-3325.3.patch, PIG-3325.demo.patch, > PIG-3325.optimize.1.patch > > > The time it takes to add a tuple to a bag has increased significantly, > causing some jobs to take about 50x longer compared to 0.10.1. I've tracked > this down to PIG-2923, which has made adding a tuple heavier weight (it now > includes some memory estimation). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3325) Adding a tuple to a bag is slow
[ https://issues.apache.org/jira/browse/PIG-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-3325: --- Attachment: PIG-3325.3.patch Slight update -- resetting all counters on clear(), and getting rid of an unnecessarily long 10K tuple test. > Adding a tuple to a bag is slow > --- > > Key: PIG-3325 > URL: https://issues.apache.org/jira/browse/PIG-3325 > Project: Pig > Issue Type: Bug >Affects Versions: 0.11, 0.11.1, 0.11.2 >Reporter: Mark Wagner >Assignee: Mark Wagner >Priority: Critical > Attachments: PIG-3325.2.patch, PIG-3325.3.patch, PIG-3325.demo.patch, > PIG-3325.optimize.1.patch > > > The time it takes to add a tuple to a bag has increased significantly, > causing some jobs to take about 50x longer compared to 0.10.1. I've tracked > this down to PIG-2923, which has made adding a tuple heavier weight (it now > includes some memory estimation). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3325) Adding a tuple to a bag is slow
[ https://issues.apache.org/jira/browse/PIG-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-3325: --- Attachment: PIG-3325.2.patch Updating with a patch. Results: ||Num Tuples in Bag || Trunk avg || Patch 1 avg || Patch 2 avg || | 1 | round: 0.00 | round: 0.00 | round: 0.00 | | 20 | round: 0.01 | round: 0.00 | round: 0.00 | | 100 | round: 0.13 | round: 0.00 | round: 0.00 | 1000 | round: 0.19 | round: 1.20 | round: 0.03 | I also ran Mark's bench test in a loop 10 times (again, to account for jit effects). Results are as follows: My Patch, Mark's test 7050 ns 450 ns 440 ns 550 ns 440 ns 440 ns 440 ns 440 ns 440 ns 540 ns 410 ns 440 ns 440 ns 430 ns 460 ns Trunk, Mark's test 243240 ns 156640 ns 25440 ns 23470 ns 18930 ns 20710 ns 16890 ns 20210 ns 17630 ns 17900 ns 21420 ns 22550 ns 22900 ns 19800 ns 16770 ns Mark's patch, Mark's Test 8480 ns 2750 ns 2690 ns 2760 ns 3270 ns 3590 ns 6530 ns 5900 ns 6340 ns 5410 ns 5400 ns 5420 ns 5670 ns 5410 ns 5420 ns > Adding a tuple to a bag is slow > --- > > Key: PIG-3325 > URL: https://issues.apache.org/jira/browse/PIG-3325 > Project: Pig > Issue Type: Bug >Affects Versions: 0.11, 0.11.1, 0.11.2 >Reporter: Mark Wagner >Assignee: Mark Wagner >Priority: Critical > Attachments: PIG-3325.2.patch, PIG-3325.demo.patch, > PIG-3325.optimize.1.patch > > > The time it takes to add a tuple to a bag has increased significantly, > causing some jobs to take about 50x longer compared to 0.10.1. I've tracked > this down to PIG-2923, which has made adding a tuple heavier weight (it now > includes some memory estimation). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3325) Adding a tuple to a bag is slow
[ https://issues.apache.org/jira/browse/PIG-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13696209#comment-13696209 ] Dmitriy V. Ryaboy commented on PIG-3325: Ok I started looking at this, will update with a patch shortly. In the meantime -- my benchmark shows Mark's patch improves perf on small bags of 20-100 elements, but causes extremely poor performance for large bags. I created a benchmark that does 100 rounds of creating a bag of N elements, for values of N in [1,20,100,1000]. These sets of 100 rounds are run 15 times each, performance of the first 5 is thrown out to account for system warmup / jit optimizations. Results: ||Num Tuples in Bag || Trunk avg || Patch 1 avg || | 1 | round: 0.00 | round: 0.00 | | 20 | round: 0.01 | round: 0.00 | | 100 | round: 0.13 | round: 0.00 | | 1000 | round: 0.19 | round: 1.20 | > Adding a tuple to a bag is slow > --- > > Key: PIG-3325 > URL: https://issues.apache.org/jira/browse/PIG-3325 > Project: Pig > Issue Type: Bug >Affects Versions: 0.11, 0.11.1, 0.11.2 >Reporter: Mark Wagner >Assignee: Mark Wagner >Priority: Critical > Attachments: PIG-3325.demo.patch, PIG-3325.optimize.1.patch > > > The time it takes to add a tuple to a bag has increased significantly, > causing some jobs to take about 50x longer compared to 0.10.1. I've tracked > this down to PIG-2923, which has made adding a tuple heavier weight (it now > includes some memory estimation). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13695279#comment-13695279 ] Dmitriy V. Ryaboy commented on PIG-3015: +1 if we find more stuff, we can open other jiras. Let's get this into trunk. > Rewrite of AvroStorage > -- > > Key: PIG-3015 > URL: https://issues.apache.org/jira/browse/PIG-3015 > Project: Pig > Issue Type: Improvement > Components: piggybank >Reporter: Joseph Adler >Assignee: Joseph Adler > Attachments: bad.avro, good.avro, PIG-3015-10.patch, > PIG-3015-11.patch, PIG-3015-12.patch, PIG-3015-20May2013.diff, > PIG-3015-22June2013.diff, PIG-3015-2.patch, PIG-3015-3.patch, > PIG-3015-4.patch, PIG-3015-5.patch, PIG-3015-6.patch, PIG-3015-7.patch, > PIG-3015-9.patch, PIG-3015-doc-2.patch, PIG-3015-doc.patch, TestInput.java, > Test.java, with_dates.pig > > > The current AvroStorage implementation has a lot of issues: it requires old > versions of Avro, it copies data much more than needed, and it's verbose and > complicated. (One pet peeve of mine is that old versions of Avro don't > support Snappy compression.) > I rewrote AvroStorage from scratch to fix these issues. In early tests, the > new implementation is significantly faster, and the code is a lot simpler. > Rewriting AvroStorage also enabled me to implement support for Trevni (as > TrevniStorage). > I'm opening this ticket to facilitate discussion while I figure out the best > way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3325) Adding a tuple to a bag is slow
[ https://issues.apache.org/jira/browse/PIG-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13689465#comment-13689465 ] Dmitriy V. Ryaboy commented on PIG-3325: What if instead of figuring out size based on the first 100 elements, we sampled first, 11th, 21st, etc until we get 100 samples? Would help with small bags (where accuracy of estimate doesn't matter as much). > Adding a tuple to a bag is slow > --- > > Key: PIG-3325 > URL: https://issues.apache.org/jira/browse/PIG-3325 > Project: Pig > Issue Type: Bug >Affects Versions: 0.11, 0.11.1, 0.11.2 >Reporter: Mark Wagner >Assignee: Mark Wagner >Priority: Critical > Attachments: PIG-3325.demo.patch, PIG-3325.optimize.1.patch > > > The time it takes to add a tuple to a bag has increased significantly, > causing some jobs to take about 50x longer compared to 0.10.1. I've tracked > this down to PIG-2923, which has made adding a tuple heavier weight (it now > includes some memory estimation). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3325) Adding a tuple to a bag is slow
[ https://issues.apache.org/jira/browse/PIG-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13688911#comment-13688911 ] Dmitriy V. Ryaboy commented on PIG-3325: [~mwagner] I was loading complex thrift structures that had bags in them. With old code (all bags register with SMM) this led to tons of weak references that needed to be cleaned out by the SMM; new code fixed that, but apparently created this other problem (which in practice on our workloads is not significant.. but your workloads may be different). Looking forward to Rohini's patch. > Adding a tuple to a bag is slow > --- > > Key: PIG-3325 > URL: https://issues.apache.org/jira/browse/PIG-3325 > Project: Pig > Issue Type: Bug >Affects Versions: 0.11, 0.11.1, 0.11.2 >Reporter: Mark Wagner >Assignee: Mark Wagner >Priority: Critical > Attachments: PIG-3325.demo.patch, PIG-3325.optimize.1.patch > > > The time it takes to add a tuple to a bag has increased significantly, > causing some jobs to take about 50x longer compared to 0.10.1. I've tracked > this down to PIG-2923, which has made adding a tuple heavier weight (it now > includes some memory estimation). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3325) Adding a tuple to a bag is slow
[ https://issues.apache.org/jira/browse/PIG-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13686100#comment-13686100 ] Dmitriy V. Ryaboy commented on PIG-3325: The previous behavior (having SMM check all bags) was pretty bad, it caused significant sudden delays if the data you were loading had bags in it. We observed pretty good speed gains for those use cases once we got rid of mandatory bag registration. Also got rid of a few memory leaks while we were in there, and the linked list maintenance overhead in SMM. > Adding a tuple to a bag is slow > --- > > Key: PIG-3325 > URL: https://issues.apache.org/jira/browse/PIG-3325 > Project: Pig > Issue Type: Bug >Affects Versions: 0.11, 0.11.1, 0.11.2 >Reporter: Mark Wagner >Assignee: Mark Wagner >Priority: Critical > Attachments: PIG-3325.demo.patch, PIG-3325.optimize.1.patch > > > The time it takes to add a tuple to a bag has increased significantly, > causing some jobs to take about 50x longer compared to 0.10.1. I've tracked > this down to PIG-2923, which has made adding a tuple heavier weight (it now > includes some memory estimation). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3325) Adding a tuple to a bag is slow
[ https://issues.apache.org/jira/browse/PIG-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13679862#comment-13679862 ] Dmitriy V. Ryaboy commented on PIG-3325: [~mwagner] thanks for catching this perf regression. I only had time for a cursory look today -- why is the existing code O(n)? Seems like it sampled up to 100 elements and no more, so it's constant (once n>=100). Seems to me like all that materially changed was that you added the sampling bit to add(). Unfortunately, a number of Bags override add() (see my notes in PIG-2923), which makes doing this in the default add() of the abstract function unreliable. Seems to me like a better approach would be to tackle the fact that for every time that getMemorySize() is called while there are fewer than 100 elements, we iterate over the whole bag (which is what you mean by O(n)?). We can do this by jumping directly to the mLastContentsSize'th element in the Bag, if we know the structure, or at least iterate to it without calling getMemorySize(), and then add to our running avg, rather than recomputing it. So, no resetting aggSampleTupleSize in your version, or avgTupleSize in mine, to 0 when sampling, just ignoring the first mLastContentsSize in the iterator. Thoughts? > Adding a tuple to a bag is slow > --- > > Key: PIG-3325 > URL: https://issues.apache.org/jira/browse/PIG-3325 > Project: Pig > Issue Type: Bug >Affects Versions: 0.11, 0.11.1, 0.11.2 >Reporter: Mark Wagner >Assignee: Mark Wagner >Priority: Critical > Attachments: PIG-3325.demo.patch, PIG-3325.optimize.1.patch > > > The time it takes to add a tuple to a bag has increased significantly, > causing some jobs to take about 50x longer compared to 0.10.1. I've tracked > this down to PIG-2923, which has made adding a tuple heavier weight (it now > includes some memory estimation). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (PIG-3325) Adding a tuple to a bag is slow
[ https://issues.apache.org/jira/browse/PIG-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13679862#comment-13679862 ] Dmitriy V. Ryaboy edited comment on PIG-3325 at 6/10/13 8:23 PM: - [~mwagner] thanks for catching this perf regression. I only had time for a cursory look today -- why is the existing code O(N)? Seems like it sampled up to 100 elements and no more, so it's constant (once n>=100). Seems to me like all that materially changed was that you added the sampling bit to add(). Unfortunately, a number of Bags override add() (see my notes in PIG-2923), which makes doing this in the default add() of the abstract function unreliable. Seems to me like a better approach would be to tackle the fact that for every time that getMemorySize() is called while there are fewer than 100 elements, we iterate over the whole bag (which is what you mean by O(N)?). We can do this by jumping directly to the mLastContentsSize'th element in the Bag, if we know the structure, or at least iterate to it without calling getMemorySize(), and then add to our running avg, rather than recomputing it. So, no resetting aggSampleTupleSize in your version, or avgTupleSize in mine, to 0 when sampling, just ignoring the first mLastContentsSize in the iterator. Thoughts? was (Author: dvryaboy): [~mwagner] thanks for catching this perf regression. I only had time for a cursory look today -- why is the existing code O(n)? Seems like it sampled up to 100 elements and no more, so it's constant (once n>=100). Seems to me like all that materially changed was that you added the sampling bit to add(). Unfortunately, a number of Bags override add() (see my notes in PIG-2923), which makes doing this in the default add() of the abstract function unreliable. Seems to me like a better approach would be to tackle the fact that for every time that getMemorySize() is called while there are fewer than 100 elements, we iterate over the whole bag (which is what you mean by O(n)?). We can do this by jumping directly to the mLastContentsSize'th element in the Bag, if we know the structure, or at least iterate to it without calling getMemorySize(), and then add to our running avg, rather than recomputing it. So, no resetting aggSampleTupleSize in your version, or avgTupleSize in mine, to 0 when sampling, just ignoring the first mLastContentsSize in the iterator. Thoughts? > Adding a tuple to a bag is slow > --- > > Key: PIG-3325 > URL: https://issues.apache.org/jira/browse/PIG-3325 > Project: Pig > Issue Type: Bug >Affects Versions: 0.11, 0.11.1, 0.11.2 >Reporter: Mark Wagner >Assignee: Mark Wagner >Priority: Critical > Attachments: PIG-3325.demo.patch, PIG-3325.optimize.1.patch > > > The time it takes to add a tuple to a bag has increased significantly, > causing some jobs to take about 50x longer compared to 0.10.1. I've tracked > this down to PIG-2923, which has made adding a tuple heavier weight (it now > includes some memory estimation). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3341) Improving performance of loading datetime values
[ https://issues.apache.org/jira/browse/PIG-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13673780#comment-13673780 ] Dmitriy V. Ryaboy commented on PIG-3341: I don't think we are completely consistent, but turning invalid into null has been pretty standard. My personal preference is also to increment a counter for # of such conversions, and to log the first N occurrences (when N errors are encountered, log something to the effect of "not logging this error any more because there's so much of it.") > Improving performance of loading datetime values > > > Key: PIG-3341 > URL: https://issues.apache.org/jira/browse/PIG-3341 > Project: Pig > Issue Type: Improvement > Components: impl >Affects Versions: 0.11.1 >Reporter: pat chan >Priority: Minor > Fix For: 0.12, 0.11.2 > > > The performance of loading datetime values can be improved by about 25% by > moving a single line in ToDate.java: > public static DateTimeZone extractDateTimeZone(String dtStr) { > Pattern pattern = > Pattern.compile("(Z|(?<=(T[0-9\\.:]{0,12}))((\\+|-)\\d{2}(:?\\d{2})?))$");; > should become: > static Pattern pattern = > Pattern.compile("(Z|(?<=(T[0-9\\.:]{0,12}))((\\+|-)\\d{2}(:?\\d{2})?))$"); > public static DateTimeZone extractDateTimeZone(String dtStr) { > There is no need to recompile the regular expression for every value. I'm not > sure if this function is ever called concurrently, but Pattern objects are > thread-safe anyways. > As a test, I created a file of 10M timestamps: > for i in 0..1000 > puts '2000-01-01T00:00:00+23' > end > I then ran this script: > grunt> A = load 'data' as (a:datetime); B = filter A by a is null; dump B; > Before the change it took 160s. > After the change, the script took 120s. > > Another performance improvement can be made for invalid datetime values. If a > datetime value is invalid, an exception is created and thrown, which is a > costly way to fail a validity check. To test the performance impact, I > created 10M invalid datetime values: > for i in 0..1000 > puts '2000-99-01T00:00:00+23' > end > In this test, the regex pattern was always recompiled. I then ran this script: > grunt> A = load 'data' as (a:datetime); B = filter A by a is not null; dump > B; > The script took 190s. > I understand this could be considered an edge case and might not be worth > changing. However, if there are use cases where invalid dates are part of > normal processing, then you might consider fixing this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (PIG-3284) Document PIG-3198 and PIG-2643
[ https://issues.apache.org/jira/browse/PIG-3284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy reassigned PIG-3284: -- Assignee: Jonathan Coveney :-) > Document PIG-3198 and PIG-2643 > -- > > Key: PIG-3284 > URL: https://issues.apache.org/jira/browse/PIG-3284 > Project: Pig > Issue Type: Task >Reporter: Jonathan Coveney >Assignee: Jonathan Coveney > > These improvements are quite useful, but only if people know that they exist. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3198) Let users use any function from PigType -> PigType as if it were builtlin
[ https://issues.apache.org/jira/browse/PIG-3198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13636681#comment-13636681 ] Dmitriy V. Ryaboy commented on PIG-3198: Please add docs! > Let users use any function from PigType -> PigType as if it were builtlin > - > > Key: PIG-3198 > URL: https://issues.apache.org/jira/browse/PIG-3198 > Project: Pig > Issue Type: Bug >Reporter: Jonathan Coveney >Assignee: Jonathan Coveney > Fix For: 0.12 > > Attachments: PIG-3198-0.patch, PIG-3198-1.patch, > PIG-3198-apache_header.patch > > > This idea is an extension of PIG-2643. Ideally, someone should be able to > call any function currently registered in Pig as if it were builtin. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3267) HCatStorer fail in limit query
[ https://issues.apache.org/jira/browse/PIG-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13629603#comment-13629603 ] Dmitriy V. Ryaboy commented on PIG-3267: (+1) > HCatStorer fail in limit query > -- > > Key: PIG-3267 > URL: https://issues.apache.org/jira/browse/PIG-3267 > Project: Pig > Issue Type: Bug >Affects Versions: 0.9.2, 0.10.1, 0.11.1 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.12 > > Attachments: PIG-3267-1.patch > > > The following query fail: > {code} > data = LOAD 'student.txt' as (name:chararray, age:int, gpa:double); > data_limited = limit data 10; > samples = foreach data_limited generate age as number; > store samples into 'samples' using > org.apache.hcatalog.pig.HCatStorer('part_dt=20130101T01T36'); > {code} > Error happens before launching the second job. Error message: > {code} > Message: org.apache.hadoop.mapred.FileAlreadyExistsException: Output > directory > hdfs://localhost:8020/user/hive/warehouse/samples/part_dt=20130101T01T36 > already exists > at > org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121) > at > org.apache.hcatalog.mapreduce.FileOutputFormatContainer.checkOutputSpecs(FileOutputFormatContainer.java:135) > at > org.apache.hcatalog.mapreduce.HCatBaseOutputFormat.checkOutputSpecs(HCatBaseOutputFormat.java:72) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.checkOutputSpecsHelper(PigOutputFormat.java:207) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.checkOutputSpecs(PigOutputFormat.java:188) > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:887) > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824) > at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378) > at > org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.pig.backend.hadoop20.PigJobControl.mainLoopAction(PigJobControl.java:157) > at > org.apache.pig.backend.hadoop20.PigJobControl.run(PigJobControl.java:134) > at java.lang.Thread.run(Thread.java:680) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$1.run(MapReduceLauncher.java:257) > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3267) HCatStorer fail in limit query
[ https://issues.apache.org/jira/browse/PIG-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13629601#comment-13629601 ] Dmitriy V. Ryaboy commented on PIG-3267: Should we apply this to 0.11 too? > HCatStorer fail in limit query > -- > > Key: PIG-3267 > URL: https://issues.apache.org/jira/browse/PIG-3267 > Project: Pig > Issue Type: Bug >Affects Versions: 0.9.2, 0.10.1, 0.11.1 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.12 > > Attachments: PIG-3267-1.patch > > > The following query fail: > {code} > data = LOAD 'student.txt' as (name:chararray, age:int, gpa:double); > data_limited = limit data 10; > samples = foreach data_limited generate age as number; > store samples into 'samples' using > org.apache.hcatalog.pig.HCatStorer('part_dt=20130101T01T36'); > {code} > Error happens before launching the second job. Error message: > {code} > Message: org.apache.hadoop.mapred.FileAlreadyExistsException: Output > directory > hdfs://localhost:8020/user/hive/warehouse/samples/part_dt=20130101T01T36 > already exists > at > org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121) > at > org.apache.hcatalog.mapreduce.FileOutputFormatContainer.checkOutputSpecs(FileOutputFormatContainer.java:135) > at > org.apache.hcatalog.mapreduce.HCatBaseOutputFormat.checkOutputSpecs(HCatBaseOutputFormat.java:72) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.checkOutputSpecsHelper(PigOutputFormat.java:207) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.checkOutputSpecs(PigOutputFormat.java:188) > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:887) > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824) > at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378) > at > org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.pig.backend.hadoop20.PigJobControl.mainLoopAction(PigJobControl.java:157) > at > org.apache.pig.backend.hadoop20.PigJobControl.run(PigJobControl.java:134) > at java.lang.Thread.run(Thread.java:680) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$1.run(MapReduceLauncher.java:257) > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3264) mvn signanddeploy target broken for pigunit, pigsmoke and piggybank
[ https://issues.apache.org/jira/browse/PIG-3264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-3264: --- Fix Version/s: 0.11.2 > mvn signanddeploy target broken for pigunit, pigsmoke and piggybank > --- > > Key: PIG-3264 > URL: https://issues.apache.org/jira/browse/PIG-3264 > Project: Pig > Issue Type: Bug >Reporter: Bill Graham >Assignee: Bill Graham > Fix For: 0.11.2 > > Attachments: PIG_3264.1.patch, PIG_3264_branch11.1.patch > > > Build fails with: > {noformat} > [artifact:deploy] Invalid reference: 'pigunit' > {noformat} > Patch on the way. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2769) a simple logic causes very long compiling time on pig 0.10.0
[ https://issues.apache.org/jira/browse/PIG-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13622756#comment-13622756 ] Dmitriy V. Ryaboy commented on PIG-2769: in 0.11 branch now. > a simple logic causes very long compiling time on pig 0.10.0 > > > Key: PIG-2769 > URL: https://issues.apache.org/jira/browse/PIG-2769 > Project: Pig > Issue Type: Bug > Components: build >Affects Versions: 0.10.0 > Environment: Apache Pig version 0.10.0-SNAPSHOT (rexported) >Reporter: Dan Li >Assignee: Nick White > Fix For: 0.12, 0.11.2 > > Attachments: case1.tar, PIG-2769.0.patch, PIG-2769.1.patch, > PIG-2769.2.patch, > TEST-org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.TestInputSizeReducerEstimator.txt > > > We found the following simple logic will cause very long compiling time for > pig 0.10.0, while using pig 0.8.1, everything is fine. > A = load 'A.txt' using PigStorage() AS (m: int); > B = FOREACH A { > days_str = (chararray) > (m == 1 ? 31: > (m == 2 ? 28: > (m == 3 ? 31: > (m == 4 ? 30: > (m == 5 ? 31: > (m == 6 ? 30: > (m == 7 ? 31: > (m == 8 ? 31: > (m == 9 ? 30: > (m == 10 ? 31: > (m == 11 ? 30:31))); > GENERATE >days_str as days_str; > } > store B into 'B'; > and here's a simple input file example: A.txt > 1 > 2 > 3 > The pig version we used in the test > Apache Pig version 0.10.0-SNAPSHOT (rexported) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2769) a simple logic causes very long compiling time on pig 0.10.0
[ https://issues.apache.org/jira/browse/PIG-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-2769: --- Fix Version/s: 0.11.2 > a simple logic causes very long compiling time on pig 0.10.0 > > > Key: PIG-2769 > URL: https://issues.apache.org/jira/browse/PIG-2769 > Project: Pig > Issue Type: Bug > Components: build >Affects Versions: 0.10.0 > Environment: Apache Pig version 0.10.0-SNAPSHOT (rexported) >Reporter: Dan Li >Assignee: Nick White > Fix For: 0.12, 0.11.2 > > Attachments: case1.tar, PIG-2769.0.patch, PIG-2769.1.patch, > PIG-2769.2.patch, > TEST-org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.TestInputSizeReducerEstimator.txt > > > We found the following simple logic will cause very long compiling time for > pig 0.10.0, while using pig 0.8.1, everything is fine. > A = load 'A.txt' using PigStorage() AS (m: int); > B = FOREACH A { > days_str = (chararray) > (m == 1 ? 31: > (m == 2 ? 28: > (m == 3 ? 31: > (m == 4 ? 30: > (m == 5 ? 31: > (m == 6 ? 30: > (m == 7 ? 31: > (m == 8 ? 31: > (m == 9 ? 30: > (m == 10 ? 31: > (m == 11 ? 30:31))); > GENERATE >days_str as days_str; > } > store B into 'B'; > and here's a simple input file example: A.txt > 1 > 2 > 3 > The pig version we used in the test > Apache Pig version 0.10.0-SNAPSHOT (rexported) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (PIG-3151) No documentation for Pig 0.10.1
[ https://issues.apache.org/jira/browse/PIG-3151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy resolved PIG-3151. Resolution: Won't Fix Release Note: we're past this now.. resolving so I can "release" the release in jira > No documentation for Pig 0.10.1 > --- > > Key: PIG-3151 > URL: https://issues.apache.org/jira/browse/PIG-3151 > Project: Pig > Issue Type: Bug > Components: documentation >Affects Versions: 0.10.1 >Reporter: Russell Jurney >Assignee: Daniel Dai >Priority: Critical > Fix For: 0.10.1 > > > http://pig.apache.org/docs/r0.10.1/start.html is missing! > http://pig.apache.org/docs/r0.10.0/start.html is there. > Are there no docs for 0.10.1? Arg! :) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2769) a simple logic causes very long compiling time on pig 0.10.0
[ https://issues.apache.org/jira/browse/PIG-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13621897#comment-13621897 ] Dmitriy V. Ryaboy commented on PIG-2769: Didn't see earlier that this only went into trunk (thanks [~knoguchi] for pointing this out!). We should put this into 0.11 branch, maybe there will be an 0.11.2 before 12 comes out. > a simple logic causes very long compiling time on pig 0.10.0 > > > Key: PIG-2769 > URL: https://issues.apache.org/jira/browse/PIG-2769 > Project: Pig > Issue Type: Bug > Components: build >Affects Versions: 0.10.0 > Environment: Apache Pig version 0.10.0-SNAPSHOT (rexported) >Reporter: Dan Li >Assignee: Nick White > Fix For: 0.12 > > Attachments: case1.tar, PIG-2769.0.patch, PIG-2769.1.patch, > PIG-2769.2.patch, > TEST-org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.TestInputSizeReducerEstimator.txt > > > We found the following simple logic will cause very long compiling time for > pig 0.10.0, while using pig 0.8.1, everything is fine. > A = load 'A.txt' using PigStorage() AS (m: int); > B = FOREACH A { > days_str = (chararray) > (m == 1 ? 31: > (m == 2 ? 28: > (m == 3 ? 31: > (m == 4 ? 30: > (m == 5 ? 31: > (m == 6 ? 30: > (m == 7 ? 31: > (m == 8 ? 31: > (m == 9 ? 30: > (m == 10 ? 31: > (m == 11 ? 30:31))); > GENERATE >days_str as days_str; > } > store B into 'B'; > and here's a simple input file example: A.txt > 1 > 2 > 3 > The pig version we used in the test > Apache Pig version 0.10.0-SNAPSHOT (rexported) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3222) New UDFContextSignature assignments in Pig 0.11 breaks HCatalog.HCatStorer
[ https://issues.apache.org/jira/browse/PIG-3222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13616999#comment-13616999 ] Dmitriy V. Ryaboy commented on PIG-3222: This is pretty confusing. Any ideas on how to fix this? Can we get away from the whole instantiation thing, and maybe keep an object registry? > New UDFContextSignature assignments in Pig 0.11 breaks HCatalog.HCatStorer > --- > > Key: PIG-3222 > URL: https://issues.apache.org/jira/browse/PIG-3222 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.11 >Reporter: Feng Peng > Labels: hcatalog > Attachments: PigStorerDemo.java > > > Pig 0.11 assigns different UDFContextSignature for different invocations of > the same load/store statement. This change breaks the HCatStorer which > assumes all front-end and back-end invocations of the same store statement > has the same UDFContextSignature so that it can read the previously stored > information correctly. > The related HCatalog code is in > https://svn.apache.org/repos/asf/incubator/hcatalog/branches/branch-0.5/hcatalog-pig-adapter/src/main/java/org/apache/hcatalog/pig/HCatStorer.java > (the setStoreLocation() function). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2586) A better plan/data flow visualizer
[ https://issues.apache.org/jira/browse/PIG-2586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13611490#comment-13611490 ] Dmitriy V. Ryaboy commented on PIG-2586: Hm I guess we can add logical plan if we want -- just need to feed it to the PPNL somehow. Ambrose is pretty separate from Pig specifics, if you give it a dag, it'll draw it. Do people use the logical plan to diagnose issues? I don't think I have had to do that yet. > A better plan/data flow visualizer > -- > > Key: PIG-2586 > URL: https://issues.apache.org/jira/browse/PIG-2586 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: Daniel Dai > Labels: gsoc2013 > > Pig supports a dot graph style plan to visualize the > logical/physical/mapreduce plan (explain with -dot option, see > http://ofps.oreilly.com/titles/9781449302641/developing_and_testing.html). > However, dot graph takes extra step to generate the plan graph and the > quality of the output is not good. It's better we can implement a better > visualizer for Pig. It should: > 1. show operator type and alias > 2. turn on/off output schema > 3. dive into foreach inner plan on demand > 4. provide a way to show operator source code, eg, tooltip of an operator > (plan don't currently have this information, but you can assume this is in > place) > 5. besides visualize logical/physical/mapreduce plan, visualize the script > itself is also useful > 6. may rely on some java graphic library such as Swing > This is a candidate project for Google summer of code 2013. More information > about the program can be found at > https://cwiki.apache.org/confluence/display/PIG/GSoc2013 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2586) A better plan/data flow visualizer
[ https://issues.apache.org/jira/browse/PIG-2586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13611478#comment-13611478 ] Dmitriy V. Ryaboy commented on PIG-2586: It does with the linked patch (it also visualizes the MR plan, without details of what's happening inside the map or reduce stage, without the patch). > A better plan/data flow visualizer > -- > > Key: PIG-2586 > URL: https://issues.apache.org/jira/browse/PIG-2586 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: Daniel Dai > Labels: gsoc2013 > > Pig supports a dot graph style plan to visualize the > logical/physical/mapreduce plan (explain with -dot option, see > http://ofps.oreilly.com/titles/9781449302641/developing_and_testing.html). > However, dot graph takes extra step to generate the plan graph and the > quality of the output is not good. It's better we can implement a better > visualizer for Pig. It should: > 1. show operator type and alias > 2. turn on/off output schema > 3. dive into foreach inner plan on demand > 4. provide a way to show operator source code, eg, tooltip of an operator > (plan don't currently have this information, but you can assume this is in > place) > 5. besides visualize logical/physical/mapreduce plan, visualize the script > itself is also useful > 6. may rely on some java graphic library such as Swing > This is a candidate project for Google summer of code 2013. More information > about the program can be found at > https://cwiki.apache.org/confluence/display/PIG/GSoc2013 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2586) A better plan/data flow visualizer
[ https://issues.apache.org/jira/browse/PIG-2586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13611472#comment-13611472 ] Dmitriy V. Ryaboy commented on PIG-2586: Do we need this given Ambrose (and from what I hear, Ambari)? What is the difference between what this proposes and what Ambrose does? https://github.com/twitter/ambrose There is an Ambrose patch to add inner plans, too: https://github.com/twitter/ambrose/issues/62 > A better plan/data flow visualizer > -- > > Key: PIG-2586 > URL: https://issues.apache.org/jira/browse/PIG-2586 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: Daniel Dai > Labels: gsoc2013 > > Pig supports a dot graph style plan to visualize the > logical/physical/mapreduce plan (explain with -dot option, see > http://ofps.oreilly.com/titles/9781449302641/developing_and_testing.html). > However, dot graph takes extra step to generate the plan graph and the > quality of the output is not good. It's better we can implement a better > visualizer for Pig. It should: > 1. show operator type and alias > 2. turn on/off output schema > 3. dive into foreach inner plan on demand > 4. provide a way to show operator source code, eg, tooltip of an operator > (plan don't currently have this information, but you can assume this is in > place) > 5. besides visualize logical/physical/mapreduce plan, visualize the script > itself is also useful > 6. may rely on some java graphic library such as Swing > This is a candidate project for Google summer of code 2013. More information > about the program can be found at > https://cwiki.apache.org/confluence/display/PIG/GSoc2013 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3258) Patch to allow MultiStorage to use more than one index to generate output tree
[ https://issues.apache.org/jira/browse/PIG-3258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13611467#comment-13611467 ] Dmitriy V. Ryaboy commented on PIG-3258: please generate patch against the project root. > Patch to allow MultiStorage to use more than one index to generate output tree > -- > > Key: PIG-3258 > URL: https://issues.apache.org/jira/browse/PIG-3258 > Project: Pig > Issue Type: Improvement > Components: piggybank >Reporter: Joel Fouse >Priority: Minor > Labels: piggybank > > I have made a patch to enable MultiStorage to handle multiple tuple indexes, > rather than only one, for generating the output directory structure. Before > I submit it, though, I need to know if I should generate the patch from > /contrib/piggybank/java where I've been compiling and unit testing, or back > at the project root. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3254) Fail a failed Pig script quicker
[ https://issues.apache.org/jira/browse/PIG-3254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13608070#comment-13608070 ] Dmitriy V. Ryaboy commented on PIG-3254: Can I add a request for whoever will work on this ticket? Right now we die with "MR Job Failed" but don't say which job. In cases when multiple jobs are launched, one of them fails, the other ones are killed, and users find it hard to figure out which job was the cause of all badness. It would be nice to print out the job id of the failed job. > Fail a failed Pig script quicker > > > Key: PIG-3254 > URL: https://issues.apache.org/jira/browse/PIG-3254 > Project: Pig > Issue Type: Improvement >Reporter: Daniel Dai > Fix For: 0.12 > > > Credit to [~asitecn]. Currently Pig could launch several mapreduce job > simultaneously. When one mapreduce job fail, we need to wait for simultaneous > mapreduce job finish. In addition, we could potentially launch additional > jobs which is doomed to fail. However, this is unnecessary in some cases: > * If "stop.on.failure==true", we can kill parallel jobs, and fail the whole > script > * If "stop.on.failure==false", and no "store" could success, we can also kill > parallel jobs, and fail the whole script > Consider simultaneous jobs may take a long time to finish, this could > significantly improve the turn around in some cases. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2388) Make shim for Hadoop 0.20 and 0.23 support dynamic
[ https://issues.apache.org/jira/browse/PIG-2388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13605849#comment-13605849 ] Dmitriy V. Ryaboy commented on PIG-2388: Hive does this, and back in the day there was a patch that did this for Pig and hadoop 18 vs hadoop 20. Should be doable, though it'll take work.. > Make shim for Hadoop 0.20 and 0.23 support dynamic > -- > > Key: PIG-2388 > URL: https://issues.apache.org/jira/browse/PIG-2388 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.9.2, 0.10.0 >Reporter: Thomas Weise > Fix For: 0.9.2, 0.10.0 > > Attachments: PIG-2388_branch-0.9.patch > > > We need a single Pig installation that works with both Hadoop versions. The > current shim implementation assumes different builds for each version. We can > solve this statically through internal build/installation system or by making > the shim dynamic so that pig.jar will work on both version with runtime > detection. Attached patch is to convert the static shims into a shim > interface with 2 implementations, each of which will be compiled against the > respective Hadoop version and included into single pig.jar (similar to what > Hive does). > The default build behavior remains unchanged, only the shim for > ${hadoopversion} will be compiled. Both shims can be built via: ant > -Dbuild-all-shims=true -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3208) [zebra] TFile should not set io.compression.codec.lzo.buffersize
[ https://issues.apache.org/jira/browse/PIG-3208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13605765#comment-13605765 ] Dmitriy V. Ryaboy commented on PIG-3208: [~daijy] why wouldn't we commit fixes provided by community? > [zebra] TFile should not set io.compression.codec.lzo.buffersize > > > Key: PIG-3208 > URL: https://issues.apache.org/jira/browse/PIG-3208 > Project: Pig > Issue Type: Bug >Reporter: Eugene Koontz >Assignee: Eugene Koontz > Attachments: PIG-3208.patch > > > In contrib/zebra/src/java/org/apache/hadoop/zebra/tfile/Compression.java, the > following occurs: > {code} > conf.setInt("io.compression.codec.lzo.buffersize", 64 * 1024); > {code} > This can cause the LZO decompressor, if called within the context of reading > TFiles, to return with an error code when trying to uncompress LZO-compressed > data, if the data's compressed size is too large to fit in 64 * 1024 bytes. > For example, the Hadoop-LZO code uses a different default value (256 * 1024): > https://github.com/twitter/hadoop-lzo/blob/master/src/java/com/hadoop/compression/lzo/LzoCodec.java#L185 > This can lead to a case where, if data is compressed with a cluster where the > default {{io.compression.codec.lzo.buffersize}} = 256*1024 is used, then code > that tries to read this data by using Pig's zebra, the Mapper will exit with > code 134 because the LZO compressor returns a -4 (which encodes the LZO C > library error LZO_E_INPUT_OVERRUN) when trying to uncompress the data. The > stack trace of such a case is shown below: > {code} > 2013-02-17 14:47:50,709 INFO com.hadoop.compression.lzo.LzoCodec: Creating > stream for compressor: com.hadoop.compression.lzo.LzoCompressor@6818c458 with > bufferSize: 262144 > 2013-02-17 14:47:50,849 INFO org.apache.hadoop.io.compress.CodecPool: Paying > back codec: com.hadoop.compression.lzo.LzoCompressor@6818c458 > 2013-02-17 14:47:50,849 INFO org.apache.hadoop.mapred.MapTask: Finished spill > 3 > 2013-02-17 14:47:50,857 INFO org.apache.hadoop.io.compress.CodecPool: > Borrowing codec: com.hadoop.compression.lzo.LzoCompressor@6818c458 > 2013-02-17 14:47:50,866 INFO com.hadoop.compression.lzo.LzoCodec: Creating > stream for compressor: com.hadoop.compression.lzo.LzoCompressor@6818c458 with > bufferSize: 262144 > 2013-02-17 14:47:50,879 INFO org.apache.hadoop.io.compress.CodecPool: Paying > back codec: com.hadoop.compression.lzo.LzoCompressor@6818c458 > 2013-02-17 14:47:50,879 INFO org.apache.hadoop.mapred.MapTask: Finished spill > 4 > 2013-02-17 14:47:50,887 INFO org.apache.hadoop.mapred.Merger: Merging 5 > sorted segments > 2013-02-17 14:47:50,890 INFO org.apache.hadoop.io.compress.CodecPool: > Borrowing codec: com.hadoop.compression.lzo.LzoDecompressor@66a23610 > 2013-02-17 14:47:50,891 INFO com.hadoop.compression.lzo.LzoDecompressor: > calling decompressBytesDirect with buffer with: position: 0 and limit: 262144 > 2013-02-17 14:47:50,891 INFO com.hadoop.compression.lzo.LzoDecompressor: > read: 245688 bytes from decompressor. > 2013-02-17 14:47:50,891 INFO org.apache.hadoop.io.compress.CodecPool: > Borrowing codec: com.hadoop.compression.lzo.LzoDecompressor@43684706 > 2013-02-17 14:47:50,892 INFO com.hadoop.compression.lzo.LzoDecompressor: > calling decompressBytesDirect with buffer with: position: 0 and limit: 65536 > 2013-02-17 14:47:50,895 INFO org.apache.hadoop.mapred.TaskLogsTruncater: > Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1 > 2013-02-17 14:47:50,897 FATAL org.apache.hadoop.mapred.Child: Error running > child : java.lang.InternalError: lzo1x_decompress returned: -4 > at > com.hadoop.compression.lzo.LzoDecompressor.decompressBytesDirect(Native > Method) > at > com.hadoop.compression.lzo.LzoDecompressor.decompress(LzoDecompressor.java:307) > at > org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:82) > at > org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:75) > at org.apache.hadoop.mapred.IFile$Reader.readData(IFile.java:341) > at org.apache.hadoop.mapred.IFile$Reader.rejigData(IFile.java:371) > at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:355) > at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:387) > at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:220) > at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:420) > at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:381) > at org.apache.hadoop.mapred.Merger.merge(Merger.java:77) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1548) > at > org.ap
[jira] [Commented] (PIG-3132) NPE when illustrating a relation with HCatLoader
[ https://issues.apache.org/jira/browse/PIG-3132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13604929#comment-13604929 ] Dmitriy V. Ryaboy commented on PIG-3132: +1 > NPE when illustrating a relation with HCatLoader > - > > Key: PIG-3132 > URL: https://issues.apache.org/jira/browse/PIG-3132 > Project: Pig > Issue Type: Bug >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.12 > > Attachments: PIG-3132-1.patch > > > Get NPE exception when illustrate a relation with HCatLoader: > {code} > A = LOAD 'studenttab10k' USING org.apache.hcatalog.pig.HCatLoader(); > illustrate A; > {code} > Exception: > {code} > java.lang.NullPointerException > at > org.apache.hcatalog.pig.PigHCatUtil.transformToTuple(PigHCatUtil.java:274) > at > org.apache.hcatalog.pig.PigHCatUtil.transformToTuple(PigHCatUtil.java:238) > at > org.apache.hcatalog.pig.HCatBaseLoader.getNext(HCatBaseLoader.java:61) > at > org.apache.pig.impl.io.ReadToEndLoader.getNextHelper(ReadToEndLoader.java:210) > at > org.apache.pig.impl.io.ReadToEndLoader.getNext(ReadToEndLoader.java:190) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLoad.getNext(POLoad.java:129) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:267) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:262) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) > at > org.apache.pig.pen.LocalMapReduceSimulator.launchPig(LocalMapReduceSimulator.java:194) > at > org.apache.pig.pen.ExampleGenerator.getData(ExampleGenerator.java:257) > at > org.apache.pig.pen.ExampleGenerator.readBaseData(ExampleGenerator.java:222) > at > org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:154) > at org.apache.pig.PigServer.getExamples(PigServer.java:1245) > at > org.apache.pig.tools.grunt.GruntParser.processIllustrate(GruntParser.java:698) > at > org.apache.pig.tools.pigscript.parser.PigScriptParser.Illustrate(PigScriptParser.java:591) > at > org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:306) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:188) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:164) > at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:67) > {code} > HCatalog side is tracked with HCATALOG-163. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3194) Changes to ObjectSerializer.java break compatibility with Hadoop 0.20.2
[ https://issues.apache.org/jira/browse/PIG-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-3194: --- Resolution: Fixed Fix Version/s: 0.12 Status: Resolved (was: Patch Available) Committed to 0.11.1 and trunk. Thanks Kai for reporting and Prashant for fixing! > Changes to ObjectSerializer.java break compatibility with Hadoop 0.20.2 > --- > > Key: PIG-3194 > URL: https://issues.apache.org/jira/browse/PIG-3194 > Project: Pig > Issue Type: Bug >Affects Versions: 0.11 >Reporter: Kai Londenberg >Assignee: Prashant Kommireddi > Fix For: 0.12, 0.11.1 > > Attachments: PIG-3194_2.patch, PIG-3194.patch > > > The changes to ObjectSerializer.java in the following commit > http://svn.apache.org/viewvc?view=revision&revision=1403934 break > compatibility with Hadoop 0.20.2 Clusters. > The reason is, that the code uses methods from Apache Commons Codec 1.4 - > which are not available in Apache Commons Codec 1.3 which is shipping with > Hadoop 0.20.2. > The offending methods are Base64.decodeBase64(String) and > Base64.encodeBase64URLSafeString(byte[]) > If I revert these changes, Pig 0.11.0 candidate 2 works well with our Hadoop > 0.20.2 Clusters. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira