[jira] [Updated] (PIG-2788) improved string interpolation of variables

2012-12-17 Thread Jonathan Coveney (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Coveney updated PIG-2788:
--

Fix Version/s: 0.12
   Status: Patch Available  (was: Open)

> improved string interpolation of variables
> --
>
> Key: PIG-2788
> URL: https://issues.apache.org/jira/browse/PIG-2788
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.10.0, 0.9.2
>Reporter: Jeff Hodges
>Assignee: Jonathan Coveney
> Fix For: 0.12
>
> Attachments: PIG-2788-0.patch
>
>
> The simplest example of the failure of the current string interpolation is 
> {code}
> store my_rel into '$OUTPUT_';
> {code}
> This will raise an error saying that OUTPUT_ is not a variable passed in. 
> Similar errors happen with a variety of other trailing characters.
> It would be nice if '${OUTPUT}_', or something similar, worked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2788) improved string interpolation of variables

2012-12-17 Thread Jonathan Coveney (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Coveney updated PIG-2788:
--

Attachment: PIG-2788-0.patch

I bet nobody thought this would ever get some love :) but this has something 
that has long annoyed me, and I wanted to familiarize myself a bit more with 
that path of the code.

This has the syntax Jeff proposed. Nothing has changed, except now you can 
optionally do ${stuff} to ally ambiguity, thus allow ${tmp}_ and other such 
things.

It's a pretty easy change, too.

> improved string interpolation of variables
> --
>
> Key: PIG-2788
> URL: https://issues.apache.org/jira/browse/PIG-2788
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.9.2, 0.10.0
>Reporter: Jeff Hodges
>Assignee: Jonathan Coveney
> Attachments: PIG-2788-0.patch
>
>
> The simplest example of the failure of the current string interpolation is 
> {code}
> store my_rel into '$OUTPUT_';
> {code}
> This will raise an error saying that OUTPUT_ is not a variable passed in. 
> Similar errors happen with a variety of other trailing characters.
> It would be nice if '${OUTPUT}_', or something similar, worked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (PIG-2788) improved string interpolation of variables

2012-12-17 Thread Jonathan Coveney (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Coveney reassigned PIG-2788:
-

Assignee: Jonathan Coveney

> improved string interpolation of variables
> --
>
> Key: PIG-2788
> URL: https://issues.apache.org/jira/browse/PIG-2788
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.9.2, 0.10.0
>Reporter: Jeff Hodges
>Assignee: Jonathan Coveney
>
> The simplest example of the failure of the current string interpolation is 
> {code}
> store my_rel into '$OUTPUT_';
> {code}
> This will raise an error saying that OUTPUT_ is not a variable passed in. 
> Similar errors happen with a variety of other trailing characters.
> It would be nice if '${OUTPUT}_', or something similar, worked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (PIG-3082) outputSchema of a UDF allows two usages when describing a Tuple schema

2012-12-17 Thread Jonathan Coveney (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Coveney reassigned PIG-3082:
-

Assignee: Jonathan Coveney

> outputSchema of a UDF allows two usages when describing a Tuple schema
> --
>
> Key: PIG-3082
> URL: https://issues.apache.org/jira/browse/PIG-3082
> Project: Pig
>  Issue Type: Bug
>Reporter: Julien Le Dem
>Assignee: Jonathan Coveney
> Attachments: PIG-3082-0.patch
>
>
> When defining an evalfunc that returns a Tuple there are two ways you can 
> implement outputSchema().
> - The right way: return a schema that contains one Field that contains the 
> type and schema of the return type of the UDF
> - The unreliable way: return a schema that contains more than one field and 
> it will be understood as a tuple schema even though there is no type (which 
> is in Field class) to specify that. This is particularly deceitful when the 
> output schema is derived from the input schema and the outputted Tuple 
> sometimes contain only one field. In such cases Pig understands the output 
> schema as a tuple only if there is more than one field. And sometimes it 
> works, sometimes it does not.
> We should at least issue a warning (backward compatibility) if not plain 
> throw an exception when the output schema contains more than one Field.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3020) "Duplicate uid in schema" error when joining two relations derived from the same load statement

2012-12-17 Thread Jonathan Coveney (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Coveney updated PIG-3020:
--

Fix Version/s: 0.12
   0.11
   Status: Patch Available  (was: In Progress)

> "Duplicate uid in schema" error when joining two relations derived from the 
> same load statement
> ---
>
> Key: PIG-3020
> URL: https://issues.apache.org/jira/browse/PIG-3020
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11
>Reporter: Julien Le Dem
>Assignee: Jonathan Coveney
> Fix For: 0.11, 0.12
>
> Attachments: PIG-3020-2.patch, PIG-3020-2_ws.patch, 
> PIG-3020_branch-0.11_1.patch, PIG-3020.patch, PIG-3093-testcase.patch
>
>
> The following validates OK with pig 0.9 and fails with the following error in 
> 0.11 (and I suspect 0.10)
> pig -c debug2.pig
> Script: debug2.pig
> {noformat}
> A = LOAD 'foo' AS (group:tuple(uid, dst_id), uids_with_recs:bag{} , 
> uids_with_flock:bag{});
> edges_both = FILTER A BY NOT IsEmpty(uids_with_recs) AND NOT 
> IsEmpty(uids_with_flock);
> edges_both = FOREACH edges_both GENERATE
> group.uid AS src_id,
> group.dst_id AS dst_id;
> both_counts = GROUP edges_both BY src_id;
> both_counts = FOREACH both_counts GENERATE
> group AS src_id, SIZE(edges_both) AS size_both;
> edges_bq = FILTER A BY NOT IsEmpty(uids_with_recs);
> edges_bq = FOREACH edges_bq GENERATE
> group.uid AS src_id,
> group.dst_id AS dst_id;
> bq_counts = GROUP edges_bq BY src_id;
> bq_counts = FOREACH bq_counts GENERATE
> group AS src_id, SIZE(edges_bq) AS size_bq;
> per_user_set_sizes = JOIN bq_counts BY src_id LEFT OUTER, both_counts BY 
> src_id;
> store per_user_set_sizes into  'foo';
> {noformat}
> Error:
> {noformat}
> ERROR 2270: Logical plan invalid state: duplicate uid in schema : 
> bq_counts::src_id#417:bytearray,bq_counts::size_bq#468:long,both_counts::src_id#417:bytearray,both_counts::size_both#480:long
> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1067: Unable to 
> explain alias null
>   at org.apache.pig.PigServer.explain(PigServer.java:999)
>   at 
> org.apache.pig.tools.grunt.GruntParser.explainCurrentBatch(GruntParser.java:398)
>   at 
> org.apache.pig.tools.grunt.GruntParser.processExplain(GruntParser.java:330)
>   at org.apache.pig.tools.grunt.Grunt.checkScript(Grunt.java:98)
>   at org.apache.pig.Main.run(Main.java:600)
>   at org.apache.pig.Main.main(Main.java:154)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2000: 
> Error processing rule LoadTypeCastInserter
>   at 
> org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:122)
>   at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:277)
>   at org.apache.pig.PigServer.compilePp(PigServer.java:1322)
>   at org.apache.pig.PigServer.explain(PigServer.java:984)
>   ... 10 more
> Caused by: org.apache.pig.impl.plan.PlanValidationException: ERROR 2270: 
> Logical plan invalid state: duplicate uid in schema : 
> bq_counts::src_id#417:bytearray,bq_counts::size_bq#468:long,both_counts::src_id#417:bytearray,both_counts::size_both#480:long
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaResetter.validate(SchemaResetter.java:232)
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:105)
>   at 
> org.apache.pig.newplan.logical.relational.LOJoin.accept(LOJoin.java:171)
>   at 
> org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
>   at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaPatcher.transformed(SchemaPatcher.java:43)
>   at 
> org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:113)
>   ... 13 more
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3020) "Duplicate uid in schema" error when joining two relations derived from the same load statement

2012-12-17 Thread Jonathan Coveney (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Coveney updated PIG-3020:
--

Resolution: Fixed
Status: Resolved  (was: Patch Available)

> "Duplicate uid in schema" error when joining two relations derived from the 
> same load statement
> ---
>
> Key: PIG-3020
> URL: https://issues.apache.org/jira/browse/PIG-3020
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11
>Reporter: Julien Le Dem
>Assignee: Jonathan Coveney
> Fix For: 0.11, 0.12
>
> Attachments: PIG-3020-2.patch, PIG-3020-2_ws.patch, 
> PIG-3020_branch-0.11_1.patch, PIG-3020.patch, PIG-3093-testcase.patch
>
>
> The following validates OK with pig 0.9 and fails with the following error in 
> 0.11 (and I suspect 0.10)
> pig -c debug2.pig
> Script: debug2.pig
> {noformat}
> A = LOAD 'foo' AS (group:tuple(uid, dst_id), uids_with_recs:bag{} , 
> uids_with_flock:bag{});
> edges_both = FILTER A BY NOT IsEmpty(uids_with_recs) AND NOT 
> IsEmpty(uids_with_flock);
> edges_both = FOREACH edges_both GENERATE
> group.uid AS src_id,
> group.dst_id AS dst_id;
> both_counts = GROUP edges_both BY src_id;
> both_counts = FOREACH both_counts GENERATE
> group AS src_id, SIZE(edges_both) AS size_both;
> edges_bq = FILTER A BY NOT IsEmpty(uids_with_recs);
> edges_bq = FOREACH edges_bq GENERATE
> group.uid AS src_id,
> group.dst_id AS dst_id;
> bq_counts = GROUP edges_bq BY src_id;
> bq_counts = FOREACH bq_counts GENERATE
> group AS src_id, SIZE(edges_bq) AS size_bq;
> per_user_set_sizes = JOIN bq_counts BY src_id LEFT OUTER, both_counts BY 
> src_id;
> store per_user_set_sizes into  'foo';
> {noformat}
> Error:
> {noformat}
> ERROR 2270: Logical plan invalid state: duplicate uid in schema : 
> bq_counts::src_id#417:bytearray,bq_counts::size_bq#468:long,both_counts::src_id#417:bytearray,both_counts::size_both#480:long
> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1067: Unable to 
> explain alias null
>   at org.apache.pig.PigServer.explain(PigServer.java:999)
>   at 
> org.apache.pig.tools.grunt.GruntParser.explainCurrentBatch(GruntParser.java:398)
>   at 
> org.apache.pig.tools.grunt.GruntParser.processExplain(GruntParser.java:330)
>   at org.apache.pig.tools.grunt.Grunt.checkScript(Grunt.java:98)
>   at org.apache.pig.Main.run(Main.java:600)
>   at org.apache.pig.Main.main(Main.java:154)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2000: 
> Error processing rule LoadTypeCastInserter
>   at 
> org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:122)
>   at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:277)
>   at org.apache.pig.PigServer.compilePp(PigServer.java:1322)
>   at org.apache.pig.PigServer.explain(PigServer.java:984)
>   ... 10 more
> Caused by: org.apache.pig.impl.plan.PlanValidationException: ERROR 2270: 
> Logical plan invalid state: duplicate uid in schema : 
> bq_counts::src_id#417:bytearray,bq_counts::size_bq#468:long,both_counts::src_id#417:bytearray,both_counts::size_both#480:long
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaResetter.validate(SchemaResetter.java:232)
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:105)
>   at 
> org.apache.pig.newplan.logical.relational.LOJoin.accept(LOJoin.java:171)
>   at 
> org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
>   at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaPatcher.transformed(SchemaPatcher.java:43)
>   at 
> org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:113)
>   ... 13 more
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (PIG-3020) "Duplicate uid in schema" error when joining two relations derived from the same load statement

2012-12-17 Thread Jonathan Coveney (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Coveney reassigned PIG-3020:
-

Assignee: Jonathan Coveney  (was: Julien Le Dem)

> "Duplicate uid in schema" error when joining two relations derived from the 
> same load statement
> ---
>
> Key: PIG-3020
> URL: https://issues.apache.org/jira/browse/PIG-3020
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11
>Reporter: Julien Le Dem
>Assignee: Jonathan Coveney
> Attachments: PIG-3020-2.patch, PIG-3020-2_ws.patch, 
> PIG-3020_branch-0.11_1.patch, PIG-3020.patch, PIG-3093-testcase.patch
>
>
> The following validates OK with pig 0.9 and fails with the following error in 
> 0.11 (and I suspect 0.10)
> pig -c debug2.pig
> Script: debug2.pig
> {noformat}
> A = LOAD 'foo' AS (group:tuple(uid, dst_id), uids_with_recs:bag{} , 
> uids_with_flock:bag{});
> edges_both = FILTER A BY NOT IsEmpty(uids_with_recs) AND NOT 
> IsEmpty(uids_with_flock);
> edges_both = FOREACH edges_both GENERATE
> group.uid AS src_id,
> group.dst_id AS dst_id;
> both_counts = GROUP edges_both BY src_id;
> both_counts = FOREACH both_counts GENERATE
> group AS src_id, SIZE(edges_both) AS size_both;
> edges_bq = FILTER A BY NOT IsEmpty(uids_with_recs);
> edges_bq = FOREACH edges_bq GENERATE
> group.uid AS src_id,
> group.dst_id AS dst_id;
> bq_counts = GROUP edges_bq BY src_id;
> bq_counts = FOREACH bq_counts GENERATE
> group AS src_id, SIZE(edges_bq) AS size_bq;
> per_user_set_sizes = JOIN bq_counts BY src_id LEFT OUTER, both_counts BY 
> src_id;
> store per_user_set_sizes into  'foo';
> {noformat}
> Error:
> {noformat}
> ERROR 2270: Logical plan invalid state: duplicate uid in schema : 
> bq_counts::src_id#417:bytearray,bq_counts::size_bq#468:long,both_counts::src_id#417:bytearray,both_counts::size_both#480:long
> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1067: Unable to 
> explain alias null
>   at org.apache.pig.PigServer.explain(PigServer.java:999)
>   at 
> org.apache.pig.tools.grunt.GruntParser.explainCurrentBatch(GruntParser.java:398)
>   at 
> org.apache.pig.tools.grunt.GruntParser.processExplain(GruntParser.java:330)
>   at org.apache.pig.tools.grunt.Grunt.checkScript(Grunt.java:98)
>   at org.apache.pig.Main.run(Main.java:600)
>   at org.apache.pig.Main.main(Main.java:154)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2000: 
> Error processing rule LoadTypeCastInserter
>   at 
> org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:122)
>   at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:277)
>   at org.apache.pig.PigServer.compilePp(PigServer.java:1322)
>   at org.apache.pig.PigServer.explain(PigServer.java:984)
>   ... 10 more
> Caused by: org.apache.pig.impl.plan.PlanValidationException: ERROR 2270: 
> Logical plan invalid state: duplicate uid in schema : 
> bq_counts::src_id#417:bytearray,bq_counts::size_bq#468:long,both_counts::src_id#417:bytearray,both_counts::size_both#480:long
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaResetter.validate(SchemaResetter.java:232)
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:105)
>   at 
> org.apache.pig.newplan.logical.relational.LOJoin.accept(LOJoin.java:171)
>   at 
> org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
>   at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaPatcher.transformed(SchemaPatcher.java:43)
>   at 
> org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:113)
>   ... 13 more
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Work started] (PIG-3020) "Duplicate uid in schema" error when joining two relations derived from the same load statement

2012-12-17 Thread Jonathan Coveney (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on PIG-3020 started by Jonathan Coveney.

> "Duplicate uid in schema" error when joining two relations derived from the 
> same load statement
> ---
>
> Key: PIG-3020
> URL: https://issues.apache.org/jira/browse/PIG-3020
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11
>Reporter: Julien Le Dem
>Assignee: Jonathan Coveney
> Attachments: PIG-3020-2.patch, PIG-3020-2_ws.patch, 
> PIG-3020_branch-0.11_1.patch, PIG-3020.patch, PIG-3093-testcase.patch
>
>
> The following validates OK with pig 0.9 and fails with the following error in 
> 0.11 (and I suspect 0.10)
> pig -c debug2.pig
> Script: debug2.pig
> {noformat}
> A = LOAD 'foo' AS (group:tuple(uid, dst_id), uids_with_recs:bag{} , 
> uids_with_flock:bag{});
> edges_both = FILTER A BY NOT IsEmpty(uids_with_recs) AND NOT 
> IsEmpty(uids_with_flock);
> edges_both = FOREACH edges_both GENERATE
> group.uid AS src_id,
> group.dst_id AS dst_id;
> both_counts = GROUP edges_both BY src_id;
> both_counts = FOREACH both_counts GENERATE
> group AS src_id, SIZE(edges_both) AS size_both;
> edges_bq = FILTER A BY NOT IsEmpty(uids_with_recs);
> edges_bq = FOREACH edges_bq GENERATE
> group.uid AS src_id,
> group.dst_id AS dst_id;
> bq_counts = GROUP edges_bq BY src_id;
> bq_counts = FOREACH bq_counts GENERATE
> group AS src_id, SIZE(edges_bq) AS size_bq;
> per_user_set_sizes = JOIN bq_counts BY src_id LEFT OUTER, both_counts BY 
> src_id;
> store per_user_set_sizes into  'foo';
> {noformat}
> Error:
> {noformat}
> ERROR 2270: Logical plan invalid state: duplicate uid in schema : 
> bq_counts::src_id#417:bytearray,bq_counts::size_bq#468:long,both_counts::src_id#417:bytearray,both_counts::size_both#480:long
> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1067: Unable to 
> explain alias null
>   at org.apache.pig.PigServer.explain(PigServer.java:999)
>   at 
> org.apache.pig.tools.grunt.GruntParser.explainCurrentBatch(GruntParser.java:398)
>   at 
> org.apache.pig.tools.grunt.GruntParser.processExplain(GruntParser.java:330)
>   at org.apache.pig.tools.grunt.Grunt.checkScript(Grunt.java:98)
>   at org.apache.pig.Main.run(Main.java:600)
>   at org.apache.pig.Main.main(Main.java:154)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2000: 
> Error processing rule LoadTypeCastInserter
>   at 
> org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:122)
>   at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:277)
>   at org.apache.pig.PigServer.compilePp(PigServer.java:1322)
>   at org.apache.pig.PigServer.explain(PigServer.java:984)
>   ... 10 more
> Caused by: org.apache.pig.impl.plan.PlanValidationException: ERROR 2270: 
> Logical plan invalid state: duplicate uid in schema : 
> bq_counts::src_id#417:bytearray,bq_counts::size_bq#468:long,both_counts::src_id#417:bytearray,both_counts::size_both#480:long
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaResetter.validate(SchemaResetter.java:232)
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:105)
>   at 
> org.apache.pig.newplan.logical.relational.LOJoin.accept(LOJoin.java:171)
>   at 
> org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
>   at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaPatcher.transformed(SchemaPatcher.java:43)
>   at 
> org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:113)
>   ... 13 more
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-3098) Add another test for the self join case

2012-12-17 Thread Jonathan Coveney (JIRA)
Jonathan Coveney created PIG-3098:
-

 Summary: Add another test for the self join case
 Key: PIG-3098
 URL: https://issues.apache.org/jira/browse/PIG-3098
 Project: Pig
  Issue Type: Bug
Reporter: Jonathan Coveney
Assignee: Jonathan Coveney
 Fix For: 0.12
 Attachments: PIG-3098-0.patch

This adds a test to TestJoin that doesn't just make sure that self joins work 
semantically in the parser, but also that it pulls the right data through. 
Thought it'd be easier to just make a new JIRA than to reopen PIG-3020.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3098) Add another test for the self join case

2012-12-17 Thread Jonathan Coveney (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Coveney updated PIG-3098:
--

Attachment: PIG-3098-0.patch

> Add another test for the self join case
> ---
>
> Key: PIG-3098
> URL: https://issues.apache.org/jira/browse/PIG-3098
> Project: Pig
>  Issue Type: Bug
>Reporter: Jonathan Coveney
>Assignee: Jonathan Coveney
> Fix For: 0.12
>
> Attachments: PIG-3098-0.patch
>
>
> This adds a test to TestJoin that doesn't just make sure that self joins work 
> semantically in the parser, but also that it pulls the right data through. 
> Thought it'd be easier to just make a new JIRA than to reopen PIG-3020.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3098) Add another test for the self join case

2012-12-17 Thread Jonathan Coveney (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Coveney updated PIG-3098:
--

Status: Patch Available  (was: Open)

> Add another test for the self join case
> ---
>
> Key: PIG-3098
> URL: https://issues.apache.org/jira/browse/PIG-3098
> Project: Pig
>  Issue Type: Bug
>Reporter: Jonathan Coveney
>Assignee: Jonathan Coveney
> Fix For: 0.12
>
> Attachments: PIG-3098-0.patch
>
>
> This adds a test to TestJoin that doesn't just make sure that self joins work 
> semantically in the parser, but also that it pulls the right data through. 
> Thought it'd be easier to just make a new JIRA than to reopen PIG-3020.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3020) "Duplicate uid in schema" error when joining two relations derived from the same load statement

2012-12-17 Thread Jonathan Coveney (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13534537#comment-13534537
 ] 

Jonathan Coveney commented on PIG-3020:
---

I am inclined to agree. Will commit to 0.11

> "Duplicate uid in schema" error when joining two relations derived from the 
> same load statement
> ---
>
> Key: PIG-3020
> URL: https://issues.apache.org/jira/browse/PIG-3020
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
> Attachments: PIG-3020-2.patch, PIG-3020-2_ws.patch, 
> PIG-3020_branch-0.11_1.patch, PIG-3020.patch, PIG-3093-testcase.patch
>
>
> The following validates OK with pig 0.9 and fails with the following error in 
> 0.11 (and I suspect 0.10)
> pig -c debug2.pig
> Script: debug2.pig
> {noformat}
> A = LOAD 'foo' AS (group:tuple(uid, dst_id), uids_with_recs:bag{} , 
> uids_with_flock:bag{});
> edges_both = FILTER A BY NOT IsEmpty(uids_with_recs) AND NOT 
> IsEmpty(uids_with_flock);
> edges_both = FOREACH edges_both GENERATE
> group.uid AS src_id,
> group.dst_id AS dst_id;
> both_counts = GROUP edges_both BY src_id;
> both_counts = FOREACH both_counts GENERATE
> group AS src_id, SIZE(edges_both) AS size_both;
> edges_bq = FILTER A BY NOT IsEmpty(uids_with_recs);
> edges_bq = FOREACH edges_bq GENERATE
> group.uid AS src_id,
> group.dst_id AS dst_id;
> bq_counts = GROUP edges_bq BY src_id;
> bq_counts = FOREACH bq_counts GENERATE
> group AS src_id, SIZE(edges_bq) AS size_bq;
> per_user_set_sizes = JOIN bq_counts BY src_id LEFT OUTER, both_counts BY 
> src_id;
> store per_user_set_sizes into  'foo';
> {noformat}
> Error:
> {noformat}
> ERROR 2270: Logical plan invalid state: duplicate uid in schema : 
> bq_counts::src_id#417:bytearray,bq_counts::size_bq#468:long,both_counts::src_id#417:bytearray,both_counts::size_both#480:long
> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1067: Unable to 
> explain alias null
>   at org.apache.pig.PigServer.explain(PigServer.java:999)
>   at 
> org.apache.pig.tools.grunt.GruntParser.explainCurrentBatch(GruntParser.java:398)
>   at 
> org.apache.pig.tools.grunt.GruntParser.processExplain(GruntParser.java:330)
>   at org.apache.pig.tools.grunt.Grunt.checkScript(Grunt.java:98)
>   at org.apache.pig.Main.run(Main.java:600)
>   at org.apache.pig.Main.main(Main.java:154)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2000: 
> Error processing rule LoadTypeCastInserter
>   at 
> org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:122)
>   at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:277)
>   at org.apache.pig.PigServer.compilePp(PigServer.java:1322)
>   at org.apache.pig.PigServer.explain(PigServer.java:984)
>   ... 10 more
> Caused by: org.apache.pig.impl.plan.PlanValidationException: ERROR 2270: 
> Logical plan invalid state: duplicate uid in schema : 
> bq_counts::src_id#417:bytearray,bq_counts::size_bq#468:long,both_counts::src_id#417:bytearray,both_counts::size_both#480:long
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaResetter.validate(SchemaResetter.java:232)
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:105)
>   at 
> org.apache.pig.newplan.logical.relational.LOJoin.accept(LOJoin.java:171)
>   at 
> org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
>   at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaPatcher.transformed(SchemaPatcher.java:43)
>   at 
> org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:113)
>   ... 13 more
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3020) "Duplicate uid in schema" error when joining two relations derived from the same load statement

2012-12-17 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13534495#comment-13534495
 ] 

Dmitriy V. Ryaboy commented on PIG-3020:


existing scripts that work on pig 9 don't work on 11 without this so I think it 
needs to be in 11 (to prevent breaking changes).

> "Duplicate uid in schema" error when joining two relations derived from the 
> same load statement
> ---
>
> Key: PIG-3020
> URL: https://issues.apache.org/jira/browse/PIG-3020
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
> Attachments: PIG-3020-2.patch, PIG-3020-2_ws.patch, 
> PIG-3020_branch-0.11_1.patch, PIG-3020.patch, PIG-3093-testcase.patch
>
>
> The following validates OK with pig 0.9 and fails with the following error in 
> 0.11 (and I suspect 0.10)
> pig -c debug2.pig
> Script: debug2.pig
> {noformat}
> A = LOAD 'foo' AS (group:tuple(uid, dst_id), uids_with_recs:bag{} , 
> uids_with_flock:bag{});
> edges_both = FILTER A BY NOT IsEmpty(uids_with_recs) AND NOT 
> IsEmpty(uids_with_flock);
> edges_both = FOREACH edges_both GENERATE
> group.uid AS src_id,
> group.dst_id AS dst_id;
> both_counts = GROUP edges_both BY src_id;
> both_counts = FOREACH both_counts GENERATE
> group AS src_id, SIZE(edges_both) AS size_both;
> edges_bq = FILTER A BY NOT IsEmpty(uids_with_recs);
> edges_bq = FOREACH edges_bq GENERATE
> group.uid AS src_id,
> group.dst_id AS dst_id;
> bq_counts = GROUP edges_bq BY src_id;
> bq_counts = FOREACH bq_counts GENERATE
> group AS src_id, SIZE(edges_bq) AS size_bq;
> per_user_set_sizes = JOIN bq_counts BY src_id LEFT OUTER, both_counts BY 
> src_id;
> store per_user_set_sizes into  'foo';
> {noformat}
> Error:
> {noformat}
> ERROR 2270: Logical plan invalid state: duplicate uid in schema : 
> bq_counts::src_id#417:bytearray,bq_counts::size_bq#468:long,both_counts::src_id#417:bytearray,both_counts::size_both#480:long
> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1067: Unable to 
> explain alias null
>   at org.apache.pig.PigServer.explain(PigServer.java:999)
>   at 
> org.apache.pig.tools.grunt.GruntParser.explainCurrentBatch(GruntParser.java:398)
>   at 
> org.apache.pig.tools.grunt.GruntParser.processExplain(GruntParser.java:330)
>   at org.apache.pig.tools.grunt.Grunt.checkScript(Grunt.java:98)
>   at org.apache.pig.Main.run(Main.java:600)
>   at org.apache.pig.Main.main(Main.java:154)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2000: 
> Error processing rule LoadTypeCastInserter
>   at 
> org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:122)
>   at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:277)
>   at org.apache.pig.PigServer.compilePp(PigServer.java:1322)
>   at org.apache.pig.PigServer.explain(PigServer.java:984)
>   ... 10 more
> Caused by: org.apache.pig.impl.plan.PlanValidationException: ERROR 2270: 
> Logical plan invalid state: duplicate uid in schema : 
> bq_counts::src_id#417:bytearray,bq_counts::size_bq#468:long,both_counts::src_id#417:bytearray,both_counts::size_both#480:long
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaResetter.validate(SchemaResetter.java:232)
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:105)
>   at 
> org.apache.pig.newplan.logical.relational.LOJoin.accept(LOJoin.java:171)
>   at 
> org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
>   at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaPatcher.transformed(SchemaPatcher.java:43)
>   at 
> org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:113)
>   ... 13 more
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Subscription: PIG patch available

2012-12-17 Thread jira
Issue Subscription
Filter: PIG patch available (37 issues)

Subscriber: pigdaily

Key Summary
PIG-3096Make PigUnit thread safe
https://issues.apache.org/jira/browse/PIG-3096
PIG-3088Add a builtin udf which removes prefixes
https://issues.apache.org/jira/browse/PIG-3088
PIG-3086Allow A Prefix To Be Added To URIs In PigUnit Tests 
https://issues.apache.org/jira/browse/PIG-3086
PIG-3078Make a UDF that, given a string, returns just the columns prefixed 
by that string
https://issues.apache.org/jira/browse/PIG-3078
PIG-3073POUserFunc creating log spam for large scripts
https://issues.apache.org/jira/browse/PIG-3073
PIG-3069Native Windows Compatibility for Pig E2E Tests and Harness
https://issues.apache.org/jira/browse/PIG-3069
PIG-3067HBaseStorage should be split up to become more managable
https://issues.apache.org/jira/browse/PIG-3067
PIG-3066Fix TestPigRunner in trunk
https://issues.apache.org/jira/browse/PIG-3066
PIG-3057make readField protected to be able to override it if we extend 
PigStorage
https://issues.apache.org/jira/browse/PIG-3057
PIG-3051java.lang.IndexOutOfBoundsException  failure with LimitOptimizer + 
ColumnPruning
https://issues.apache.org/jira/browse/PIG-3051
PIG-3050Fix FindBugs multithreading warnings
https://issues.apache.org/jira/browse/PIG-3050
PIG-3029TestTypeCheckingValidatorNewLP has some path reference issues for 
cross-platform execution
https://issues.apache.org/jira/browse/PIG-3029
PIG-3028testGrunt dev test needs some command filters to run correctly 
without cygwin
https://issues.apache.org/jira/browse/PIG-3028
PIG-3027pigTest unit test needs a newline filter for comparisons of golden 
multi-line
https://issues.apache.org/jira/browse/PIG-3027
PIG-3026Pig checked-in baseline comparisons need a pre-filter to address 
OS-specific newline differences
https://issues.apache.org/jira/browse/PIG-3026
PIG-3025TestPruneColumn unit test - SimpleEchoStreamingCommand perl inline 
script needs simplification
https://issues.apache.org/jira/browse/PIG-3025
PIG-3024TestEmptyInputDir unit test - hadoop version detection logic is 
brittle
https://issues.apache.org/jira/browse/PIG-3024
PIG-3015Rewrite of AvroStorage
https://issues.apache.org/jira/browse/PIG-3015
PIG-3010Allow UDF's to flatten themselves
https://issues.apache.org/jira/browse/PIG-3010
PIG-2959Add a pig.cmd for Pig to run under Windows
https://issues.apache.org/jira/browse/PIG-2959
PIG-2957TetsScriptUDF fail due to volume prefix in jar
https://issues.apache.org/jira/browse/PIG-2957
PIG-2956Invalid cache specification for some streaming statement
https://issues.apache.org/jira/browse/PIG-2956
PIG-2955 Fix bunch of Pig e2e tests on Windows 
https://issues.apache.org/jira/browse/PIG-2955
PIG-2878Pig current releases lack a UDF equalIgnoreCase.This function 
returns a Boolean value indicating whether string left is equal to string 
right. This check is case insensitive.
https://issues.apache.org/jira/browse/PIG-2878
PIG-2873Converting bin/pig shell script to python
https://issues.apache.org/jira/browse/PIG-2873
PIG-2834MultiStorage requires unused constructor argument
https://issues.apache.org/jira/browse/PIG-2834
PIG-2824Pushing checking number of fields into LoadFunc
https://issues.apache.org/jira/browse/PIG-2824
PIG-2661Pig uses an extra job for loading data in Pigmix L9
https://issues.apache.org/jira/browse/PIG-2661
PIG-2645PigSplit does not handle the case where SerializationFactory 
returns null
https://issues.apache.org/jira/browse/PIG-2645
PIG-2614AvroStorage crashes on LOADING a single bad error
https://issues.apache.org/jira/browse/PIG-2614
PIG-2507Semicolon in paramenters for UDF results in parsing error
https://issues.apache.org/jira/browse/PIG-2507
PIG-2433Jython import module not working if module path is in classpath
https://issues.apache.org/jira/browse/PIG-2433
PIG-2417Streaming UDFs -  allow users to easily write UDFs in scripting 
languages with no JVM implementation.
https://issues.apache.org/jira/browse/PIG-2417
PIG-2362Rework Ant build.xml to use macrodef instead of antcall
https://issues.apache.org/jira/browse/PIG-2362
PIG-2312NPE when relation and column share the same name and used in Nested 
Foreach 
https://issues.apache.org/jira/browse/PIG-2312
PIG-1942script UDF (jython) should utilize the intended output schema to 
more directly convert Py objects to Pig objects
https://issues.apache.org/jira/browse/PIG-1942
PIG-1237Piggyb

[jira] [Commented] (PIG-3020) "Duplicate uid in schema" error when joining two relations derived from the same load statement

2012-12-17 Thread Jonathan Coveney (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13534454#comment-13534454
 ] 

Jonathan Coveney commented on PIG-3020:
---

This is in trunk. Not sure if it meets the criteria to be in pig-11?

> "Duplicate uid in schema" error when joining two relations derived from the 
> same load statement
> ---
>
> Key: PIG-3020
> URL: https://issues.apache.org/jira/browse/PIG-3020
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
> Attachments: PIG-3020-2.patch, PIG-3020-2_ws.patch, 
> PIG-3020_branch-0.11_1.patch, PIG-3020.patch, PIG-3093-testcase.patch
>
>
> The following validates OK with pig 0.9 and fails with the following error in 
> 0.11 (and I suspect 0.10)
> pig -c debug2.pig
> Script: debug2.pig
> {noformat}
> A = LOAD 'foo' AS (group:tuple(uid, dst_id), uids_with_recs:bag{} , 
> uids_with_flock:bag{});
> edges_both = FILTER A BY NOT IsEmpty(uids_with_recs) AND NOT 
> IsEmpty(uids_with_flock);
> edges_both = FOREACH edges_both GENERATE
> group.uid AS src_id,
> group.dst_id AS dst_id;
> both_counts = GROUP edges_both BY src_id;
> both_counts = FOREACH both_counts GENERATE
> group AS src_id, SIZE(edges_both) AS size_both;
> edges_bq = FILTER A BY NOT IsEmpty(uids_with_recs);
> edges_bq = FOREACH edges_bq GENERATE
> group.uid AS src_id,
> group.dst_id AS dst_id;
> bq_counts = GROUP edges_bq BY src_id;
> bq_counts = FOREACH bq_counts GENERATE
> group AS src_id, SIZE(edges_bq) AS size_bq;
> per_user_set_sizes = JOIN bq_counts BY src_id LEFT OUTER, both_counts BY 
> src_id;
> store per_user_set_sizes into  'foo';
> {noformat}
> Error:
> {noformat}
> ERROR 2270: Logical plan invalid state: duplicate uid in schema : 
> bq_counts::src_id#417:bytearray,bq_counts::size_bq#468:long,both_counts::src_id#417:bytearray,both_counts::size_both#480:long
> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1067: Unable to 
> explain alias null
>   at org.apache.pig.PigServer.explain(PigServer.java:999)
>   at 
> org.apache.pig.tools.grunt.GruntParser.explainCurrentBatch(GruntParser.java:398)
>   at 
> org.apache.pig.tools.grunt.GruntParser.processExplain(GruntParser.java:330)
>   at org.apache.pig.tools.grunt.Grunt.checkScript(Grunt.java:98)
>   at org.apache.pig.Main.run(Main.java:600)
>   at org.apache.pig.Main.main(Main.java:154)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2000: 
> Error processing rule LoadTypeCastInserter
>   at 
> org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:122)
>   at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:277)
>   at org.apache.pig.PigServer.compilePp(PigServer.java:1322)
>   at org.apache.pig.PigServer.explain(PigServer.java:984)
>   ... 10 more
> Caused by: org.apache.pig.impl.plan.PlanValidationException: ERROR 2270: 
> Logical plan invalid state: duplicate uid in schema : 
> bq_counts::src_id#417:bytearray,bq_counts::size_bq#468:long,both_counts::src_id#417:bytearray,both_counts::size_both#480:long
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaResetter.validate(SchemaResetter.java:232)
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:105)
>   at 
> org.apache.pig.newplan.logical.relational.LOJoin.accept(LOJoin.java:171)
>   at 
> org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
>   at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaPatcher.transformed(SchemaPatcher.java:43)
>   at 
> org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:113)
>   ... 13 more
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (PIG-3093) Self join + realias results in schema errors

2012-12-17 Thread Jonathan Coveney (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Coveney resolved PIG-3093.
---

Resolution: Duplicate

> Self join + realias results in schema errors
> 
>
> Key: PIG-3093
> URL: https://issues.apache.org/jira/browse/PIG-3093
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11, 0.12
>Reporter: Jonathan Coveney
>Assignee: Jonathan Coveney
>Priority: Critical
> Fix For: 0.12
>
>
> So this one took a while to isolate, but is pretty crazy.
> {code}
> A = load 'a' as (field1:chararray);
> B = foreach A generate *;
> C = join A by field1, B by field1;
> D = foreach C generate A::field1 as field2, B::field1;
> describe D;
> /*
> D: {
> field2: chararray,
> B::field1: chararray
> }
> */
> E = foreach D generate field2, field1;
> describe E;
> /*
> E: {
> B::field1: chararray,
> B::field1: chararray
> }
> */
> F = foreach E generate field2;
> store F into 'fail';
> --  Invalid field projection. 
> Projected field [field2] does not exist in schema: 
> B::field1:chararray,B::field1:chararray.
> {code}
> If you take a look at that code snippet, that is pretty nuts! Since the 2 
> fields come from the same original table, renaming one causes issues with 
> both. WUT. The even weirder part is not that they both get renamed, but that 
> they both become the unrenamed value.
> Interestingly, flipping the value of the projection changes the order of the 
> output, so it looks like it's whatever the final reference is. ie
> {code}
> A = load 'a' as (field1:chararray);
> B = foreach A generate *;
> C = join A by field1, B by field1;
> D = foreach C generate B::field1, A::field1 as field2;
> describe D;
> E = foreach D generate field2, field1;
> describe E;
> F = foreach E generate field2;
> store F into 'fail';
> {code}
> results in
> {code}
> D: {
> B::field1: chararray,
> field2: chararray
> }
> E: {
> field2: chararray,
> field2: chararray
> }
> 2012-12-13 00:13:10,045 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 1025: 
>  Invalid field projection. Projected 
> field [field2] does not exist in schema: field2:chararray,field2:chararray.
> {code}
> This seems to imply the solution: make copies of the Schema. I added a test 
> and will hopefully have a patch soon.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Our release process

2012-12-17 Thread Olga Natkovich
Hi Jonathan,

I thought I answered your email last week but I just noticed that the answer 
did not come through.

We tell users that at is coming in the next release. Now that Pig is quite 
mature and stable, we don't see much of this. Having more frequent releases 
definitely helps in this respect.

Olga






From: Jonathan Coveney 
To: "dev@pig.apache.org" ; Olga Natkovich 
 
Sent: Thursday, December 13, 2012 1:14 PM
Subject: Re: Our release process

Olga,

A related but separate question: what do y'all do when there is a feature
that is finished, but for an upcoming release? ie a feature in trunk, but
not in 0.11 (which, let us assume, is stable).

Jon


2012/12/13 Olga Natkovich 

> Hi Julien,
>
> I think for us at Yahoo to be able to run our releases directly from the
> branch we would need the guarantees that I proposed in my initial email and
> something that we agreed to last year. The only changes that go in are
>
> - Failures without reasonable workarounds
> - Silent failures.
>
> My main concerns with the proposal is that I do not believe that our
> current testing infra is robust/inclusive enough to catch errors. That's
> why I am hesitant in widening the scope.
>
> I am fine with whatever the outcome the majority of people agrees with. I
> am just saying that Yahoo will likely need a private branch if our rules
> are too relaxed.
>
> Olga
>
>
>
> - Original Message -
> From: Julien Le Dem 
> To: "dev@pig.apache.org" ; Olga Natkovich <
> onatkov...@yahoo.com>
> Cc:
> Sent: Wednesday, December 12, 2012 4:54 PM
> Subject: Re: Our release process
>
> Agreed. The priority of a change is subjective as well.
> My definition for inclusion on the release branch:
> - Only bug fixes.
> - Only if they have fairly understood repercussions (up to the committers
> who +/-1 as usual).
> - If we thought it would not break things but still does (CI or externally
> reported failure) we revert it.
> What do you want to add/change? Please reformulate those rules the way you
> like and let's see how we can converge.
> (Also, let's keep it short for clarity)
>
> Julien
>
> On Wed, Dec 12, 2012 at 11:08 AM, Olga Natkovich  >wrote:
>
> > Hi Julien,
> >
> > I understand what you are trying to do and I can see that being able to
> > make more fixes post release has value for some use cases. My concern is
> > that "things that do not destabilize the branch" is fairly subjective and
> > also not always easy to ascertain beyond trivial changes. The only way I
> > know to keep a code stable is to limit the updates. Also we need to
> clearly
> > state what the constrains are for a post release commits so that every
> user
> > can decide whether it works for them.
> >
> > Olga
> >
> >
> > 
> > From: Julien Le Dem 
> > To: "dev@pig.apache.org" 
> > Sent: Wednesday, December 12, 2012 10:26 AM
> > Subject: Re: Our release process
> >
> > I think we all agree here, let's not jump to conclusions.
> > Everything in this branch I am talking about is in Apache Pig. Everything
> > we do in Pig is contributed.
> > We have a branch for 0.11 where we keep merging the official 0.11 branch
> > plus a few patches (and it will stay small) that are only in Apache
> TRUNK.
> > The goal here is to help keeping the release branch stable by not adding
> > patches that are only useful to us.
> > Having this branch allows us to fix anything quickly and redeploy to
> > production. It is also what allows us to use the pig 0.11 branch in
> > production before it is even released.
> > This definitely benefits the community and helps making 0.11 stable.
> > This is a very reasonable way to keep using a recent version of Pig in
> > production.
> >
> > Olga: My goal is to decrease the scope of what is going in the release
> > branch and to make sure we add only bug fixes that are not making it
> > unstable. I also think having a short definition of this helps which is
> why
> > I have been chiming in.
> > Let us know how you want to decrease the scope. I'm just trying to
> simplify
> > here.
> >
> > Julien
> >
> >
> >
> > On Tue, Dec 11, 2012 at 8:54 AM, Prashant Kommireddi <
> prash1...@gmail.com
> > >wrote:
> >
> > > Share the same concern as Russell here. Not great for the project for
> > > everyone to go "private branch" approach.
> > >
> > > On Tue, Dec 11, 2012 at 8:33 AM, Russell Jurney <
> > russell.jur...@gmail.com
> > > >wrote:
> > >
> > > > Wait. Ack. Do we want everyone to do this? This sounds like
> > > fragmentation.
> > > > :(
> > > >
> > > > Russell Jurney twitter.com/rjurney
> > > >
> > > >
> > > > On Dec 10, 2012, at 3:24 PM, Olga Natkovich 
> > > wrote:
> > > >
> > > > > If everybody is using a private branch then
> > > > >
> > > > > (1) We are not serving a significant part of our community
> > > > > (2) There is no motivation to contribute those patches to branches
> > > (only
> > > > to trunk).
> > > > >
> > > > > Yahoo has been trying hard to work of the Apach

[jira] [Commented] (PIG-3020) "Duplicate uid in schema" error when joining two relations derived from the same load statement

2012-12-17 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13534401#comment-13534401
 ] 

Julien Le Dem commented on PIG-3020:


looks good to me
+1

> "Duplicate uid in schema" error when joining two relations derived from the 
> same load statement
> ---
>
> Key: PIG-3020
> URL: https://issues.apache.org/jira/browse/PIG-3020
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
> Attachments: PIG-3020-2.patch, PIG-3020-2_ws.patch, 
> PIG-3020_branch-0.11_1.patch, PIG-3020.patch, PIG-3093-testcase.patch
>
>
> The following validates OK with pig 0.9 and fails with the following error in 
> 0.11 (and I suspect 0.10)
> pig -c debug2.pig
> Script: debug2.pig
> {noformat}
> A = LOAD 'foo' AS (group:tuple(uid, dst_id), uids_with_recs:bag{} , 
> uids_with_flock:bag{});
> edges_both = FILTER A BY NOT IsEmpty(uids_with_recs) AND NOT 
> IsEmpty(uids_with_flock);
> edges_both = FOREACH edges_both GENERATE
> group.uid AS src_id,
> group.dst_id AS dst_id;
> both_counts = GROUP edges_both BY src_id;
> both_counts = FOREACH both_counts GENERATE
> group AS src_id, SIZE(edges_both) AS size_both;
> edges_bq = FILTER A BY NOT IsEmpty(uids_with_recs);
> edges_bq = FOREACH edges_bq GENERATE
> group.uid AS src_id,
> group.dst_id AS dst_id;
> bq_counts = GROUP edges_bq BY src_id;
> bq_counts = FOREACH bq_counts GENERATE
> group AS src_id, SIZE(edges_bq) AS size_bq;
> per_user_set_sizes = JOIN bq_counts BY src_id LEFT OUTER, both_counts BY 
> src_id;
> store per_user_set_sizes into  'foo';
> {noformat}
> Error:
> {noformat}
> ERROR 2270: Logical plan invalid state: duplicate uid in schema : 
> bq_counts::src_id#417:bytearray,bq_counts::size_bq#468:long,both_counts::src_id#417:bytearray,both_counts::size_both#480:long
> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1067: Unable to 
> explain alias null
>   at org.apache.pig.PigServer.explain(PigServer.java:999)
>   at 
> org.apache.pig.tools.grunt.GruntParser.explainCurrentBatch(GruntParser.java:398)
>   at 
> org.apache.pig.tools.grunt.GruntParser.processExplain(GruntParser.java:330)
>   at org.apache.pig.tools.grunt.Grunt.checkScript(Grunt.java:98)
>   at org.apache.pig.Main.run(Main.java:600)
>   at org.apache.pig.Main.main(Main.java:154)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2000: 
> Error processing rule LoadTypeCastInserter
>   at 
> org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:122)
>   at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:277)
>   at org.apache.pig.PigServer.compilePp(PigServer.java:1322)
>   at org.apache.pig.PigServer.explain(PigServer.java:984)
>   ... 10 more
> Caused by: org.apache.pig.impl.plan.PlanValidationException: ERROR 2270: 
> Logical plan invalid state: duplicate uid in schema : 
> bq_counts::src_id#417:bytearray,bq_counts::size_bq#468:long,both_counts::src_id#417:bytearray,both_counts::size_both#480:long
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaResetter.validate(SchemaResetter.java:232)
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:105)
>   at 
> org.apache.pig.newplan.logical.relational.LOJoin.accept(LOJoin.java:171)
>   at 
> org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
>   at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaPatcher.transformed(SchemaPatcher.java:43)
>   at 
> org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:113)
>   ... 13 more
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3020) "Duplicate uid in schema" error when joining two relations derived from the same load statement

2012-12-17 Thread Jonathan Coveney (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Coveney updated PIG-3020:
--

Attachment: PIG-3020-2.patch
PIG-3020-2_ws.patch

I've attached a fix, with and without whitespace changes (would like to attach 
_ws, but easier to review without). This include and also fixes PIG-3093

> "Duplicate uid in schema" error when joining two relations derived from the 
> same load statement
> ---
>
> Key: PIG-3020
> URL: https://issues.apache.org/jira/browse/PIG-3020
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
> Attachments: PIG-3020-2.patch, PIG-3020-2_ws.patch, 
> PIG-3020_branch-0.11_1.patch, PIG-3020.patch, PIG-3093-testcase.patch
>
>
> The following validates OK with pig 0.9 and fails with the following error in 
> 0.11 (and I suspect 0.10)
> pig -c debug2.pig
> Script: debug2.pig
> {noformat}
> A = LOAD 'foo' AS (group:tuple(uid, dst_id), uids_with_recs:bag{} , 
> uids_with_flock:bag{});
> edges_both = FILTER A BY NOT IsEmpty(uids_with_recs) AND NOT 
> IsEmpty(uids_with_flock);
> edges_both = FOREACH edges_both GENERATE
> group.uid AS src_id,
> group.dst_id AS dst_id;
> both_counts = GROUP edges_both BY src_id;
> both_counts = FOREACH both_counts GENERATE
> group AS src_id, SIZE(edges_both) AS size_both;
> edges_bq = FILTER A BY NOT IsEmpty(uids_with_recs);
> edges_bq = FOREACH edges_bq GENERATE
> group.uid AS src_id,
> group.dst_id AS dst_id;
> bq_counts = GROUP edges_bq BY src_id;
> bq_counts = FOREACH bq_counts GENERATE
> group AS src_id, SIZE(edges_bq) AS size_bq;
> per_user_set_sizes = JOIN bq_counts BY src_id LEFT OUTER, both_counts BY 
> src_id;
> store per_user_set_sizes into  'foo';
> {noformat}
> Error:
> {noformat}
> ERROR 2270: Logical plan invalid state: duplicate uid in schema : 
> bq_counts::src_id#417:bytearray,bq_counts::size_bq#468:long,both_counts::src_id#417:bytearray,both_counts::size_both#480:long
> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1067: Unable to 
> explain alias null
>   at org.apache.pig.PigServer.explain(PigServer.java:999)
>   at 
> org.apache.pig.tools.grunt.GruntParser.explainCurrentBatch(GruntParser.java:398)
>   at 
> org.apache.pig.tools.grunt.GruntParser.processExplain(GruntParser.java:330)
>   at org.apache.pig.tools.grunt.Grunt.checkScript(Grunt.java:98)
>   at org.apache.pig.Main.run(Main.java:600)
>   at org.apache.pig.Main.main(Main.java:154)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2000: 
> Error processing rule LoadTypeCastInserter
>   at 
> org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:122)
>   at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:277)
>   at org.apache.pig.PigServer.compilePp(PigServer.java:1322)
>   at org.apache.pig.PigServer.explain(PigServer.java:984)
>   ... 10 more
> Caused by: org.apache.pig.impl.plan.PlanValidationException: ERROR 2270: 
> Logical plan invalid state: duplicate uid in schema : 
> bq_counts::src_id#417:bytearray,bq_counts::size_bq#468:long,both_counts::src_id#417:bytearray,both_counts::size_both#480:long
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaResetter.validate(SchemaResetter.java:232)
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:105)
>   at 
> org.apache.pig.newplan.logical.relational.LOJoin.accept(LOJoin.java:171)
>   at 
> org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
>   at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaPatcher.transformed(SchemaPatcher.java:43)
>   at 
> org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:113)
>   ... 13 more
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Review Request: PIG-3015 Rewrite of AvroStorage

2012-12-17 Thread Joseph Adler

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/8104/
---

(Updated Dec. 17, 2012, 7:36 p.m.)


Review request for pig and Cheolsoo Park.


Changes
---

Added test cases for Trevni (and made sure all the test cases pass)


Description
---

The current AvroStorage implementation has a lot of issues: it requires old 
versions of Avro, it copies data much more than needed, and it's verbose and 
complicated. (One pet peeve of mine is that old versions of Avro don't support 
Snappy compression.)

I rewrote AvroStorage from scratch to fix these issues. In early tests, the new 
implementation is significantly faster, and the code is a lot simpler. 
Rewriting AvroStorage also enabled me to implement support for Trevni.

This is the latest version of the patch, complete with test cases and 
TrevniStorage. (Test cases for TrevniStorage are still missing).


This addresses bug PIG-3015.
https://issues.apache.org/jira/browse/PIG-3015


Diffs (updated)
-

  .eclipse.templates/.classpath aa9bfd5 
  build.xml 1f21839 
  ivy.xml 70e8d50 
  ivy/libraries.properties bfbbbc0 
  src/org/apache/pig/builtin/AvroStorage.java PRE-CREATION 
  src/org/apache/pig/builtin/TrevniStorage.java PRE-CREATION 
  src/org/apache/pig/impl/util/avro/AvroArrayReader.java PRE-CREATION 
  src/org/apache/pig/impl/util/avro/AvroBagWrapper.java PRE-CREATION 
  src/org/apache/pig/impl/util/avro/AvroMapWrapper.java PRE-CREATION 
  src/org/apache/pig/impl/util/avro/AvroRecordReader.java PRE-CREATION 
  src/org/apache/pig/impl/util/avro/AvroRecordWriter.java PRE-CREATION 
  src/org/apache/pig/impl/util/avro/AvroStorageDataConversionUtilities.java 
PRE-CREATION 
  src/org/apache/pig/impl/util/avro/AvroStorageSchemaConversionUtilities.java 
PRE-CREATION 
  src/org/apache/pig/impl/util/avro/AvroTupleWrapper.java PRE-CREATION 
  test/commit-tests 5081fbc 
  test/org/apache/pig/builtin/TestAvroStorage.java PRE-CREATION 
  test/org/apache/pig/builtin/avro/code/pig/directory_test.pig PRE-CREATION 
  test/org/apache/pig/builtin/avro/code/pig/identity.pig PRE-CREATION 
  test/org/apache/pig/builtin/avro/code/pig/identity_ai1_ao2.pig PRE-CREATION 
  test/org/apache/pig/builtin/avro/code/pig/identity_ao2.pig PRE-CREATION 
  test/org/apache/pig/builtin/avro/code/pig/identity_blank_first_args.pig 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/code/pig/identity_codec.pig PRE-CREATION 
  test/org/apache/pig/builtin/avro/code/pig/identity_just_ao2.pig PRE-CREATION 
  test/org/apache/pig/builtin/avro/code/pig/namesWithDoubleColons.pig 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/code/pig/recursive_tests.pig PRE-CREATION 
  test/org/apache/pig/builtin/avro/code/pig/trevni_to_avro.pig PRE-CREATION 
  test/org/apache/pig/builtin/avro/code/pig/trevni_to_trevni.pig PRE-CREATION 
  test/org/apache/pig/builtin/avro/createtests.py PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/arrays.json PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/arraysAsOutputByPig.json 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/recordWithRepeatedSubRecords.json 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/records.json PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/recordsAsOutputByPig.json 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/recordsOfArrays.json PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/recordsOfArraysOfRecords.json 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/recordsSubSchema.json PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/recordsSubSchemaNullable.json 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/recordsWithDoubleUnderscores.json 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/recordsWithEnums.json PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/recordsWithFixed.json PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/recordsWithMaps.json PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/recordsWithMapsOfRecords.json 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/recordsWithNullableUnions.json 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/recursiveRecord.json PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/simpleRecordsTrevni.json 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/arrays.avsc PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/arraysAsOutputByPig.avsc PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordWithRepeatedSubRecords.avsc 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/records.avsc PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsAsOutputByPig.avsc 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsOfArrays.avsc PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsOfArraysOfRecords.avsc 
PRE-CREATION 
  test

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

2012-12-17 Thread Joseph Adler (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13534212#comment-13534212
 ] 

Joseph Adler commented on PIG-3015:
---

My apologies; forgot to add those to the patch. Replaced the patch version.

> Rewrite of AvroStorage
> --
>
> Key: PIG-3015
> URL: https://issues.apache.org/jira/browse/PIG-3015
> Project: Pig
>  Issue Type: Improvement
>  Components: piggybank
>Reporter: Joseph Adler
>Assignee: Joseph Adler
> Attachments: PIG-3015.patch
>
>
> The current AvroStorage implementation has a lot of issues: it requires old 
> versions of Avro, it copies data much more than needed, and it's verbose and 
> complicated. (One pet peeve of mine is that old versions of Avro don't 
> support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
> new implementation is significantly faster, and the code is a lot simpler. 
> Rewriting AvroStorage also enabled me to implement support for Trevni (as 
> TrevniStorage).
> I'm opening this ticket to facilitate discussion while I figure out the best 
> way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

2012-12-17 Thread Joseph Adler (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Adler updated PIG-3015:
--

Attachment: PIG-3015.patch

> Rewrite of AvroStorage
> --
>
> Key: PIG-3015
> URL: https://issues.apache.org/jira/browse/PIG-3015
> Project: Pig
>  Issue Type: Improvement
>  Components: piggybank
>Reporter: Joseph Adler
>Assignee: Joseph Adler
> Attachments: PIG-3015.patch
>
>
> The current AvroStorage implementation has a lot of issues: it requires old 
> versions of Avro, it copies data much more than needed, and it's verbose and 
> complicated. (One pet peeve of mine is that old versions of Avro don't 
> support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
> new implementation is significantly faster, and the code is a lot simpler. 
> Rewriting AvroStorage also enabled me to implement support for Trevni (as 
> TrevniStorage).
> I'm opening this ticket to facilitate discussion while I figure out the best 
> way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

2012-12-17 Thread Joseph Adler (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Adler updated PIG-3015:
--

Attachment: (was: PIG-3015.patch)

> Rewrite of AvroStorage
> --
>
> Key: PIG-3015
> URL: https://issues.apache.org/jira/browse/PIG-3015
> Project: Pig
>  Issue Type: Improvement
>  Components: piggybank
>Reporter: Joseph Adler
>Assignee: Joseph Adler
> Attachments: PIG-3015.patch
>
>
> The current AvroStorage implementation has a lot of issues: it requires old 
> versions of Avro, it copies data much more than needed, and it's verbose and 
> complicated. (One pet peeve of mine is that old versions of Avro don't 
> support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
> new implementation is significantly faster, and the code is a lot simpler. 
> Rewriting AvroStorage also enabled me to implement support for Trevni (as 
> TrevniStorage).
> I'm opening this ticket to facilitate discussion while I figure out the best 
> way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3050) Fix FindBugs multithreading warnings

2012-12-17 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-3050:
---

Status: Patch Available  (was: Open)

> Fix FindBugs multithreading warnings
> 
>
> Key: PIG-3050
> URL: https://issues.apache.org/jira/browse/PIG-3050
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.12
>
> Attachments: PIG-3050.patch
>
>
> There was a race condition reported when running Pig in local mode on the 
> user mailing list. This motivated me to fix potential multithreading bugs 
> that can be identified by FindBugs.
> FindBugs identifies the following potential bugs:
> # Mutable static field
> # Inconsistent synchronization
> # Incorrect lazy initialization of static field
> # Incorrect lazy initialization and update of static field
> # Unsynchronized get method, synchronized set method
> There are in total 1153 warnings that FindBugs complains, but they're outside 
> of the scope of this jira.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Review Request: PIG-3050 Fix FindBugs multithreading warnings

2012-12-17 Thread Cheolsoo Park

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/8649/
---

Review request for pig and Santhosh Srinivasan.


Description
---

Please see https://issues.apache.org/jira/browse/PIG-3050


This addresses bug PIG-3050.
https://issues.apache.org/jira/browse/PIG-3050


Diffs
-

  
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigHadoopLogger.java
 9b8223d 
  
src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/PhysicalOperator.java
 ee4d52a 
  
src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POProject.java
 5195dee 
  
src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserComparisonFunc.java
 fcaf9b0 
  
src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java
 df1af28 
  
src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POFRJoin.java
 58a8892 
  
src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POForEach.java
 0a69ef2 
  
src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POJoinPackage.java
 d1283b8 
  
src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPackage.java
 6bbe5e0 
  
src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPackageLite.java
 8ab351d 
  
src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POStream.java
 e3379c8 
  
src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POUnion.java
 b29c481 
  src/org/apache/pig/data/DefaultAbstractBag.java 816143f 
  src/org/apache/pig/data/NonSpillableDataBag.java 6b59c8f 
  src/org/apache/pig/data/SchemaTupleBackend.java 6f0ad3b 
  src/org/apache/pig/impl/util/SpillableMemoryManager.java 403d774 

Diff: https://reviews.apache.org/r/8649/diff/


Testing
---

Verified that both unit test and e2e test pass.


Thanks,

Cheolsoo Park



[jira] [Updated] (PIG-3050) Fix FindBugs multithreading warnings

2012-12-17 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-3050:
---

Attachment: PIG-3050.patch

Attached is a patch that fixes the following issues:
- Mutual static field
{code:title=PhysicalOperator.java}
public static PigProgressable reporter;
{code}
There was a reported race condition due to this static field (For details, see 
[here|http://search-hadoop.com/m/2OdLNRMwXa2/Intermittent+NullPointerException&subj=Intermittent+NullPointerException]).
 Since {{reporter}} should be local to thread, I converted it to ThreadLocal.
- Inconsistent synchronization
{code:title=POStream.java}
public Result getNext(Tuple t) throws ExecException {
...
if(initialized) {
   ...
}
...
}
...
public Result getNextHelper(Tuple t) throws ExecException {
...
synchronized(this) {
   ...
   if(!initialized) {
  ...
   }
   ...
   initialized = true;
   ...
}
}
{code}
Synchronized access to {{initialized}} is performed inside {{getNextHelper()}}, 
but unsynchronized access was performed inside {{getNext()}}. I added a 
synchronized getter method and used that method inside {{getNext()}}.
- Incorrect lazy initialization of static field
{code:title=SpillableMemoryManager.java}
public static SpillableMemoryManager getInstance() {
if (manager == null) {
manager = new SpillableMemoryManager();
}
return manager;
}
{code}
FindBugs says, "Because the compiler may reorder instructions, threads are not 
guaranteed to see a completely initialized object if the method can be called 
by multiple threads." So I declared {{manager}} as volatile.
- Incorrect lazy initialization and update of static field
{code:title=SchemaTupleBackend.java}
public static void initialize(Configuration jConf, PigContext pigContext, 
boolean isLocal) throws IOException {
if (stb != null) {
LOG.warn("SchemaTupleBackend has already been initialized");
} else {
SchemaTupleFrontend.lazyReset(pigContext);
SchemaTupleFrontend.reset();
stb = new SchemaTupleBackend(jConf, isLocal);
stb.copyAndResolve();
}
}
{code}
FindBugs says, "After the field is set, the object stored into that location is 
further updated. The setting of the field is visible to other threads as soon 
as it is set. If further accesses in the method that set the field serve to 
initialize the object, then you have a very serious multithreading bug." So I 
moved the assignment to the end of the method after all initialization is done.
- Unsynchronized get method, synchronized set method
{code:title=PigHadoopLogger.java}
public synchronized void setReporter(PigStatusReporter rep) {
this.reporter = rep;
}
public boolean getAggregate() {
return aggregate;
}
{code}
I made {{getAggregate()}} synchronized.

> Fix FindBugs multithreading warnings
> 
>
> Key: PIG-3050
> URL: https://issues.apache.org/jira/browse/PIG-3050
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.12
>
> Attachments: PIG-3050.patch
>
>
> There was a race condition reported when running Pig in local mode on the 
> user mailing list. This motivated me to fix potential multithreading bugs 
> that can be identified by FindBugs.
> FindBugs identifies the following potential bugs:
> # Mutable static field
> # Inconsistent synchronization
> # Incorrect lazy initialization of static field
> # Incorrect lazy initialization and update of static field
> # Unsynchronized get method, synchronized set method
> There are in total 1153 warnings that FindBugs complains, but they're outside 
> of the scope of this jira.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3051) java.lang.IndexOutOfBoundsException failure with LimitOptimizer + ColumnPruning

2012-12-17 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13534028#comment-13534028
 ] 

Rohini Palaniswamy commented on PIG-3051:
-

  Resetting the attached LOSort operator of the ProjectExpression to the 
newSort is good. But found an issue with the copy not setting the label, type 
and Uid.  
{code}
@Test
public void testPIG3051() throws Exception {
String[] input = {
"1,2,3,4",
"2,3,4,1",
"3,4,1,2",
"4,1,2,3"
};

Util.createLocalInputFile( "a.txt", input);
String query = "A =load 'a.txt' using PigStorage(',') as (a1:chararray, 
a2:chararray, a3:chararray, a4:chararray);" +
"B = foreach A generate a2,a3,a4;" +
"G = order B by a4;" +
"U1 = limit G 3;" +
"U2 = foreach U1 generate a4;" +
"store G into 'g' using PigStorage();" +
"store U2 into 'u2' using PigStorage(); ";
try {
PigServer pigServer = new PigServer(ExecType.LOCAL);
pigServer.registerQuery(query);
} catch (Exception e) {
e.printStackTrace();
}

}
{code}
sort.mSortColPlans - a4:(Name: Project Type: chararray Uid: 4 Input: 0 Column: 
2)
newSort.mSortColPlans - (Name: Project Type: null Uid: null Input: 0 Column: 2)
{code}

{code}

> java.lang.IndexOutOfBoundsException  failure with LimitOptimizer + 
> ColumnPruning
> 
>
> Key: PIG-3051
> URL: https://issues.apache.org/jira/browse/PIG-3051
> Project: Pig
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.10.0, 0.11
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
> Fix For: 0.11
>
> Attachments: pig-3051-v1.1-withe2etest.txt, 
> pig-3051-v1-withouttest.txt
>
>
> Had a user hitting 
> "Caused by: java.lang.IndexOutOfBoundsException: Index: 1, Size: 1" error 
> when he had multiple stores and limit in his code.
> I couldn't reproduce this with short pig code (due to ColumnPruning somehow 
> not happening when shortened), but here's a snippet. 
> {noformat}
> ...
> G3 = FOREACH G2 GENERATE sortCol, FLATTEN(group) as label, (long)COUNT(G1) as 
> cnt;
> G4 = ORDER G3 BY cnt DESC PARALLEL 25;
> ONEROW = LIMIT G4 1;
> U1 = FOREACH ONEROW GENERATE 3 as sortcol, 'somelabel' as label, cnt;
> store U1 into 'u1' using PigStorage();
> store G4 into 'g4' using PigStorage();
> {noformat}
> With '-t ColumnMapKeyPrune', job didn't hit the error.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira