[jira] [Updated] (PIG-3020) "Duplicate uid in schema" error when joining two relations derived from the same load statement

2012-12-13 Thread Jonathan Coveney (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Coveney updated PIG-3020:
--

Attachment: PIG-3093-testcase.patch

Julien,

I've included a test that I think you should add to this patch (and it may turn 
out that pig 3093 is a duplicate of this).

Either way, my test fails on trunk, but it fails with a different error on your 
branch. Looks like when you change the uid you whack the alias.

> "Duplicate uid in schema" error when joining two relations derived from the 
> same load statement
> ---
>
> Key: PIG-3020
> URL: https://issues.apache.org/jira/browse/PIG-3020
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
> Attachments: PIG-3020_branch-0.11_1.patch, PIG-3020.patch, 
> PIG-3093-testcase.patch
>
>
> The following validates OK with pig 0.9 and fails with the following error in 
> 0.11 (and I suspect 0.10)
> pig -c debug2.pig
> Script: debug2.pig
> {noformat}
> A = LOAD 'foo' AS (group:tuple(uid, dst_id), uids_with_recs:bag{} , 
> uids_with_flock:bag{});
> edges_both = FILTER A BY NOT IsEmpty(uids_with_recs) AND NOT 
> IsEmpty(uids_with_flock);
> edges_both = FOREACH edges_both GENERATE
> group.uid AS src_id,
> group.dst_id AS dst_id;
> both_counts = GROUP edges_both BY src_id;
> both_counts = FOREACH both_counts GENERATE
> group AS src_id, SIZE(edges_both) AS size_both;
> edges_bq = FILTER A BY NOT IsEmpty(uids_with_recs);
> edges_bq = FOREACH edges_bq GENERATE
> group.uid AS src_id,
> group.dst_id AS dst_id;
> bq_counts = GROUP edges_bq BY src_id;
> bq_counts = FOREACH bq_counts GENERATE
> group AS src_id, SIZE(edges_bq) AS size_bq;
> per_user_set_sizes = JOIN bq_counts BY src_id LEFT OUTER, both_counts BY 
> src_id;
> store per_user_set_sizes into  'foo';
> {noformat}
> Error:
> {noformat}
> ERROR 2270: Logical plan invalid state: duplicate uid in schema : 
> bq_counts::src_id#417:bytearray,bq_counts::size_bq#468:long,both_counts::src_id#417:bytearray,both_counts::size_both#480:long
> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1067: Unable to 
> explain alias null
>   at org.apache.pig.PigServer.explain(PigServer.java:999)
>   at 
> org.apache.pig.tools.grunt.GruntParser.explainCurrentBatch(GruntParser.java:398)
>   at 
> org.apache.pig.tools.grunt.GruntParser.processExplain(GruntParser.java:330)
>   at org.apache.pig.tools.grunt.Grunt.checkScript(Grunt.java:98)
>   at org.apache.pig.Main.run(Main.java:600)
>   at org.apache.pig.Main.main(Main.java:154)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2000: 
> Error processing rule LoadTypeCastInserter
>   at 
> org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:122)
>   at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:277)
>   at org.apache.pig.PigServer.compilePp(PigServer.java:1322)
>   at org.apache.pig.PigServer.explain(PigServer.java:984)
>   ... 10 more
> Caused by: org.apache.pig.impl.plan.PlanValidationException: ERROR 2270: 
> Logical plan invalid state: duplicate uid in schema : 
> bq_counts::src_id#417:bytearray,bq_counts::size_bq#468:long,both_counts::src_id#417:bytearray,both_counts::size_both#480:long
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaResetter.validate(SchemaResetter.java:232)
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:105)
>   at 
> org.apache.pig.newplan.logical.relational.LOJoin.accept(LOJoin.java:171)
>   at 
> org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
>   at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaPatcher.transformed(SchemaPatcher.java:43)
>   at 
> org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:113)
>   ... 13 more
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Subscription: PIG patch available

2012-12-13 Thread jira
Issue Subscription
Filter: PIG patch available (37 issues)

Subscriber: pigdaily

Key Summary
PIG-3095"which" is called many, many times for each Pig STREAM statement
https://issues.apache.org/jira/browse/PIG-3095
PIG-3088Add a builtin udf which removes prefixes
https://issues.apache.org/jira/browse/PIG-3088
PIG-3086Allow A Prefix To Be Added To URIs In PigUnit Tests 
https://issues.apache.org/jira/browse/PIG-3086
PIG-3085Errors and lacks in document "Built In Functions"
https://issues.apache.org/jira/browse/PIG-3085
PIG-3078Make a UDF that, given a string, returns just the columns prefixed 
by that string
https://issues.apache.org/jira/browse/PIG-3078
PIG-3073POUserFunc creating log spam for large scripts
https://issues.apache.org/jira/browse/PIG-3073
PIG-3069Native Windows Compatibility for Pig E2E Tests and Harness
https://issues.apache.org/jira/browse/PIG-3069
PIG-3067HBaseStorage should be split up to become more managable
https://issues.apache.org/jira/browse/PIG-3067
PIG-3066Fix TestPigRunner in trunk
https://issues.apache.org/jira/browse/PIG-3066
PIG-3057make readField protected to be able to override it if we extend 
PigStorage
https://issues.apache.org/jira/browse/PIG-3057
PIG-3051java.lang.IndexOutOfBoundsException  failure with LimitOptimizer + 
ColumnPruning
https://issues.apache.org/jira/browse/PIG-3051
PIG-3029TestTypeCheckingValidatorNewLP has some path reference issues for 
cross-platform execution
https://issues.apache.org/jira/browse/PIG-3029
PIG-3028testGrunt dev test needs some command filters to run correctly 
without cygwin
https://issues.apache.org/jira/browse/PIG-3028
PIG-3027pigTest unit test needs a newline filter for comparisons of golden 
multi-line
https://issues.apache.org/jira/browse/PIG-3027
PIG-3026Pig checked-in baseline comparisons need a pre-filter to address 
OS-specific newline differences
https://issues.apache.org/jira/browse/PIG-3026
PIG-3025TestPruneColumn unit test - SimpleEchoStreamingCommand perl inline 
script needs simplification
https://issues.apache.org/jira/browse/PIG-3025
PIG-3024TestEmptyInputDir unit test - hadoop version detection logic is 
brittle
https://issues.apache.org/jira/browse/PIG-3024
PIG-3015Rewrite of AvroStorage
https://issues.apache.org/jira/browse/PIG-3015
PIG-3010Allow UDF's to flatten themselves
https://issues.apache.org/jira/browse/PIG-3010
PIG-2959Add a pig.cmd for Pig to run under Windows
https://issues.apache.org/jira/browse/PIG-2959
PIG-2957TetsScriptUDF fail due to volume prefix in jar
https://issues.apache.org/jira/browse/PIG-2957
PIG-2956Invalid cache specification for some streaming statement
https://issues.apache.org/jira/browse/PIG-2956
PIG-2955 Fix bunch of Pig e2e tests on Windows 
https://issues.apache.org/jira/browse/PIG-2955
PIG-2878Pig current releases lack a UDF equalIgnoreCase.This function 
returns a Boolean value indicating whether string left is equal to string 
right. This check is case insensitive.
https://issues.apache.org/jira/browse/PIG-2878
PIG-2873Converting bin/pig shell script to python
https://issues.apache.org/jira/browse/PIG-2873
PIG-2834MultiStorage requires unused constructor argument
https://issues.apache.org/jira/browse/PIG-2834
PIG-2824Pushing checking number of fields into LoadFunc
https://issues.apache.org/jira/browse/PIG-2824
PIG-2661Pig uses an extra job for loading data in Pigmix L9
https://issues.apache.org/jira/browse/PIG-2661
PIG-2645PigSplit does not handle the case where SerializationFactory 
returns null
https://issues.apache.org/jira/browse/PIG-2645
PIG-2614AvroStorage crashes on LOADING a single bad error
https://issues.apache.org/jira/browse/PIG-2614
PIG-2507Semicolon in paramenters for UDF results in parsing error
https://issues.apache.org/jira/browse/PIG-2507
PIG-2433Jython import module not working if module path is in classpath
https://issues.apache.org/jira/browse/PIG-2433
PIG-2417Streaming UDFs -  allow users to easily write UDFs in scripting 
languages with no JVM implementation.
https://issues.apache.org/jira/browse/PIG-2417
PIG-2362Rework Ant build.xml to use macrodef instead of antcall
https://issues.apache.org/jira/browse/PIG-2362
PIG-2312NPE when relation and column share the same name and used in Nested 
Foreach 
https://issues.apache.org/jira/browse/PIG-2312
PIG-1942script UDF (jython) should utilize the intended output schema to 
more directly convert Py objects to Pig objects
https://issu

[jira] [Updated] (PIG-3020) "Duplicate uid in schema" error when joining two relations derived from the same load statement

2012-12-13 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem updated PIG-3020:
---

Attachment: PIG-3020_branch-0.11_1.patch

> "Duplicate uid in schema" error when joining two relations derived from the 
> same load statement
> ---
>
> Key: PIG-3020
> URL: https://issues.apache.org/jira/browse/PIG-3020
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
> Attachments: PIG-3020_branch-0.11_1.patch, PIG-3020.patch
>
>
> The following validates OK with pig 0.9 and fails with the following error in 
> 0.11 (and I suspect 0.10)
> pig -c debug2.pig
> Script: debug2.pig
> {noformat}
> A = LOAD 'foo' AS (group:tuple(uid, dst_id), uids_with_recs:bag{} , 
> uids_with_flock:bag{});
> edges_both = FILTER A BY NOT IsEmpty(uids_with_recs) AND NOT 
> IsEmpty(uids_with_flock);
> edges_both = FOREACH edges_both GENERATE
> group.uid AS src_id,
> group.dst_id AS dst_id;
> both_counts = GROUP edges_both BY src_id;
> both_counts = FOREACH both_counts GENERATE
> group AS src_id, SIZE(edges_both) AS size_both;
> edges_bq = FILTER A BY NOT IsEmpty(uids_with_recs);
> edges_bq = FOREACH edges_bq GENERATE
> group.uid AS src_id,
> group.dst_id AS dst_id;
> bq_counts = GROUP edges_bq BY src_id;
> bq_counts = FOREACH bq_counts GENERATE
> group AS src_id, SIZE(edges_bq) AS size_bq;
> per_user_set_sizes = JOIN bq_counts BY src_id LEFT OUTER, both_counts BY 
> src_id;
> store per_user_set_sizes into  'foo';
> {noformat}
> Error:
> {noformat}
> ERROR 2270: Logical plan invalid state: duplicate uid in schema : 
> bq_counts::src_id#417:bytearray,bq_counts::size_bq#468:long,both_counts::src_id#417:bytearray,both_counts::size_both#480:long
> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1067: Unable to 
> explain alias null
>   at org.apache.pig.PigServer.explain(PigServer.java:999)
>   at 
> org.apache.pig.tools.grunt.GruntParser.explainCurrentBatch(GruntParser.java:398)
>   at 
> org.apache.pig.tools.grunt.GruntParser.processExplain(GruntParser.java:330)
>   at org.apache.pig.tools.grunt.Grunt.checkScript(Grunt.java:98)
>   at org.apache.pig.Main.run(Main.java:600)
>   at org.apache.pig.Main.main(Main.java:154)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2000: 
> Error processing rule LoadTypeCastInserter
>   at 
> org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:122)
>   at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:277)
>   at org.apache.pig.PigServer.compilePp(PigServer.java:1322)
>   at org.apache.pig.PigServer.explain(PigServer.java:984)
>   ... 10 more
> Caused by: org.apache.pig.impl.plan.PlanValidationException: ERROR 2270: 
> Logical plan invalid state: duplicate uid in schema : 
> bq_counts::src_id#417:bytearray,bq_counts::size_bq#468:long,both_counts::src_id#417:bytearray,both_counts::size_both#480:long
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaResetter.validate(SchemaResetter.java:232)
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:105)
>   at 
> org.apache.pig.newplan.logical.relational.LOJoin.accept(LOJoin.java:171)
>   at 
> org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
>   at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaPatcher.transformed(SchemaPatcher.java:43)
>   at 
> org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:113)
>   ... 13 more
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3020) "Duplicate uid in schema" error when joining two relations derived from the same load statement

2012-12-13 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem updated PIG-3020:
---

Description: 
The following validates OK with pig 0.9 and fails with the following error in 
0.11 (and I suspect 0.10)

pig -c debug2.pig

Script: debug2.pig
{noformat}
A = LOAD 'foo' AS (group:tuple(uid, dst_id), uids_with_recs:bag{} , 
uids_with_flock:bag{});
edges_both = FILTER A BY NOT IsEmpty(uids_with_recs) AND NOT 
IsEmpty(uids_with_flock);
edges_both = FOREACH edges_both GENERATE
group.uid AS src_id,
group.dst_id AS dst_id;
both_counts = GROUP edges_both BY src_id;
both_counts = FOREACH both_counts GENERATE
group AS src_id, SIZE(edges_both) AS size_both;

edges_bq = FILTER A BY NOT IsEmpty(uids_with_recs);
edges_bq = FOREACH edges_bq GENERATE
group.uid AS src_id,
group.dst_id AS dst_id;
bq_counts = GROUP edges_bq BY src_id;
bq_counts = FOREACH bq_counts GENERATE
group AS src_id, SIZE(edges_bq) AS size_bq;

per_user_set_sizes = JOIN bq_counts BY src_id LEFT OUTER, both_counts BY src_id;
store per_user_set_sizes into  'foo';
{noformat}

Error:
{noformat}
ERROR 2270: Logical plan invalid state: duplicate uid in schema : 
bq_counts::src_id#417:bytearray,bq_counts::size_bq#468:long,both_counts::src_id#417:bytearray,both_counts::size_both#480:long

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1067: Unable to 
explain alias null
at org.apache.pig.PigServer.explain(PigServer.java:999)
at 
org.apache.pig.tools.grunt.GruntParser.explainCurrentBatch(GruntParser.java:398)
at 
org.apache.pig.tools.grunt.GruntParser.processExplain(GruntParser.java:330)
at org.apache.pig.tools.grunt.Grunt.checkScript(Grunt.java:98)
at org.apache.pig.Main.run(Main.java:600)
at org.apache.pig.Main.main(Main.java:154)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2000: 
Error processing rule LoadTypeCastInserter
at 
org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:122)
at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:277)
at org.apache.pig.PigServer.compilePp(PigServer.java:1322)
at org.apache.pig.PigServer.explain(PigServer.java:984)
... 10 more
Caused by: org.apache.pig.impl.plan.PlanValidationException: ERROR 2270: 
Logical plan invalid state: duplicate uid in schema : 
bq_counts::src_id#417:bytearray,bq_counts::size_bq#468:long,both_counts::src_id#417:bytearray,both_counts::size_both#480:long
at 
org.apache.pig.newplan.logical.optimizer.SchemaResetter.validate(SchemaResetter.java:232)
at 
org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:105)
at 
org.apache.pig.newplan.logical.relational.LOJoin.accept(LOJoin.java:171)
at 
org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
at 
org.apache.pig.newplan.logical.optimizer.SchemaPatcher.transformed(SchemaPatcher.java:43)
at 
org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:113)
... 13 more
{noformat}

  was:
The following vali=dates OK with pig 0.9 and fails with the following error in 
0.11 (and I suspect 0.10)

pig -c debug2.pig

Script: debug2.pig
{noformat}
A = LOAD 'foo' AS (group:tuple(uid, dst_id), uids_with_recs:bag{} , 
uids_with_flock:bag{});
edges_both = FILTER A BY NOT IsEmpty(uids_with_recs) AND NOT 
IsEmpty(uids_with_flock);
edges_both = FOREACH edges_both GENERATE
group.uid AS src_id,
group.dst_id AS dst_id;
both_counts = GROUP edges_both BY src_id;
both_counts = FOREACH both_counts GENERATE
group AS src_id, SIZE(edges_both) AS size_both;

edges_bq = FILTER A BY NOT IsEmpty(uids_with_recs);
edges_bq = FOREACH edges_bq GENERATE
group.uid AS src_id,
group.dst_id AS dst_id;
bq_counts = GROUP edges_bq BY src_id;
bq_counts = FOREACH bq_counts GENERATE
group AS src_id, SIZE(edges_bq) AS size_bq;

per_user_set_sizes = JOIN bq_counts BY src_id LEFT OUTER, both_counts BY src_id;
store per_user_set_sizes into  'foo';
{noformat}

Error:
{noformat}
ERROR 2270: Logical plan invalid state: duplicate uid in schema : 
bq_counts::src_id#417:bytearray,bq_counts::size_bq#468:long,both_counts::src_id#417:bytearray,both_counts::size_both#480:long

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1067: Unable to 
explain alias null
at org.apache.pig.PigSer

[jira] [Updated] (PIG-3020) "Duplicate uid in schema" error when joining two relations derived from the same load statement

2012-12-13 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem updated PIG-3020:
---

Patch Info: Patch Available

> "Duplicate uid in schema" error when joining two relations derived from the 
> same load statement
> ---
>
> Key: PIG-3020
> URL: https://issues.apache.org/jira/browse/PIG-3020
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
> Attachments: PIG-3020.patch
>
>
> The following vali=dates OK with pig 0.9 and fails with the following error 
> in 0.11 (and I suspect 0.10)
> pig -c debug2.pig
> Script: debug2.pig
> {noformat}
> A = LOAD 'foo' AS (group:tuple(uid, dst_id), uids_with_recs:bag{} , 
> uids_with_flock:bag{});
> edges_both = FILTER A BY NOT IsEmpty(uids_with_recs) AND NOT 
> IsEmpty(uids_with_flock);
> edges_both = FOREACH edges_both GENERATE
> group.uid AS src_id,
> group.dst_id AS dst_id;
> both_counts = GROUP edges_both BY src_id;
> both_counts = FOREACH both_counts GENERATE
> group AS src_id, SIZE(edges_both) AS size_both;
> edges_bq = FILTER A BY NOT IsEmpty(uids_with_recs);
> edges_bq = FOREACH edges_bq GENERATE
> group.uid AS src_id,
> group.dst_id AS dst_id;
> bq_counts = GROUP edges_bq BY src_id;
> bq_counts = FOREACH bq_counts GENERATE
> group AS src_id, SIZE(edges_bq) AS size_bq;
> per_user_set_sizes = JOIN bq_counts BY src_id LEFT OUTER, both_counts BY 
> src_id;
> store per_user_set_sizes into  'foo';
> {noformat}
> Error:
> {noformat}
> ERROR 2270: Logical plan invalid state: duplicate uid in schema : 
> bq_counts::src_id#417:bytearray,bq_counts::size_bq#468:long,both_counts::src_id#417:bytearray,both_counts::size_both#480:long
> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1067: Unable to 
> explain alias null
>   at org.apache.pig.PigServer.explain(PigServer.java:999)
>   at 
> org.apache.pig.tools.grunt.GruntParser.explainCurrentBatch(GruntParser.java:398)
>   at 
> org.apache.pig.tools.grunt.GruntParser.processExplain(GruntParser.java:330)
>   at org.apache.pig.tools.grunt.Grunt.checkScript(Grunt.java:98)
>   at org.apache.pig.Main.run(Main.java:600)
>   at org.apache.pig.Main.main(Main.java:154)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2000: 
> Error processing rule LoadTypeCastInserter
>   at 
> org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:122)
>   at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:277)
>   at org.apache.pig.PigServer.compilePp(PigServer.java:1322)
>   at org.apache.pig.PigServer.explain(PigServer.java:984)
>   ... 10 more
> Caused by: org.apache.pig.impl.plan.PlanValidationException: ERROR 2270: 
> Logical plan invalid state: duplicate uid in schema : 
> bq_counts::src_id#417:bytearray,bq_counts::size_bq#468:long,both_counts::src_id#417:bytearray,both_counts::size_both#480:long
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaResetter.validate(SchemaResetter.java:232)
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:105)
>   at 
> org.apache.pig.newplan.logical.relational.LOJoin.accept(LOJoin.java:171)
>   at 
> org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
>   at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaPatcher.transformed(SchemaPatcher.java:43)
>   at 
> org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:113)
>   ... 13 more
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (PIG-3020) "Duplicate uid in schema" error when joining two relations derived from the same load statement

2012-12-13 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem reassigned PIG-3020:
--

Assignee: Julien Le Dem

> "Duplicate uid in schema" error when joining two relations derived from the 
> same load statement
> ---
>
> Key: PIG-3020
> URL: https://issues.apache.org/jira/browse/PIG-3020
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
> Attachments: PIG-3020.patch
>
>
> The following vali=dates OK with pig 0.9 and fails with the following error 
> in 0.11 (and I suspect 0.10)
> pig -c debug2.pig
> Script: debug2.pig
> {noformat}
> A = LOAD 'foo' AS (group:tuple(uid, dst_id), uids_with_recs:bag{} , 
> uids_with_flock:bag{});
> edges_both = FILTER A BY NOT IsEmpty(uids_with_recs) AND NOT 
> IsEmpty(uids_with_flock);
> edges_both = FOREACH edges_both GENERATE
> group.uid AS src_id,
> group.dst_id AS dst_id;
> both_counts = GROUP edges_both BY src_id;
> both_counts = FOREACH both_counts GENERATE
> group AS src_id, SIZE(edges_both) AS size_both;
> edges_bq = FILTER A BY NOT IsEmpty(uids_with_recs);
> edges_bq = FOREACH edges_bq GENERATE
> group.uid AS src_id,
> group.dst_id AS dst_id;
> bq_counts = GROUP edges_bq BY src_id;
> bq_counts = FOREACH bq_counts GENERATE
> group AS src_id, SIZE(edges_bq) AS size_bq;
> per_user_set_sizes = JOIN bq_counts BY src_id LEFT OUTER, both_counts BY 
> src_id;
> store per_user_set_sizes into  'foo';
> {noformat}
> Error:
> {noformat}
> ERROR 2270: Logical plan invalid state: duplicate uid in schema : 
> bq_counts::src_id#417:bytearray,bq_counts::size_bq#468:long,both_counts::src_id#417:bytearray,both_counts::size_both#480:long
> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1067: Unable to 
> explain alias null
>   at org.apache.pig.PigServer.explain(PigServer.java:999)
>   at 
> org.apache.pig.tools.grunt.GruntParser.explainCurrentBatch(GruntParser.java:398)
>   at 
> org.apache.pig.tools.grunt.GruntParser.processExplain(GruntParser.java:330)
>   at org.apache.pig.tools.grunt.Grunt.checkScript(Grunt.java:98)
>   at org.apache.pig.Main.run(Main.java:600)
>   at org.apache.pig.Main.main(Main.java:154)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2000: 
> Error processing rule LoadTypeCastInserter
>   at 
> org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:122)
>   at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:277)
>   at org.apache.pig.PigServer.compilePp(PigServer.java:1322)
>   at org.apache.pig.PigServer.explain(PigServer.java:984)
>   ... 10 more
> Caused by: org.apache.pig.impl.plan.PlanValidationException: ERROR 2270: 
> Logical plan invalid state: duplicate uid in schema : 
> bq_counts::src_id#417:bytearray,bq_counts::size_bq#468:long,both_counts::src_id#417:bytearray,both_counts::size_both#480:long
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaResetter.validate(SchemaResetter.java:232)
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:105)
>   at 
> org.apache.pig.newplan.logical.relational.LOJoin.accept(LOJoin.java:171)
>   at 
> org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
>   at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
>   at 
> org.apache.pig.newplan.logical.optimizer.SchemaPatcher.transformed(SchemaPatcher.java:43)
>   at 
> org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:113)
>   ... 13 more
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Can someone explain the purpose UID serves in the logical plan?

2012-12-13 Thread Jonathan Coveney
Howdy ya'll,

I'm trying to fix the issue in this JIRA:
https://issues.apache.org/jira/browse/PIG-3093

I got the plan at one point, as saw this:

#---

# New Logical Plan:

#---

D: (Name: LOForEach Schema: B::field1#4:chararray,field2#4:chararray)

|   |

|   (Name: LOGenerate[false,false] Schema:
B::field1#4:chararray,field2#4:chararray)

|   |   |

|   |   B::field1:(Name: Project Type: chararray Uid: 4 Input: 0 Column:
(*))

|   |   |

|   |   A::field1:(Name: Project Type: chararray Uid: 4 Input: 1 Column:
(*))

|   |

|   |---(Name: LOInnerLoad[B::field1] Schema: B::field1#4:chararray)

|   |

|   |---(Name: LOInnerLoad[A::field1] Schema: A::field1#4:chararray)

|

|---C: (Name: LOJoin(HASH) Schema:
A::field1#4:chararray,B::field1#4:chararray)

|   |

|   field1:(Name: Project Type: chararray Uid: 4 Input: 0 Column:
field1)

|   |

|   field1:(Name: Project Type: chararray Uid: 4 Input: 1 Column:
field1)

|

|---A: (Name: LOLoad Schema: field1#4:chararray)RequiredFields:null

|

|---B: (Name: LOForEach Schema: field1#4:chararray)

|   |

|   (Name: LOGenerate[false] Schema: field1#4:chararray)

|   |   |

|   |   field1:(Name: Project Type: chararray Uid: 4 Input: 0
Column: (*))

|   |

|   |---(Name: LOInnerLoad[0] Schema: field1#4:chararray)

|

|---A: (Name: LOLoad Schema: field1#4:chararray)RequiredFields:null


Noting that the Uid is repeated (because the 2 fields are derived from the
same field). I'm not sure if this is the source of the error, but since I
do not yet know what the error is I thought I would ask about it, as I do
not well understand the role of the uid, but it comes up a lot in the
LogicalPlan.


Thank you!

Jon


Re: Our release process

2012-12-13 Thread Jonathan Coveney
Olga,

A related but separate question: what do y'all do when there is a feature
that is finished, but for an upcoming release? ie a feature in trunk, but
not in 0.11 (which, let us assume, is stable).

Jon


2012/12/13 Olga Natkovich 

> Hi Julien,
>
> I think for us at Yahoo to be able to run our releases directly from the
> branch we would need the guarantees that I proposed in my initial email and
> something that we agreed to last year. The only changes that go in are
>
> - Failures without reasonable workarounds
> - Silent failures.
>
> My main concerns with the proposal is that I do not believe that our
> current testing infra is robust/inclusive enough to catch errors. That's
> why I am hesitant in widening the scope.
>
> I am fine with whatever the outcome the majority of people agrees with. I
> am just saying that Yahoo will likely need a private branch if our rules
> are too relaxed.
>
> Olga
>
>
>
> - Original Message -
> From: Julien Le Dem 
> To: "dev@pig.apache.org" ; Olga Natkovich <
> onatkov...@yahoo.com>
> Cc:
> Sent: Wednesday, December 12, 2012 4:54 PM
> Subject: Re: Our release process
>
> Agreed. The priority of a change is subjective as well.
> My definition for inclusion on the release branch:
> - Only bug fixes.
> - Only if they have fairly understood repercussions (up to the committers
> who +/-1 as usual).
> - If we thought it would not break things but still does (CI or externally
> reported failure) we revert it.
> What do you want to add/change? Please reformulate those rules the way you
> like and let's see how we can converge.
> (Also, let's keep it short for clarity)
>
> Julien
>
> On Wed, Dec 12, 2012 at 11:08 AM, Olga Natkovich  >wrote:
>
> > Hi Julien,
> >
> > I understand what you are trying to do and I can see that being able to
> > make more fixes post release has value for some use cases. My concern is
> > that "things that do not destabilize the branch" is fairly subjective and
> > also not always easy to ascertain beyond trivial changes. The only way I
> > know to keep a code stable is to limit the updates. Also we need to
> clearly
> > state what the constrains are for a post release commits so that every
> user
> > can decide whether it works for them.
> >
> > Olga
> >
> >
> > 
> > From: Julien Le Dem 
> > To: "dev@pig.apache.org" 
> > Sent: Wednesday, December 12, 2012 10:26 AM
> > Subject: Re: Our release process
> >
> > I think we all agree here, let's not jump to conclusions.
> > Everything in this branch I am talking about is in Apache Pig. Everything
> > we do in Pig is contributed.
> > We have a branch for 0.11 where we keep merging the official 0.11 branch
> > plus a few patches (and it will stay small) that are only in Apache
> TRUNK.
> > The goal here is to help keeping the release branch stable by not adding
> > patches that are only useful to us.
> > Having this branch allows us to fix anything quickly and redeploy to
> > production. It is also what allows us to use the pig 0.11 branch in
> > production before it is even released.
> > This definitely benefits the community and helps making 0.11 stable.
> > This is a very reasonable way to keep using a recent version of Pig in
> > production.
> >
> > Olga: My goal is to decrease the scope of what is going in the release
> > branch and to make sure we add only bug fixes that are not making it
> > unstable. I also think having a short definition of this helps which is
> why
> > I have been chiming in.
> > Let us know how you want to decrease the scope. I'm just trying to
> simplify
> > here.
> >
> > Julien
> >
> >
> >
> > On Tue, Dec 11, 2012 at 8:54 AM, Prashant Kommireddi <
> prash1...@gmail.com
> > >wrote:
> >
> > > Share the same concern as Russell here. Not great for the project for
> > > everyone to go "private branch" approach.
> > >
> > > On Tue, Dec 11, 2012 at 8:33 AM, Russell Jurney <
> > russell.jur...@gmail.com
> > > >wrote:
> > >
> > > > Wait. Ack. Do we want everyone to do this? This sounds like
> > > fragmentation.
> > > > :(
> > > >
> > > > Russell Jurney twitter.com/rjurney
> > > >
> > > >
> > > > On Dec 10, 2012, at 3:24 PM, Olga Natkovich 
> > > wrote:
> > > >
> > > > > If everybody is using a private branch then
> > > > >
> > > > > (1) We are not serving a significant part of our community
> > > > > (2) There is no motivation to contribute those patches to branches
> > > (only
> > > > to trunk).
> > > > >
> > > > > Yahoo has been trying hard to work of the Apache branches but if we
> > > > increase the scope of what is going into branches, we will go with
> > > private
> > > > branch approach as well.
> > > > >
> > > > > Olga
> > > > >
> > > > >
> > > > > 
> > > > > From: Julien Le Dem 
> > > > > To: Olga Natkovich 
> > > > > Cc: "dev@pig.apache.org" ; Santhosh M S <
> > > > santhosh_mut...@yahoo.com>; "billgra...@gmail.com" <
> > billgra...@gmail.com
> > > >
> > > > > Sent: Friday, December 7, 2012 3:54 PM
> > 

Re: Our release process

2012-12-13 Thread Olga Natkovich
Hi Julien,

I think for us at Yahoo to be able to run our releases directly from the branch 
we would need the guarantees that I proposed in my initial email and something 
that we agreed to last year. The only changes that go in are

- Failures without reasonable workarounds
- Silent failures.

My main concerns with the proposal is that I do not believe that our current 
testing infra is robust/inclusive enough to catch errors. That's why I 
am hesitant in widening the scope.

I am fine with whatever the outcome the majority of people agrees with. I am 
just saying that Yahoo will likely need a private branch if our rules are too 
relaxed.

Olga



- Original Message -
From: Julien Le Dem 
To: "dev@pig.apache.org" ; Olga Natkovich 

Cc: 
Sent: Wednesday, December 12, 2012 4:54 PM
Subject: Re: Our release process

Agreed. The priority of a change is subjective as well.
My definition for inclusion on the release branch:
- Only bug fixes.
- Only if they have fairly understood repercussions (up to the committers
who +/-1 as usual).
- If we thought it would not break things but still does (CI or externally
reported failure) we revert it.
What do you want to add/change? Please reformulate those rules the way you
like and let's see how we can converge.
(Also, let's keep it short for clarity)

Julien

On Wed, Dec 12, 2012 at 11:08 AM, Olga Natkovich wrote:

> Hi Julien,
>
> I understand what you are trying to do and I can see that being able to
> make more fixes post release has value for some use cases. My concern is
> that "things that do not destabilize the branch" is fairly subjective and
> also not always easy to ascertain beyond trivial changes. The only way I
> know to keep a code stable is to limit the updates. Also we need to clearly
> state what the constrains are for a post release commits so that every user
> can decide whether it works for them.
>
> Olga
>
>
> 
> From: Julien Le Dem 
> To: "dev@pig.apache.org" 
> Sent: Wednesday, December 12, 2012 10:26 AM
> Subject: Re: Our release process
>
> I think we all agree here, let's not jump to conclusions.
> Everything in this branch I am talking about is in Apache Pig. Everything
> we do in Pig is contributed.
> We have a branch for 0.11 where we keep merging the official 0.11 branch
> plus a few patches (and it will stay small) that are only in Apache TRUNK.
> The goal here is to help keeping the release branch stable by not adding
> patches that are only useful to us.
> Having this branch allows us to fix anything quickly and redeploy to
> production. It is also what allows us to use the pig 0.11 branch in
> production before it is even released.
> This definitely benefits the community and helps making 0.11 stable.
> This is a very reasonable way to keep using a recent version of Pig in
> production.
>
> Olga: My goal is to decrease the scope of what is going in the release
> branch and to make sure we add only bug fixes that are not making it
> unstable. I also think having a short definition of this helps which is why
> I have been chiming in.
> Let us know how you want to decrease the scope. I'm just trying to simplify
> here.
>
> Julien
>
>
>
> On Tue, Dec 11, 2012 at 8:54 AM, Prashant Kommireddi  >wrote:
>
> > Share the same concern as Russell here. Not great for the project for
> > everyone to go "private branch" approach.
> >
> > On Tue, Dec 11, 2012 at 8:33 AM, Russell Jurney <
> russell.jur...@gmail.com
> > >wrote:
> >
> > > Wait. Ack. Do we want everyone to do this? This sounds like
> > fragmentation.
> > > :(
> > >
> > > Russell Jurney twitter.com/rjurney
> > >
> > >
> > > On Dec 10, 2012, at 3:24 PM, Olga Natkovich 
> > wrote:
> > >
> > > > If everybody is using a private branch then
> > > >
> > > > (1) We are not serving a significant part of our community
> > > > (2) There is no motivation to contribute those patches to branches
> > (only
> > > to trunk).
> > > >
> > > > Yahoo has been trying hard to work of the Apache branches but if we
> > > increase the scope of what is going into branches, we will go with
> > private
> > > branch approach as well.
> > > >
> > > > Olga
> > > >
> > > >
> > > > 
> > > > From: Julien Le Dem 
> > > > To: Olga Natkovich 
> > > > Cc: "dev@pig.apache.org" ; Santhosh M S <
> > > santhosh_mut...@yahoo.com>; "billgra...@gmail.com" <
> billgra...@gmail.com
> > >
> > > > Sent: Friday, December 7, 2012 3:54 PM
> > > > Subject: Re: Our release process
> > > >
> > > > Here's my criteria for inclusion in a release branch:
> > > > - no new feature. Only bug fixes.
> > > > - The criteria is more about stability than priority. The
> person/group
> > > > asking for it has a good reason for wanting it in the branch. If
> > > commiters
> > > > think the patch is reasonable and won't make the branch unstable then
> > we
> > > > should check it in. If it breaks something anyway, we revert it.
> > > >
> > > > For what it's worth we (at Twitter) maintain a

[jira] [Commented] (PIG-2553) Pig shouldn't allow attempts to write multiple relations into same directory

2012-12-13 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13531449#comment-13531449
 ] 

Cheolsoo Park commented on PIG-2553:


Hi Prashant,

Thanks for your responses:
# Agreed.
# Thanks.
# On a second thought, how about simplifying it even further?
{code}
if 
("true".equals(pigContext.getProperties().getProperty(PIG_LOCATION_CHECK_STRICT)))
 {
checkDuplicateStoreLoc(storeOps);
}
...
/**
 * This method checks whether the multiple sinks (STORE) use the same
 * "file-based" location. If yes, throws a runtime exception.
 * 
 * @param storeOps
 */
private void checkDuplicateStoreLoc(Set storeOps) {
Set uniqueStoreLoc = new HashSet();
for(LOStore store : storeOps) {
String filename = store.getFileSpec().getFileName();
if(!uniqueStoreLoc.add(filename) && 
UriUtil.isHDFSFileOrLocalOrS3N(filename))
throw new RuntimeException("Script contains 2 or more STORE 
statements writing to same location : "+ filename);
}
}
{code}
# Sure. That sounds reasonable. But can you add the new property to 
{{pig.properties}} as well? I like to have a single place where all properties 
are listed. As far as I know, {{pig.properties}} is only such a place as of now.
# I can't build {{admin.xml}}. I get the following error when running {{ant 
docs}}:
{code}
 [exec] 
/home/cheolsoo/workspace/pig/src/docs/src/documentation/content/xdocs/admin.xml:33:66:
 Element type "b" must be declared.
 [exec] 
/home/cheolsoo/workspace/pig/src/docs/src/documentation/content/xdocs/admin.xml:33:194:
 The content of element type "p" must match 
"(strong|em|code|sub|sup|br|img|icon|acronym|map|xi:include|a)"
{code}
Replacing {{}} with {{}} works for me. Also, it would 
be nice if you could avoid using tabs for indentation. :-)

> Pig shouldn't allow attempts to write multiple relations into same directory
> 
>
> Key: PIG-2553
> URL: https://issues.apache.org/jira/browse/PIG-2553
> Project: Pig
>  Issue Type: Improvement
>Reporter: Dmitriy V. Ryaboy
>Assignee: Prashant Kommireddi
> Attachments: PIG-2553_1.patch, PIG-2553.patch
>
>
> We've seen multiple occasions where users accidentally try to store 2 or more 
> different relations to the same destination directory. Currently, this passes 
> the Pig planner and fails on MR side due to concurrent attempts to create the 
> same part file on the reducer. This is extremely confusing to the user, and 
> hard to debug.
> We should instead fail their scripts before they are even submitted, since we 
> can identify the erroneous condition from the beginning.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2857) Add a -tagPath option to PigStorage

2012-12-13 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-2857:
---

Fix Version/s: 0.12

> Add a -tagPath option to PigStorage
> ---
>
> Key: PIG-2857
> URL: https://issues.apache.org/jira/browse/PIG-2857
> Project: Pig
>  Issue Type: New Feature
>Reporter: Dmitriy V. Ryaboy
>Assignee: Prashant Kommireddi
> Fix For: 0.12
>
> Attachments: PIG-2857_1.patch, PIG-2857_2.patch, PIG-2857_3.patch, 
> PIG-2857.patch
>
>
> We recently added a "-tagSource" option to PigStorage, which allows us to add 
> filenames from which records come to the returned tuples.
> Often, users want the whole path, not just the source file. I propose we add 
> a "-tagPath" option to do this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (PIG-2857) Add a -tagPath option to PigStorage

2012-12-13 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park resolved PIG-2857.


Resolution: Fixed

+1.

Committed to trunk. Thanks Prashant!

> Add a -tagPath option to PigStorage
> ---
>
> Key: PIG-2857
> URL: https://issues.apache.org/jira/browse/PIG-2857
> Project: Pig
>  Issue Type: New Feature
>Reporter: Dmitriy V. Ryaboy
>Assignee: Prashant Kommireddi
> Attachments: PIG-2857_1.patch, PIG-2857_2.patch, PIG-2857_3.patch, 
> PIG-2857.patch
>
>
> We recently added a "-tagSource" option to PigStorage, which allows us to add 
> filenames from which records come to the returned tuples.
> Often, users want the whole path, not just the source file. I propose we add 
> a "-tagPath" option to do this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2553) Pig shouldn't allow attempts to write multiple relations into same directory

2012-12-13 Thread Prashant Kommireddi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13531401#comment-13531401
 ] 

Prashant Kommireddi commented on PIG-2553:
--

Hi Cheolsoo, please see my comments below

1. Should we hold off on making this variable public until when needed? One 
could always modify scope in the future.
2. Good point. Will do
3. Returning String makes sense.
4. I feel like the new section would be a useful place for admins to go to, and 
we could keep adding properties that admins could/should be aware of. If its an 
overkill, I am fine with documenting pig.properties only. Let me know.

Again, thanks for reviewing.

> Pig shouldn't allow attempts to write multiple relations into same directory
> 
>
> Key: PIG-2553
> URL: https://issues.apache.org/jira/browse/PIG-2553
> Project: Pig
>  Issue Type: Improvement
>Reporter: Dmitriy V. Ryaboy
>Assignee: Prashant Kommireddi
> Attachments: PIG-2553_1.patch, PIG-2553.patch
>
>
> We've seen multiple occasions where users accidentally try to store 2 or more 
> different relations to the same destination directory. Currently, this passes 
> the Pig planner and fails on MR side due to concurrent attempts to create the 
> same part file on the reducer. This is extremely confusing to the user, and 
> hard to debug.
> We should instead fail their scripts before they are even submitted, since we 
> can identify the erroneous condition from the beginning.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2857) Add a -tagPath option to PigStorage

2012-12-13 Thread Prashant Kommireddi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13531381#comment-13531381
 ] 

Prashant Kommireddi commented on PIG-2857:
--

Go ahead. Thanks Cheolsoo

> Add a -tagPath option to PigStorage
> ---
>
> Key: PIG-2857
> URL: https://issues.apache.org/jira/browse/PIG-2857
> Project: Pig
>  Issue Type: New Feature
>Reporter: Dmitriy V. Ryaboy
>Assignee: Prashant Kommireddi
> Attachments: PIG-2857_1.patch, PIG-2857_2.patch, PIG-2857_3.patch, 
> PIG-2857.patch
>
>
> We recently added a "-tagSource" option to PigStorage, which allows us to add 
> filenames from which records come to the returned tuples.
> Often, users want the whole path, not just the source file. I propose we add 
> a "-tagPath" option to do this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2857) Add a -tagPath option to PigStorage

2012-12-13 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-2857:
---

Attachment: PIG-2857_3.patch

Hi Prashant,

Thank you very much. I verified:
- The doc builds fine.
- ant test-commit passes.
- ant test -Dtestcase=TestPigStorage passes.

I am attaching a new patch where I removed tabs. I also made a minor change to 
the PigStorage option parsing code as follows:

from
{code}
+if (configuredOptions.hasOption("tagsource")) {
+mLog.warn("'-tagsource' is deprecated. Use '-tagFile' 
instead.");
+}
 isSchemaOn = configuredOptions.hasOption("schema");
 dontLoadSchema = configuredOptions.hasOption("noschema");
-tagSource = configuredOptions.hasOption(TAG_SOURCE_PATH);
+// Remove -tagsource in 0.13. For backward compatibility we need
+// tagsource to be supported until at least 0.12
+tagFile = configuredOptions.hasOption(TAG_SOURCE_FILE) || 
configuredOptions.hasOption("tagsource");
+tagPath = configuredOptions.hasOption(TAG_SOURCE_PATH);
{code}
to
{code}
-tagSource = configuredOptions.hasOption(TAG_SOURCE_PATH);
+tagFile = configuredOptions.hasOption(TAG_SOURCE_FILE);
+tagPath = configuredOptions.hasOption(TAG_SOURCE_PATH);
+// TODO: Remove -tagsource in 0.13. For backward compatibility, we
+// need tagsource to be supported until at least 0.12
+if (configuredOptions.hasOption("tagsource")) {
+mLog.warn("'-tagsource' is deprecated. Use '-tagFile' 
instead.");
+tagFile = true;
+}
{code}
If you're fine with the change, I will go ahead commit it.

> Add a -tagPath option to PigStorage
> ---
>
> Key: PIG-2857
> URL: https://issues.apache.org/jira/browse/PIG-2857
> Project: Pig
>  Issue Type: New Feature
>Reporter: Dmitriy V. Ryaboy
>Assignee: Prashant Kommireddi
> Attachments: PIG-2857_1.patch, PIG-2857_2.patch, PIG-2857_3.patch, 
> PIG-2857.patch
>
>
> We recently added a "-tagSource" option to PigStorage, which allows us to add 
> filenames from which records come to the returned tuples.
> Often, users want the whole path, not just the source file. I propose we add 
> a "-tagPath" option to do this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3089) Implicit relation names

2012-12-13 Thread Jonathan Coveney (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13531342#comment-13531342
 ] 

Jonathan Coveney commented on PIG-3089:
---

Thejas: I implemented your suggested here 
https://issues.apache.org/jira/browse/PIG-3090

> Implicit relation names
> ---
>
> Key: PIG-3089
> URL: https://issues.apache.org/jira/browse/PIG-3089
> Project: Pig
>  Issue Type: New Feature
>  Components: grunt, parser
>Reporter: Russell Jurney
>Assignee: Jonathan Coveney
>
> A = load foo;
> B = load bar;
> filter A by id > 5;
> join A_1 by id, B by id;
> // or A_filter
> foreach A_1_B generate id;
> store into foobar; // A_1_B_1 or A_filter_B_generate
> Or some such routine?
> We don't have to be explicit no more!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3089) Implicit relation names

2012-12-13 Thread Russell Jurney (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13531339#comment-13531339
 ] 

Russell Jurney commented on PIG-3089:
-

I sit there for minutes trying to name my relations. Thats what I want to fix.

I like Thejas' suggestion better.

> Implicit relation names
> ---
>
> Key: PIG-3089
> URL: https://issues.apache.org/jira/browse/PIG-3089
> Project: Pig
>  Issue Type: New Feature
>  Components: grunt, parser
>Reporter: Russell Jurney
>Assignee: Jonathan Coveney
>
> A = load foo;
> B = load bar;
> filter A by id > 5;
> join A_1 by id, B by id;
> // or A_filter
> foreach A_1_B generate id;
> store into foobar; // A_1_B_1 or A_filter_B_generate
> Or some such routine?
> We don't have to be explicit no more!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2857) Add a -tagPath option to PigStorage

2012-12-13 Thread Prashant Kommireddi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Kommireddi updated PIG-2857:
-

Attachment: PIG-2857_2.patch

Thanks Cheolsoo. I have updated the patch with your feedback incorporated, 
except that I wasn't sure about where the tabs were present.

> Add a -tagPath option to PigStorage
> ---
>
> Key: PIG-2857
> URL: https://issues.apache.org/jira/browse/PIG-2857
> Project: Pig
>  Issue Type: New Feature
>Reporter: Dmitriy V. Ryaboy
>Assignee: Prashant Kommireddi
> Attachments: PIG-2857_1.patch, PIG-2857_2.patch, PIG-2857.patch
>
>
> We recently added a "-tagSource" option to PigStorage, which allows us to add 
> filenames from which records come to the returned tuples.
> Often, users want the whole path, not just the source file. I propose we add 
> a "-tagPath" option to do this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2341) Need better documentation on Pig/HBase integration

2012-12-13 Thread Bill Graham (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Graham updated PIG-2341:
-

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed, thanks Jayesh! This documentation is way overdue, so huge props for 
jumping on it.

> Need better documentation on Pig/HBase integration
> --
>
> Key: PIG-2341
> URL: https://issues.apache.org/jira/browse/PIG-2341
> Project: Pig
>  Issue Type: Sub-task
>  Components: documentation
>Affects Versions: 0.9.0, 0.10.0
>Reporter: Mikael Sitruk
>Assignee: Jayesh Thakrar
>  Labels: documentation, hbase
> Fix For: 0.11
>
> Attachments: PIG-2341.2.patch, PIG-2341.3.patch, PIG-2341.4.patch, 
> PIG-2341.5.patch, PIG-2341.patch
>
>
> One of the nice thing between Pig and Hbase is that they can be integrated. 
> Thanks to recent patch (PIG-1250) committed.
> The documentation is not well updated yet (currently almost relate to the 
> patch itself). It world be nice to document this feature in detail in the Pig 
> documentation page (e.g, in here: 
> http://pig.apache.org/docs/r0.9.1/func.html#load-store-functions).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (PIG-3092) HBaseStorage javadoc cleanup

2012-12-13 Thread Bill Graham (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Graham resolved PIG-3092.
--

   Resolution: Duplicate
Fix Version/s: 0.11
 Assignee: Bill Graham

Including this in PIG-2341. Marking as duplicate.

> HBaseStorage javadoc cleanup
> 
>
> Key: PIG-3092
> URL: https://issues.apache.org/jira/browse/PIG-3092
> Project: Pig
>  Issue Type: Bug
>Reporter: Bill Graham
>Assignee: Bill Graham
>  Labels: docuentation, hbase, noob, simple
> Fix For: 0.11
>
>
> This JavaDoc is incorrect, since there's no {{AS}} in {{STORE}}:
> {noformat}
>  * copy = STORE raw INTO 'hbase://SampleTableCopy'
>  *   USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
>  *   'info:first_name info:last_name friends:* info:*')
>  *   AS (info:first_name info:last_name buddies:* info:*);
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2341) Need better documentation on Pig/HBase integration

2012-12-13 Thread Bill Graham (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Graham updated PIG-2341:
-

Attachment: PIG-2341.5.patch

Thanks Jayesh for the merge! I think we're all set. Attaching patch 5 which 
contains some minor tweaks and two main changes:

- Rebasing the patch the base of the Pig repos. You generally will want to 
submit pathes so they can apply from the base dir.
- Rolling javadoc bug PIG-3092 into this one.

> Need better documentation on Pig/HBase integration
> --
>
> Key: PIG-2341
> URL: https://issues.apache.org/jira/browse/PIG-2341
> Project: Pig
>  Issue Type: Sub-task
>  Components: documentation
>Affects Versions: 0.9.0, 0.10.0
>Reporter: Mikael Sitruk
>Assignee: Jayesh Thakrar
>  Labels: documentation, hbase
> Fix For: 0.11
>
> Attachments: PIG-2341.2.patch, PIG-2341.3.patch, PIG-2341.4.patch, 
> PIG-2341.5.patch, PIG-2341.patch
>
>
> One of the nice thing between Pig and Hbase is that they can be integrated. 
> Thanks to recent patch (PIG-1250) committed.
> The documentation is not well updated yet (currently almost relate to the 
> patch itself). It world be nice to document this feature in detail in the Pig 
> documentation page (e.g, in here: 
> http://pig.apache.org/docs/r0.9.1/func.html#load-store-functions).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira