[jira] [Created] (PIG-2062) Script silently ended

2011-05-11 Thread Daniel Dai (JIRA)
Script silently ended
-

 Key: PIG-2062
 URL: https://issues.apache.org/jira/browse/PIG-2062
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.9.0
Reporter: Daniel Dai
 Fix For: 0.9.0


The following script ended silently without execution.

{code}
a = load '1.txt' as (a0, a1);
b = load '2.txt' as (b0, b1);
all = join a by a0, b by b0;
store all into '';
{code}

If change the alias "all", it will run. We need to throw exception saying "all" 
is a keyword.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2056) Jython error messages should show script name

2011-05-11 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13032152#comment-13032152
 ] 

Thejas M Nair commented on PIG-2056:


+1

> Jython error messages should show script name
> -
>
> Key: PIG-2056
> URL: https://issues.apache.org/jira/browse/PIG-2056
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Richard Ding
>Assignee: Richard Ding
>Priority: Minor
> Fix For: 0.9.0
>
> Attachments: PIG-2056.patch
>
>
> Instead of messages like
> {code}
> Traceback (most recent call last):
>   File "", line 12, in 
> {code}
> It should display the script file name:
> {code}
> Traceback (most recent call last):
>   File "test.py", line 12, in 
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2059) PIG doesn't validate incomplete query in batch mode even if -c option is given

2011-05-11 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13032087#comment-13032087
 ] 

Daniel Dai commented on PIG-2059:
-

Might be better to change the name of the method. 
validateQuery->parseAndValidate, compile->validate. Other part looks good.

> PIG doesn't validate incomplete query in batch mode even if -c option is given
> --
>
> Key: PIG-2059
> URL: https://issues.apache.org/jira/browse/PIG-2059
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.9.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Fix For: 0.9.0
>
> Attachments: PIG-2059.patch
>
>
> Given the following in a file to Pig, pig doesn't report any error, even if 
> -c option is given:
> A = load 'x' as (u, v);
> B = foreach A generate $3;
> It's questionable whether to validate the query in batch mode as it doesn't 
> contain any store/dump statement. However, if -c option is given, validation 
> should be nevertheless performed.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (PIG-2044) Patten match bug in org.apache.pig.newplan.optimizer.Rule

2011-05-11 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai reassigned PIG-2044:
---

Assignee: Koji Noguchi  (was: Daniel Dai)

> Patten match bug in org.apache.pig.newplan.optimizer.Rule
> -
>
> Key: PIG-2044
> URL: https://issues.apache.org/jira/browse/PIG-2044
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.9.0
>Reporter: Daniel Dai
>Assignee: Koji Noguchi
> Fix For: 0.10
>
>
> Koji find that we have a bug org.apache.pig.newplan.optimizer.Rule. The 
> "break" in line 179 seems to be wrong. This multiple branch matching is not 
> used in Pig, but could be a problem for the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-2061) NewPlan match() is sensitive to ordering

2011-05-11 Thread Koji Noguchi (JIRA)
NewPlan match() is sensitive to ordering


 Key: PIG-2061
 URL: https://issues.apache.org/jira/browse/PIG-2061
 Project: Pig
  Issue Type: Bug
Reporter: Koji Noguchi
Priority: Minor


There is no current Rule that is affected by this 
but inside TestNewPlanRule.java

{noformat}
155 public void testMultiNode() throws Exception {
...
175  pattern.connect(op1, op3);
176  pattern.connect(op2, op3);
...
178  Rule r = new SillyRule("basic", pattern);
179  List l = r.match(plan);
180  assertEquals(1, l.size());
{noformat}

but this test fail when we swap line 175 and 176 even though they are 
structurally equivalent.


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-1825) ability to turn off the write ahead log for pig's HBaseStorage

2011-05-11 Thread Bill Graham (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Graham updated PIG-1825:
-

Attachment: PIG-1825_3.patch

Here's patch #3 with {{testStoreToHBase_2_no_WAL}} removed. I agree we should 
remove it if HBase doesn't even deal with it in the unit test mode.

I think using the {{-noWAL}} option makes the most sense, since it's very clear 
what it does. I've added comments in the Javadocs to make sure the risks are 
clear.

If someone uses an obscurely named flag (i.e., -noWAL) without understanding 
what it does by reading either the Pig javadocs or the HBase documentation, 
then they're really flying blind.

> ability to turn off the write ahead log for pig's HBaseStorage
> --
>
> Key: PIG-1825
> URL: https://issues.apache.org/jira/browse/PIG-1825
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.8.0
>Reporter: Corbin Hoenes
>Assignee: Bill Graham
>Priority: Minor
> Attachments: HBaseStorage_noWAL.patch, PIG-1825_1.patch, 
> PIG-1825_2.patch, PIG-1825_3.patch
>
>
> Added an option to allow a caller of HBaseStorage to turn off the 
> WriteAheadLog feature while doing bulk loads into hbase.
> From the performance tuning wikipage: 
> http://wiki.apache.org/hadoop/PerformanceTuning
> "To speed up the inserts in a non critical job (like an import job), you can 
> use Put.writeToWAL(false) to bypass writing to the write ahead log."
> We've tested this on HBase 0.20.6 and it helps dramatically.  
> The -noWAL options is passed in just like other options for hbase storage:
> STORE myalias INTO 'MyTable' USING 
> org.apache.pig.backend.hadoop.hbase.HBaseStorage('mycolumnfamily:field1 
> mycolumnfamily:field2','-noWAL');
> This would be my first patch so please educate me with any steps I need to 
> do.  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2056) Jython error messages should show script name

2011-05-11 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13031911#comment-13031911
 ] 

Richard Ding commented on PIG-2056:
---

Result of test-patch:

{code}
 [exec] +1 overall.  
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] +1 tests included.  The patch appears to include 3 new or 
modified tests.
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.
{code}

> Jython error messages should show script name
> -
>
> Key: PIG-2056
> URL: https://issues.apache.org/jira/browse/PIG-2056
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Richard Ding
>Assignee: Richard Ding
>Priority: Minor
> Fix For: 0.9.0
>
> Attachments: PIG-2056.patch
>
>
> Instead of messages like
> {code}
> Traceback (most recent call last):
>   File "", line 12, in 
> {code}
> It should display the script file name:
> {code}
> Traceback (most recent call last):
>   File "test.py", line 12, in 
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2059) PIG doesn't validate incomplete query in batch mode even if -c option is given

2011-05-11 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13031905#comment-13031905
 ] 

Xuefu Zhang commented on PIG-2059:
--

Test-patch run:

 [exec] +1 overall.
 [exec]
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec]
 [exec] +1 tests included.  The patch appears to include 3 new or 
modified tests.
 [exec]
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec]
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec]
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec]
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.


> PIG doesn't validate incomplete query in batch mode even if -c option is given
> --
>
> Key: PIG-2059
> URL: https://issues.apache.org/jira/browse/PIG-2059
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.9.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Fix For: 0.9.0
>
> Attachments: PIG-2059.patch
>
>
> Given the following in a file to Pig, pig doesn't report any error, even if 
> -c option is given:
> A = load 'x' as (u, v);
> B = foreach A generate $3;
> It's questionable whether to validate the query in batch mode as it doesn't 
> contain any store/dump statement. However, if -c option is given, validation 
> should be nevertheless performed.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2059) PIG doesn't validate incomplete query in batch mode even if -c option is given

2011-05-11 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-2059:
-

Attachment: PIG-2059.patch

> PIG doesn't validate incomplete query in batch mode even if -c option is given
> --
>
> Key: PIG-2059
> URL: https://issues.apache.org/jira/browse/PIG-2059
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.9.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Fix For: 0.9.0
>
> Attachments: PIG-2059.patch
>
>
> Given the following in a file to Pig, pig doesn't report any error, even if 
> -c option is given:
> A = load 'x' as (u, v);
> B = foreach A generate $3;
> It's questionable whether to validate the query in batch mode as it doesn't 
> contain any store/dump statement. However, if -c option is given, validation 
> should be nevertheless performed.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-1683) New logical plan: Nested foreach plan fail if one inner alias is refered more than once

2011-05-11 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13031873#comment-13031873
 ] 

Daniel Dai commented on PIG-1683:
-

Tried it with 0.8.1, even use old logical plan (-Dpig.usenewlogicalplan=true), 
the issue is the same. However, 0.9 fixed this issue. Seems to be a problem in 
the old parser and 0.9 new parser fix the issue.

> New logical plan: Nested foreach plan fail if one inner alias is refered more 
> than once
> ---
>
> Key: PIG-1683
> URL: https://issues.apache.org/jira/browse/PIG-1683
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1683-1.patch
>
>
> The following script fail:
> {code}
> a = load '1.txt' as (a0, a1, a2);
> b = load '2.txt' as (b0, b1);
> c = join a by a0, b by b0;
> d = foreach c {
> d0 = a::a0;
> d1 = a::a1;
> generate ((d0 is not null)? d0 : d1);
> }
> explain d;
> {code}
> Stack:
> ERROR 2015: Invalid physical operators in the physical plan
> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1067: Unable to 
> explain alias d
> at org.apache.pig.PigServer.explain(PigServer.java:957)
> at 
> org.apache.pig.tools.grunt.GruntParser.explainCurrentBatch(GruntParser.java:353)
> at 
> org.apache.pig.tools.grunt.GruntParser.processExplain(GruntParser.java:285)
> at 
> org.apache.pig.tools.grunt.GruntParser.processExplain(GruntParser.java:248)
> at 
> org.apache.pig.tools.pigscript.parser.PigScriptParser.Explain(PigScriptParser.java:605)
> at 
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:327)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:141)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90)
> at org.apache.pig.Main.run(Main.java:498)
> at org.apache.pig.Main.main(Main.java:107)
> Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2042: 
> Error in new logical plan. Try -Dpig.usenewlogicalplan=false.
> at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:308)
> at org.apache.pig.PigServer.compilePp(PigServer.java:1350)
> at org.apache.pig.PigServer.explain(PigServer.java:926)
> ... 10 more
> Caused by: 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogicalToPhysicalTranslatorException:
>  ERROR 2015: Invalid physical operators in the physical plan
> at 
> org.apache.pig.newplan.logical.expression.ExpToPhyTranslationVisitor.visit(ExpToPhyTranslationVisitor.java:474)
> at 
> org.apache.pig.newplan.logical.expression.BinCondExpression.accept(BinCondExpression.java:82)
> at 
> org.apache.pig.newplan.ReverseDependencyOrderWalker.walk(ReverseDependencyOrderWalker.java:70)
> at 
> org.apache.pig.newplan.logical.relational.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:519)
> at 
> org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.java:71)
> at 
> org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
> at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50)
> at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:295)
> ... 12 more
> Caused by: org.apache.pig.impl.plan.PlanException: ERROR 0: Attempt to give 
> operator of type 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject
>  multiple outputs.  This operator does not support multiple outputs.
> at 
> org.apache.pig.impl.plan.OperatorPlan.connect(OperatorPlan.java:180)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.plans.PhysicalPlan.connect(PhysicalPlan.java:133)
> at 
> org.apache.pig.newplan.logical.expression.ExpToPhyTranslationVisitor.visit(ExpToPhyTranslationVisitor.java:470)
> ... 19 more

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (PIG-2058) Macro missing returns clause doesn't give a good error message

2011-05-11 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding resolved PIG-2058.
---

  Resolution: Fixed
Hadoop Flags: [Reviewed]

Patch committed to trunk and 0.9 branch.

> Macro missing returns clause doesn't give a good error message
> --
>
> Key: PIG-2058
> URL: https://issues.apache.org/jira/browse/PIG-2058
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.9.0
>Reporter: Xuefu Zhang
>Assignee: Richard Ding
> Fix For: 0.9.0
>
> Attachments: PIG-2058.patch
>
>
> For the following query:
> define test( out1,out2 ){
>A  = load 'x' as (u:int, v:int);
>$B  = filter A by u < 3 and v <  20;
> }
> Pig gives the following error message: Syntax error,unexpected symbol at or 
> near '{'
> Previously, it gives: mismatched input '{' expecting RETURNS
> The previous message is more meaningful.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2058) Macro missing returns clause doesn't give a good error message

2011-05-11 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13031833#comment-13031833
 ] 

Xuefu Zhang commented on PIG-2058:
--

+1

> Macro missing returns clause doesn't give a good error message
> --
>
> Key: PIG-2058
> URL: https://issues.apache.org/jira/browse/PIG-2058
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.9.0
>Reporter: Xuefu Zhang
>Assignee: Richard Ding
> Fix For: 0.9.0
>
> Attachments: PIG-2058.patch
>
>
> For the following query:
> define test( out1,out2 ){
>A  = load 'x' as (u:int, v:int);
>$B  = filter A by u < 3 and v <  20;
> }
> Pig gives the following error message: Syntax error,unexpected symbol at or 
> near '{'
> Previously, it gives: mismatched input '{' expecting RETURNS
> The previous message is more meaningful.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2014) SAMPLE shouldn't be pushed up

2011-05-11 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-2014:
---

Release Note: 
A new annotation, @Nondeterministic, is introduced to allow UDF authors to mark 
their UDFs as such. 

A non-deterministic UDF is one that can produce different results when invoked 
on the same input. Examples of non-deterministic behavior might be, for 
example, getCurrentTime() or RANDOM.

Certain Pig optimizations depend on UDFs being deterministic. It is therefore 
very important for correctness that non-deterministic UDFs be annotated as 
such. 

  Status: Patch Available  (was: Open)

> SAMPLE shouldn't be pushed up
> -
>
> Key: PIG-2014
> URL: https://issues.apache.org/jira/browse/PIG-2014
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.9.0, 0.10
>Reporter: Jacob Perkins
>Assignee: Dmitriy V. Ryaboy
> Fix For: 0.9.0
>
> Attachments: PIG-2014.2.patch, PIG-2014.patch
>
>
> Consider the following code:
> {code:none}
> tfidf_all = LOAD '$TFIDF' AS (doc_id:chararray, token:chararray, 
> weight:double);
> grouped   = GROUP tfidf_all BY doc_id;
> vectors   = FOREACH grouped GENERATE group AS doc_id, tfidf_all.(token, 
> weight) AS vector;
> DUMP vectors;
> {code}
> This, of course, runs just fine. In a real example, tfidf_all contains 
> 1,428,280 records. The reduce output records should be exactly the number of 
> documents, which turn out to be 18,863 in this case. All well and good.
> The strangeness comes when you add a SAMPLE command:
> {code:none}
> sampled = SAMPLE vectors 0.0012;
> DUMP sampled;
> {code}
> Running this results in 1,513 reduce output records. The reduce output 
> records be much much closer to 22 or 23 records (eg. 0.0012*18863).
> Evidently, Pig rewrites SAMPLE into filter, and then pushes that filter in 
> front of the group. It shouldn't push that filter  
> since the UDF is non-deterministic.  
> Quick fix: If you add "-t PushUpFilter" to your command line when invoking 
> pig this won't happen.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2014) SAMPLE shouldn't be pushed up

2011-05-11 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-2014:
---

Attachment: PIG-2014.2.patch

This addresses PushUpFilter and FilterAboveForeach, and fixes the SAMPLE issue.

I didn't tackle PushDownForeachFlatten -- there's a lot going on there and I'm 
not sure I understand it all. We should open a separate ticket for making sure 
that optimization does not break on nondeterministic operations.

> SAMPLE shouldn't be pushed up
> -
>
> Key: PIG-2014
> URL: https://issues.apache.org/jira/browse/PIG-2014
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.9.0, 0.10
>Reporter: Jacob Perkins
>Assignee: Dmitriy V. Ryaboy
> Fix For: 0.9.0
>
> Attachments: PIG-2014.2.patch, PIG-2014.patch
>
>
> Consider the following code:
> {code:none}
> tfidf_all = LOAD '$TFIDF' AS (doc_id:chararray, token:chararray, 
> weight:double);
> grouped   = GROUP tfidf_all BY doc_id;
> vectors   = FOREACH grouped GENERATE group AS doc_id, tfidf_all.(token, 
> weight) AS vector;
> DUMP vectors;
> {code}
> This, of course, runs just fine. In a real example, tfidf_all contains 
> 1,428,280 records. The reduce output records should be exactly the number of 
> documents, which turn out to be 18,863 in this case. All well and good.
> The strangeness comes when you add a SAMPLE command:
> {code:none}
> sampled = SAMPLE vectors 0.0012;
> DUMP sampled;
> {code}
> Running this results in 1,513 reduce output records. The reduce output 
> records be much much closer to 22 or 23 records (eg. 0.0012*18863).
> Evidently, Pig rewrites SAMPLE into filter, and then pushes that filter in 
> front of the group. It shouldn't push that filter  
> since the UDF is non-deterministic.  
> Quick fix: If you add "-t PushUpFilter" to your command line when invoking 
> pig this won't happen.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2014) SAMPLE shouldn't be pushed up

2011-05-11 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-2014:
---

Status: Open  (was: Patch Available)

> SAMPLE shouldn't be pushed up
> -
>
> Key: PIG-2014
> URL: https://issues.apache.org/jira/browse/PIG-2014
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.9.0, 0.10
>Reporter: Jacob Perkins
>Assignee: Dmitriy V. Ryaboy
> Fix For: 0.9.0
>
> Attachments: PIG-2014.patch
>
>
> Consider the following code:
> {code:none}
> tfidf_all = LOAD '$TFIDF' AS (doc_id:chararray, token:chararray, 
> weight:double);
> grouped   = GROUP tfidf_all BY doc_id;
> vectors   = FOREACH grouped GENERATE group AS doc_id, tfidf_all.(token, 
> weight) AS vector;
> DUMP vectors;
> {code}
> This, of course, runs just fine. In a real example, tfidf_all contains 
> 1,428,280 records. The reduce output records should be exactly the number of 
> documents, which turn out to be 18,863 in this case. All well and good.
> The strangeness comes when you add a SAMPLE command:
> {code:none}
> sampled = SAMPLE vectors 0.0012;
> DUMP sampled;
> {code}
> Running this results in 1,513 reduce output records. The reduce output 
> records be much much closer to 22 or 23 records (eg. 0.0012*18863).
> Evidently, Pig rewrites SAMPLE into filter, and then pushes that filter in 
> front of the group. It shouldn't push that filter  
> since the UDF is non-deterministic.  
> Quick fix: If you add "-t PushUpFilter" to your command line when invoking 
> pig this won't happen.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


DOAP for Pig

2011-05-11 Thread Craig L Russell

Hi,

I noticed that Pig is not among the Apache projects at 
http://projects.apache.org/indexes/alpha.html#P

Thought you'd like to know...

Craig

Craig L Russell
Secretary, Apache Software Foundation
Chair, OpenJPA PMC
c...@apache.org http://db.apache.org/jdo











[jira] [Commented] (PIG-1683) New logical plan: Nested foreach plan fail if one inner alias is refered more than once

2011-05-11 Thread Thomas Kappler (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13031769#comment-13031769
 ] 

Thomas Kappler commented on PIG-1683:
-

Sorry, I forgot: this is with Pig 0.8.1 and the included Hadoop.

> New logical plan: Nested foreach plan fail if one inner alias is refered more 
> than once
> ---
>
> Key: PIG-1683
> URL: https://issues.apache.org/jira/browse/PIG-1683
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1683-1.patch
>
>
> The following script fail:
> {code}
> a = load '1.txt' as (a0, a1, a2);
> b = load '2.txt' as (b0, b1);
> c = join a by a0, b by b0;
> d = foreach c {
> d0 = a::a0;
> d1 = a::a1;
> generate ((d0 is not null)? d0 : d1);
> }
> explain d;
> {code}
> Stack:
> ERROR 2015: Invalid physical operators in the physical plan
> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1067: Unable to 
> explain alias d
> at org.apache.pig.PigServer.explain(PigServer.java:957)
> at 
> org.apache.pig.tools.grunt.GruntParser.explainCurrentBatch(GruntParser.java:353)
> at 
> org.apache.pig.tools.grunt.GruntParser.processExplain(GruntParser.java:285)
> at 
> org.apache.pig.tools.grunt.GruntParser.processExplain(GruntParser.java:248)
> at 
> org.apache.pig.tools.pigscript.parser.PigScriptParser.Explain(PigScriptParser.java:605)
> at 
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:327)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:141)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90)
> at org.apache.pig.Main.run(Main.java:498)
> at org.apache.pig.Main.main(Main.java:107)
> Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2042: 
> Error in new logical plan. Try -Dpig.usenewlogicalplan=false.
> at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:308)
> at org.apache.pig.PigServer.compilePp(PigServer.java:1350)
> at org.apache.pig.PigServer.explain(PigServer.java:926)
> ... 10 more
> Caused by: 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogicalToPhysicalTranslatorException:
>  ERROR 2015: Invalid physical operators in the physical plan
> at 
> org.apache.pig.newplan.logical.expression.ExpToPhyTranslationVisitor.visit(ExpToPhyTranslationVisitor.java:474)
> at 
> org.apache.pig.newplan.logical.expression.BinCondExpression.accept(BinCondExpression.java:82)
> at 
> org.apache.pig.newplan.ReverseDependencyOrderWalker.walk(ReverseDependencyOrderWalker.java:70)
> at 
> org.apache.pig.newplan.logical.relational.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:519)
> at 
> org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.java:71)
> at 
> org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
> at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50)
> at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:295)
> ... 12 more
> Caused by: org.apache.pig.impl.plan.PlanException: ERROR 0: Attempt to give 
> operator of type 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject
>  multiple outputs.  This operator does not support multiple outputs.
> at 
> org.apache.pig.impl.plan.OperatorPlan.connect(OperatorPlan.java:180)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.plans.PhysicalPlan.connect(PhysicalPlan.java:133)
> at 
> org.apache.pig.newplan.logical.expression.ExpToPhyTranslationVisitor.visit(ExpToPhyTranslationVisitor.java:470)
> ... 19 more

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-1683) New logical plan: Nested foreach plan fail if one inner alias is refered more than once

2011-05-11 Thread Thomas Kappler (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13031767#comment-13031767
 ] 

Thomas Kappler commented on PIG-1683:
-

I found a strange problem that looks like a special case of this issue. 
Apologies if it isn't.

I wanted to use REGEX_EXTRACT in a nested generate block where I clean up some 
strings. Pig accepts or rejects the block depending on the order of the "is 
null" condition. The simplest example I could come up with that shows the 
problem is this:

{noformat} 
a = load '1.txt' using PigStorage(',') as (a0:chararray, a1:chararray);
b = foreach a {
b0 = TRIM(a0);
b1 = REGEX_EXTRACT(b0, '^\\((.+)\\)$', 1);
generate ((b1 is null) ? b0 : b1) as cleaned_name; -- FAILS
-- generate ((b1 is not null) ? b1 : b0) as cleaned_name; -- SUCCEEDS
-- generate ((b1 is null) ? b0 : b1); -- FAILS
}
store b into 'out';
{noformat}

1.txt is

{noformat}
foo1,bar1
 (foo2),bar2
{noformat}

The "b is null" variant fails with the original error message of this issue: 
"Attempt to give operator of type 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject
 multiple outputs. This operator does not support multiple outputs."

The inverted, logically equivalent "b is not null" variant succeeds.

If I replace the REGEX_EXTRACT call with a simple expression like "b1 = a0", it 
works. But the way I read the Pig Latin reference, it should be allowed at this 
point since it's not a relational operator?

> New logical plan: Nested foreach plan fail if one inner alias is refered more 
> than once
> ---
>
> Key: PIG-1683
> URL: https://issues.apache.org/jira/browse/PIG-1683
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1683-1.patch
>
>
> The following script fail:
> {code}
> a = load '1.txt' as (a0, a1, a2);
> b = load '2.txt' as (b0, b1);
> c = join a by a0, b by b0;
> d = foreach c {
> d0 = a::a0;
> d1 = a::a1;
> generate ((d0 is not null)? d0 : d1);
> }
> explain d;
> {code}
> Stack:
> ERROR 2015: Invalid physical operators in the physical plan
> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1067: Unable to 
> explain alias d
> at org.apache.pig.PigServer.explain(PigServer.java:957)
> at 
> org.apache.pig.tools.grunt.GruntParser.explainCurrentBatch(GruntParser.java:353)
> at 
> org.apache.pig.tools.grunt.GruntParser.processExplain(GruntParser.java:285)
> at 
> org.apache.pig.tools.grunt.GruntParser.processExplain(GruntParser.java:248)
> at 
> org.apache.pig.tools.pigscript.parser.PigScriptParser.Explain(PigScriptParser.java:605)
> at 
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:327)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:141)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90)
> at org.apache.pig.Main.run(Main.java:498)
> at org.apache.pig.Main.main(Main.java:107)
> Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2042: 
> Error in new logical plan. Try -Dpig.usenewlogicalplan=false.
> at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:308)
> at org.apache.pig.PigServer.compilePp(PigServer.java:1350)
> at org.apache.pig.PigServer.explain(PigServer.java:926)
> ... 10 more
> Caused by: 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogicalToPhysicalTranslatorException:
>  ERROR 2015: Invalid physical operators in the physical plan
> at 
> org.apache.pig.newplan.logical.expression.ExpToPhyTranslationVisitor.visit(ExpToPhyTranslationVisitor.java:474)
> at 
> org.apache.pig.newplan.logical.expression.BinCondExpression.accept(BinCondExpression.java:82)
> at 
> org.apache.pig.newplan.ReverseDependencyOrderWalker.walk(ReverseDependencyOrderWalker.java:70)
> at 
> org.apache.pig.newplan.logical.relational.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:519)
> at 
> org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.java:71)
> at 
> org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
> at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50)
> at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:295)
> ... 12 more
> Caused by: org.apache.pig.impl.plan.PlanException: ERROR 0

[jira] [Commented] (PIG-2014) SAMPLE shouldn't be pushed up

2011-05-11 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13031730#comment-13031730
 ] 

Dmitriy V. Ryaboy commented on PIG-2014:


Daniel,
So this is interesting. I took my fix out, left the test in, and the test still 
passed -- because, as you correctly pointed out, TestNewPlanFilterAboveForeach 
only invokes a few of the rules. If I add PushUpFilter to MyPlanOptimizer 
within that test, my new test starts failing if the fix is not present, and 
passes if the fix is present. So the PushUpFilter is definitely at least part 
of what's causing the movement of Filter in this case.

So I need to fix up PushDownForEachFlatten and FilterAboveForeach, *and* I need 
to fix my test :).

> SAMPLE shouldn't be pushed up
> -
>
> Key: PIG-2014
> URL: https://issues.apache.org/jira/browse/PIG-2014
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.9.0, 0.10
>Reporter: Jacob Perkins
>Assignee: Dmitriy V. Ryaboy
> Fix For: 0.9.0
>
> Attachments: PIG-2014.patch
>
>
> Consider the following code:
> {code:none}
> tfidf_all = LOAD '$TFIDF' AS (doc_id:chararray, token:chararray, 
> weight:double);
> grouped   = GROUP tfidf_all BY doc_id;
> vectors   = FOREACH grouped GENERATE group AS doc_id, tfidf_all.(token, 
> weight) AS vector;
> DUMP vectors;
> {code}
> This, of course, runs just fine. In a real example, tfidf_all contains 
> 1,428,280 records. The reduce output records should be exactly the number of 
> documents, which turn out to be 18,863 in this case. All well and good.
> The strangeness comes when you add a SAMPLE command:
> {code:none}
> sampled = SAMPLE vectors 0.0012;
> DUMP sampled;
> {code}
> Running this results in 1,513 reduce output records. The reduce output 
> records be much much closer to 22 or 23 records (eg. 0.0012*18863).
> Evidently, Pig rewrites SAMPLE into filter, and then pushes that filter in 
> front of the group. It shouldn't push that filter  
> since the UDF is non-deterministic.  
> Quick fix: If you add "-t PushUpFilter" to your command line when invoking 
> pig this won't happen.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-2060) Fix errors in pig grammars reported by ANTLRWorks

2011-05-11 Thread Gianmarco De Francisci Morales (JIRA)
Fix errors in pig grammars reported by ANTLRWorks
-

 Key: PIG-2060
 URL: https://issues.apache.org/jira/browse/PIG-2060
 Project: Pig
  Issue Type: Bug
Reporter: Gianmarco De Francisci Morales
Assignee: Gianmarco De Francisci Morales
Priority: Minor


There are various errors in pig's grammar files highlighted by ANTLRWorks.
In particular, on token MATCHES, ANY and EVAL.
The first one should be removed, as there is already STR_OP_MATCHES,
the second one is an imaginary tokens that should be defined in the appropriate 
section.
On the third one I am not sure.
I have been told it is from the old parsers but it is not used anywhere. Is it 
correct?
Is it reserved for future uses? Has it anything to do with FUNC_EVAL?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira