[jira] Commented: (PIG-1661) Add alternative search-provider to Pig site

2010-10-01 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917133#action_12917133
 ] 

Daniel Dai commented on PIG-1661:
-

The site looks good. I would vote yes. 

> Add alternative search-provider to Pig site
> ---
>
> Key: PIG-1661
> URL: https://issues.apache.org/jira/browse/PIG-1661
> Project: Pig
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Alex Baranau
>Priority: Minor
> Attachments: PIG-1661.patch
>
>
> Use search-hadoop.com service to make available search in Pig sources, MLs, 
> wiki, etc.
> This was initially proposed on user mailing list. The search service was 
> already added in site's skin (common for all Hadoop related projects) via 
> AVRO-626 so this issue is about enabling it for Pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1542) log level not propogated to MR task loggers

2010-10-01 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917079#action_12917079
 ] 

Daniel Dai commented on PIG-1542:
-

Yes, -d xxx should treat as -Ddebug=xxx. And system properties already have 
higher priority in the current code. (And in my mind, we should deprecate -d in 
favor of -Ddebug)

> log level not propogated to MR task loggers
> ---
>
> Key: PIG-1542
> URL: https://issues.apache.org/jira/browse/PIG-1542
> Project: Pig
>  Issue Type: Bug
>Reporter: Thejas M Nair
>Assignee: niraj rai
> Fix For: 0.8.0
>
> Attachments: PIG-1542.patch, PIG-1542_1.patch, PIG-1542_2.patch
>
>
> Specifying "-d DEBUG" does not affect the logging of the MR tasks .
> This was fixed earlier in PIG-882 .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1659) sortinfo is not set for store if there is a filter after ORDER BY

2010-10-01 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1659:


Attachment: PIG-1659-1.patch

> sortinfo is not set for store if there is a filter after ORDER BY
> -
>
> Key: PIG-1659
> URL: https://issues.apache.org/jira/browse/PIG-1659
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Yan Zhou
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1659-1.patch
>
>
> This has caused 6 (of 7) failures in the Zebra test 
> TestOrderPreserveVariableTable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1659) sortinfo is not set for store if there is a filter after ORDER BY

2010-10-01 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916998#action_12916998
 ] 

Daniel Dai commented on PIG-1659:
-

We should set sortInfo after optimization. So we should add SetSortInfo after 
the optimization of new logical plan. This code is missing.

> sortinfo is not set for store if there is a filter after ORDER BY
> -
>
> Key: PIG-1659
> URL: https://issues.apache.org/jira/browse/PIG-1659
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Yan Zhou
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
>
> This has caused 6 (of 7) failures in the Zebra test 
> TestOrderPreserveVariableTable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1638) sh output gets mixed up with the grunt prompt

2010-09-30 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1638:


  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

Patch committed to both trunk and 0.8 branch.

> sh output gets mixed up with the grunt prompt
> -
>
> Key: PIG-1638
> URL: https://issues.apache.org/jira/browse/PIG-1638
> Project: Pig
>  Issue Type: Bug
>  Components: grunt
>Affects Versions: 0.8.0
>Reporter: niraj rai
>Assignee: niraj rai
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: PIG-1638_0.patch
>
>
> Many times, the grunt prompt gets mixed up with the sh output.e.g.
> grunt> sh ls
> 000
> autocomplete
> bin
> build
> build.xml
> grunt> CHANGES.txt
> conf
> contrib
> In the above case,  grunt> is mixed up with the output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1638) sh output gets mixed up with the grunt prompt

2010-09-30 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916725#action_12916725
 ] 

Daniel Dai commented on PIG-1638:
-

+1

> sh output gets mixed up with the grunt prompt
> -
>
> Key: PIG-1638
> URL: https://issues.apache.org/jira/browse/PIG-1638
> Project: Pig
>  Issue Type: Bug
>  Components: grunt
>Affects Versions: 0.8.0
>Reporter: niraj rai
>Assignee: niraj rai
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: PIG-1638_0.patch
>
>
> Many times, the grunt prompt gets mixed up with the sh output.e.g.
> grunt> sh ls
> 000
> autocomplete
> bin
> build
> build.xml
> grunt> CHANGES.txt
> conf
> contrib
> In the above case,  grunt> is mixed up with the output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1637) Combiner not use because optimizor inserts a foreach between group and algebric function

2010-09-28 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai resolved PIG-1637.
-

Hadoop Flags: [Reviewed]
  Resolution: Fixed

All tests pass except for TestSortedTableUnion / TestSortedTableUnionMergeJoin 
for zebra, which are already fail and will be addressed by 
[PIG-1649|https://issues.apache.org/jira/browse/PIG-1649].

Patch committed to both trunk and 0.8 branch.

> Combiner not use because optimizor inserts a foreach between group and 
> algebric function
> 
>
> Key: PIG-1637
> URL: https://issues.apache.org/jira/browse/PIG-1637
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1637-1.patch, PIG-1637-2.patch
>
>
> The following script does not use combiner after new optimization change.
> {code}
> A = load ':INPATH:/pigmix/page_views' using 
> org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
> as (user, action, timespent, query_term, ip_addr, timestamp, 
> estimated_revenue, page_info, page_links);
> B = foreach A generate user, (int)timespent as timespent, 
> (double)estimated_revenue as estimated_revenue;
> C = group B all; 
> D = foreach C generate SUM(B.timespent), AVG(B.estimated_revenue);
> store D into ':OUTPATH:';
> {code}
> This is because after group, optimizer detect group key is not used 
> afterward, it add a foreach statement after C. This is how it looks like 
> after optimization:
> {code}
> A = load ':INPATH:/pigmix/page_views' using 
> org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
> as (user, action, timespent, query_term, ip_addr, timestamp, 
> estimated_revenue, page_info, page_links);
> B = foreach A generate user, (int)timespent as timespent, 
> (double)estimated_revenue as estimated_revenue;
> C = group B all; 
> C1 = foreach C generate B;
> D = foreach C1 generate SUM(B.timespent), AVG(B.estimated_revenue);
> store D into ':OUTPATH:';
> {code}
> That cancel the combiner optimization for D. 
> The way to solve the issue is to merge the C1 we inserted and D. Currently, 
> we do not merge these two foreach. The reason is that one output of the first 
> foreach (B) is referred twice in D, and currently rule assume after merge, we 
> need to calculate B twice in D. Actually, C1 is only doing projection, no 
> calculation of B. Merging C1 and D will not result calculating B twice. So C1 
> and D should be merged.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1651) PIG class loading mishandled

2010-09-28 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915959#action_12915959
 ] 

Daniel Dai commented on PIG-1651:
-

+1

> PIG class loading mishandled
> 
>
> Key: PIG-1651
> URL: https://issues.apache.org/jira/browse/PIG-1651
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Yan Zhou
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1651.patch
>
>
> If just having zebra.jar as being registered in a PIG script but not in the 
> CLASSPATH, the query using zebra fails since there appear to be multiple 
> classes loaded into JVM, causing static variable set previously not seen 
> after one instance of the class is created through reflection. (After the 
> zebra.jar is specified in CLASSPATH, it works fine.) The exception stack is 
> as follows:
> ackend error message during job submission
> ---
> org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to 
> create input splits for: hdfs://hostname/pathto/zebra_dir :: null
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:284)
> at 
> org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:907)
> at 
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:801)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:752)
> at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
> at 
> org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
> at 
> org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
> at java.lang.Thread.run(Thread.java:619)
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.zebra.io.ColumnGroup.getNonDataFilePrefix(ColumnGroup.java:123)
> at 
> org.apache.hadoop.zebra.io.ColumnGroup$CGPathFilter.accept(ColumnGroup.java:2413)
> at 
> org.apache.hadoop.zebra.mapreduce.TableInputFormat$DummyFileInputFormat$MultiPathFilter.accept(TableInputFormat.java:718)
> at 
> org.apache.hadoop.fs.FileSystem$GlobFilter.accept(FileSystem.java:1084)
> at 
> org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:919)
> at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:866)
> at 
> org.apache.hadoop.zebra.mapreduce.TableInputFormat$DummyFileInputFormat.listStatus(TableInputFormat.java:780)
> at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:246)
> at 
> org.apache.hadoop.zebra.mapreduce.TableInputFormat.getRowSplits(TableInputFormat.java:863)
> at 
> org.apache.hadoop.zebra.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:1017)
> at 
> org.apache.hadoop.zebra.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:961)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:269)
> ... 7 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1637) Combiner not use because optimizor inserts a foreach between group and algebric function

2010-09-28 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915950#action_12915950
 ] 

Daniel Dai commented on PIG-1637:
-

Yes, it could be improved as per Xuefu's suggestion. Anyway, current patch 
solve the "combiner not used" issue, will commit this part first. I will open 
another Jira to improve it. Also, MergeForEach is a best example to practice 
cloning framework [PIG-1587|https://issues.apache.org/jira/browse/PIG-1587], so 
it is better to improve it once PIG-1587 is available.

> Combiner not use because optimizor inserts a foreach between group and 
> algebric function
> 
>
> Key: PIG-1637
> URL: https://issues.apache.org/jira/browse/PIG-1637
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1637-1.patch, PIG-1637-2.patch
>
>
> The following script does not use combiner after new optimization change.
> {code}
> A = load ':INPATH:/pigmix/page_views' using 
> org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
> as (user, action, timespent, query_term, ip_addr, timestamp, 
> estimated_revenue, page_info, page_links);
> B = foreach A generate user, (int)timespent as timespent, 
> (double)estimated_revenue as estimated_revenue;
> C = group B all; 
> D = foreach C generate SUM(B.timespent), AVG(B.estimated_revenue);
> store D into ':OUTPATH:';
> {code}
> This is because after group, optimizer detect group key is not used 
> afterward, it add a foreach statement after C. This is how it looks like 
> after optimization:
> {code}
> A = load ':INPATH:/pigmix/page_views' using 
> org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
> as (user, action, timespent, query_term, ip_addr, timestamp, 
> estimated_revenue, page_info, page_links);
> B = foreach A generate user, (int)timespent as timespent, 
> (double)estimated_revenue as estimated_revenue;
> C = group B all; 
> C1 = foreach C generate B;
> D = foreach C1 generate SUM(B.timespent), AVG(B.estimated_revenue);
> store D into ':OUTPATH:';
> {code}
> That cancel the combiner optimization for D. 
> The way to solve the issue is to merge the C1 we inserted and D. Currently, 
> we do not merge these two foreach. The reason is that one output of the first 
> foreach (B) is referred twice in D, and currently rule assume after merge, we 
> need to calculate B twice in D. Actually, C1 is only doing projection, no 
> calculation of B. Merging C1 and D will not result calculating B twice. So C1 
> and D should be merged.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1579) Intermittent unit test failure for TestScriptUDF.testPythonScriptUDFNullInputOutput

2010-09-28 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915941#action_12915941
 ] 

Daniel Dai commented on PIG-1579:
-

Rollback the change and run test many times, all tests pass. Seems some change 
between r990721 and now (r1002348) fix this issue. Will rollback the change and 
close the Jira.

> Intermittent unit test failure for 
> TestScriptUDF.testPythonScriptUDFNullInputOutput
> ---
>
> Key: PIG-1579
> URL: https://issues.apache.org/jira/browse/PIG-1579
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1579-1.patch
>
>
> Error message:
> org.apache.pig.backend.executionengine.ExecException: ERROR 0: Error 
> executing function: Traceback (most recent call last):
>   File "", line 5, in multStr
> TypeError: can't multiply sequence by non-int of type 'NoneType'
> at 
> org.apache.pig.scripting.jython.JythonFunction.exec(JythonFunction.java:107)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:295)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:346)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> at org.apache.hadoop.mapred.Child.main(Child.java:170)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1637) Combiner not use because optimizor inserts a foreach between group and algebric function

2010-09-28 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915880#action_12915880
 ] 

Daniel Dai commented on PIG-1637:
-

test-patch result for PIG-1637-2.patch:

 [exec] +1 overall.  
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] +1 tests included.  The patch appears to include 3 new or 
modified tests.
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.


> Combiner not use because optimizor inserts a foreach between group and 
> algebric function
> 
>
> Key: PIG-1637
> URL: https://issues.apache.org/jira/browse/PIG-1637
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1637-1.patch, PIG-1637-2.patch
>
>
> The following script does not use combiner after new optimization change.
> {code}
> A = load ':INPATH:/pigmix/page_views' using 
> org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
> as (user, action, timespent, query_term, ip_addr, timestamp, 
> estimated_revenue, page_info, page_links);
> B = foreach A generate user, (int)timespent as timespent, 
> (double)estimated_revenue as estimated_revenue;
> C = group B all; 
> D = foreach C generate SUM(B.timespent), AVG(B.estimated_revenue);
> store D into ':OUTPATH:';
> {code}
> This is because after group, optimizer detect group key is not used 
> afterward, it add a foreach statement after C. This is how it looks like 
> after optimization:
> {code}
> A = load ':INPATH:/pigmix/page_views' using 
> org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
> as (user, action, timespent, query_term, ip_addr, timestamp, 
> estimated_revenue, page_info, page_links);
> B = foreach A generate user, (int)timespent as timespent, 
> (double)estimated_revenue as estimated_revenue;
> C = group B all; 
> C1 = foreach C generate B;
> D = foreach C1 generate SUM(B.timespent), AVG(B.estimated_revenue);
> store D into ':OUTPATH:';
> {code}
> That cancel the combiner optimization for D. 
> The way to solve the issue is to merge the C1 we inserted and D. Currently, 
> we do not merge these two foreach. The reason is that one output of the first 
> foreach (B) is referred twice in D, and currently rule assume after merge, we 
> need to calculate B twice in D. Actually, C1 is only doing projection, no 
> calculation of B. Merging C1 and D will not result calculating B twice. So C1 
> and D should be merged.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1653) Scripting UDF fails if the path to script is an absolute path

2010-09-28 Thread Daniel Dai (JIRA)
Scripting UDF fails if the path to script is an absolute path
-

 Key: PIG-1653
 URL: https://issues.apache.org/jira/browse/PIG-1653
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Daniel Dai
 Fix For: 0.8.0


The following script fail:
{code}
register '/homes/jianyong/pig/aaa/scriptingudf.py' using jython as myfuncs;
a = load '/user/pig/tests/data/singlefile/studenttab10k' using PigStorage() as 
(name, age, gpa:double);
b = foreach a generate myfuncs.square(gpa);
dump b;
{code}

If we change the register to use relative path (such as "aaa/scriptingudf.py"), 
it success.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1637) Combiner not use because optimizor inserts a foreach between group and algebric function

2010-09-28 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1637:


Attachment: PIG-1637-2.patch

A bug caught by Xuefu. Reattach the patch.

> Combiner not use because optimizor inserts a foreach between group and 
> algebric function
> 
>
> Key: PIG-1637
> URL: https://issues.apache.org/jira/browse/PIG-1637
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1637-1.patch, PIG-1637-2.patch
>
>
> The following script does not use combiner after new optimization change.
> {code}
> A = load ':INPATH:/pigmix/page_views' using 
> org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
> as (user, action, timespent, query_term, ip_addr, timestamp, 
> estimated_revenue, page_info, page_links);
> B = foreach A generate user, (int)timespent as timespent, 
> (double)estimated_revenue as estimated_revenue;
> C = group B all; 
> D = foreach C generate SUM(B.timespent), AVG(B.estimated_revenue);
> store D into ':OUTPATH:';
> {code}
> This is because after group, optimizer detect group key is not used 
> afterward, it add a foreach statement after C. This is how it looks like 
> after optimization:
> {code}
> A = load ':INPATH:/pigmix/page_views' using 
> org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
> as (user, action, timespent, query_term, ip_addr, timestamp, 
> estimated_revenue, page_info, page_links);
> B = foreach A generate user, (int)timespent as timespent, 
> (double)estimated_revenue as estimated_revenue;
> C = group B all; 
> C1 = foreach C generate B;
> D = foreach C1 generate SUM(B.timespent), AVG(B.estimated_revenue);
> store D into ':OUTPATH:';
> {code}
> That cancel the combiner optimization for D. 
> The way to solve the issue is to merge the C1 we inserted and D. Currently, 
> we do not merge these two foreach. The reason is that one output of the first 
> foreach (B) is referred twice in D, and currently rule assume after merge, we 
> need to calculate B twice in D. Actually, C1 is only doing projection, no 
> calculation of B. Merging C1 and D will not result calculating B twice. So C1 
> and D should be merged.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1652) TestSortedTableUnion and TestSortedTableUnionMergeJoin fail on trunk due to estimateNumberOfReducers bug

2010-09-28 Thread Daniel Dai (JIRA)
TestSortedTableUnion and TestSortedTableUnionMergeJoin fail on trunk due to 
estimateNumberOfReducers bug


 Key: PIG-1652
 URL: https://issues.apache.org/jira/browse/PIG-1652
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
 Fix For: 0.8.0


TestSortedTableUnion and TestSortedTableUnionMergeJoin fail on trunk due to the 
input size estimation. Here is the stack of TestSortedTableUnionMergeJoin:

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to store 
alias records3
at org.apache.pig.PigServer.storeEx(PigServer.java:877)
at org.apache.pig.PigServer.store(PigServer.java:815)
at org.apache.pig.PigServer.openIterator(PigServer.java:727)
at 
org.apache.hadoop.zebra.pig.TestSortedTableUnionMergeJoin.testStorer(TestSortedTableUnionMergeJoin.java:203)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2043: 
Unexpected error during execution.
at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:326)
at 
org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1197)
at org.apache.pig.PigServer.storeEx(PigServer.java:873)
Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: 
Illegal character in scheme name at index 69: 
org.apache.hadoop.zebra.pig.TestSortedTableUnionMergeJoin.testStorer1,file:
at org.apache.hadoop.fs.Path.initialize(Path.java:140)
at org.apache.hadoop.fs.Path.(Path.java:126)
at org.apache.hadoop.fs.Path.(Path.java:50)
at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:963)
at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
at 
org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:902)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:866)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:844)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getTotalInputFileSize(JobControlCompiler.java:715)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.estimateNumberOfReducers(JobControlCompiler.java:688)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.SampleOptimizer.visitMROp(SampleOptimizer.java:140)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceOper.visit(MapReduceOper.java:246)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceOper.visit(MapReduceOper.java:41)
at 
org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:69)
at 
org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:71)
at 
org.apache.pig.impl.plan.DepthFirstWalker.walk(DepthFirstWalker.java:52)
at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.SampleOptimizer.visit(SampleOptimizer.java:69)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.compile(MapReduceLauncher.java:491)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:116)
at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:301)
Caused by: java.net.URISyntaxException: Illegal character in scheme name at 
index 69: 
org.apache.hadoop.zebra.pig.TestSortedTableUnionMergeJoin.testStorer1,file:
at java.net.URI$Parser.fail(URI.java:2809)
at java.net.URI$Parser.checkChars(URI.java:2982)
at java.net.URI$Parser.parse(URI.java:3009)
at java.net.URI.(URI.java:736)
at org.apache.hadoop.fs.Path.initialize(Path.java:137)

The reason is we are trying to do globStatus 

[jira] Updated: (PIG-1637) Combiner not use because optimizor inserts a foreach between group and algebric function

2010-09-27 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1637:


Attachment: PIG-1637-1.patch

> Combiner not use because optimizor inserts a foreach between group and 
> algebric function
> 
>
> Key: PIG-1637
> URL: https://issues.apache.org/jira/browse/PIG-1637
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1637-1.patch
>
>
> The following script does not use combiner after new optimization change.
> {code}
> A = load ':INPATH:/pigmix/page_views' using 
> org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
> as (user, action, timespent, query_term, ip_addr, timestamp, 
> estimated_revenue, page_info, page_links);
> B = foreach A generate user, (int)timespent as timespent, 
> (double)estimated_revenue as estimated_revenue;
> C = group B all; 
> D = foreach C generate SUM(B.timespent), AVG(B.estimated_revenue);
> store D into ':OUTPATH:';
> {code}
> This is because after group, optimizer detect group key is not used 
> afterward, it add a foreach statement after C. This is how it looks like 
> after optimization:
> {code}
> A = load ':INPATH:/pigmix/page_views' using 
> org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
> as (user, action, timespent, query_term, ip_addr, timestamp, 
> estimated_revenue, page_info, page_links);
> B = foreach A generate user, (int)timespent as timespent, 
> (double)estimated_revenue as estimated_revenue;
> C = group B all; 
> C1 = foreach C generate B;
> D = foreach C1 generate SUM(B.timespent), AVG(B.estimated_revenue);
> store D into ':OUTPATH:';
> {code}
> That cancel the combiner optimization for D. 
> The way to solve the issue is to merge the C1 we inserted and D. Currently, 
> we do not merge these two foreach. The reason is that one output of the first 
> foreach (B) is referred twice in D, and currently rule assume after merge, we 
> need to calculate B twice in D. Actually, C1 is only doing projection, no 
> calculation of B. Merging C1 and D will not result calculating B twice. So C1 
> and D should be merged.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1647) Logical simplifier throws a NPE

2010-09-27 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915365#action_12915365
 ] 

Daniel Dai commented on PIG-1647:
-

+1. Please commit.

> Logical simplifier throws a NPE
> ---
>
> Key: PIG-1647
> URL: https://issues.apache.org/jira/browse/PIG-1647
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Yan Zhou
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1647.patch, PIG-1647.patch
>
>
> A query like:
> A = load 'd.txt' as (a:chararray, b:long, c:map[], d:chararray, e:chararray);
> B = filter A by a == 'v' and b == 117L and c#'p1' == 'h' and c#'p2' == 'to' 
> and ((d is not null and d != '') or (e is not null and e != ''));
> will cause the logical expression simplifier to throw a NPE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1644) New logical plan: Plan.connect with position is misused in some places

2010-09-26 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai resolved PIG-1644.
-

Hadoop Flags: [Reviewed]
  Resolution: Fixed

 [exec] +1 overall.  
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] +1 tests included.  The patch appears to include 6 new or 
modified tests.
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.

All tests pass. 

Patch committed to both trunk and 0.8 branch.

> New logical plan: Plan.connect with position is misused in some places
> --
>
> Key: PIG-1644
> URL: https://issues.apache.org/jira/browse/PIG-1644
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1644-1.patch, PIG-1644-2.patch, PIG-1644-3.patch, 
> PIG-1644-4.patch
>
>
> When we replace/remove/insert a node, we will use disconnect/connect methods 
> of OperatorPlan. When we disconnect an edge, we shall save the position of 
> the edge in origination and destination, and use this position when connect 
> to the new predecessor/successor. Some of the pattens are:
> Insert a new node:
> {code}
> Pair pos = plan.disconnect(pred, succ);
> plan.connect(pred, pos.first, newnode, 0);
> plan.connect(newnode, 0, succ, pos.second);
> {code}
> Remove a node:
> {code}
> Pair pos1 = plan.disconnect(pred, nodeToRemove);
> Pair pos2 = plan.disconnect(nodeToRemove, succ);
> plan.connect(pred, pos1.first, succ, pos2.second);
> {code}
> Replace a node:
> {code}
> Pair pos1 = plan.disconnect(pred, nodeToReplace);
> Pair pos2 = plan.disconnect(nodeToReplace, succ);
> plan.connect(pred, pos1.first, newNode, pos1.second);
> plan.connect(newNode, pos2.first, succ, pos2.second);
> {code}
> There are couple of places of we does not follow this pattern, that results 
> some error. For example, the following script fail:
> {code}
> a = load '1.txt' as (a0, a1, a2, a3);
> b = foreach a generate a0, a1, a2;
> store b into 'aaa';
> c = order b by a2;
> d = foreach c generate a2;
> store d into 'bbb';
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1644) New logical plan: Plan.connect with position is misused in some places

2010-09-26 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1644:


Attachment: PIG-1644-4.patch

PIG-1644-4.patch fix findbug warnings and additional unit failures.

> New logical plan: Plan.connect with position is misused in some places
> --
>
> Key: PIG-1644
> URL: https://issues.apache.org/jira/browse/PIG-1644
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1644-1.patch, PIG-1644-2.patch, PIG-1644-3.patch, 
> PIG-1644-4.patch
>
>
> When we replace/remove/insert a node, we will use disconnect/connect methods 
> of OperatorPlan. When we disconnect an edge, we shall save the position of 
> the edge in origination and destination, and use this position when connect 
> to the new predecessor/successor. Some of the pattens are:
> Insert a new node:
> {code}
> Pair pos = plan.disconnect(pred, succ);
> plan.connect(pred, pos.first, newnode, 0);
> plan.connect(newnode, 0, succ, pos.second);
> {code}
> Remove a node:
> {code}
> Pair pos1 = plan.disconnect(pred, nodeToRemove);
> Pair pos2 = plan.disconnect(nodeToRemove, succ);
> plan.connect(pred, pos1.first, succ, pos2.second);
> {code}
> Replace a node:
> {code}
> Pair pos1 = plan.disconnect(pred, nodeToReplace);
> Pair pos2 = plan.disconnect(nodeToReplace, succ);
> plan.connect(pred, pos1.first, newNode, pos1.second);
> plan.connect(newNode, pos2.first, succ, pos2.second);
> {code}
> There are couple of places of we does not follow this pattern, that results 
> some error. For example, the following script fail:
> {code}
> a = load '1.txt' as (a0, a1, a2, a3);
> b = foreach a generate a0, a1, a2;
> store b into 'aaa';
> c = order b by a2;
> d = foreach c generate a2;
> store d into 'bbb';
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1643) join fails for a query with input having 'load using pigstorage without schema' + 'foreach'

2010-09-26 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai resolved PIG-1643.
-

Release Note: PIG-1643.4.patch committed to both trunk and 0.8 branch.
  Resolution: Fixed

> join fails for a query with input having 'load using pigstorage without 
> schema' + 'foreach'
> ---
>
> Key: PIG-1643
> URL: https://issues.apache.org/jira/browse/PIG-1643
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Fix For: 0.8.0
>
> Attachments: PIG-1643.1.patch, PIG-1643.2.patch, PIG-1643.3.patch, 
> PIG-1643.4.patch
>
>
> {code}
> l1 = load 'std.txt';
> l2 = load 'std.txt'; 
> f1 = foreach l1 generate $0 as abc, $1 as  def;
> -- j =  join f1 by $0, l2 by $0 using 'replicated';
> -- j =  join l2 by $0, f1 by $0 using 'replicated';
> j =  join l2 by $0, f1 by $0 ;
> dump j;
> {code}
> the error -
> {code}
> 2010-09-22 16:24:48,584 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 2044: The type null cannot be collected as a Key type
> {code}
> The MR plan from explain  -
> {code}
> #--
> # Map Reduce Plan  
> #--
> MapReduce node scope-21
> Map Plan
> Union[tuple] - scope-22
> |
> |---j: Local Rearrange[tuple]{bytearray}(false) - scope-11
> |   |   |
> |   |   Project[bytearray][0] - scope-12
> |   |
> |   |---l2: 
> Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage)
>  - scope-0
> |
> |---j: Local Rearrange[tuple]{NULL}(false) - scope-13
> |   |
> |   Project[NULL][0] - scope-14
> |
> |---f1: New For Each(false,false)[bag] - scope-6
> |   |
> |   Project[bytearray][0] - scope-2
> |   |
> |   Project[bytearray][1] - scope-4
> |
> |---l1: 
> Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage)
>  - scope-1
> Reduce Plan
> j: Store(/tmp/x:org.apache.pig.builtin.PigStorage) - scope-18
> |
> |---POJoinPackage(true,true)[tuple] - scope-23
> Global sort: false
> 
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1643) join fails for a query with input having 'load using pigstorage without schema' + 'foreach'

2010-09-26 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915037#action_12915037
 ] 

Daniel Dai commented on PIG-1643:
-

 [exec] +1 overall.  
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] +1 tests included.  The patch appears to include 3 new or 
modified tests.
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.

All tests pass.

> join fails for a query with input having 'load using pigstorage without 
> schema' + 'foreach'
> ---
>
> Key: PIG-1643
> URL: https://issues.apache.org/jira/browse/PIG-1643
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Fix For: 0.8.0
>
> Attachments: PIG-1643.1.patch, PIG-1643.2.patch, PIG-1643.3.patch, 
> PIG-1643.4.patch
>
>
> {code}
> l1 = load 'std.txt';
> l2 = load 'std.txt'; 
> f1 = foreach l1 generate $0 as abc, $1 as  def;
> -- j =  join f1 by $0, l2 by $0 using 'replicated';
> -- j =  join l2 by $0, f1 by $0 using 'replicated';
> j =  join l2 by $0, f1 by $0 ;
> dump j;
> {code}
> the error -
> {code}
> 2010-09-22 16:24:48,584 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 2044: The type null cannot be collected as a Key type
> {code}
> The MR plan from explain  -
> {code}
> #--
> # Map Reduce Plan  
> #--
> MapReduce node scope-21
> Map Plan
> Union[tuple] - scope-22
> |
> |---j: Local Rearrange[tuple]{bytearray}(false) - scope-11
> |   |   |
> |   |   Project[bytearray][0] - scope-12
> |   |
> |   |---l2: 
> Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage)
>  - scope-0
> |
> |---j: Local Rearrange[tuple]{NULL}(false) - scope-13
> |   |
> |   Project[NULL][0] - scope-14
> |
> |---f1: New For Each(false,false)[bag] - scope-6
> |   |
> |   Project[bytearray][0] - scope-2
> |   |
> |   Project[bytearray][1] - scope-4
> |
> |---l1: 
> Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage)
>  - scope-1
> Reduce Plan
> j: Store(/tmp/x:org.apache.pig.builtin.PigStorage) - scope-18
> |
> |---POJoinPackage(true,true)[tuple] - scope-23
> Global sort: false
> 
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1643) join fails for a query with input having 'load using pigstorage without schema' + 'foreach'

2010-09-24 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1643:


Attachment: PIG-1643.3.patch

PIG-1643.3.patch is more general than PIG-1643.2.patch. It solves this null 
schema issue for all expressions.

> join fails for a query with input having 'load using pigstorage without 
> schema' + 'foreach'
> ---
>
> Key: PIG-1643
> URL: https://issues.apache.org/jira/browse/PIG-1643
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Fix For: 0.8.0
>
> Attachments: PIG-1643.1.patch, PIG-1643.2.patch, PIG-1643.3.patch
>
>
> {code}
> l1 = load 'std.txt';
> l2 = load 'std.txt'; 
> f1 = foreach l1 generate $0 as abc, $1 as  def;
> -- j =  join f1 by $0, l2 by $0 using 'replicated';
> -- j =  join l2 by $0, f1 by $0 using 'replicated';
> j =  join l2 by $0, f1 by $0 ;
> dump j;
> {code}
> the error -
> {code}
> 2010-09-22 16:24:48,584 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 2044: The type null cannot be collected as a Key type
> {code}
> The MR plan from explain  -
> {code}
> #--
> # Map Reduce Plan  
> #--
> MapReduce node scope-21
> Map Plan
> Union[tuple] - scope-22
> |
> |---j: Local Rearrange[tuple]{bytearray}(false) - scope-11
> |   |   |
> |   |   Project[bytearray][0] - scope-12
> |   |
> |   |---l2: 
> Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage)
>  - scope-0
> |
> |---j: Local Rearrange[tuple]{NULL}(false) - scope-13
> |   |
> |   Project[NULL][0] - scope-14
> |
> |---f1: New For Each(false,false)[bag] - scope-6
> |   |
> |   Project[bytearray][0] - scope-2
> |   |
> |   Project[bytearray][1] - scope-4
> |
> |---l1: 
> Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage)
>  - scope-1
> Reduce Plan
> j: Store(/tmp/x:org.apache.pig.builtin.PigStorage) - scope-18
> |
> |---POJoinPackage(true,true)[tuple] - scope-23
> Global sort: false
> 
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1639) New logical plan: PushUpFilter should not push before group/cogroup if filter condition contains UDF

2010-09-24 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1639:


  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

Patch committed to both trunk and 0.8 branch.

> New logical plan: PushUpFilter should not push before group/cogroup if filter 
> condition contains UDF
> 
>
> Key: PIG-1639
> URL: https://issues.apache.org/jira/browse/PIG-1639
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Xuefu Zhang
> Fix For: 0.8.0
>
> Attachments: jira-1639-1.patch
>
>
> The following script fail:
> {code}
> a = load 'file' AS (f1, f2, f3);
> b = group a by f1;
> c = filter b by COUNT(a) > 1;
> dump c;
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1643) join fails for a query with input having 'load using pigstorage without schema' + 'foreach'

2010-09-24 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1643:


Attachment: PIG-1643.2.patch

Attach a fix.

> join fails for a query with input having 'load using pigstorage without 
> schema' + 'foreach'
> ---
>
> Key: PIG-1643
> URL: https://issues.apache.org/jira/browse/PIG-1643
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Fix For: 0.8.0
>
> Attachments: PIG-1643.1.patch, PIG-1643.2.patch
>
>
> {code}
> l1 = load 'std.txt';
> l2 = load 'std.txt'; 
> f1 = foreach l1 generate $0 as abc, $1 as  def;
> -- j =  join f1 by $0, l2 by $0 using 'replicated';
> -- j =  join l2 by $0, f1 by $0 using 'replicated';
> j =  join l2 by $0, f1 by $0 ;
> dump j;
> {code}
> the error -
> {code}
> 2010-09-22 16:24:48,584 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 2044: The type null cannot be collected as a Key type
> {code}
> The MR plan from explain  -
> {code}
> #--
> # Map Reduce Plan  
> #--
> MapReduce node scope-21
> Map Plan
> Union[tuple] - scope-22
> |
> |---j: Local Rearrange[tuple]{bytearray}(false) - scope-11
> |   |   |
> |   |   Project[bytearray][0] - scope-12
> |   |
> |   |---l2: 
> Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage)
>  - scope-0
> |
> |---j: Local Rearrange[tuple]{NULL}(false) - scope-13
> |   |
> |   Project[NULL][0] - scope-14
> |
> |---f1: New For Each(false,false)[bag] - scope-6
> |   |
> |   Project[bytearray][0] - scope-2
> |   |
> |   Project[bytearray][1] - scope-4
> |
> |---l1: 
> Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage)
>  - scope-1
> Reduce Plan
> j: Store(/tmp/x:org.apache.pig.builtin.PigStorage) - scope-18
> |
> |---POJoinPackage(true,true)[tuple] - scope-23
> Global sort: false
> 
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Reopened: (PIG-1643) join fails for a query with input having 'load using pigstorage without schema' + 'foreach'

2010-09-24 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai reopened PIG-1643:
-


The following script does not produce the right result after patch:
{code}
a = load '/grid/2/dev/pigqa/in/singlefile/studenttab10k';
b = foreach a generate *;
store b into '/grid/2/dev/pigqa/out/log/hadoopqa.1285338379/Foreach_2.out';
{code}

> join fails for a query with input having 'load using pigstorage without 
> schema' + 'foreach'
> ---
>
> Key: PIG-1643
> URL: https://issues.apache.org/jira/browse/PIG-1643
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Fix For: 0.8.0
>
> Attachments: PIG-1643.1.patch, PIG-1643.2.patch
>
>
> {code}
> l1 = load 'std.txt';
> l2 = load 'std.txt'; 
> f1 = foreach l1 generate $0 as abc, $1 as  def;
> -- j =  join f1 by $0, l2 by $0 using 'replicated';
> -- j =  join l2 by $0, f1 by $0 using 'replicated';
> j =  join l2 by $0, f1 by $0 ;
> dump j;
> {code}
> the error -
> {code}
> 2010-09-22 16:24:48,584 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 2044: The type null cannot be collected as a Key type
> {code}
> The MR plan from explain  -
> {code}
> #--
> # Map Reduce Plan  
> #--
> MapReduce node scope-21
> Map Plan
> Union[tuple] - scope-22
> |
> |---j: Local Rearrange[tuple]{bytearray}(false) - scope-11
> |   |   |
> |   |   Project[bytearray][0] - scope-12
> |   |
> |   |---l2: 
> Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage)
>  - scope-0
> |
> |---j: Local Rearrange[tuple]{NULL}(false) - scope-13
> |   |
> |   Project[NULL][0] - scope-14
> |
> |---f1: New For Each(false,false)[bag] - scope-6
> |   |
> |   Project[bytearray][0] - scope-2
> |   |
> |   Project[bytearray][1] - scope-4
> |
> |---l1: 
> Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage)
>  - scope-1
> Reduce Plan
> j: Store(/tmp/x:org.apache.pig.builtin.PigStorage) - scope-18
> |
> |---POJoinPackage(true,true)[tuple] - scope-23
> Global sort: false
> 
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1635) Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed

2010-09-24 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914675#action_12914675
 ] 

Daniel Dai commented on PIG-1635:
-

+1 for commit.

> Logical simplifier does not simplify away constants under AND and OR; after 
> simplificaion the ordering of operands of AND and OR may get changed
> 
>
> Key: PIG-1635
> URL: https://issues.apache.org/jira/browse/PIG-1635
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Yan Zhou
>Assignee: Yan Zhou
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: PIG-1635.patch
>
>
> b = FILTER a by (( f1 > 1) AND (1 == 1))
> or 
> b = FILTER a by ((f1 > 1) OR ( 1==0))
> should be simplified to
> b = FILTER a by f1 > 1;
> Regarding ordering change, an example is that 
> b = filter a by ((f1 is not null) AND (f2 is not null));
> Even without possible simplification, the expression is changed to
> b = filter a by ((f2 is not null) AND (f1 is not null));
> Even though the ordering change in this case, and probably in most other 
> cases, does not create any difference, but for two reasons some users might 
> care about the ordering: if stateful UDFs are used as operands of AND or OR; 
> and if the ordering is intended by the application designer to maximize the 
> chances to shortcut the composite boolean evaluation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1644) New logical plan: Plan.connect with position is misused in some places

2010-09-24 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1644:


Attachment: PIG-1644-3.patch

Find one bug introduced by refactory. Attach PIG-1644-3.patch with the fix, and 
running the tests again.

> New logical plan: Plan.connect with position is misused in some places
> --
>
> Key: PIG-1644
> URL: https://issues.apache.org/jira/browse/PIG-1644
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1644-1.patch, PIG-1644-2.patch, PIG-1644-3.patch
>
>
> When we replace/remove/insert a node, we will use disconnect/connect methods 
> of OperatorPlan. When we disconnect an edge, we shall save the position of 
> the edge in origination and destination, and use this position when connect 
> to the new predecessor/successor. Some of the pattens are:
> Insert a new node:
> {code}
> Pair pos = plan.disconnect(pred, succ);
> plan.connect(pred, pos.first, newnode, 0);
> plan.connect(newnode, 0, succ, pos.second);
> {code}
> Remove a node:
> {code}
> Pair pos1 = plan.disconnect(pred, nodeToRemove);
> Pair pos2 = plan.disconnect(nodeToRemove, succ);
> plan.connect(pred, pos1.first, succ, pos2.second);
> {code}
> Replace a node:
> {code}
> Pair pos1 = plan.disconnect(pred, nodeToReplace);
> Pair pos2 = plan.disconnect(nodeToReplace, succ);
> plan.connect(pred, pos1.first, newNode, pos1.second);
> plan.connect(newNode, pos2.first, succ, pos2.second);
> {code}
> There are couple of places of we does not follow this pattern, that results 
> some error. For example, the following script fail:
> {code}
> a = load '1.txt' as (a0, a1, a2, a3);
> b = foreach a generate a0, a1, a2;
> store b into 'aaa';
> c = order b by a2;
> d = foreach c generate a2;
> store d into 'bbb';
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1635) Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed

2010-09-24 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914662#action_12914662
 ] 

Daniel Dai commented on PIG-1635:
-

+1, patch looks good. Also can you have a review of all connect/disconnect 
usage in ExpressionSimplifer, according to 
[PIG-1644|https://issues.apache.org/jira/browse/PIG-1644]? I see lots of misuse 
in other rules.

> Logical simplifier does not simplify away constants under AND and OR; after 
> simplificaion the ordering of operands of AND and OR may get changed
> 
>
> Key: PIG-1635
> URL: https://issues.apache.org/jira/browse/PIG-1635
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Yan Zhou
>Assignee: Yan Zhou
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: PIG-1635.patch
>
>
> b = FILTER a by (( f1 > 1) AND (1 == 1))
> or 
> b = FILTER a by ((f1 > 1) OR ( 1==0))
> should be simplified to
> b = FILTER a by f1 > 1;
> Regarding ordering change, an example is that 
> b = filter a by ((f1 is not null) AND (f2 is not null));
> Even without possible simplification, the expression is changed to
> b = filter a by ((f2 is not null) AND (f1 is not null));
> Even though the ordering change in this case, and probably in most other 
> cases, does not create any difference, but for two reasons some users might 
> care about the ordering: if stateful UDFs are used as operands of AND or OR; 
> and if the ordering is intended by the application designer to maximize the 
> chances to shortcut the composite boolean evaluation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1644) New logical plan: Plan.connect with position is misused in some places

2010-09-23 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1644:


Attachment: PIG-1644-2.patch

Attach the patch with new methods and refactory of existing code.

> New logical plan: Plan.connect with position is misused in some places
> --
>
> Key: PIG-1644
> URL: https://issues.apache.org/jira/browse/PIG-1644
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1644-1.patch, PIG-1644-2.patch
>
>
> When we replace/remove/insert a node, we will use disconnect/connect methods 
> of OperatorPlan. When we disconnect an edge, we shall save the position of 
> the edge in origination and destination, and use this position when connect 
> to the new predecessor/successor. Some of the pattens are:
> Insert a new node:
> {code}
> Pair pos = plan.disconnect(pred, succ);
> plan.connect(pred, pos.first, newnode, 0);
> plan.connect(newnode, 0, succ, pos.second);
> {code}
> Remove a node:
> {code}
> Pair pos1 = plan.disconnect(pred, nodeToRemove);
> Pair pos2 = plan.disconnect(nodeToRemove, succ);
> plan.connect(pred, pos1.first, succ, pos2.second);
> {code}
> Replace a node:
> {code}
> Pair pos1 = plan.disconnect(pred, nodeToReplace);
> Pair pos2 = plan.disconnect(nodeToReplace, succ);
> plan.connect(pred, pos1.first, newNode, pos1.second);
> plan.connect(newNode, pos2.first, succ, pos2.second);
> {code}
> There are couple of places of we does not follow this pattern, that results 
> some error. For example, the following script fail:
> {code}
> a = load '1.txt' as (a0, a1, a2, a3);
> b = foreach a generate a0, a1, a2;
> store b into 'aaa';
> c = order b by a2;
> d = foreach c generate a2;
> store d into 'bbb';
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1644) New logical plan: Plan.connect with position is misused in some places

2010-09-23 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914317#action_12914317
 ] 

Daniel Dai commented on PIG-1644:
-

After looking into the existing code, seems insertBetween is a more useful 
method. So I want to drop insertBefore/insertAfter, and add insertBetween
{code}
insertBetween(Operator pred, Operator operatorToInsert, Operator succ)
{code}

> New logical plan: Plan.connect with position is misused in some places
> --
>
> Key: PIG-1644
> URL: https://issues.apache.org/jira/browse/PIG-1644
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1644-1.patch
>
>
> When we replace/remove/insert a node, we will use disconnect/connect methods 
> of OperatorPlan. When we disconnect an edge, we shall save the position of 
> the edge in origination and destination, and use this position when connect 
> to the new predecessor/successor. Some of the pattens are:
> Insert a new node:
> {code}
> Pair pos = plan.disconnect(pred, succ);
> plan.connect(pred, pos.first, newnode, 0);
> plan.connect(newnode, 0, succ, pos.second);
> {code}
> Remove a node:
> {code}
> Pair pos1 = plan.disconnect(pred, nodeToRemove);
> Pair pos2 = plan.disconnect(nodeToRemove, succ);
> plan.connect(pred, pos1.first, succ, pos2.second);
> {code}
> Replace a node:
> {code}
> Pair pos1 = plan.disconnect(pred, nodeToReplace);
> Pair pos2 = plan.disconnect(nodeToReplace, succ);
> plan.connect(pred, pos1.first, newNode, pos1.second);
> plan.connect(newNode, pos2.first, succ, pos2.second);
> {code}
> There are couple of places of we does not follow this pattern, that results 
> some error. For example, the following script fail:
> {code}
> a = load '1.txt' as (a0, a1, a2, a3);
> b = foreach a generate a0, a1, a2;
> store b into 'aaa';
> c = order b by a2;
> d = foreach c generate a2;
> store d into 'bbb';
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1639) New logical plan: PushUpFilter should not push before group/cogroup if filter condition contains UDF

2010-09-23 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914154#action_12914154
 ] 

Daniel Dai commented on PIG-1639:
-

+1 if all tests pass.

> New logical plan: PushUpFilter should not push before group/cogroup if filter 
> condition contains UDF
> 
>
> Key: PIG-1639
> URL: https://issues.apache.org/jira/browse/PIG-1639
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Xuefu Zhang
> Fix For: 0.8.0
>
> Attachments: jira-1639-1.patch
>
>
> The following script fail:
> {code}
> a = load 'file' AS (f1, f2, f3);
> b = group a by f1;
> c = filter b by COUNT(a) > 1;
> dump c;
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1639) New logical plan: PushUpFilter should not push before group/cogroup if filter condition contains UDF

2010-09-23 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1639:


Summary: New logical plan: PushUpFilter should not push before 
group/cogroup if filter condition contains UDF  (was: New logical plan: 
PushUpFilter should not optimize if filter condition contains UDF)

> New logical plan: PushUpFilter should not push before group/cogroup if filter 
> condition contains UDF
> 
>
> Key: PIG-1639
> URL: https://issues.apache.org/jira/browse/PIG-1639
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Xuefu Zhang
> Fix For: 0.8.0
>
> Attachments: jira-1639-1.patch
>
>
> The following script fail:
> {code}
> a = load 'file' AS (f1, f2, f3);
> b = group a by f1;
> c = filter b by COUNT(a) > 1;
> dump c;
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1644) New logical plan: Plan.connect with position is misused in some places

2010-09-23 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914147#action_12914147
 ] 

Daniel Dai commented on PIG-1644:
-

Yes, I think we can do replace/remove/insert. They should be simple and clear 
enough to use. Here is the new methods adding to OperatorPlan:
{code}
replace(Operator oldOperator, Operator newOperator)
remove(Operator operatorToRemove) // Connect all its successors to 
predecessor/connect all it's predecessors to successor
insertBefore(Operator operatorToInsert, Operator pos) // Insert 
operatorToInsert before pos, connect all pos's predecessors to operatorToInsert
insertAfter(Operator operatorToInsert, Operator pos) // Insert operatorToInsert 
after pos, connect operatorToInsert to all pos's successor
{code}

How does it sounds?

> New logical plan: Plan.connect with position is misused in some places
> --
>
> Key: PIG-1644
> URL: https://issues.apache.org/jira/browse/PIG-1644
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1644-1.patch
>
>
> When we replace/remove/insert a node, we will use disconnect/connect methods 
> of OperatorPlan. When we disconnect an edge, we shall save the position of 
> the edge in origination and destination, and use this position when connect 
> to the new predecessor/successor. Some of the pattens are:
> Insert a new node:
> {code}
> Pair pos = plan.disconnect(pred, succ);
> plan.connect(pred, pos.first, newnode, 0);
> plan.connect(newnode, 0, succ, pos.second);
> {code}
> Remove a node:
> {code}
> Pair pos1 = plan.disconnect(pred, nodeToRemove);
> Pair pos2 = plan.disconnect(nodeToRemove, succ);
> plan.connect(pred, pos1.first, succ, pos2.second);
> {code}
> Replace a node:
> {code}
> Pair pos1 = plan.disconnect(pred, nodeToReplace);
> Pair pos2 = plan.disconnect(nodeToReplace, succ);
> plan.connect(pred, pos1.first, newNode, pos1.second);
> plan.connect(newNode, pos2.first, succ, pos2.second);
> {code}
> There are couple of places of we does not follow this pattern, that results 
> some error. For example, the following script fail:
> {code}
> a = load '1.txt' as (a0, a1, a2, a3);
> b = foreach a generate a0, a1, a2;
> store b into 'aaa';
> c = order b by a2;
> d = foreach c generate a2;
> store d into 'bbb';
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1643) join fails for a query with input having 'load using pigstorage without schema' + 'foreach'

2010-09-23 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914126#action_12914126
 ] 

Daniel Dai commented on PIG-1643:
-

+1 if tests pass.

> join fails for a query with input having 'load using pigstorage without 
> schema' + 'foreach'
> ---
>
> Key: PIG-1643
> URL: https://issues.apache.org/jira/browse/PIG-1643
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Fix For: 0.8.0
>
> Attachments: PIG-1643.1.patch
>
>
> {code}
> l1 = load 'std.txt';
> l2 = load 'std.txt'; 
> f1 = foreach l1 generate $0 as abc, $1 as  def;
> -- j =  join f1 by $0, l2 by $0 using 'replicated';
> -- j =  join l2 by $0, f1 by $0 using 'replicated';
> j =  join l2 by $0, f1 by $0 ;
> dump j;
> {code}
> the error -
> {code}
> 2010-09-22 16:24:48,584 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 2044: The type null cannot be collected as a Key type
> {code}
> The MR plan from explain  -
> {code}
> #--
> # Map Reduce Plan  
> #--
> MapReduce node scope-21
> Map Plan
> Union[tuple] - scope-22
> |
> |---j: Local Rearrange[tuple]{bytearray}(false) - scope-11
> |   |   |
> |   |   Project[bytearray][0] - scope-12
> |   |
> |   |---l2: 
> Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage)
>  - scope-0
> |
> |---j: Local Rearrange[tuple]{NULL}(false) - scope-13
> |   |
> |   Project[NULL][0] - scope-14
> |
> |---f1: New For Each(false,false)[bag] - scope-6
> |   |
> |   Project[bytearray][0] - scope-2
> |   |
> |   Project[bytearray][1] - scope-4
> |
> |---l1: 
> Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage)
>  - scope-1
> Reduce Plan
> j: Store(/tmp/x:org.apache.pig.builtin.PigStorage) - scope-18
> |
> |---POJoinPackage(true,true)[tuple] - scope-23
> Global sort: false
> 
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1644) New logical plan: Plan.connect with position is misused in some places

2010-09-22 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1644:


Attachment: PIG-1644-1.patch

Attach the patch to address all such places in new logical plan, except for 
ExpressionSimplifier. There is some work underway for ExpressionSimplifier 
([PIG-1635|https://issues.apache.org/jira/browse/PIG-1635]) include some of 
these changes, I don't want to conflict with that patch. So after PIG-1635, we 
may also review the connect/disconnect usage of ExpressionSimplifier.

> New logical plan: Plan.connect with position is misused in some places
> --
>
> Key: PIG-1644
> URL: https://issues.apache.org/jira/browse/PIG-1644
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1644-1.patch
>
>
> When we replace/remove/insert a node, we will use disconnect/connect methods 
> of OperatorPlan. When we disconnect an edge, we shall save the position of 
> the edge in origination and destination, and use this position when connect 
> to the new predecessor/successor. Some of the pattens are:
> Insert a new node:
> {code}
> Pair pos = plan.disconnect(pred, succ);
> plan.connect(pred, pos.first, newnode, 0);
> plan.connect(newnode, 0, succ, pos.second);
> {code}
> Remove a node:
> {code}
> Pair pos1 = plan.disconnect(pred, nodeToRemove);
> Pair pos2 = plan.disconnect(nodeToRemove, succ);
> plan.connect(pred, pos1.first, succ, pos2.second);
> {code}
> Replace a node:
> {code}
> Pair pos1 = plan.disconnect(pred, nodeToReplace);
> Pair pos2 = plan.disconnect(nodeToReplace, succ);
> plan.connect(pred, pos1.first, newNode, pos1.second);
> plan.connect(newNode, pos2.first, succ, pos2.second);
> {code}
> There are couple of places of we does not follow this pattern, that results 
> some error. For example, the following script fail:
> {code}
> a = load '1.txt' as (a0, a1, a2, a3);
> b = foreach a generate a0, a1, a2;
> store b into 'aaa';
> c = order b by a2;
> d = foreach c generate a2;
> store d into 'bbb';
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1644) New logical plan: Plan.connect with position is misused in some places

2010-09-22 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1644:


Attachment: PIG-1644-1.patch

> New logical plan: Plan.connect with position is misused in some places
> --
>
> Key: PIG-1644
> URL: https://issues.apache.org/jira/browse/PIG-1644
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1644-1.patch
>
>
> When we replace/remove/insert a node, we will use disconnect/connect methods 
> of OperatorPlan. When we disconnect an edge, we shall save the position of 
> the edge in origination and destination, and use this position when connect 
> to the new predecessor/successor. Some of the pattens are:
> Insert a new node:
> {code}
> Pair pos = plan.disconnect(pred, succ);
> plan.connect(pred, pos.first, newnode, 0);
> plan.connect(newnode, 0, succ, pos.second);
> {code}
> Remove a node:
> {code}
> Pair pos1 = plan.disconnect(pred, nodeToRemove);
> Pair pos2 = plan.disconnect(nodeToRemove, succ);
> plan.connect(pred, pos1.first, succ, pos2.second);
> {code}
> Replace a node:
> {code}
> Pair pos1 = plan.disconnect(pred, nodeToReplace);
> Pair pos2 = plan.disconnect(nodeToReplace, succ);
> plan.connect(pred, pos1.first, newNode, pos1.second);
> plan.connect(newNode, pos2.first, succ, pos2.second);
> {code}
> There are couple of places of we does not follow this pattern, that results 
> some error. For example, the following script fail:
> {code}
> a = load '1.txt' as (a0, a1, a2, a3);
> b = foreach a generate a0, a1, a2;
> store b into 'aaa';
> c = order b by a2;
> d = foreach c generate a2;
> store d into 'bbb';
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1644) New logical plan: Plan.connect with position is misused in some places

2010-09-22 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1644:


Attachment: (was: PIG-1644-1.patch)

> New logical plan: Plan.connect with position is misused in some places
> --
>
> Key: PIG-1644
> URL: https://issues.apache.org/jira/browse/PIG-1644
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1644-1.patch
>
>
> When we replace/remove/insert a node, we will use disconnect/connect methods 
> of OperatorPlan. When we disconnect an edge, we shall save the position of 
> the edge in origination and destination, and use this position when connect 
> to the new predecessor/successor. Some of the pattens are:
> Insert a new node:
> {code}
> Pair pos = plan.disconnect(pred, succ);
> plan.connect(pred, pos.first, newnode, 0);
> plan.connect(newnode, 0, succ, pos.second);
> {code}
> Remove a node:
> {code}
> Pair pos1 = plan.disconnect(pred, nodeToRemove);
> Pair pos2 = plan.disconnect(nodeToRemove, succ);
> plan.connect(pred, pos1.first, succ, pos2.second);
> {code}
> Replace a node:
> {code}
> Pair pos1 = plan.disconnect(pred, nodeToReplace);
> Pair pos2 = plan.disconnect(nodeToReplace, succ);
> plan.connect(pred, pos1.first, newNode, pos1.second);
> plan.connect(newNode, pos2.first, succ, pos2.second);
> {code}
> There are couple of places of we does not follow this pattern, that results 
> some error. For example, the following script fail:
> {code}
> a = load '1.txt' as (a0, a1, a2, a3);
> b = foreach a generate a0, a1, a2;
> store b into 'aaa';
> c = order b by a2;
> d = foreach c generate a2;
> store d into 'bbb';
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1644) New logical plan: Plan.connect with position is misused in some places

2010-09-22 Thread Daniel Dai (JIRA)
New logical plan: Plan.connect with position is misused in some places
--

 Key: PIG-1644
 URL: https://issues.apache.org/jira/browse/PIG-1644
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0


When we replace/remove/insert a node, we will use disconnect/connect methods of 
OperatorPlan. When we disconnect an edge, we shall save the position of the 
edge in origination and destination, and use this position when connect to the 
new predecessor/successor. Some of the pattens are:

Insert a new node:
{code}
Pair pos = plan.disconnect(pred, succ);
plan.connect(pred, pos.first, newnode, 0);
plan.connect(newnode, 0, succ, pos.second);
{code}

Remove a node:
{code}
Pair pos1 = plan.disconnect(pred, nodeToRemove);
Pair pos2 = plan.disconnect(nodeToRemove, succ);
plan.connect(pred, pos1.first, succ, pos2.second);
{code}

Replace a node:
{code}
Pair pos1 = plan.disconnect(pred, nodeToReplace);
Pair pos2 = plan.disconnect(nodeToReplace, succ);
plan.connect(pred, pos1.first, newNode, pos1.second);
plan.connect(newNode, pos2.first, succ, pos2.second);
{code}

There are couple of places of we does not follow this pattern, that results 
some error. For example, the following script fail:
{code}
a = load '1.txt' as (a0, a1, a2, a3);
b = foreach a generate a0, a1, a2;
store b into 'aaa';
c = order b by a2;
d = foreach c generate a2;
store d into 'bbb';
{code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1636) Scalar fail if the scalar variable is generated by limit

2010-09-22 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai resolved PIG-1636.
-

Hadoop Flags: [Reviewed]
  Resolution: Fixed

Patch committed to both trunk and 0.8 branch.

> Scalar fail if the scalar variable is generated by limit
> 
>
> Key: PIG-1636
> URL: https://issues.apache.org/jira/browse/PIG-1636
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1636-1.patch
>
>
> The following script fail:
> {code}
> a = load 'studenttab10k' as (name: chararray, age: int, gpa: float);
> b = group a all;
> c = foreach b generate SUM(a.age) as total;
> c1= limit c 1;
> d = foreach a generate name, age/(double)c1.total as d_sum;
> store d into '111';
> {code}
> The problem is we have a reference to c1 in d. In the optimizer, we push 
> limit before foreach, d still reference to limit, and we get the wrong schema 
> for the scalar.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1636) Scalar fail if the scalar variable is generated by limit

2010-09-22 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913714#action_12913714
 ] 

Daniel Dai commented on PIG-1636:
-

test-patch result:
 [exec] +1 overall.  
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] +1 tests included.  The patch appears to include 3 new or 
modified tests.
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.

All tests pass.

> Scalar fail if the scalar variable is generated by limit
> 
>
> Key: PIG-1636
> URL: https://issues.apache.org/jira/browse/PIG-1636
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1636-1.patch
>
>
> The following script fail:
> {code}
> a = load 'studenttab10k' as (name: chararray, age: int, gpa: float);
> b = group a all;
> c = foreach b generate SUM(a.age) as total;
> c1= limit c 1;
> d = foreach a generate name, age/(double)c1.total as d_sum;
> store d into '111';
> {code}
> The problem is we have a reference to c1 in d. In the optimizer, we push 
> limit before foreach, d still reference to limit, and we get the wrong schema 
> for the scalar.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1605) Adding soft link to plan to solve input file dependency

2010-09-21 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai resolved PIG-1605.
-

Hadoop Flags: [Reviewed]
  Resolution: Fixed

Release audit warning is due to jdiff. No new file added. Patch committed to 
both trunk and 0.8 branch.

> Adding soft link to plan to solve input file dependency
> ---
>
> Key: PIG-1605
> URL: https://issues.apache.org/jira/browse/PIG-1605
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1605-1.patch, PIG-1605-2.patch
>
>
> In scalar implementation, we need to deal with implicit dependencies. 
> [PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve 
> the problem by adding a LOScalar operator. Here is a different approach. We 
> will add a soft link to the plan, and soft link is only visible to the 
> walkers. By doing this, we can make sure we visit LOStore which generate 
> scalar first, and then LOForEach which use the scalar. All other part of the 
> logical plan does not know the existence of the soft link. The benefits are:
> 1. Logical plan do not need to deal with LOScalar, this makes logical plan 
> cleaner
> 2. Conceptually scalar dependency is different. Regular link represent a data 
> flow in pipeline. In scalar, the dependency means an operator depends on a 
> file generated by the other operator. It's different type of data dependency.
> 3. Soft link can solve other dependency problem in the future. If we 
> introduce another UDF dependent on a file generated by another operator, we 
> can use this mechanism to solve it. 
> 4. With soft link, we can use scalar come from different sources in the same 
> statement, which in my mind is not a rare use case. (eg: D = foreach C 
> generate c0/A.total, c1/B.count; )
> Currently, there are two cases we can use soft link:
> 1. scalar dependency, where ReadScalar UDF will use a file generate by a 
> LOStore
> 2. store-load dependency, where we will load a file which is generated by a 
> store in the same script. This happens in multi-store case. Currently we 
> solve it by regular link. It is better to use a soft link.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1605) Adding soft link to plan to solve input file dependency

2010-09-21 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1605:


Attachment: PIG-1605-2.patch

PIG-1605-2.patch fix findbug warnings.

test-patch result:
 [exec] -1 overall.
 [exec]
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec]
 [exec] +1 tests included.  The patch appears to include 6 new or 
modified tests.
 [exec]
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec]
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec]
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec]
 [exec] -1 release audit.  The applied patch generated 455 release 
audit warnings (more than the trunk's current 453 warning
s).

> Adding soft link to plan to solve input file dependency
> ---
>
> Key: PIG-1605
> URL: https://issues.apache.org/jira/browse/PIG-1605
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1605-1.patch, PIG-1605-2.patch
>
>
> In scalar implementation, we need to deal with implicit dependencies. 
> [PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve 
> the problem by adding a LOScalar operator. Here is a different approach. We 
> will add a soft link to the plan, and soft link is only visible to the 
> walkers. By doing this, we can make sure we visit LOStore which generate 
> scalar first, and then LOForEach which use the scalar. All other part of the 
> logical plan does not know the existence of the soft link. The benefits are:
> 1. Logical plan do not need to deal with LOScalar, this makes logical plan 
> cleaner
> 2. Conceptually scalar dependency is different. Regular link represent a data 
> flow in pipeline. In scalar, the dependency means an operator depends on a 
> file generated by the other operator. It's different type of data dependency.
> 3. Soft link can solve other dependency problem in the future. If we 
> introduce another UDF dependent on a file generated by another operator, we 
> can use this mechanism to solve it. 
> 4. With soft link, we can use scalar come from different sources in the same 
> statement, which in my mind is not a rare use case. (eg: D = foreach C 
> generate c0/A.total, c1/B.count; )
> Currently, there are two cases we can use soft link:
> 1. scalar dependency, where ReadScalar UDF will use a file generate by a 
> LOStore
> 2. store-load dependency, where we will load a file which is generated by a 
> store in the same script. This happens in multi-store case. Currently we 
> solve it by regular link. It is better to use a soft link.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1598) Pig gobbles up error messages - Part 2

2010-09-21 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1598:


  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

Patch looks good. Committed to both trunk and 0.8 branch.

> Pig gobbles up error messages - Part 2
> --
>
> Key: PIG-1598
> URL: https://issues.apache.org/jira/browse/PIG-1598
> Project: Pig
>  Issue Type: Improvement
>Reporter: Ashutosh Chauhan
>Assignee: niraj rai
> Fix For: 0.8.0
>
> Attachments: PIG-1598_0.patch
>
>
> Another case of PIG-1531 .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1639) New logical plan: PushUpFilter should not optimize if filter condition contains UDF

2010-09-21 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1639:


Description: 
The following script fail:
{code}
a = load 'file' AS (f1, f2, f3);
b = group a by f1;
c = filter b by COUNT(a) > 1;
dump c;
{code}

> New logical plan: PushUpFilter should not optimize if filter condition 
> contains UDF
> ---
>
> Key: PIG-1639
> URL: https://issues.apache.org/jira/browse/PIG-1639
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
>
> The following script fail:
> {code}
> a = load 'file' AS (f1, f2, f3);
> b = group a by f1;
> c = filter b by COUNT(a) > 1;
> dump c;
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1639) New logical plan: PushUpFilter should not optimize if filter condition contains UDF

2010-09-21 Thread Daniel Dai (JIRA)
New logical plan: PushUpFilter should not optimize if filter condition contains 
UDF
---

 Key: PIG-1639
 URL: https://issues.apache.org/jira/browse/PIG-1639
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1636) Scalar fail if the scalar variable is generated by limit

2010-09-21 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1636:


Attachment: PIG-1636-1.patch

This patch depends on PIG-1605.

> Scalar fail if the scalar variable is generated by limit
> 
>
> Key: PIG-1636
> URL: https://issues.apache.org/jira/browse/PIG-1636
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1636-1.patch
>
>
> The following script fail:
> {code}
> a = load 'studenttab10k' as (name: chararray, age: int, gpa: float);
> b = group a all;
> c = foreach b generate SUM(a.age) as total;
> c1= limit c 1;
> d = foreach a generate name, age/(double)c1.total as d_sum;
> store d into '111';
> {code}
> The problem is we have a reference to c1 in d. In the optimizer, we push 
> limit before foreach, d still reference to limit, and we get the wrong schema 
> for the scalar.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1637) Combiner not use because optimizor inserts a foreach between group and algebric function

2010-09-21 Thread Daniel Dai (JIRA)
Combiner not use because optimizor inserts a foreach between group and algebric 
function


 Key: PIG-1637
 URL: https://issues.apache.org/jira/browse/PIG-1637
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0


The following script does not use combiner after new optimization change.

{code}
A = load ':INPATH:/pigmix/page_views' using 
org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
as (user, action, timespent, query_term, ip_addr, timestamp, 
estimated_revenue, page_info, page_links);
B = foreach A generate user, (int)timespent as timespent, 
(double)estimated_revenue as estimated_revenue;
C = group B all; 
D = foreach C generate SUM(B.timespent), AVG(B.estimated_revenue);
store D into ':OUTPATH:';
{code}

This is because after group, optimizer detect group key is not used afterward, 
it add a foreach statement after C. This is how it looks like after 
optimization:
{code}
A = load ':INPATH:/pigmix/page_views' using 
org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
as (user, action, timespent, query_term, ip_addr, timestamp, 
estimated_revenue, page_info, page_links);
B = foreach A generate user, (int)timespent as timespent, 
(double)estimated_revenue as estimated_revenue;
C = group B all; 
C1 = foreach C generate B;
D = foreach C1 generate SUM(B.timespent), AVG(B.estimated_revenue);
store D into ':OUTPATH:';
{code}

That cancel the combiner optimization for D. 

The way to solve the issue is to merge the C1 we inserted and D. Currently, we 
do not merge these two foreach. The reason is that one output of the first 
foreach (B) is referred twice in D, and currently rule assume after merge, we 
need to calculate B twice in D. Actually, C1 is only doing projection, no 
calculation of B. Merging C1 and D will not result calculating B twice. So C1 
and D should be merged.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1636) Scalar fail if the scalar variable is generated by limit

2010-09-21 Thread Daniel Dai (JIRA)
Scalar fail if the scalar variable is generated by limit


 Key: PIG-1636
 URL: https://issues.apache.org/jira/browse/PIG-1636
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0


The following script fail:
{code}
a = load 'studenttab10k' as (name: chararray, age: int, gpa: float);
b = group a all;
c = foreach b generate SUM(a.age) as total;
c1= limit c 1;
d = foreach a generate name, age/(double)c1.total as d_sum;
store d into '111';
{code}

The problem is we have a reference to c1 in d. In the optimizer, we push limit 
before foreach, d still reference to limit, and we get the wrong schema for the 
scalar.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1605) Adding soft link to plan to solve input file dependency

2010-09-21 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1605:


Attachment: PIG-1605-1.patch

> Adding soft link to plan to solve input file dependency
> ---
>
> Key: PIG-1605
> URL: https://issues.apache.org/jira/browse/PIG-1605
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1605-1.patch
>
>
> In scalar implementation, we need to deal with implicit dependencies. 
> [PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve 
> the problem by adding a LOScalar operator. Here is a different approach. We 
> will add a soft link to the plan, and soft link is only visible to the 
> walkers. By doing this, we can make sure we visit LOStore which generate 
> scalar first, and then LOForEach which use the scalar. All other part of the 
> logical plan does not know the existence of the soft link. The benefits are:
> 1. Logical plan do not need to deal with LOScalar, this makes logical plan 
> cleaner
> 2. Conceptually scalar dependency is different. Regular link represent a data 
> flow in pipeline. In scalar, the dependency means an operator depends on a 
> file generated by the other operator. It's different type of data dependency.
> 3. Soft link can solve other dependency problem in the future. If we 
> introduce another UDF dependent on a file generated by another operator, we 
> can use this mechanism to solve it. 
> 4. With soft link, we can use scalar come from different sources in the same 
> statement, which in my mind is not a rare use case. (eg: D = foreach C 
> generate c0/A.total, c1/B.count; )
> Currently, there are two cases we can use soft link:
> 1. scalar dependency, where ReadScalar UDF will use a file generate by a 
> LOStore
> 2. store-load dependency, where we will load a file which is generated by a 
> store in the same script. This happens in multi-store case. Currently we 
> solve it by regular link. It is better to use a soft link.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1614) javacc.jar pulled twice from maven repository

2010-09-15 Thread Daniel Dai (JIRA)
javacc.jar pulled twice from maven repository
-

 Key: PIG-1614
 URL: https://issues.apache.org/jira/browse/PIG-1614
 Project: Pig
  Issue Type: Bug
  Components: build
Reporter: Daniel Dai
Priority: Trivial


ant pull javacc.jar twice from maven. One is javacc.jar, and the other is 
javacc-4.2.jar.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1608) pig should always include pig-default.properties and pig.properties in the pig.jar

2010-09-15 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1608:


  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

Patch committed to trunk. Thanks Niraj!

> pig should always include pig-default.properties and pig.properties in the 
> pig.jar
> --
>
> Key: PIG-1608
> URL: https://issues.apache.org/jira/browse/PIG-1608
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: niraj rai
>Assignee: niraj rai
> Fix For: 0.9.0
>
> Attachments: PIG-1608_0.patch, PIG-1608_1.patch
>
>
> pig should always include pig-default.properties and pig.properties as a part 
> of the pig.jar file

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1608) pig should always include pig-default.properties and pig.properties in the pig.jar

2010-09-15 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1608:


Fix Version/s: 0.9.0
Affects Version/s: 0.8.0

> pig should always include pig-default.properties and pig.properties in the 
> pig.jar
> --
>
> Key: PIG-1608
> URL: https://issues.apache.org/jira/browse/PIG-1608
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: niraj rai
>Assignee: niraj rai
> Fix For: 0.9.0
>
> Attachments: PIG-1608_0.patch, PIG-1608_1.patch
>
>
> pig should always include pig-default.properties and pig.properties as a part 
> of the pig.jar file

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1608) pig should always include pig-default.properties and pig.properties in the pig.jar

2010-09-15 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12909821#action_12909821
 ] 

Daniel Dai commented on PIG-1608:
-

Two comments:
1. target "buildJar-withouthadoop" should also include this change
2. format comment: use space instead of tab

Target "jar", "package" looks good.

> pig should always include pig-default.properties and pig.properties in the 
> pig.jar
> --
>
> Key: PIG-1608
> URL: https://issues.apache.org/jira/browse/PIG-1608
> Project: Pig
>  Issue Type: Bug
>Reporter: niraj rai
>Assignee: niraj rai
> Attachments: PIG-1608_0.patch
>
>
> pig should always include pig-default.properties and pig.properties as a part 
> of the pig.jar file

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1605) Adding soft link to plan to solve input file dependency

2010-09-13 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12909007#action_12909007
 ] 

Daniel Dai commented on PIG-1605:
-

Changes are reasonably small. Here is a summary:
1. Add the following methods to the plan (both old and new):
{code}
public void createSoftLink(E from, E to)
public List getSoftLinkPredecessors(E op)
public List getSoftLinkSuccessors(E op)
{code}

2. All walkers need to change. When walker get predecessors/successors, it need 
to get both soft/regular link predecessors. The changes are straight forward, eg
from:
{code}
Collection newSuccessors = mPlan.getSuccessors(suc);
{code}
to:
{code}
Collection newSuccessors = mPlan.getSuccessors(suc);
newSuccessors.addAll(mPlan.getSoftLinkSuccessors(suc));
{code}

3. Change plan utility functions, such as replace, replaceAndAddSucessors, 
replaceAndAddPredecessors, etc
In new logical plan, there is no change since we only have minimum utility 
functions. In old logical plan, there should be some change to make those 
utility functions aware of soft link, but if we decide not support old logical 
plan going forward, no change needed, only need to note those utility functions 
does not deal with soft link within the function.

4. Change scalar to use soft link
This include creating soft link, maintaining soft link when doing transform 
(migrating to new plan, translating to physical plan). 

5. Change store-load to use soft link
This is an optional step. Currently we use regular link, conceptually we shall 
use soft link. It is Ok if we don't do this for now.

Also note in most cases, there is no soft link, the plan will behave just like 
before, so this change should be safe enough.

> Adding soft link to plan to solve input file dependency
> ---
>
> Key: PIG-1605
> URL: https://issues.apache.org/jira/browse/PIG-1605
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
>
> In scalar implementation, we need to deal with implicit dependencies. 
> [PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve 
> the problem by adding a LOScalar operator. Here is a different approach. We 
> will add a soft link to the plan, and soft link is only visible to the 
> walkers. By doing this, we can make sure we visit LOStore which generate 
> scalar first, and then LOForEach which use the scalar. All other part of the 
> logical plan does not know the existence of the soft link. The benefits are:
> 1. Logical plan do not need to deal with LOScalar, this makes logical plan 
> cleaner
> 2. Conceptually scalar dependency is different. Regular link represent a data 
> flow in pipeline. In scalar, the dependency means an operator depends on a 
> file generated by the other operator. It's different type of data dependency.
> 3. Soft link can solve other dependency problem in the future. If we 
> introduce another UDF dependent on a file generated by another operator, we 
> can use this mechanism to solve it. 
> 4. With soft link, we can use scalar come from different sources in the same 
> statement, which in my mind is not a rare use case. (eg: D = foreach C 
> generate c0/A.total, c1/B.count; )
> Currently, there are two cases we can use soft link:
> 1. scalar dependency, where ReadScalar UDF will use a file generate by a 
> LOStore
> 2. store-load dependency, where we will load a file which is generated by a 
> store in the same script. This happens in multi-store case. Currently we 
> solve it by regular link. It is better to use a soft link.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1605) Adding soft link to plan to solve input file dependency

2010-09-13 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12909008#action_12909008
 ] 

Daniel Dai commented on PIG-1605:
-

Yes, Thejas is right. The first 3 are the main reasons for the change.

> Adding soft link to plan to solve input file dependency
> ---
>
> Key: PIG-1605
> URL: https://issues.apache.org/jira/browse/PIG-1605
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
>
> In scalar implementation, we need to deal with implicit dependencies. 
> [PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve 
> the problem by adding a LOScalar operator. Here is a different approach. We 
> will add a soft link to the plan, and soft link is only visible to the 
> walkers. By doing this, we can make sure we visit LOStore which generate 
> scalar first, and then LOForEach which use the scalar. All other part of the 
> logical plan does not know the existence of the soft link. The benefits are:
> 1. Logical plan do not need to deal with LOScalar, this makes logical plan 
> cleaner
> 2. Conceptually scalar dependency is different. Regular link represent a data 
> flow in pipeline. In scalar, the dependency means an operator depends on a 
> file generated by the other operator. It's different type of data dependency.
> 3. Soft link can solve other dependency problem in the future. If we 
> introduce another UDF dependent on a file generated by another operator, we 
> can use this mechanism to solve it. 
> 4. With soft link, we can use scalar come from different sources in the same 
> statement, which in my mind is not a rare use case. (eg: D = foreach C 
> generate c0/A.total, c1/B.count; )
> Currently, there are two cases we can use soft link:
> 1. scalar dependency, where ReadScalar UDF will use a file generate by a 
> LOStore
> 2. store-load dependency, where we will load a file which is generated by a 
> store in the same script. This happens in multi-store case. Currently we 
> solve it by regular link. It is better to use a soft link.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1605) Adding soft link to plan to solve input file dependency

2010-09-13 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1605:


Description: 
In scalar implementation, we need to deal with implicit dependencies. 
[PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve 
the problem by adding a LOScalar operator. Here is a different approach. We 
will add a soft link to the plan, and soft link is only visible to the walkers. 
By doing this, we can make sure we visit LOStore which generate scalar first, 
and then LOForEach which use the scalar. All other part of the logical plan 
does not know the existence of the soft link. The benefits are:

1. Logical plan do not need to deal with LOScalar, this makes logical plan 
cleaner
2. Conceptually scalar dependency is different. Regular link represent a data 
flow in pipeline. In scalar, the dependency means an operator depends on a file 
generated by the other operator. It's different type of data dependency.
3. Soft link can solve other dependency problem in the future. If we introduce 
another UDF dependent on a file generated by another operator, we can use this 
mechanism to solve it. 
4. With soft link, we can use scalar come from different sources in the same 
statement, which in my mind is not a rare use case. (eg: D = foreach C generate 
c0/A.total, c1/B.count; )

Currently, there are two cases we can use soft link:
1. scalar dependency, where ReadScalar UDF will use a file generate by a LOStore
2. store-load dependency, where we will load a file which is generated by a 
store in the same script. This happens in multi-store case. Currently we solve 
it by regular link. It is better to use a soft link.

  was:
In scalar implementation, we need to deal with implicit dependencies. 
[PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve 
the problem by adding a LOScalar operator. Here is a different approach. We 
will add a soft link to the plan, and soft link is only visible to the walkers. 
By doing this, we can make sure we visit LOStore which generate scalar first, 
and then LOForEach which use the scalar. All other part of the logical plan 
does not know the existence of the soft link. The benefits are:

1. Logical plan do not need to deal with LOScalar, this makes logical plan 
cleaner
2. Conceptually scalar dependency is different. Regular link represent a data 
flow in pipeline. In scalar, the dependency means an operator depends on a file 
generated by the other operator. It's different type of data dependency.
3. Soft link can solve other dependency problem in the future. If we introduce 
another UDF dependent on a file generated by another operator, we can use this 
mechanism to solve it. 
4. With soft link, we can use scalar come from different sources in the same 
statement, which in my mind is not a rare use case. (eg: D = foreach C generate 
c0/A.total, c1/B.count;)

Currently, there are two cases we can use soft link:
1. scalar dependency, where ReadScalar UDF will use a file generate by a LOStore
2. store-load dependency, where we will load a file which is generated by a 
store in the same script. This happens in multi-store case. Currently we solve 
it by regular link. It is better to use a soft link.


> Adding soft link to plan to solve input file dependency
> ---
>
> Key: PIG-1605
> URL: https://issues.apache.org/jira/browse/PIG-1605
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
>
> In scalar implementation, we need to deal with implicit dependencies. 
> [PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve 
> the problem by adding a LOScalar operator. Here is a different approach. We 
> will add a soft link to the plan, and soft link is only visible to the 
> walkers. By doing this, we can make sure we visit LOStore which generate 
> scalar first, and then LOForEach which use the scalar. All other part of the 
> logical plan does not know the existence of the soft link. The benefits are:
> 1. Logical plan do not need to deal with LOScalar, this makes logical plan 
> cleaner
> 2. Conceptually scalar dependency is different. Regular link represent a data 
> flow in pipeline. In scalar, the dependency means an operator depends on a 
> file generated by the other operator. It's different type of data dependency.
> 3. Soft link can solve other dependency problem in the future. If we 
> introduce another UDF dependent on a file generated by another operator, we 
> can use this mechanism to solve it. 
> 4. With soft link, we can use scalar come from different sources in the same 
> statement, which in my mind is not a rare use case. (eg: D = foreach C 
> generate c0/A.total, c1

[jira] Updated: (PIG-1605) Adding soft link to plan to solve input file dependency

2010-09-13 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1605:


Description: 
In scalar implementation, we need to deal with implicit dependencies. 
[PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve 
the problem by adding a LOScalar operator. Here is a different approach. We 
will add a soft link to the plan, and soft link is only visible to the walkers. 
By doing this, we can make sure we visit LOStore which generate scalar first, 
and then LOForEach which use the scalar. All other part of the logical plan 
does not know the existence of the soft link. The benefits are:

1. Logical plan do not need to deal with LOScalar, this makes logical plan 
cleaner
2. Conceptually scalar dependency is different. Regular link represent a data 
flow in pipeline. In scalar, the dependency means an operator depends on a file 
generated by the other operator. It's different type of data dependency.
3. Soft link can solve other dependency problem in the future. If we introduce 
another UDF dependent on a file generated by another operator, we can use this 
mechanism to solve it. 
4. With soft link, we can use scalar come from different sources in the same 
statement, which in my mind is not a rare use case. (eg: D = foreach C generate 
c0/A.total, c1/B.count;)

Currently, there are two cases we can use soft link:
1. scalar dependency, where ReadScalar UDF will use a file generate by a LOStore
2. store-load dependency, where we will load a file which is generated by a 
store in the same script. This happens in multi-store case. Currently we solve 
it by regular link. It is better to use a soft link.

  was:
In scalar implementation, we need to deal with implicit dependencies. 
[PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve 
the problem by adding a LOScalar operator. Here is a different approach. We 
will add a soft link to the plan, and soft link is only visible to the walkers. 
By doing this, we can make sure we visit LOStore which generate scalar first, 
and then LOForEach which use the scalar. All other part of the logical plan 
does not know the existence of the soft link. The benefits are:

1. Logical plan do not need to deal with LOScalar, this makes logical plan 
cleaner
2. Conceptually scalar dependency is different. Regular link represent a data 
flow in pipeline. In scalar, the dependency means an operator depends on a file 
generated by the other operator. It's different type of data dependency.
3. Soft link can solve other dependency problem in the future. If we introduce 
another UDF dependent on a file generated by another operator, we can use this 
mechanism to solve it. 

Currently, there are two cases we can use soft link:
1. scalar dependency, where ReadScalar UDF will use a file generate by a LOStore
2. store-load dependency, where we will load a file which is generated by a 
store in the same script. This happens in multi-store case. Currently we solve 
it by regular link. It is better to use a soft link.


> Adding soft link to plan to solve input file dependency
> ---
>
> Key: PIG-1605
> URL: https://issues.apache.org/jira/browse/PIG-1605
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
>
> In scalar implementation, we need to deal with implicit dependencies. 
> [PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve 
> the problem by adding a LOScalar operator. Here is a different approach. We 
> will add a soft link to the plan, and soft link is only visible to the 
> walkers. By doing this, we can make sure we visit LOStore which generate 
> scalar first, and then LOForEach which use the scalar. All other part of the 
> logical plan does not know the existence of the soft link. The benefits are:
> 1. Logical plan do not need to deal with LOScalar, this makes logical plan 
> cleaner
> 2. Conceptually scalar dependency is different. Regular link represent a data 
> flow in pipeline. In scalar, the dependency means an operator depends on a 
> file generated by the other operator. It's different type of data dependency.
> 3. Soft link can solve other dependency problem in the future. If we 
> introduce another UDF dependent on a file generated by another operator, we 
> can use this mechanism to solve it. 
> 4. With soft link, we can use scalar come from different sources in the same 
> statement, which in my mind is not a rare use case. (eg: D = foreach C 
> generate c0/A.total, c1/B.count;)
> Currently, there are two cases we can use soft link:
> 1. scalar dependency, where ReadScalar UDF will use a file generate by a 
> LOStore
> 2. store-load dependency, where

[jira] Commented: (PIG-1608) pig should always include pig-default.properties and pig.properties in the pig.jar

2010-09-13 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12908886#action_12908886
 ] 

Daniel Dai commented on PIG-1608:
-

pig should include pig-default.properties into pig.jar, but not pig.properties, 
just like hadoop does for core-default.xml, core-site.xml.

> pig should always include pig-default.properties and pig.properties in the 
> pig.jar
> --
>
> Key: PIG-1608
> URL: https://issues.apache.org/jira/browse/PIG-1608
> Project: Pig
>  Issue Type: Bug
>Reporter: niraj rai
>Assignee: niraj rai
>
> pig should always include pig-default.properties and pig.properties as a part 
> of the pig.jar file

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1605) Adding soft link to plan to solve input file dependency

2010-09-10 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1605:


Description: 
In scalar implementation, we need to deal with implicit dependencies. 
[PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve 
the problem by adding a LOScalar operator. Here is a different approach. We 
will add a soft link to the plan, and soft link is only visible to the walkers. 
By doing this, we can make sure we visit LOStore which generate scalar first, 
and then LOForEach which use the scalar. All other part of the logical plan 
does not know the existence of the soft link. The benefits are:

1. Logical plan do not need to deal with LOScalar, this makes logical plan 
cleaner
2. Conceptually scalar dependency is different. Regular link represent a data 
flow in pipeline. In scalar, the dependency means an operator depends on a file 
generated by the other operator. It's different type of data dependency.
3. Soft link can solve other dependency problem in the future. If we introduce 
another UDF dependent on a file generated by another operator, we can use this 
mechanism to solve it. 

Currently, there are two cases we can use soft link:
1. scalar dependency, where ReadScalar UDF will use a file generate by a LOStore
2. store-load dependency, where we will load a file which is generated by a 
store in the same script. This happens in multi-store case. Currently we solve 
it by regular link. It is better to use a soft link.

  was:
In scalar implementation, we need to deal with implicit dependencies. 
[PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve 
the problem by adding a LOScalar operator. Here is a different approach. We 
will add a soft link to the plan, and soft link is only visible to the walkers. 
All other part of the logical plan does not know the existence of the soft 
link. The benefits are:

1. Logical plan do not need to deal with LOScalar, this makes logical plan 
cleaner
2. Conceptually scalar dependency is different. Regular link represent a data 
flow in pipeline. In scalar, the dependency means an operator depends on a file 
generated by the other operator. It's different type of data dependency.
3. Soft link can solve other dependency problem in the future. If we introduce 
another UDF dependent on a file generated by another operator, we can use this 
mechanism to solve it. 

Currently, there are two cases we can use soft link:
1. scalar dependency, where ReadScalar UDF will use a file generate by a LOStore
2. store-load dependency, where we will load a file which is generated by a 
store in the same script. This happens in multi-store case. Currently we solve 
it by regular link. It is better to use a soft link.


> Adding soft link to plan to solve input file dependency
> ---
>
> Key: PIG-1605
> URL: https://issues.apache.org/jira/browse/PIG-1605
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
>
> In scalar implementation, we need to deal with implicit dependencies. 
> [PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve 
> the problem by adding a LOScalar operator. Here is a different approach. We 
> will add a soft link to the plan, and soft link is only visible to the 
> walkers. By doing this, we can make sure we visit LOStore which generate 
> scalar first, and then LOForEach which use the scalar. All other part of the 
> logical plan does not know the existence of the soft link. The benefits are:
> 1. Logical plan do not need to deal with LOScalar, this makes logical plan 
> cleaner
> 2. Conceptually scalar dependency is different. Regular link represent a data 
> flow in pipeline. In scalar, the dependency means an operator depends on a 
> file generated by the other operator. It's different type of data dependency.
> 3. Soft link can solve other dependency problem in the future. If we 
> introduce another UDF dependent on a file generated by another operator, we 
> can use this mechanism to solve it. 
> Currently, there are two cases we can use soft link:
> 1. scalar dependency, where ReadScalar UDF will use a file generate by a 
> LOStore
> 2. store-load dependency, where we will load a file which is generated by a 
> store in the same script. This happens in multi-store case. Currently we 
> solve it by regular link. It is better to use a soft link.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1605) Adding soft link to plan to solve input file dependency

2010-09-10 Thread Daniel Dai (JIRA)
Adding soft link to plan to solve input file dependency
---

 Key: PIG-1605
 URL: https://issues.apache.org/jira/browse/PIG-1605
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0


In scalar implementation, we need to deal with implicit dependencies. 
[PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve 
the problem by adding a LOScalar operator. Here is a different approach. We 
will add a soft link to the plan, and soft link is only visible to the walkers. 
All other part of the logical plan does not know the existence of the soft 
link. The benefits are:

1. Logical plan do not need to deal with LOScalar, this makes logical plan 
cleaner
2. Conceptually scalar dependency is different. Regular link represent a data 
flow in pipeline. In scalar, the dependency means an operator depends on a file 
generated by the other operator. It's different type of data dependency.
3. Soft link can solve other dependency problem in the future. If we introduce 
another UDF dependent on a file generated by another operator, we can use this 
mechanism to solve it. 

Currently, there are two cases we can use soft link:
1. scalar dependency, where ReadScalar UDF will use a file generate by a LOStore
2. store-load dependency, where we will load a file which is generated by a 
store in the same script. This happens in multi-store case. Currently we solve 
it by regular link. It is better to use a soft link.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1604) 'relation as scalar' does not work with complex types

2010-09-10 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12908096#action_12908096
 ] 

Daniel Dai commented on PIG-1604:
-

+1, patch looks good.

> 'relation as scalar' does not work with complex types 
> --
>
> Key: PIG-1604
> URL: https://issues.apache.org/jira/browse/PIG-1604
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Fix For: 0.8.0
>
> Attachments: PIG-1604.1.patch
>
>
> Statement such as 
> sclr = limit b 1;
> d = foreach a generate name, age/(double)sclr.mapcol#'it' as some_sum;
> Results in the following parse error:
>  ERROR 1000: Error during parsing. Non-atomic field expected but found atomic 
> field

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1437) [Optimization] Rewrite GroupBy-Foreach-flatten(group) to Distinct

2010-09-08 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1437:


 Assignee: Xuefu Zhang
Fix Version/s: 0.9.0

> [Optimization] Rewrite GroupBy-Foreach-flatten(group) to Distinct
> -
>
> Key: PIG-1437
> URL: https://issues.apache.org/jira/browse/PIG-1437
> Project: Pig
>  Issue Type: Sub-task
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Assignee: Xuefu Zhang
>Priority: Minor
> Fix For: 0.9.0
>
>
> Its possible to rewrite queries like this
> {code}
> A = load 'data' as (name,age);
> B = group A by (name,age);
> C = foreach B generate group.name, group.age;
> dump C;
> {code}
> or
> {code} 
> (name,age);
> B = group A by (name
> A = load 'data' as,age);
> C = foreach B generate flatten(group);
> dump C;
> {code}
> to
> {code}
> A = load 'data' as (name,age);
> B = distinct A;
> dump B;
> {code}
> This could only be done if no columns within the bags are referenced 
> subsequently in the script. Since in Pig-Hadoop world DISTINCT will be 
> executed more effeciently then group-by this will be a huge win. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1322) Logical Optimizer: change outer join into regular join

2010-09-08 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1322:


 Assignee: Xuefu Zhang  (was: Daniel Dai)
Fix Version/s: 0.9.0

> Logical Optimizer: change outer join into regular join
> --
>
> Key: PIG-1322
> URL: https://issues.apache.org/jira/browse/PIG-1322
> Project: Pig
>  Issue Type: Sub-task
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
>Assignee: Xuefu Zhang
> Fix For: 0.9.0
>
>
> In some cases, we can change the outer join into a regular join. The benefit 
> is regular join is easier to optimize in subsequent optimization. 
> Example:
> C = join A by a0 LEFT OUTER, B by b0;
> D = filter C by b0 > 0;
> => 
> C = join A by a0, B by b0;
> D = filter C by b0 > 0;
> Because we made this change, so PushUpFilter rule can further push the filter 
> in front of regular join which otherwise cannot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with

2010-09-07 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1178:


  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

> LogicalPlan and Optimizer are too complex and hard to work with
> ---
>
> Key: PIG-1178
> URL: https://issues.apache.org/jira/browse/PIG-1178
> Project: Pig
>  Issue Type: Improvement
>Reporter: Alan Gates
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: expressions-2.patch, expressions.patch, lp.patch, 
> lp.patch, PIG-1178-10.patch, PIG-1178-11.patch, PIG-1178-4.patch, 
> PIG-1178-5.patch, PIG-1178-6.patch, PIG-1178-7.patch, PIG-1178-8.patch, 
> PIG-1178-9.patch, pig_1178.patch, pig_1178.patch, PIG_1178.patch, 
> pig_1178_2.patch, pig_1178_3.2.patch, pig_1178_3.3.patch, pig_1178_3.4.patch, 
> pig_1178_3.patch
>
>
> The current implementation of the logical plan and the logical optimizer in 
> Pig has proven to not be easily extensible. Developer feedback has indicated 
> that adding new rules to the optimizer is quite burdensome. In addition, the 
> logical plan has been an area of numerous bugs, many of which have been 
> difficult to fix. Developers also feel that the logical plan is difficult to 
> understand and maintain. The root cause for these issues is that a number of 
> design decisions that were made as part of the 0.2 rewrite of the front end 
> have now proven to be sub-optimal. The heart of this proposal is to revisit a 
> number of those proposals and rebuild the logical plan with a simpler design 
> that will make it much easier to maintain the logical plan as well as extend 
> the logical optimizer. 
> See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full 
> details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with

2010-09-07 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907061#action_12907061
 ] 

Daniel Dai commented on PIG-1178:
-

PIG-1178-11.patch committed to both trunk and 0.8 branch. 

> LogicalPlan and Optimizer are too complex and hard to work with
> ---
>
> Key: PIG-1178
> URL: https://issues.apache.org/jira/browse/PIG-1178
> Project: Pig
>  Issue Type: Improvement
>Reporter: Alan Gates
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: expressions-2.patch, expressions.patch, lp.patch, 
> lp.patch, PIG-1178-10.patch, PIG-1178-11.patch, PIG-1178-4.patch, 
> PIG-1178-5.patch, PIG-1178-6.patch, PIG-1178-7.patch, PIG-1178-8.patch, 
> PIG-1178-9.patch, pig_1178.patch, pig_1178.patch, PIG_1178.patch, 
> pig_1178_2.patch, pig_1178_3.2.patch, pig_1178_3.3.patch, pig_1178_3.4.patch, 
> pig_1178_3.patch
>
>
> The current implementation of the logical plan and the logical optimizer in 
> Pig has proven to not be easily extensible. Developer feedback has indicated 
> that adding new rules to the optimizer is quite burdensome. In addition, the 
> logical plan has been an area of numerous bugs, many of which have been 
> difficult to fix. Developers also feel that the logical plan is difficult to 
> understand and maintain. The root cause for these issues is that a number of 
> design decisions that were made as part of the 0.2 rewrite of the front end 
> have now proven to be sub-optimal. The heart of this proposal is to revisit a 
> number of those proposals and rebuild the logical plan with a simpler design 
> that will make it much easier to maintain the logical plan as well as extend 
> the logical optimizer. 
> See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full 
> details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with

2010-09-07 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1178:


Attachment: PIG-1178-11.patch

PIG-1178-11.patch change the layout of explain, error code and comments, etc. 
No real functional changes.

test-patch result:
 [exec] +1 overall.  
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] +1 tests included.  The patch appears to include 11 new or 
modified tests.
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.

> LogicalPlan and Optimizer are too complex and hard to work with
> ---
>
> Key: PIG-1178
> URL: https://issues.apache.org/jira/browse/PIG-1178
> Project: Pig
>  Issue Type: Improvement
>Reporter: Alan Gates
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: expressions-2.patch, expressions.patch, lp.patch, 
> lp.patch, PIG-1178-10.patch, PIG-1178-11.patch, PIG-1178-4.patch, 
> PIG-1178-5.patch, PIG-1178-6.patch, PIG-1178-7.patch, PIG-1178-8.patch, 
> PIG-1178-9.patch, pig_1178.patch, pig_1178.patch, PIG_1178.patch, 
> pig_1178_2.patch, pig_1178_3.2.patch, pig_1178_3.3.patch, pig_1178_3.4.patch, 
> pig_1178_3.patch
>
>
> The current implementation of the logical plan and the logical optimizer in 
> Pig has proven to not be easily extensible. Developer feedback has indicated 
> that adding new rules to the optimizer is quite burdensome. In addition, the 
> logical plan has been an area of numerous bugs, many of which have been 
> difficult to fix. Developers also feel that the logical plan is difficult to 
> understand and maintain. The root cause for these issues is that a number of 
> design decisions that were made as part of the 0.2 rewrite of the front end 
> have now proven to be sub-optimal. The heart of this proposal is to revisit a 
> number of those proposals and rebuild the logical plan with a simpler design 
> that will make it much easier to maintain the logical plan as well as extend 
> the logical optimizer. 
> See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full 
> details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1595) casting relation to scalar- problem with handling of data from non PigStorage loaders

2010-09-07 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906932#action_12906932
 ] 

Daniel Dai commented on PIG-1595:
-

+1 for the test failure fix.

> casting relation to scalar- problem with handling of data from non PigStorage 
> loaders
> -
>
> Key: PIG-1595
> URL: https://issues.apache.org/jira/browse/PIG-1595
> Project: Pig
>  Issue Type: Bug
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Fix For: 0.8.0
>
> Attachments: PIG-1595.1.patch, PIG-1595.2.patch
>
>
> If load functions that don't follow the same bytearray format as PigStorage 
> for other supported datatypes, or those that don't implement the LoadCaster 
> interface are used in 'casting relation to scalar' (PIG-1434), it can cause 
> the query to fail or create incorrect results.
> The root cause of the problem is that there is a real dependency between the 
> ReadScalars udf that returns the scalar value and the LogicalOperator that 
> acts as its input. But the logicalplan does not capture this dependency. So 
> in SchemaResetter visitor used by the optimizer, the order in which schema is 
> reset and evaluated does not take this into consideration. If the schema of 
> the input LogicalOperator does not get evaluated before the ReadScalar udf, 
> the resutltype of ReadScalar udf becomes bytearray. POUserFunc will convert 
> the input to bytearray using ' new DataByteArray(inp.toString().getBytes())'. 
> But this bytearray encoding of other supported types might not be same for 
> the LoadFunction associated with the column, and that can result in problems.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1601) Make scalar work for secure hadoop

2010-09-07 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai resolved PIG-1601.
-

Hadoop Flags: [Reviewed]
  Resolution: Fixed

Patch committed to both trunk and 0.8 branch.

> Make scalar work for secure hadoop
> --
>
> Key: PIG-1601
> URL: https://issues.apache.org/jira/browse/PIG-1601
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1601-1.patch
>
>
> Error message:
> open file
> 'hdfs://gsbl90890.blue.ygrid.yahoo.com/tmp/temp851711738/tmp727366271'; error 
> =
> java.io.IOException: Delegation Token can be issued only with kerberos or web
> authentication at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:4975)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNode.getDelegationToken(NameNode.java:432)
> at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source) at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597) at
> org.apache.hadoop.ipc.RPC$Server.call(RPC.java:523) at
> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1301) at
> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1297) at
> java.security.AccessController.doPrivileged(Native Method) at
> javax.security.auth.Subject.doAs(Subject.java:396) at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1295) at
> org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:66) at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229)
> at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:313)
> at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:448)
> at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:441)
> at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.Divide.getNext(Divide.java:72)
> at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:358)
> at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291)
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236)
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:638) at
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:314) at
> org.apache.hadoop.mapred.Child$4.run(Child.java:217) at
> java.security.AccessController.doPrivileged(Native Method) at
> javax.security.auth.Subject.doAs(Subject.java:396) at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)
> at org.apache.hadoop.mapred.Child.main(Child.java:211) 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1594) NullPointerException in new logical planner

2010-09-06 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai resolved PIG-1594.
-

Resolution: Fixed

This issue is fixed by PIG-1178-10.patch.

> NullPointerException in new logical planner
> ---
>
> Key: PIG-1594
> URL: https://issues.apache.org/jira/browse/PIG-1594
> Project: Pig
>  Issue Type: Bug
>Reporter: Andrew Hitchcock
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
>
> I've been testing the trunk version of Pig on Elastic MapReduce against our 
> log processing sample application(1). When I try to run the query it throws a 
> NullPointerException and suggests I disable the new logical plan. Disabling 
> it works and the script succeeds. Here is the query I'm trying to run:
> {code}
> register file:/home/hadoop/lib/pig/piggybank.jar
>   DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT();
>   RAW_LOGS = LOAD '$INPUT' USING TextLoader as (line:chararray);
>   LOGS_BASE= foreach RAW_LOGS generate FLATTEN(EXTRACT(line, '^(\\S+) (\\S+) 
> (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)" 
> "([^"]*)"')) as (remoteAddr:chararray, remoteLogname:chararray, 
> user:chararray, time:chararray, request:chararray, status:int, 
> bytes_string:chararray, referrer:chararray, browser:chararray);
>   REFERRER_ONLY = FOREACH LOGS_BASE GENERATE referrer;
>   FILTERED = FILTER REFERRER_ONLY BY referrer matches '.*bing.*' OR referrer 
> matches '.*google.*';
>   SEARCH_TERMS = FOREACH FILTERED GENERATE FLATTEN(EXTRACT(referrer, 
> '.*[&\\?]q=([^&]+).*')) as terms:chararray;
>   SEARCH_TERMS_FILTERED = FILTER SEARCH_TERMS BY NOT $0 IS NULL;
>   SEARCH_TERMS_COUNT = FOREACH (GROUP SEARCH_TERMS_FILTERED BY $0) GENERATE 
> $0, COUNT($1) as num;
>   SEARCH_TERMS_COUNT_SORTED = LIMIT(ORDER SEARCH_TERMS_COUNT BY num DESC) 50;
>   STORE SEARCH_TERMS_COUNT_SORTED into '$OUTPUT';
> {code}
> And here is the stack trace that results:
> {code}
> ERROR 2042: Error in new logical plan. Try -Dpig.usenewlogicalplan=false.
> org.apache.pig.backend.executionengine.ExecException: ERROR 2042: Error in 
> new logical plan. Try -Dpig.usenewlogicalplan=false.
> at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:285)
> at org.apache.pig.PigServer.compilePp(PigServer.java:1301)
> at 
> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1154)
> at org.apache.pig.PigServer.execute(PigServer.java:1148)
> at org.apache.pig.PigServer.access$100(PigServer.java:123)
> at org.apache.pig.PigServer$Graph.execute(PigServer.java:1464)
> at org.apache.pig.PigServer.executeBatchEx(PigServer.java:350)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:324)
> at 
> org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:111)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:140)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90)
> at org.apache.pig.Main.run(Main.java:491)
> at org.apache.pig.Main.main(Main.java:107)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> Caused by: java.lang.NullPointerException
> at org.apache.pig.EvalFunc.getSchemaName(EvalFunc.java:76)
> at 
> org.apache.pig.piggybank.impl.ErrorCatchingBase.outputSchema(ErrorCatchingBase.java:76)
> at 
> org.apache.pig.newplan.logical.expression.UserFuncExpression.getFieldSchema(UserFuncExpression.java:111)
> at 
> org.apache.pig.newplan.logical.optimizer.FieldSchemaResetter.execute(SchemaResetter.java:175)
> at 
> org.apache.pig.newplan.logical.expression.AllSameExpressionVisitor.visit(AllSameExpressionVisitor.java:143)
> at 
> org.apache.pig.newplan.logical.expression.UserFuncExpression.accept(UserFuncExpression.java:55)
> at 
> org.apache.pig.newplan.ReverseDependencyOrderWalker.walk(ReverseDependencyOrderWalker.java:69)
> at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50)
> at 
> org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:87)
> at 
> org.apache.pig.newplan.logical.relational.LOGenerate.accept(LOGenerate.java:149)
> at 
> org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:74)
> at 
> org.apache.pig.newpl

[jira] Commented: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with

2010-09-06 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906592#action_12906592
 ] 

Daniel Dai commented on PIG-1178:
-

Patch PIG-1178-10.patch committed.

> LogicalPlan and Optimizer are too complex and hard to work with
> ---
>
> Key: PIG-1178
> URL: https://issues.apache.org/jira/browse/PIG-1178
> Project: Pig
>  Issue Type: Improvement
>Reporter: Alan Gates
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: expressions-2.patch, expressions.patch, lp.patch, 
> lp.patch, PIG-1178-10.patch, PIG-1178-4.patch, PIG-1178-5.patch, 
> PIG-1178-6.patch, PIG-1178-7.patch, PIG-1178-8.patch, PIG-1178-9.patch, 
> pig_1178.patch, pig_1178.patch, PIG_1178.patch, pig_1178_2.patch, 
> pig_1178_3.2.patch, pig_1178_3.3.patch, pig_1178_3.4.patch, pig_1178_3.patch
>
>
> The current implementation of the logical plan and the logical optimizer in 
> Pig has proven to not be easily extensible. Developer feedback has indicated 
> that adding new rules to the optimizer is quite burdensome. In addition, the 
> logical plan has been an area of numerous bugs, many of which have been 
> difficult to fix. Developers also feel that the logical plan is difficult to 
> understand and maintain. The root cause for these issues is that a number of 
> design decisions that were made as part of the 0.2 rewrite of the front end 
> have now proven to be sub-optimal. The heart of this proposal is to revisit a 
> number of those proposals and rebuild the logical plan with a simpler design 
> that will make it much easier to maintain the logical plan as well as extend 
> the logical optimizer. 
> See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full 
> details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with

2010-09-06 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1178:


Attachment: PIG-1178-10.patch

Patch PIG-1178-10.patch address foreach user defined schema.

test-patch result:
 [exec] +1 overall.  
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] +1 tests included.  The patch appears to include 3 new or 
modified tests.
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.

All test pass.

> LogicalPlan and Optimizer are too complex and hard to work with
> ---
>
> Key: PIG-1178
> URL: https://issues.apache.org/jira/browse/PIG-1178
> Project: Pig
>  Issue Type: Improvement
>Reporter: Alan Gates
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: expressions-2.patch, expressions.patch, lp.patch, 
> lp.patch, PIG-1178-10.patch, PIG-1178-4.patch, PIG-1178-5.patch, 
> PIG-1178-6.patch, PIG-1178-7.patch, PIG-1178-8.patch, PIG-1178-9.patch, 
> pig_1178.patch, pig_1178.patch, PIG_1178.patch, pig_1178_2.patch, 
> pig_1178_3.2.patch, pig_1178_3.3.patch, pig_1178_3.4.patch, pig_1178_3.patch
>
>
> The current implementation of the logical plan and the logical optimizer in 
> Pig has proven to not be easily extensible. Developer feedback has indicated 
> that adding new rules to the optimizer is quite burdensome. In addition, the 
> logical plan has been an area of numerous bugs, many of which have been 
> difficult to fix. Developers also feel that the logical plan is difficult to 
> understand and maintain. The root cause for these issues is that a number of 
> design decisions that were made as part of the 0.2 rewrite of the front end 
> have now proven to be sub-optimal. The heart of this proposal is to revisit a 
> number of those proposals and rebuild the logical plan with a simpler design 
> that will make it much easier to maintain the logical plan as well as extend 
> the logical optimizer. 
> See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full 
> details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1575) Complete the migration of optimization rule PushUpFilter including missing test cases

2010-09-05 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1575:


Attachment: jira-1575-5.patch

Patch looks good. Attach the final patch. 

test patch result:
 [exec] +1 overall.  
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] +1 tests included.  The patch appears to include 6 new or 
modified tests.
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.

All tests pass.

Patch committed to both trunk and 0.8 branch.

> Complete the migration of optimization rule PushUpFilter including missing 
> test cases
> -
>
> Key: PIG-1575
> URL: https://issues.apache.org/jira/browse/PIG-1575
> Project: Pig
>  Issue Type: Bug
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Fix For: 0.8.0
>
> Attachments: jira-1575-1.patch, jira-1575-2.patch, jira-1575-3.patch, 
> jira-1575-4.patch, jira-1575-5.patch
>
>
> The Optimization rule under the new logical plan, PushUpFilter, only does a 
> subset of optimization scenarios compared to the same rule under the old 
> logical plan. For instance, it only considers filter after join, but the old 
> optimization also considers other operators such as CoGroup, Union, Cross, 
> etc. The migration of the rule should be complete.
> Also, the test cases created for testing the old PushUpFilter wasn't migrated 
> to the new logical plan code base. It should be also migrated. (A few has 
> been migrated in JIRA-1574.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1575) Complete the migration of optimization rule PushUpFilter including missing test cases

2010-09-05 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1575:


  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

> Complete the migration of optimization rule PushUpFilter including missing 
> test cases
> -
>
> Key: PIG-1575
> URL: https://issues.apache.org/jira/browse/PIG-1575
> Project: Pig
>  Issue Type: Bug
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Fix For: 0.8.0
>
> Attachments: jira-1575-1.patch, jira-1575-2.patch, jira-1575-3.patch, 
> jira-1575-4.patch, jira-1575-5.patch
>
>
> The Optimization rule under the new logical plan, PushUpFilter, only does a 
> subset of optimization scenarios compared to the same rule under the old 
> logical plan. For instance, it only considers filter after join, but the old 
> optimization also considers other operators such as CoGroup, Union, Cross, 
> etc. The migration of the rule should be complete.
> Also, the test cases created for testing the old PushUpFilter wasn't migrated 
> to the new logical plan code base. It should be also migrated. (A few has 
> been migrated in JIRA-1574.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1595) casting relation to scalar- problem with handling of data from non PigStorage loaders

2010-09-04 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906322#action_12906322
 ] 

Daniel Dai commented on PIG-1595:
-

Patch break TestScalarAliases.testScalarErrMultipleRowsInInput. Comment out 
TestScalarAliases.testScalarErrMultipleRowsInInput temporarily.

> casting relation to scalar- problem with handling of data from non PigStorage 
> loaders
> -
>
> Key: PIG-1595
> URL: https://issues.apache.org/jira/browse/PIG-1595
> Project: Pig
>  Issue Type: Bug
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Fix For: 0.8.0
>
> Attachments: PIG-1595.1.patch
>
>
> If load functions that don't follow the same bytearray format as PigStorage 
> for other supported datatypes, or those that don't implement the LoadCaster 
> interface are used in 'casting relation to scalar' (PIG-1434), it can cause 
> the query to fail or create incorrect results.
> The root cause of the problem is that there is a real dependency between the 
> ReadScalars udf that returns the scalar value and the LogicalOperator that 
> acts as its input. But the logicalplan does not capture this dependency. So 
> in SchemaResetter visitor used by the optimizer, the order in which schema is 
> reset and evaluated does not take this into consideration. If the schema of 
> the input LogicalOperator does not get evaluated before the ReadScalar udf, 
> the resutltype of ReadScalar udf becomes bytearray. POUserFunc will convert 
> the input to bytearray using ' new DataByteArray(inp.toString().getBytes())'. 
> But this bytearray encoding of other supported types might not be same for 
> the LoadFunction associated with the column, and that can result in problems.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1548) Optimize scalar to consolidate the part file

2010-09-04 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906321#action_12906321
 ] 

Daniel Dai commented on PIG-1548:
-

Patch break TestFRJoin2.testConcatenateJobForScalar3. Comment out 
TestFRJoin2.testConcatenateJobForScalar3 temporarily.

> Optimize scalar to consolidate the part file
> 
>
> Key: PIG-1548
> URL: https://issues.apache.org/jira/browse/PIG-1548
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1548.patch, PIG-1548_1.patch
>
>
> Current scalar implementation will write a scalar file onto dfs. When Pig 
> need the scalar, it will open the dfs file directly. Each scalar file 
> contains more than one part file though it contains only one record. This 
> puts a huge load to namenode. We should consolidate part file before open it. 
> Another optional step is put the consolicated file into distributed cache. 
> This further bring down the load of namenode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1601) Make scalar work for secure hadoop

2010-09-03 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1601:


Attachment: PIG-1601-1.patch

> Make scalar work for secure hadoop
> --
>
> Key: PIG-1601
> URL: https://issues.apache.org/jira/browse/PIG-1601
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1601-1.patch
>
>
> Error message:
> open file
> 'hdfs://gsbl90890.blue.ygrid.yahoo.com/tmp/temp851711738/tmp727366271'; error 
> =
> java.io.IOException: Delegation Token can be issued only with kerberos or web
> authentication at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:4975)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNode.getDelegationToken(NameNode.java:432)
> at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source) at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597) at
> org.apache.hadoop.ipc.RPC$Server.call(RPC.java:523) at
> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1301) at
> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1297) at
> java.security.AccessController.doPrivileged(Native Method) at
> javax.security.auth.Subject.doAs(Subject.java:396) at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1295) at
> org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:66) at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229)
> at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:313)
> at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:448)
> at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:441)
> at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.Divide.getNext(Divide.java:72)
> at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:358)
> at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291)
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236)
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:638) at
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:314) at
> org.apache.hadoop.mapred.Child$4.run(Child.java:217) at
> java.security.AccessController.doPrivileged(Native Method) at
> javax.security.auth.Subject.doAs(Subject.java:396) at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)
> at org.apache.hadoop.mapred.Child.main(Child.java:211) 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1601) Make scalar work for secure hadoop

2010-09-03 Thread Daniel Dai (JIRA)
Make scalar work for secure hadoop
--

 Key: PIG-1601
 URL: https://issues.apache.org/jira/browse/PIG-1601
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0
 Attachments: PIG-1601-1.patch

Error message:
open file
'hdfs://gsbl90890.blue.ygrid.yahoo.com/tmp/temp851711738/tmp727366271'; error =
java.io.IOException: Delegation Token can be issued only with kerberos or web
authentication at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:4975)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.getDelegationToken(NameNode.java:432)
at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source) at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597) at
org.apache.hadoop.ipc.RPC$Server.call(RPC.java:523) at
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1301) at
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1297) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:396) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1295) at
org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:66) at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:313)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:448)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:441)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.Divide.getNext(Divide.java:72)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:358)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:638) at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:314) at
org.apache.hadoop.mapred.Child$4.run(Child.java:217) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:396) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)
at org.apache.hadoop.mapred.Child.main(Child.java:211) 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1543) IsEmpty returns the wrong value after using LIMIT

2010-09-03 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1543:


  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

Patch committed to both trunk and 0.8 branch.

> IsEmpty returns the wrong value after using LIMIT
> -
>
> Key: PIG-1543
> URL: https://issues.apache.org/jira/browse/PIG-1543
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Justin Hu
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1543-1.patch
>
>
> 1. Two input files:
> 1a: limit_empty.input_a
> 1
> 1
> 1
> 1b: limit_empty.input_b
> 2
> 2
> 2.
> The pig script: limit_empty.pig
> -- A contains only 1's & B contains only 2's
> A = load 'limit_empty.input_a' as (a1:int);
> B = load 'limit_empty.input_a' as (b1:int);
> C =COGROUP A by a1, B by b1;
> D = FOREACH C generate A, B, (IsEmpty(A)? 0:1), (IsEmpty(B)? 0:1), COUNT(A), 
> COUNT(B);
> store D into 'limit_empty.output/d';
> -- After the script done, we see the right results:
> -- {(1),(1),(1)}   {}  1   0   3   0
> -- {} {(2),(2)}  0   1   0   2
> C1 = foreach C { Alim = limit A 1; Blim = limit B 1; generate Alim, Blim; }
> D1 = FOREACH C1 generate Alim,Blim, (IsEmpty(Alim)? 0:1), (IsEmpty(Blim)? 
> 0:1), COUNT(Alim), COUNT(Blim);
> store D1 into 'limit_empty.output/d1';
> -- After the script done, we see the unexpected results:
> -- {(1)}   {}1   1   1   0
> -- {}  {(2)} 1   1   0   1
> dump D;
> dump D1;
> 3. Run the scrip and redirect the stdout (2 dumps) file. There are two issues:
> The major one:
> IsEmpty() returns FALSE for empty bag in limit_empty.output/d1/*, while 
> IsEmpty() returns correctly in limit_empty.output/d/*.
> The difference is that one has been applied with "LIMIT" before using 
> IsEmpty().
> The minor one:
> The redirected output only contains the first dump:
> ({(1),(1),(1)},{},1,0,3L,0L)
> ({},{(2),(2)},0,1,0L,2L)
> We expect two more lines like:
> ({(1)},{},1,1,1L,0L)
> ({},{(2)},1,1,0L,1L)
> Besides, there is error says:
> [main] ERROR org.apache.pig.backend.hadoop.executionengine.HJob - 
> java.lang.ClassCastException: java.lang.Integer cannot be cast to 
> org.apache.pig.data.Tuple

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1591) pig does not create a log file, if tje MR job succeeds but front end fails.

2010-09-03 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai resolved PIG-1591.
-

 Hadoop Flags: [Reviewed]
Fix Version/s: 0.8.0
   Resolution: Fixed

Patch committed to both trunk and 0.8 branch.

> pig does not create a log file, if tje MR job succeeds but front end fails.
> ---
>
> Key: PIG-1591
> URL: https://issues.apache.org/jira/browse/PIG-1591
> Project: Pig
>  Issue Type: Bug
>Reporter: niraj rai
>Assignee: niraj rai
> Fix For: 0.8.0
>
> Attachments: pig_1591.patch
>
>
> When I run this script:
> A = load 'limit_empty.input_a' as (a1:int);
> B = load 'limit_empty.input_b' as (b1:int);
> C =COGROUP A by a1, B by b1;
> C1 = foreach C { Alim = limit A 1; Blim = limit B 1; generate Alim, Blim; }
> D1 = FOREACH C1 generate Alim,Blim, (IsEmpty(Alim)? 0:1), (IsEmpty(Blim)? 
> 0:1), COUNT(Alim), COUNT(Blim);
> dump D1;
> The MR job succeeds but the pig job fails with the fillowing error:
> 2010-08-31 13:33:09,960 [main] INFO  org.apache.hadoop.metrics.jvm.JvmMetrics 
> - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - 
> already initialized
> 2010-08-31 13:33:09,962 [main] INFO  org.apache.pig.impl.io.InterStorage - 
> Pig Internal storage in use
> 2010-08-31 13:33:09,963 [main] INFO  org.apache.pig.impl.io.InterStorage - 
> Pig Internal storage in use
> 2010-08-31 13:33:09,963 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>  - Success!
> 2010-08-31 13:33:09,964 [main] INFO  org.apache.pig.impl.io.InterStorage - 
> Pig Internal storage in use
> 2010-08-31 13:33:09,965 [main] INFO  org.apache.hadoop.metrics.jvm.JvmMetrics 
> - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - 
> already initialized
> 2010-08-31 13:33:09,969 [main] INFO  
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
> process : 1
> 2010-08-31 13:33:09,969 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input 
> paths to process : 1
> 2010-08-31 13:33:09,973 [main] ERROR 
> org.apache.pig.backend.hadoop.executionengine.HJob - 
> java.lang.ClassCastException: java.lang.Integer cannot be cast to 
> org.apache.pig.data.Tuple
> since MR job is succeeded, so the pig does not create any log file, but it 
> should still create a log file, giving the cause of failure in the pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1595) casting relation to scalar- problem with handling of data from non PigStorage loaders

2010-09-03 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906116#action_12906116
 ] 

Daniel Dai commented on PIG-1595:
-

Patch looks good. This patch is to address the problem that we cannot get 
output schema of the scalar UDF at compile time. Another approach is write 
ReadScalars.outputSchema(), and use the input schema to figure out the output 
schema. But again we need to address the dependency to make sure input schema 
is correctly set before calling outputSchema(). So both approach should be 
equivalent.

> casting relation to scalar- problem with handling of data from non PigStorage 
> loaders
> -
>
> Key: PIG-1595
> URL: https://issues.apache.org/jira/browse/PIG-1595
> Project: Pig
>  Issue Type: Bug
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Fix For: 0.8.0
>
> Attachments: PIG-1595.1.patch
>
>
> If load functions that don't follow the same bytearray format as PigStorage 
> for other supported datatypes, or those that don't implement the LoadCaster 
> interface are used in 'casting relation to scalar' (PIG-1434), it can cause 
> the query to fail or create incorrect results.
> The root cause of the problem is that there is a real dependency between the 
> ReadScalars udf that returns the scalar value and the LogicalOperator that 
> acts as its input. But the logicalplan does not capture this dependency. So 
> in SchemaResetter visitor used by the optimizer, the order in which schema is 
> reset and evaluated does not take this into consideration. If the schema of 
> the input LogicalOperator does not get evaluated before the ReadScalar udf, 
> the resutltype of ReadScalar udf becomes bytearray. POUserFunc will convert 
> the input to bytearray using ' new DataByteArray(inp.toString().getBytes())'. 
> But this bytearray encoding of other supported types might not be same for 
> the LoadFunction associated with the column, and that can result in problems.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1591) pig does not create a log file, if tje MR job succeeds but front end fails.

2010-09-03 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905966#action_12905966
 ] 

Daniel Dai commented on PIG-1591:
-

+1. No unit test needed since it is about error message. Manually tested and it 
works. Will commit it shortly.

> pig does not create a log file, if tje MR job succeeds but front end fails.
> ---
>
> Key: PIG-1591
> URL: https://issues.apache.org/jira/browse/PIG-1591
> Project: Pig
>  Issue Type: Bug
>Reporter: niraj rai
>Assignee: niraj rai
> Attachments: pig_1591.patch
>
>
> When I run this script:
> A = load 'limit_empty.input_a' as (a1:int);
> B = load 'limit_empty.input_b' as (b1:int);
> C =COGROUP A by a1, B by b1;
> C1 = foreach C { Alim = limit A 1; Blim = limit B 1; generate Alim, Blim; }
> D1 = FOREACH C1 generate Alim,Blim, (IsEmpty(Alim)? 0:1), (IsEmpty(Blim)? 
> 0:1), COUNT(Alim), COUNT(Blim);
> dump D1;
> The MR job succeeds but the pig job fails with the fillowing error:
> 2010-08-31 13:33:09,960 [main] INFO  org.apache.hadoop.metrics.jvm.JvmMetrics 
> - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - 
> already initialized
> 2010-08-31 13:33:09,962 [main] INFO  org.apache.pig.impl.io.InterStorage - 
> Pig Internal storage in use
> 2010-08-31 13:33:09,963 [main] INFO  org.apache.pig.impl.io.InterStorage - 
> Pig Internal storage in use
> 2010-08-31 13:33:09,963 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>  - Success!
> 2010-08-31 13:33:09,964 [main] INFO  org.apache.pig.impl.io.InterStorage - 
> Pig Internal storage in use
> 2010-08-31 13:33:09,965 [main] INFO  org.apache.hadoop.metrics.jvm.JvmMetrics 
> - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - 
> already initialized
> 2010-08-31 13:33:09,969 [main] INFO  
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
> process : 1
> 2010-08-31 13:33:09,969 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input 
> paths to process : 1
> 2010-08-31 13:33:09,973 [main] ERROR 
> org.apache.pig.backend.hadoop.executionengine.HJob - 
> java.lang.ClassCastException: java.lang.Integer cannot be cast to 
> org.apache.pig.data.Tuple
> since MR job is succeeded, so the pig does not create any log file, but it 
> should still create a log file, giving the cause of failure in the pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1543) IsEmpty returns the wrong value after using LIMIT

2010-09-02 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905587#action_12905587
 ] 

Daniel Dai commented on PIG-1543:
-

test-patch result:

 [exec] +1 overall.  
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] +1 tests included.  The patch appears to include 3 new or 
modified tests.
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.

All tests pass

> IsEmpty returns the wrong value after using LIMIT
> -
>
> Key: PIG-1543
> URL: https://issues.apache.org/jira/browse/PIG-1543
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Justin Hu
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1543-1.patch
>
>
> 1. Two input files:
> 1a: limit_empty.input_a
> 1
> 1
> 1
> 1b: limit_empty.input_b
> 2
> 2
> 2.
> The pig script: limit_empty.pig
> -- A contains only 1's & B contains only 2's
> A = load 'limit_empty.input_a' as (a1:int);
> B = load 'limit_empty.input_a' as (b1:int);
> C =COGROUP A by a1, B by b1;
> D = FOREACH C generate A, B, (IsEmpty(A)? 0:1), (IsEmpty(B)? 0:1), COUNT(A), 
> COUNT(B);
> store D into 'limit_empty.output/d';
> -- After the script done, we see the right results:
> -- {(1),(1),(1)}   {}  1   0   3   0
> -- {} {(2),(2)}  0   1   0   2
> C1 = foreach C { Alim = limit A 1; Blim = limit B 1; generate Alim, Blim; }
> D1 = FOREACH C1 generate Alim,Blim, (IsEmpty(Alim)? 0:1), (IsEmpty(Blim)? 
> 0:1), COUNT(Alim), COUNT(Blim);
> store D1 into 'limit_empty.output/d1';
> -- After the script done, we see the unexpected results:
> -- {(1)}   {}1   1   1   0
> -- {}  {(2)} 1   1   0   1
> dump D;
> dump D1;
> 3. Run the scrip and redirect the stdout (2 dumps) file. There are two issues:
> The major one:
> IsEmpty() returns FALSE for empty bag in limit_empty.output/d1/*, while 
> IsEmpty() returns correctly in limit_empty.output/d/*.
> The difference is that one has been applied with "LIMIT" before using 
> IsEmpty().
> The minor one:
> The redirected output only contains the first dump:
> ({(1),(1),(1)},{},1,0,3L,0L)
> ({},{(2),(2)},0,1,0L,2L)
> We expect two more lines like:
> ({(1)},{},1,1,1L,0L)
> ({},{(2)},1,1,0L,1L)
> Besides, there is error says:
> [main] ERROR org.apache.pig.backend.hadoop.executionengine.HJob - 
> java.lang.ClassCastException: java.lang.Integer cannot be cast to 
> org.apache.pig.data.Tuple

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1583) piggybank unit test TestLookupInFiles is broken

2010-09-01 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1583:


  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

Patch committed to both trunk and 0.8 branch.

> piggybank unit test TestLookupInFiles is broken
> ---
>
> Key: PIG-1583
> URL: https://issues.apache.org/jira/browse/PIG-1583
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1583-1.patch
>
>
> Error message:
> 10/08/31 09:32:12 INFO mapred.TaskInProgress: Error from 
> attempt_20100831093139211_0001_m_00_3: 
> org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught 
> error from UDF: org.apache.pig.piggybank.evaluation.string.LookupInFiles 
> [LookupInFiles : Cannot open file one]
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:262)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:283)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:355)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> at org.apache.hadoop.mapred.Child.main(Child.java:170)
> Caused by: java.io.IOException: LookupInFiles : Cannot open file one
> at 
> org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:92)
> at 
> org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:115)
> at 
> org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:49)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229)
> ... 10 more
> Caused by: java.io.IOException: hdfs://localhost:47453/user/hadoopqa/one 
> does not exist
> at 
> org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:224)
> at 
> org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:172)
> at 
> org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:89)
> ... 13 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1572) change default datatype when relations are used as scalar to bytearray

2010-09-01 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905293#action_12905293
 ] 

Daniel Dai commented on PIG-1572:
-

Patch looks good. One minor doubt is when we migrate to new logical plan, 
UserFuncExpression already have necessary cast inserted, seems we do not need 
to change new logical plan's UserFuncExpression.getFieldSchema(), am I right?

> change default datatype when relations are used as scalar to bytearray
> --
>
> Key: PIG-1572
> URL: https://issues.apache.org/jira/browse/PIG-1572
> Project: Pig
>  Issue Type: Bug
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Fix For: 0.8.0
>
> Attachments: PIG-1572.1.patch, PIG-1572.2.patch
>
>
> When relations are cast to scalar, the current default type is chararray. 
> This is inconsistent with the behavior in rest of pig-latin.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1543) IsEmpty returns the wrong value after using LIMIT

2010-09-01 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1543:


Status: Patch Available  (was: Open)

> IsEmpty returns the wrong value after using LIMIT
> -
>
> Key: PIG-1543
> URL: https://issues.apache.org/jira/browse/PIG-1543
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Justin Hu
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1543-1.patch
>
>
> 1. Two input files:
> 1a: limit_empty.input_a
> 1
> 1
> 1
> 1b: limit_empty.input_b
> 2
> 2
> 2.
> The pig script: limit_empty.pig
> -- A contains only 1's & B contains only 2's
> A = load 'limit_empty.input_a' as (a1:int);
> B = load 'limit_empty.input_a' as (b1:int);
> C =COGROUP A by a1, B by b1;
> D = FOREACH C generate A, B, (IsEmpty(A)? 0:1), (IsEmpty(B)? 0:1), COUNT(A), 
> COUNT(B);
> store D into 'limit_empty.output/d';
> -- After the script done, we see the right results:
> -- {(1),(1),(1)}   {}  1   0   3   0
> -- {} {(2),(2)}  0   1   0   2
> C1 = foreach C { Alim = limit A 1; Blim = limit B 1; generate Alim, Blim; }
> D1 = FOREACH C1 generate Alim,Blim, (IsEmpty(Alim)? 0:1), (IsEmpty(Blim)? 
> 0:1), COUNT(Alim), COUNT(Blim);
> store D1 into 'limit_empty.output/d1';
> -- After the script done, we see the unexpected results:
> -- {(1)}   {}1   1   1   0
> -- {}  {(2)} 1   1   0   1
> dump D;
> dump D1;
> 3. Run the scrip and redirect the stdout (2 dumps) file. There are two issues:
> The major one:
> IsEmpty() returns FALSE for empty bag in limit_empty.output/d1/*, while 
> IsEmpty() returns correctly in limit_empty.output/d/*.
> The difference is that one has been applied with "LIMIT" before using 
> IsEmpty().
> The minor one:
> The redirected output only contains the first dump:
> ({(1),(1),(1)},{},1,0,3L,0L)
> ({},{(2),(2)},0,1,0L,2L)
> We expect two more lines like:
> ({(1)},{},1,1,1L,0L)
> ({},{(2)},1,1,0L,1L)
> Besides, there is error says:
> [main] ERROR org.apache.pig.backend.hadoop.executionengine.HJob - 
> java.lang.ClassCastException: java.lang.Integer cannot be cast to 
> org.apache.pig.data.Tuple

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1543) IsEmpty returns the wrong value after using LIMIT

2010-09-01 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1543:


Attachment: PIG-1543-1.patch

This patch fix the first issue. The problem is we erroneously put a null in the 
bag when we expect an empty bag

The second issue is a side effect of first issue. BinInterSedes has the 
assumption that bag only contains tuple, so it does not expect a null inside 
bag. This issue is fixed automatically once first issue is in.

> IsEmpty returns the wrong value after using LIMIT
> -
>
> Key: PIG-1543
> URL: https://issues.apache.org/jira/browse/PIG-1543
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Justin Hu
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1543-1.patch
>
>
> 1. Two input files:
> 1a: limit_empty.input_a
> 1
> 1
> 1
> 1b: limit_empty.input_b
> 2
> 2
> 2.
> The pig script: limit_empty.pig
> -- A contains only 1's & B contains only 2's
> A = load 'limit_empty.input_a' as (a1:int);
> B = load 'limit_empty.input_a' as (b1:int);
> C =COGROUP A by a1, B by b1;
> D = FOREACH C generate A, B, (IsEmpty(A)? 0:1), (IsEmpty(B)? 0:1), COUNT(A), 
> COUNT(B);
> store D into 'limit_empty.output/d';
> -- After the script done, we see the right results:
> -- {(1),(1),(1)}   {}  1   0   3   0
> -- {} {(2),(2)}  0   1   0   2
> C1 = foreach C { Alim = limit A 1; Blim = limit B 1; generate Alim, Blim; }
> D1 = FOREACH C1 generate Alim,Blim, (IsEmpty(Alim)? 0:1), (IsEmpty(Blim)? 
> 0:1), COUNT(Alim), COUNT(Blim);
> store D1 into 'limit_empty.output/d1';
> -- After the script done, we see the unexpected results:
> -- {(1)}   {}1   1   1   0
> -- {}  {(2)} 1   1   0   1
> dump D;
> dump D1;
> 3. Run the scrip and redirect the stdout (2 dumps) file. There are two issues:
> The major one:
> IsEmpty() returns FALSE for empty bag in limit_empty.output/d1/*, while 
> IsEmpty() returns correctly in limit_empty.output/d/*.
> The difference is that one has been applied with "LIMIT" before using 
> IsEmpty().
> The minor one:
> The redirected output only contains the first dump:
> ({(1),(1),(1)},{},1,0,3L,0L)
> ({},{(2),(2)},0,1,0L,2L)
> We expect two more lines like:
> ({(1)},{},1,1,1L,0L)
> ({},{(2)},1,1,0L,1L)
> Besides, there is error says:
> [main] ERROR org.apache.pig.backend.hadoop.executionengine.HJob - 
> java.lang.ClassCastException: java.lang.Integer cannot be cast to 
> org.apache.pig.data.Tuple

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1587) Cloning utility functions for new logical plan

2010-09-01 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1587:


Description: 
We sometimes need to copy a logical operator/plan when writing an optimization 
rule. Currently copy an operator/plan is awkward. We need to write some 
utilities to facilitate this process. Swati contribute PIG-1510 but we feel it 
still cannot address most use cases. I propose to add some more utilities into 
new logical plan:

all LogicalExpressions:
{code}
copy(LogicalExpressionPlan newPlan, boolean keepUid);
{code}
* Do a shallow copy of the logical expression operator (except for fieldSchema, 
uidOnlySchema, ProjectExpression.attachedRelationalOp)
* Set the plan to newPlan
* If keepUid is true, further copy uidOnlyFieldSchema

all LogicalRelationalOperators:
{code}
copy(LogicalPlan newPlan, boolean keepUid);
{code}
* Do a shallow copy of the logical relational operator (except for schema, uid 
related fields)
* Set the plan to newPlan;
* If the operator have inner plan/expression plan, copy the whole inner plan 
with the same keepUid flag (Especially, LOInnerLoad will copy its inner 
project, with the same keepUid flag)
* If keepUid is true, further copy uid related fields (LOUnion.uidMapping, 
LOCogroup.groupKeyUidOnlySchema, LOCogroup.generatedInputUids)

LogicalExpressionPlan.java
{code}
LogicalExpressionPlan copy(LogicalRelationalOperator attachedRelationalOp, 
boolean keepUid);
LogicalExpressionPlan copyAbove(LogicalExpression leave, 
LogicalRelationalOperator attachedRelationalOp, boolean keepUid);
LogicalExpressionPlan copyBelow(LogicalExpression root, 
LogicalRelationalOperator attachedRelationalOp, boolean keepUid);
{code}
* Create a new logical expression plan and copy expression operator along with 
connection with the same keepUid flag
* Set all ProjectExpression.attachedRelationalOp to attachedRelationalOp 
parameter

{code}
Pair, List> merge(LogicalExpressionPlan plan, 
LogicalRelationalOperator attachedRelationalOp);
{code}
* Merge plan into the current logical expression plan as an independent tree
* attachedRelationalOp is the destination operator new logical expression plan 
attached to
* return the sources/sinks of this independent tree


LogicalPlan.java
{code}
LogicalPlan copy(LOForEach foreach, boolean keepUid);
LogicalPlan copyAbove(LogicalRelationalOperator leave, LOForEach foreach, 
boolean keepUid);
LogicalPlan copyBelow(LogicalRelationalOperator root, LOForEach foreach, 
boolean keepUid);
{code}
* Main use case to copy inner plan of ForEach
* Create a new logical plan and copy relational operator along with connection
* Copy all expression plans inside relational operator, set plan and 
attachedRelationalOp properly
* If the plan is ForEach inner plan, param foreach is the destination ForEach 
operator; otherwise, pass null

{code}
Pair, List> merge(LogicalPlan plan, LOForEach foreach);
{code}
* Merge plan into the current logical plan as an independent tree
* foreach is the destination LOForEach is the destination plan is a ForEach 
inner plan; otherwise, pass null
* return the sources/sinks of this independent tree


  was:
We sometimes need to copy a logical operator/plan when writing an optimization 
rule. Currently copy an operator/plan is awkward. We need to write some 
utilities to facilitate this process. Swati contribute PIG-1510 but we feel it 
still cannot address most use cases. I propose to add some more utilities into 
new logical plan:

all LogicalExpressions:
{code}
copy(LogicalExpressionPlan newPlan, boolean keepUid);
{code}
* Do a shallow copy of the logical expression operator (except for fieldSchema, 
uidOnlySchema, ProjectExpression.attachedRelationalOp)
* Set the plan to newPlan
* If keepUid is true, further copy uidOnlyFieldSchema

all LogicalRelationalOperators:
{code}
copy(LogicalPlan newPlan, boolean keepUid);
{code}
* Do a shallow copy of the logical relational operator (except for schema, uid 
related fields)
* Set the plan to newPlan;
* If the operator have inner plan/expression plan, copy the whole inner plan 
with the same keepUid flag (Especially, LOInnerLoad will copy its inner 
project, with the same keepUid flag)
* If keepUid is true, further copy uid related fields (LOUnion.uidMapping, 
LOCogroup.groupKeyUidOnlySchema, LOCogroup.generatedInputUids)

LogicalExpressionPlan.java
{code}
LogicalExpressionPlan copy(LogicalRelationalOperator attachedRelationalOp, 
boolean keepUid);
{code}
* Copy expression operator along with connection with the same keepUid flag
* Set all ProjectExpression.attachedRelationalOp to attachedRelationalOp 
parameter

{code}
List merge(LogicalExpressionPlan plan);
{code}
* Merge plan into the current logical expression plan as an independent tree
* return the sources of this independent tree


LogicalPlan.java
{code}
LogicalPlan copy(boolean keepUid);
{code}
* Main use case to copy inner plan of ForEach
* Copy all rel

[jira] Created: (PIG-1587) Cloning utility functions for new logical plan

2010-08-31 Thread Daniel Dai (JIRA)
Cloning utility functions for new logical plan
--

 Key: PIG-1587
 URL: https://issues.apache.org/jira/browse/PIG-1587
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
 Fix For: 0.9.0


We sometimes need to copy a logical operator/plan when writing an optimization 
rule. Currently copy an operator/plan is awkward. We need to write some 
utilities to facilitate this process. Swati contribute PIG-1510 but we feel it 
still cannot address most use cases. I propose to add some more utilities into 
new logical plan:

all LogicalExpressions:
{code}
copy(LogicalExpressionPlan newPlan, boolean keepUid);
{code}
* Do a shallow copy of the logical expression operator (except for fieldSchema, 
uidOnlySchema, ProjectExpression.attachedRelationalOp)
* Set the plan to newPlan
* If keepUid is true, further copy uidOnlyFieldSchema

all LogicalRelationalOperators:
{code}
copy(LogicalPlan newPlan, boolean keepUid);
{code}
* Do a shallow copy of the logical relational operator (except for schema, uid 
related fields)
* Set the plan to newPlan;
* If the operator have inner plan/expression plan, copy the whole inner plan 
with the same keepUid flag (Especially, LOInnerLoad will copy its inner 
project, with the same keepUid flag)
* If keepUid is true, further copy uid related fields (LOUnion.uidMapping, 
LOCogroup.groupKeyUidOnlySchema, LOCogroup.generatedInputUids)

LogicalExpressionPlan.java
{code}
LogicalExpressionPlan copy(LogicalRelationalOperator attachedRelationalOp, 
boolean keepUid);
{code}
* Copy expression operator along with connection with the same keepUid flag
* Set all ProjectExpression.attachedRelationalOp to attachedRelationalOp 
parameter

{code}
List merge(LogicalExpressionPlan plan);
{code}
* Merge plan into the current logical expression plan as an independent tree
* return the sources of this independent tree


LogicalPlan.java
{code}
LogicalPlan copy(boolean keepUid);
{code}
* Main use case to copy inner plan of ForEach
* Copy all relational operator along with connection
* Copy all expression plans inside relational operator, set plan and 
attachedRelationalOp properly

{code}
List merge(LogicalPlan plan);
{code}
* Merge plan into the current logical plan as an independent tree
* return the sources of this independent tree


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1583) piggybank unit test TestLookupInFiles is broken

2010-08-31 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1583:


Attachment: PIG-1583-1.patch

> piggybank unit test TestLookupInFiles is broken
> ---
>
> Key: PIG-1583
> URL: https://issues.apache.org/jira/browse/PIG-1583
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1583-1.patch
>
>
> Error message:
> 10/08/31 09:32:12 INFO mapred.TaskInProgress: Error from 
> attempt_20100831093139211_0001_m_00_3: 
> org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught 
> error from UDF: org.apache.pig.piggybank.evaluation.string.LookupInFiles 
> [LookupInFiles : Cannot open file one]
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:262)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:283)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:355)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> at org.apache.hadoop.mapred.Child.main(Child.java:170)
> Caused by: java.io.IOException: LookupInFiles : Cannot open file one
> at 
> org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:92)
> at 
> org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:115)
> at 
> org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:49)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229)
> ... 10 more
> Caused by: java.io.IOException: hdfs://localhost:47453/user/hadoopqa/one 
> does not exist
> at 
> org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:224)
> at 
> org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:172)
> at 
> org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:89)
> ... 13 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1583) piggybank unit test TestLookupInFiles is broken

2010-08-31 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1583:


Attachment: (was: PIG-1583-1.patch)

> piggybank unit test TestLookupInFiles is broken
> ---
>
> Key: PIG-1583
> URL: https://issues.apache.org/jira/browse/PIG-1583
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1583-1.patch
>
>
> Error message:
> 10/08/31 09:32:12 INFO mapred.TaskInProgress: Error from 
> attempt_20100831093139211_0001_m_00_3: 
> org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught 
> error from UDF: org.apache.pig.piggybank.evaluation.string.LookupInFiles 
> [LookupInFiles : Cannot open file one]
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:262)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:283)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:355)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> at org.apache.hadoop.mapred.Child.main(Child.java:170)
> Caused by: java.io.IOException: LookupInFiles : Cannot open file one
> at 
> org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:92)
> at 
> org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:115)
> at 
> org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:49)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229)
> ... 10 more
> Caused by: java.io.IOException: hdfs://localhost:47453/user/hadoopqa/one 
> does not exist
> at 
> org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:224)
> at 
> org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:172)
> at 
> org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:89)
> ... 13 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1583) piggybank unit test TestLookupInFiles is broken

2010-08-31 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1583:


Attachment: PIG-1583-1.patch

> piggybank unit test TestLookupInFiles is broken
> ---
>
> Key: PIG-1583
> URL: https://issues.apache.org/jira/browse/PIG-1583
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1583-1.patch
>
>
> Error message:
> 10/08/31 09:32:12 INFO mapred.TaskInProgress: Error from 
> attempt_20100831093139211_0001_m_00_3: 
> org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught 
> error from UDF: org.apache.pig.piggybank.evaluation.string.LookupInFiles 
> [LookupInFiles : Cannot open file one]
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:262)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:283)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:355)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> at org.apache.hadoop.mapred.Child.main(Child.java:170)
> Caused by: java.io.IOException: LookupInFiles : Cannot open file one
> at 
> org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:92)
> at 
> org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:115)
> at 
> org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:49)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229)
> ... 10 more
> Caused by: java.io.IOException: hdfs://localhost:47453/user/hadoopqa/one 
> does not exist
> at 
> org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:224)
> at 
> org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:172)
> at 
> org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:89)
> ... 13 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1583) piggybank unit test TestLookupInFiles is broken

2010-08-31 Thread Daniel Dai (JIRA)
piggybank unit test TestLookupInFiles is broken
---

 Key: PIG-1583
 URL: https://issues.apache.org/jira/browse/PIG-1583
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0
 Attachments: PIG-1583-1.patch

Error message:
10/08/31 09:32:12 INFO mapred.TaskInProgress: Error from 
attempt_20100831093139211_0001_m_00_3: 
org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught 
error from UDF: org.apache.pig.piggybank.evaluation.string.LookupInFiles 
[LookupInFiles : Cannot open file one]
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:262)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:283)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:355)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: java.io.IOException: LookupInFiles : Cannot open file one
at 
org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:92)
at 
org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:115)
at 
org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:49)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229)
... 10 more
Caused by: java.io.IOException: hdfs://localhost:47453/user/hadoopqa/one 
does not exist
at 
org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:224)
at 
org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:172)
at 
org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:89)
... 13 more


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with

2010-08-30 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904485#action_12904485
 ] 

Daniel Dai commented on PIG-1178:
-

PIG-1178-9.patch committed.

> LogicalPlan and Optimizer are too complex and hard to work with
> ---
>
> Key: PIG-1178
> URL: https://issues.apache.org/jira/browse/PIG-1178
> Project: Pig
>  Issue Type: Improvement
>Reporter: Alan Gates
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: expressions-2.patch, expressions.patch, lp.patch, 
> lp.patch, PIG-1178-4.patch, PIG-1178-5.patch, PIG-1178-6.patch, 
> PIG-1178-7.patch, PIG-1178-8.patch, PIG-1178-9.patch, pig_1178.patch, 
> pig_1178.patch, PIG_1178.patch, pig_1178_2.patch, pig_1178_3.2.patch, 
> pig_1178_3.3.patch, pig_1178_3.4.patch, pig_1178_3.patch
>
>
> The current implementation of the logical plan and the logical optimizer in 
> Pig has proven to not be easily extensible. Developer feedback has indicated 
> that adding new rules to the optimizer is quite burdensome. In addition, the 
> logical plan has been an area of numerous bugs, many of which have been 
> difficult to fix. Developers also feel that the logical plan is difficult to 
> understand and maintain. The root cause for these issues is that a number of 
> design decisions that were made as part of the 0.2 rewrite of the front end 
> have now proven to be sub-optimal. The heart of this proposal is to revisit a 
> number of those proposals and rebuild the logical plan with a simpler design 
> that will make it much easier to maintain the logical plan as well as extend 
> the logical optimizer. 
> See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full 
> details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with

2010-08-30 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1178:


Attachment: PIG-1178-9.patch

Update help message.

> LogicalPlan and Optimizer are too complex and hard to work with
> ---
>
> Key: PIG-1178
> URL: https://issues.apache.org/jira/browse/PIG-1178
> Project: Pig
>  Issue Type: Improvement
>Reporter: Alan Gates
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: expressions-2.patch, expressions.patch, lp.patch, 
> lp.patch, PIG-1178-4.patch, PIG-1178-5.patch, PIG-1178-6.patch, 
> PIG-1178-7.patch, PIG-1178-8.patch, PIG-1178-9.patch, pig_1178.patch, 
> pig_1178.patch, PIG_1178.patch, pig_1178_2.patch, pig_1178_3.2.patch, 
> pig_1178_3.3.patch, pig_1178_3.4.patch, pig_1178_3.patch
>
>
> The current implementation of the logical plan and the logical optimizer in 
> Pig has proven to not be easily extensible. Developer feedback has indicated 
> that adding new rules to the optimizer is quite burdensome. In addition, the 
> logical plan has been an area of numerous bugs, many of which have been 
> difficult to fix. Developers also feel that the logical plan is difficult to 
> understand and maintain. The root cause for these issues is that a number of 
> design decisions that were made as part of the 0.2 rewrite of the front end 
> have now proven to be sub-optimal. The heart of this proposal is to revisit a 
> number of those proposals and rebuild the logical plan with a simpler design 
> that will make it much easier to maintain the logical plan as well as extend 
> the logical optimizer. 
> See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full 
> details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1543) IsEmpty returns the wrong value after using LIMIT

2010-08-30 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904291#action_12904291
 ] 

Daniel Dai commented on PIG-1543:
-

This seems not a logical layer problem and new optimizer does not address it. 
It might related to [PIG-747|https://issues.apache.org/jira/browse/PIG-747], 
need further investigation.

> IsEmpty returns the wrong value after using LIMIT
> -
>
> Key: PIG-1543
> URL: https://issues.apache.org/jira/browse/PIG-1543
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Justin Hu
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
>
> 1. Two input files:
> 1a: limit_empty.input_a
> 1
> 1
> 1
> 1b: limit_empty.input_b
> 2
> 2
> 2.
> The pig script: limit_empty.pig
> -- A contains only 1's & B contains only 2's
> A = load 'limit_empty.input_a' as (a1:int);
> B = load 'limit_empty.input_a' as (b1:int);
> C =COGROUP A by a1, B by b1;
> D = FOREACH C generate A, B, (IsEmpty(A)? 0:1), (IsEmpty(B)? 0:1), COUNT(A), 
> COUNT(B);
> store D into 'limit_empty.output/d';
> -- After the script done, we see the right results:
> -- {(1),(1),(1)}   {}  1   0   3   0
> -- {} {(2),(2)}  0   1   0   2
> C1 = foreach C { Alim = limit A 1; Blim = limit B 1; generate Alim, Blim; }
> D1 = FOREACH C1 generate Alim,Blim, (IsEmpty(Alim)? 0:1), (IsEmpty(Blim)? 
> 0:1), COUNT(Alim), COUNT(Blim);
> store D1 into 'limit_empty.output/d1';
> -- After the script done, we see the unexpected results:
> -- {(1)}   {}1   1   1   0
> -- {}  {(2)} 1   1   0   1
> dump D;
> dump D1;
> 3. Run the scrip and redirect the stdout (2 dumps) file. There are two issues:
> The major one:
> IsEmpty() returns FALSE for empty bag in limit_empty.output/d1/*, while 
> IsEmpty() returns correctly in limit_empty.output/d/*.
> The difference is that one has been applied with "LIMIT" before using 
> IsEmpty().
> The minor one:
> The redirected output only contains the first dump:
> ({(1),(1),(1)},{},1,0,3L,0L)
> ({},{(2),(2)},0,1,0L,2L)
> We expect two more lines like:
> ({(1)},{},1,1,1L,0L)
> ({},{(2)},1,1,0L,1L)
> Besides, there is error says:
> [main] ERROR org.apache.pig.backend.hadoop.executionengine.HJob - 
> java.lang.ClassCastException: java.lang.Integer cannot be cast to 
> org.apache.pig.data.Tuple

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1579) Intermittent unit test failure for TestScriptUDF.testPythonScriptUDFNullInputOutput

2010-08-30 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1579:


Description: 
Error message:
org.apache.pig.backend.executionengine.ExecException: ERROR 0: Error executing 
function: Traceback (most recent call last):
  File "", line 5, in multStr
TypeError: can't multiply sequence by non-int of type 'NoneType'

at 
org.apache.pig.scripting.jython.JythonFunction.exec(JythonFunction.java:107)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:295)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:346)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)


> Intermittent unit test failure for 
> TestScriptUDF.testPythonScriptUDFNullInputOutput
> ---
>
> Key: PIG-1579
> URL: https://issues.apache.org/jira/browse/PIG-1579
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1579-1.patch
>
>
> Error message:
> org.apache.pig.backend.executionengine.ExecException: ERROR 0: Error 
> executing function: Traceback (most recent call last):
>   File "", line 5, in multStr
> TypeError: can't multiply sequence by non-int of type 'NoneType'
> at 
> org.apache.pig.scripting.jython.JythonFunction.exec(JythonFunction.java:107)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:295)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:346)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> at org.apache.hadoop.mapred.Child.main(Child.java:170)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1579) Intermittent unit test failure for TestScriptUDF.testPythonScriptUDFNullInputOutput

2010-08-30 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1579:


Attachment: PIG-1579-1.patch

Attach a fix. However, this fix is shallow and may need an in-depth look. 
Commit the temporary fix and leave the Jira open.

> Intermittent unit test failure for 
> TestScriptUDF.testPythonScriptUDFNullInputOutput
> ---
>
> Key: PIG-1579
> URL: https://issues.apache.org/jira/browse/PIG-1579
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1579-1.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1579) Intermittent unit test failure for TestScriptUDF.testPythonScriptUDFNullInputOutput

2010-08-30 Thread Daniel Dai (JIRA)
Intermittent unit test failure for 
TestScriptUDF.testPythonScriptUDFNullInputOutput
---

 Key: PIG-1579
 URL: https://issues.apache.org/jira/browse/PIG-1579
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Daniel Dai
 Fix For: 0.8.0




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1568) Optimization rule FilterAboveForeach is too restrictive and doesn't handle project * correctly

2010-08-30 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1568:


  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

test-patch result:

 [exec] +1 overall.  
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] +1 tests included.  The patch appears to include 6 new or 
modified tests.
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.

Patch committed. Thanks Xuefu!

> Optimization rule FilterAboveForeach is too restrictive and doesn't handle 
> project * correctly
> --
>
> Key: PIG-1568
> URL: https://issues.apache.org/jira/browse/PIG-1568
> Project: Pig
>  Issue Type: Bug
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Fix For: 0.8.0
>
> Attachments: jira-1568-1.patch, jira-1568-1.patch
>
>
> FilterAboveForeach rule is to optimize the plan by pushing up filter above 
> previous foreach operator. However, during code review, two major problems 
> were found:
> 1. Current implementation assumes that if no projection is found in the 
> filter condition then all columns from foreach are projected. This issue 
> prevents the following optimization:
>   A = LOAD 'file.txt' AS (a(u,v), b, c);
>   B = FOREACH A GENERATE $0, b;
>   C = FILTER B BY 8 > 5;
>   STORE C INTO 'empty';
> 2. Current implementation doesn't handle * probjection, which means project 
> all columns. As a result, it wasn't able to optimize the following:
>   A = LOAD 'file.txt' AS (a(u,v), b, c);
>   B = FOREACH A GENERATE $0, b;
>   C = FILTER B BY Identity.class.getName(*) > 5;
>   STORE C INTO 'empty';

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1574) Optimization rule PushUpFilter causes filter to be pushed up out joins

2010-08-30 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1574:


  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

test-patch result:
jira-1574-1.patch

 [exec] +1 overall.  
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] +1 tests included.  The patch appears to include 3 new or 
modified tests.
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.

This patch does not push filter before join if the join is outer join. Actually 
we can push filter to the outer side of the join. I assume it will be addressed 
in PIG-1575.

Patch jira-1574-1.patch committed. Thanks Xuefu!

> Optimization rule PushUpFilter causes filter to be pushed up out joins
> --
>
> Key: PIG-1574
> URL: https://issues.apache.org/jira/browse/PIG-1574
> Project: Pig
>  Issue Type: Bug
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Fix For: 0.8.0
>
> Attachments: jira-1574-1.patch
>
>
> The PushUpFilter optimization rule in the new logical plan moves the filter 
> up to one of the join branch. It does this aggressively by find an operator 
> that has all the projection UIDs. However, it didn't consider that the found 
> operator might be another join. If that join is outer, then we cannot simply 
> move the filter to one of its branches.
> As an example, the following script will be erroneously optimized:
> A = load 'myfile' as (d1:int);
> B = load 'anotherfile' as (d2:int);
> C = join A by d1 full outer, B by d2;
> D = load 'xxx' as (d3:int);
> E = join C by d1, D by d3;
> F = filter E by d1 > 5;
> G = store F into 'dummy';

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-365) Map side optimization for Limit (top k case)

2010-08-29 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903996#action_12903996
 ] 

Daniel Dai commented on PIG-365:


Hi, Gianmarco,
Yes, you are right. This is a quite old Jira and it is no longer applicable. I 
will close this Jira. More recent limit optimization we are still looking at is 
[PIG-1270|https://issues.apache.org/jira/browse/PIG-1270]. 

> Map side optimization for Limit (top k case)
> 
>
> Key: PIG-365
> URL: https://issues.apache.org/jira/browse/PIG-365
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.2.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
>Priority: Minor
>
> In map side, only collect top k records to improve performance

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



  1   2   3   4   5   6   7   8   9   10   >