[jira] Commented: (PIG-1661) Add alternative search-provider to Pig site
[ https://issues.apache.org/jira/browse/PIG-1661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917133#action_12917133 ] Daniel Dai commented on PIG-1661: - The site looks good. I would vote yes. > Add alternative search-provider to Pig site > --- > > Key: PIG-1661 > URL: https://issues.apache.org/jira/browse/PIG-1661 > Project: Pig > Issue Type: Improvement > Components: documentation >Reporter: Alex Baranau >Priority: Minor > Attachments: PIG-1661.patch > > > Use search-hadoop.com service to make available search in Pig sources, MLs, > wiki, etc. > This was initially proposed on user mailing list. The search service was > already added in site's skin (common for all Hadoop related projects) via > AVRO-626 so this issue is about enabling it for Pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1542) log level not propogated to MR task loggers
[ https://issues.apache.org/jira/browse/PIG-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917079#action_12917079 ] Daniel Dai commented on PIG-1542: - Yes, -d xxx should treat as -Ddebug=xxx. And system properties already have higher priority in the current code. (And in my mind, we should deprecate -d in favor of -Ddebug) > log level not propogated to MR task loggers > --- > > Key: PIG-1542 > URL: https://issues.apache.org/jira/browse/PIG-1542 > Project: Pig > Issue Type: Bug >Reporter: Thejas M Nair >Assignee: niraj rai > Fix For: 0.8.0 > > Attachments: PIG-1542.patch, PIG-1542_1.patch, PIG-1542_2.patch > > > Specifying "-d DEBUG" does not affect the logging of the MR tasks . > This was fixed earlier in PIG-882 . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1659) sortinfo is not set for store if there is a filter after ORDER BY
[ https://issues.apache.org/jira/browse/PIG-1659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1659: Attachment: PIG-1659-1.patch > sortinfo is not set for store if there is a filter after ORDER BY > - > > Key: PIG-1659 > URL: https://issues.apache.org/jira/browse/PIG-1659 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Yan Zhou >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: PIG-1659-1.patch > > > This has caused 6 (of 7) failures in the Zebra test > TestOrderPreserveVariableTable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1659) sortinfo is not set for store if there is a filter after ORDER BY
[ https://issues.apache.org/jira/browse/PIG-1659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916998#action_12916998 ] Daniel Dai commented on PIG-1659: - We should set sortInfo after optimization. So we should add SetSortInfo after the optimization of new logical plan. This code is missing. > sortinfo is not set for store if there is a filter after ORDER BY > - > > Key: PIG-1659 > URL: https://issues.apache.org/jira/browse/PIG-1659 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Yan Zhou >Assignee: Daniel Dai > Fix For: 0.8.0 > > > This has caused 6 (of 7) failures in the Zebra test > TestOrderPreserveVariableTable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1638) sh output gets mixed up with the grunt prompt
[ https://issues.apache.org/jira/browse/PIG-1638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1638: Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed Patch committed to both trunk and 0.8 branch. > sh output gets mixed up with the grunt prompt > - > > Key: PIG-1638 > URL: https://issues.apache.org/jira/browse/PIG-1638 > Project: Pig > Issue Type: Bug > Components: grunt >Affects Versions: 0.8.0 >Reporter: niraj rai >Assignee: niraj rai >Priority: Minor > Fix For: 0.8.0 > > Attachments: PIG-1638_0.patch > > > Many times, the grunt prompt gets mixed up with the sh output.e.g. > grunt> sh ls > 000 > autocomplete > bin > build > build.xml > grunt> CHANGES.txt > conf > contrib > In the above case, grunt> is mixed up with the output. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1638) sh output gets mixed up with the grunt prompt
[ https://issues.apache.org/jira/browse/PIG-1638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916725#action_12916725 ] Daniel Dai commented on PIG-1638: - +1 > sh output gets mixed up with the grunt prompt > - > > Key: PIG-1638 > URL: https://issues.apache.org/jira/browse/PIG-1638 > Project: Pig > Issue Type: Bug > Components: grunt >Affects Versions: 0.8.0 >Reporter: niraj rai >Assignee: niraj rai >Priority: Minor > Fix For: 0.8.0 > > Attachments: PIG-1638_0.patch > > > Many times, the grunt prompt gets mixed up with the sh output.e.g. > grunt> sh ls > 000 > autocomplete > bin > build > build.xml > grunt> CHANGES.txt > conf > contrib > In the above case, grunt> is mixed up with the output. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1637) Combiner not use because optimizor inserts a foreach between group and algebric function
[ https://issues.apache.org/jira/browse/PIG-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai resolved PIG-1637. - Hadoop Flags: [Reviewed] Resolution: Fixed All tests pass except for TestSortedTableUnion / TestSortedTableUnionMergeJoin for zebra, which are already fail and will be addressed by [PIG-1649|https://issues.apache.org/jira/browse/PIG-1649]. Patch committed to both trunk and 0.8 branch. > Combiner not use because optimizor inserts a foreach between group and > algebric function > > > Key: PIG-1637 > URL: https://issues.apache.org/jira/browse/PIG-1637 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: PIG-1637-1.patch, PIG-1637-2.patch > > > The following script does not use combiner after new optimization change. > {code} > A = load ':INPATH:/pigmix/page_views' using > org.apache.pig.test.udf.storefunc.PigPerformanceLoader() > as (user, action, timespent, query_term, ip_addr, timestamp, > estimated_revenue, page_info, page_links); > B = foreach A generate user, (int)timespent as timespent, > (double)estimated_revenue as estimated_revenue; > C = group B all; > D = foreach C generate SUM(B.timespent), AVG(B.estimated_revenue); > store D into ':OUTPATH:'; > {code} > This is because after group, optimizer detect group key is not used > afterward, it add a foreach statement after C. This is how it looks like > after optimization: > {code} > A = load ':INPATH:/pigmix/page_views' using > org.apache.pig.test.udf.storefunc.PigPerformanceLoader() > as (user, action, timespent, query_term, ip_addr, timestamp, > estimated_revenue, page_info, page_links); > B = foreach A generate user, (int)timespent as timespent, > (double)estimated_revenue as estimated_revenue; > C = group B all; > C1 = foreach C generate B; > D = foreach C1 generate SUM(B.timespent), AVG(B.estimated_revenue); > store D into ':OUTPATH:'; > {code} > That cancel the combiner optimization for D. > The way to solve the issue is to merge the C1 we inserted and D. Currently, > we do not merge these two foreach. The reason is that one output of the first > foreach (B) is referred twice in D, and currently rule assume after merge, we > need to calculate B twice in D. Actually, C1 is only doing projection, no > calculation of B. Merging C1 and D will not result calculating B twice. So C1 > and D should be merged. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1651) PIG class loading mishandled
[ https://issues.apache.org/jira/browse/PIG-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915959#action_12915959 ] Daniel Dai commented on PIG-1651: - +1 > PIG class loading mishandled > > > Key: PIG-1651 > URL: https://issues.apache.org/jira/browse/PIG-1651 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Yan Zhou >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1651.patch > > > If just having zebra.jar as being registered in a PIG script but not in the > CLASSPATH, the query using zebra fails since there appear to be multiple > classes loaded into JVM, causing static variable set previously not seen > after one instance of the class is created through reflection. (After the > zebra.jar is specified in CLASSPATH, it works fine.) The exception stack is > as follows: > ackend error message during job submission > --- > org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to > create input splits for: hdfs://hostname/pathto/zebra_dir :: null > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:284) > at > org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:907) > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:801) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:752) > at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378) > at > org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247) > at > org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279) > at java.lang.Thread.run(Thread.java:619) > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.zebra.io.ColumnGroup.getNonDataFilePrefix(ColumnGroup.java:123) > at > org.apache.hadoop.zebra.io.ColumnGroup$CGPathFilter.accept(ColumnGroup.java:2413) > at > org.apache.hadoop.zebra.mapreduce.TableInputFormat$DummyFileInputFormat$MultiPathFilter.accept(TableInputFormat.java:718) > at > org.apache.hadoop.fs.FileSystem$GlobFilter.accept(FileSystem.java:1084) > at > org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:919) > at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:866) > at > org.apache.hadoop.zebra.mapreduce.TableInputFormat$DummyFileInputFormat.listStatus(TableInputFormat.java:780) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:246) > at > org.apache.hadoop.zebra.mapreduce.TableInputFormat.getRowSplits(TableInputFormat.java:863) > at > org.apache.hadoop.zebra.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:1017) > at > org.apache.hadoop.zebra.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:961) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:269) > ... 7 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1637) Combiner not use because optimizor inserts a foreach between group and algebric function
[ https://issues.apache.org/jira/browse/PIG-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915950#action_12915950 ] Daniel Dai commented on PIG-1637: - Yes, it could be improved as per Xuefu's suggestion. Anyway, current patch solve the "combiner not used" issue, will commit this part first. I will open another Jira to improve it. Also, MergeForEach is a best example to practice cloning framework [PIG-1587|https://issues.apache.org/jira/browse/PIG-1587], so it is better to improve it once PIG-1587 is available. > Combiner not use because optimizor inserts a foreach between group and > algebric function > > > Key: PIG-1637 > URL: https://issues.apache.org/jira/browse/PIG-1637 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: PIG-1637-1.patch, PIG-1637-2.patch > > > The following script does not use combiner after new optimization change. > {code} > A = load ':INPATH:/pigmix/page_views' using > org.apache.pig.test.udf.storefunc.PigPerformanceLoader() > as (user, action, timespent, query_term, ip_addr, timestamp, > estimated_revenue, page_info, page_links); > B = foreach A generate user, (int)timespent as timespent, > (double)estimated_revenue as estimated_revenue; > C = group B all; > D = foreach C generate SUM(B.timespent), AVG(B.estimated_revenue); > store D into ':OUTPATH:'; > {code} > This is because after group, optimizer detect group key is not used > afterward, it add a foreach statement after C. This is how it looks like > after optimization: > {code} > A = load ':INPATH:/pigmix/page_views' using > org.apache.pig.test.udf.storefunc.PigPerformanceLoader() > as (user, action, timespent, query_term, ip_addr, timestamp, > estimated_revenue, page_info, page_links); > B = foreach A generate user, (int)timespent as timespent, > (double)estimated_revenue as estimated_revenue; > C = group B all; > C1 = foreach C generate B; > D = foreach C1 generate SUM(B.timespent), AVG(B.estimated_revenue); > store D into ':OUTPATH:'; > {code} > That cancel the combiner optimization for D. > The way to solve the issue is to merge the C1 we inserted and D. Currently, > we do not merge these two foreach. The reason is that one output of the first > foreach (B) is referred twice in D, and currently rule assume after merge, we > need to calculate B twice in D. Actually, C1 is only doing projection, no > calculation of B. Merging C1 and D will not result calculating B twice. So C1 > and D should be merged. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1579) Intermittent unit test failure for TestScriptUDF.testPythonScriptUDFNullInputOutput
[ https://issues.apache.org/jira/browse/PIG-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915941#action_12915941 ] Daniel Dai commented on PIG-1579: - Rollback the change and run test many times, all tests pass. Seems some change between r990721 and now (r1002348) fix this issue. Will rollback the change and close the Jira. > Intermittent unit test failure for > TestScriptUDF.testPythonScriptUDFNullInputOutput > --- > > Key: PIG-1579 > URL: https://issues.apache.org/jira/browse/PIG-1579 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: PIG-1579-1.patch > > > Error message: > org.apache.pig.backend.executionengine.ExecException: ERROR 0: Error > executing function: Traceback (most recent call last): > File "", line 5, in multStr > TypeError: can't multiply sequence by non-int of type 'NoneType' > at > org.apache.pig.scripting.jython.JythonFunction.exec(JythonFunction.java:107) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:295) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:346) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) > at org.apache.hadoop.mapred.Child.main(Child.java:170) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1637) Combiner not use because optimizor inserts a foreach between group and algebric function
[ https://issues.apache.org/jira/browse/PIG-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915880#action_12915880 ] Daniel Dai commented on PIG-1637: - test-patch result for PIG-1637-2.patch: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. > Combiner not use because optimizor inserts a foreach between group and > algebric function > > > Key: PIG-1637 > URL: https://issues.apache.org/jira/browse/PIG-1637 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: PIG-1637-1.patch, PIG-1637-2.patch > > > The following script does not use combiner after new optimization change. > {code} > A = load ':INPATH:/pigmix/page_views' using > org.apache.pig.test.udf.storefunc.PigPerformanceLoader() > as (user, action, timespent, query_term, ip_addr, timestamp, > estimated_revenue, page_info, page_links); > B = foreach A generate user, (int)timespent as timespent, > (double)estimated_revenue as estimated_revenue; > C = group B all; > D = foreach C generate SUM(B.timespent), AVG(B.estimated_revenue); > store D into ':OUTPATH:'; > {code} > This is because after group, optimizer detect group key is not used > afterward, it add a foreach statement after C. This is how it looks like > after optimization: > {code} > A = load ':INPATH:/pigmix/page_views' using > org.apache.pig.test.udf.storefunc.PigPerformanceLoader() > as (user, action, timespent, query_term, ip_addr, timestamp, > estimated_revenue, page_info, page_links); > B = foreach A generate user, (int)timespent as timespent, > (double)estimated_revenue as estimated_revenue; > C = group B all; > C1 = foreach C generate B; > D = foreach C1 generate SUM(B.timespent), AVG(B.estimated_revenue); > store D into ':OUTPATH:'; > {code} > That cancel the combiner optimization for D. > The way to solve the issue is to merge the C1 we inserted and D. Currently, > we do not merge these two foreach. The reason is that one output of the first > foreach (B) is referred twice in D, and currently rule assume after merge, we > need to calculate B twice in D. Actually, C1 is only doing projection, no > calculation of B. Merging C1 and D will not result calculating B twice. So C1 > and D should be merged. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1653) Scripting UDF fails if the path to script is an absolute path
Scripting UDF fails if the path to script is an absolute path - Key: PIG-1653 URL: https://issues.apache.org/jira/browse/PIG-1653 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Daniel Dai Fix For: 0.8.0 The following script fail: {code} register '/homes/jianyong/pig/aaa/scriptingudf.py' using jython as myfuncs; a = load '/user/pig/tests/data/singlefile/studenttab10k' using PigStorage() as (name, age, gpa:double); b = foreach a generate myfuncs.square(gpa); dump b; {code} If we change the register to use relative path (such as "aaa/scriptingudf.py"), it success. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1637) Combiner not use because optimizor inserts a foreach between group and algebric function
[ https://issues.apache.org/jira/browse/PIG-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1637: Attachment: PIG-1637-2.patch A bug caught by Xuefu. Reattach the patch. > Combiner not use because optimizor inserts a foreach between group and > algebric function > > > Key: PIG-1637 > URL: https://issues.apache.org/jira/browse/PIG-1637 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: PIG-1637-1.patch, PIG-1637-2.patch > > > The following script does not use combiner after new optimization change. > {code} > A = load ':INPATH:/pigmix/page_views' using > org.apache.pig.test.udf.storefunc.PigPerformanceLoader() > as (user, action, timespent, query_term, ip_addr, timestamp, > estimated_revenue, page_info, page_links); > B = foreach A generate user, (int)timespent as timespent, > (double)estimated_revenue as estimated_revenue; > C = group B all; > D = foreach C generate SUM(B.timespent), AVG(B.estimated_revenue); > store D into ':OUTPATH:'; > {code} > This is because after group, optimizer detect group key is not used > afterward, it add a foreach statement after C. This is how it looks like > after optimization: > {code} > A = load ':INPATH:/pigmix/page_views' using > org.apache.pig.test.udf.storefunc.PigPerformanceLoader() > as (user, action, timespent, query_term, ip_addr, timestamp, > estimated_revenue, page_info, page_links); > B = foreach A generate user, (int)timespent as timespent, > (double)estimated_revenue as estimated_revenue; > C = group B all; > C1 = foreach C generate B; > D = foreach C1 generate SUM(B.timespent), AVG(B.estimated_revenue); > store D into ':OUTPATH:'; > {code} > That cancel the combiner optimization for D. > The way to solve the issue is to merge the C1 we inserted and D. Currently, > we do not merge these two foreach. The reason is that one output of the first > foreach (B) is referred twice in D, and currently rule assume after merge, we > need to calculate B twice in D. Actually, C1 is only doing projection, no > calculation of B. Merging C1 and D will not result calculating B twice. So C1 > and D should be merged. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1652) TestSortedTableUnion and TestSortedTableUnionMergeJoin fail on trunk due to estimateNumberOfReducers bug
TestSortedTableUnion and TestSortedTableUnionMergeJoin fail on trunk due to estimateNumberOfReducers bug Key: PIG-1652 URL: https://issues.apache.org/jira/browse/PIG-1652 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Fix For: 0.8.0 TestSortedTableUnion and TestSortedTableUnionMergeJoin fail on trunk due to the input size estimation. Here is the stack of TestSortedTableUnionMergeJoin: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to store alias records3 at org.apache.pig.PigServer.storeEx(PigServer.java:877) at org.apache.pig.PigServer.store(PigServer.java:815) at org.apache.pig.PigServer.openIterator(PigServer.java:727) at org.apache.hadoop.zebra.pig.TestSortedTableUnionMergeJoin.testStorer(TestSortedTableUnionMergeJoin.java:203) Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2043: Unexpected error during execution. at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:326) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1197) at org.apache.pig.PigServer.storeEx(PigServer.java:873) Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: Illegal character in scheme name at index 69: org.apache.hadoop.zebra.pig.TestSortedTableUnionMergeJoin.testStorer1,file: at org.apache.hadoop.fs.Path.initialize(Path.java:140) at org.apache.hadoop.fs.Path.(Path.java:126) at org.apache.hadoop.fs.Path.(Path.java:50) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:963) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:902) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:866) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:844) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getTotalInputFileSize(JobControlCompiler.java:715) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.estimateNumberOfReducers(JobControlCompiler.java:688) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.SampleOptimizer.visitMROp(SampleOptimizer.java:140) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceOper.visit(MapReduceOper.java:246) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceOper.visit(MapReduceOper.java:41) at org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:69) at org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:71) at org.apache.pig.impl.plan.DepthFirstWalker.walk(DepthFirstWalker.java:52) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.SampleOptimizer.visit(SampleOptimizer.java:69) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.compile(MapReduceLauncher.java:491) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:116) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:301) Caused by: java.net.URISyntaxException: Illegal character in scheme name at index 69: org.apache.hadoop.zebra.pig.TestSortedTableUnionMergeJoin.testStorer1,file: at java.net.URI$Parser.fail(URI.java:2809) at java.net.URI$Parser.checkChars(URI.java:2982) at java.net.URI$Parser.parse(URI.java:3009) at java.net.URI.(URI.java:736) at org.apache.hadoop.fs.Path.initialize(Path.java:137) The reason is we are trying to do globStatus
[jira] Updated: (PIG-1637) Combiner not use because optimizor inserts a foreach between group and algebric function
[ https://issues.apache.org/jira/browse/PIG-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1637: Attachment: PIG-1637-1.patch > Combiner not use because optimizor inserts a foreach between group and > algebric function > > > Key: PIG-1637 > URL: https://issues.apache.org/jira/browse/PIG-1637 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: PIG-1637-1.patch > > > The following script does not use combiner after new optimization change. > {code} > A = load ':INPATH:/pigmix/page_views' using > org.apache.pig.test.udf.storefunc.PigPerformanceLoader() > as (user, action, timespent, query_term, ip_addr, timestamp, > estimated_revenue, page_info, page_links); > B = foreach A generate user, (int)timespent as timespent, > (double)estimated_revenue as estimated_revenue; > C = group B all; > D = foreach C generate SUM(B.timespent), AVG(B.estimated_revenue); > store D into ':OUTPATH:'; > {code} > This is because after group, optimizer detect group key is not used > afterward, it add a foreach statement after C. This is how it looks like > after optimization: > {code} > A = load ':INPATH:/pigmix/page_views' using > org.apache.pig.test.udf.storefunc.PigPerformanceLoader() > as (user, action, timespent, query_term, ip_addr, timestamp, > estimated_revenue, page_info, page_links); > B = foreach A generate user, (int)timespent as timespent, > (double)estimated_revenue as estimated_revenue; > C = group B all; > C1 = foreach C generate B; > D = foreach C1 generate SUM(B.timespent), AVG(B.estimated_revenue); > store D into ':OUTPATH:'; > {code} > That cancel the combiner optimization for D. > The way to solve the issue is to merge the C1 we inserted and D. Currently, > we do not merge these two foreach. The reason is that one output of the first > foreach (B) is referred twice in D, and currently rule assume after merge, we > need to calculate B twice in D. Actually, C1 is only doing projection, no > calculation of B. Merging C1 and D will not result calculating B twice. So C1 > and D should be merged. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1647) Logical simplifier throws a NPE
[ https://issues.apache.org/jira/browse/PIG-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915365#action_12915365 ] Daniel Dai commented on PIG-1647: - +1. Please commit. > Logical simplifier throws a NPE > --- > > Key: PIG-1647 > URL: https://issues.apache.org/jira/browse/PIG-1647 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Yan Zhou >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1647.patch, PIG-1647.patch > > > A query like: > A = load 'd.txt' as (a:chararray, b:long, c:map[], d:chararray, e:chararray); > B = filter A by a == 'v' and b == 117L and c#'p1' == 'h' and c#'p2' == 'to' > and ((d is not null and d != '') or (e is not null and e != '')); > will cause the logical expression simplifier to throw a NPE. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1644) New logical plan: Plan.connect with position is misused in some places
[ https://issues.apache.org/jira/browse/PIG-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai resolved PIG-1644. - Hadoop Flags: [Reviewed] Resolution: Fixed [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 6 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. All tests pass. Patch committed to both trunk and 0.8 branch. > New logical plan: Plan.connect with position is misused in some places > -- > > Key: PIG-1644 > URL: https://issues.apache.org/jira/browse/PIG-1644 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: PIG-1644-1.patch, PIG-1644-2.patch, PIG-1644-3.patch, > PIG-1644-4.patch > > > When we replace/remove/insert a node, we will use disconnect/connect methods > of OperatorPlan. When we disconnect an edge, we shall save the position of > the edge in origination and destination, and use this position when connect > to the new predecessor/successor. Some of the pattens are: > Insert a new node: > {code} > Pair pos = plan.disconnect(pred, succ); > plan.connect(pred, pos.first, newnode, 0); > plan.connect(newnode, 0, succ, pos.second); > {code} > Remove a node: > {code} > Pair pos1 = plan.disconnect(pred, nodeToRemove); > Pair pos2 = plan.disconnect(nodeToRemove, succ); > plan.connect(pred, pos1.first, succ, pos2.second); > {code} > Replace a node: > {code} > Pair pos1 = plan.disconnect(pred, nodeToReplace); > Pair pos2 = plan.disconnect(nodeToReplace, succ); > plan.connect(pred, pos1.first, newNode, pos1.second); > plan.connect(newNode, pos2.first, succ, pos2.second); > {code} > There are couple of places of we does not follow this pattern, that results > some error. For example, the following script fail: > {code} > a = load '1.txt' as (a0, a1, a2, a3); > b = foreach a generate a0, a1, a2; > store b into 'aaa'; > c = order b by a2; > d = foreach c generate a2; > store d into 'bbb'; > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1644) New logical plan: Plan.connect with position is misused in some places
[ https://issues.apache.org/jira/browse/PIG-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1644: Attachment: PIG-1644-4.patch PIG-1644-4.patch fix findbug warnings and additional unit failures. > New logical plan: Plan.connect with position is misused in some places > -- > > Key: PIG-1644 > URL: https://issues.apache.org/jira/browse/PIG-1644 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: PIG-1644-1.patch, PIG-1644-2.patch, PIG-1644-3.patch, > PIG-1644-4.patch > > > When we replace/remove/insert a node, we will use disconnect/connect methods > of OperatorPlan. When we disconnect an edge, we shall save the position of > the edge in origination and destination, and use this position when connect > to the new predecessor/successor. Some of the pattens are: > Insert a new node: > {code} > Pair pos = plan.disconnect(pred, succ); > plan.connect(pred, pos.first, newnode, 0); > plan.connect(newnode, 0, succ, pos.second); > {code} > Remove a node: > {code} > Pair pos1 = plan.disconnect(pred, nodeToRemove); > Pair pos2 = plan.disconnect(nodeToRemove, succ); > plan.connect(pred, pos1.first, succ, pos2.second); > {code} > Replace a node: > {code} > Pair pos1 = plan.disconnect(pred, nodeToReplace); > Pair pos2 = plan.disconnect(nodeToReplace, succ); > plan.connect(pred, pos1.first, newNode, pos1.second); > plan.connect(newNode, pos2.first, succ, pos2.second); > {code} > There are couple of places of we does not follow this pattern, that results > some error. For example, the following script fail: > {code} > a = load '1.txt' as (a0, a1, a2, a3); > b = foreach a generate a0, a1, a2; > store b into 'aaa'; > c = order b by a2; > d = foreach c generate a2; > store d into 'bbb'; > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1643) join fails for a query with input having 'load using pigstorage without schema' + 'foreach'
[ https://issues.apache.org/jira/browse/PIG-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai resolved PIG-1643. - Release Note: PIG-1643.4.patch committed to both trunk and 0.8 branch. Resolution: Fixed > join fails for a query with input having 'load using pigstorage without > schema' + 'foreach' > --- > > Key: PIG-1643 > URL: https://issues.apache.org/jira/browse/PIG-1643 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Thejas M Nair >Assignee: Thejas M Nair > Fix For: 0.8.0 > > Attachments: PIG-1643.1.patch, PIG-1643.2.patch, PIG-1643.3.patch, > PIG-1643.4.patch > > > {code} > l1 = load 'std.txt'; > l2 = load 'std.txt'; > f1 = foreach l1 generate $0 as abc, $1 as def; > -- j = join f1 by $0, l2 by $0 using 'replicated'; > -- j = join l2 by $0, f1 by $0 using 'replicated'; > j = join l2 by $0, f1 by $0 ; > dump j; > {code} > the error - > {code} > 2010-09-22 16:24:48,584 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 2044: The type null cannot be collected as a Key type > {code} > The MR plan from explain - > {code} > #-- > # Map Reduce Plan > #-- > MapReduce node scope-21 > Map Plan > Union[tuple] - scope-22 > | > |---j: Local Rearrange[tuple]{bytearray}(false) - scope-11 > | | | > | | Project[bytearray][0] - scope-12 > | | > | |---l2: > Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage) > - scope-0 > | > |---j: Local Rearrange[tuple]{NULL}(false) - scope-13 > | | > | Project[NULL][0] - scope-14 > | > |---f1: New For Each(false,false)[bag] - scope-6 > | | > | Project[bytearray][0] - scope-2 > | | > | Project[bytearray][1] - scope-4 > | > |---l1: > Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage) > - scope-1 > Reduce Plan > j: Store(/tmp/x:org.apache.pig.builtin.PigStorage) - scope-18 > | > |---POJoinPackage(true,true)[tuple] - scope-23 > Global sort: false > > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1643) join fails for a query with input having 'load using pigstorage without schema' + 'foreach'
[ https://issues.apache.org/jira/browse/PIG-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915037#action_12915037 ] Daniel Dai commented on PIG-1643: - [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. All tests pass. > join fails for a query with input having 'load using pigstorage without > schema' + 'foreach' > --- > > Key: PIG-1643 > URL: https://issues.apache.org/jira/browse/PIG-1643 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Thejas M Nair >Assignee: Thejas M Nair > Fix For: 0.8.0 > > Attachments: PIG-1643.1.patch, PIG-1643.2.patch, PIG-1643.3.patch, > PIG-1643.4.patch > > > {code} > l1 = load 'std.txt'; > l2 = load 'std.txt'; > f1 = foreach l1 generate $0 as abc, $1 as def; > -- j = join f1 by $0, l2 by $0 using 'replicated'; > -- j = join l2 by $0, f1 by $0 using 'replicated'; > j = join l2 by $0, f1 by $0 ; > dump j; > {code} > the error - > {code} > 2010-09-22 16:24:48,584 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 2044: The type null cannot be collected as a Key type > {code} > The MR plan from explain - > {code} > #-- > # Map Reduce Plan > #-- > MapReduce node scope-21 > Map Plan > Union[tuple] - scope-22 > | > |---j: Local Rearrange[tuple]{bytearray}(false) - scope-11 > | | | > | | Project[bytearray][0] - scope-12 > | | > | |---l2: > Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage) > - scope-0 > | > |---j: Local Rearrange[tuple]{NULL}(false) - scope-13 > | | > | Project[NULL][0] - scope-14 > | > |---f1: New For Each(false,false)[bag] - scope-6 > | | > | Project[bytearray][0] - scope-2 > | | > | Project[bytearray][1] - scope-4 > | > |---l1: > Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage) > - scope-1 > Reduce Plan > j: Store(/tmp/x:org.apache.pig.builtin.PigStorage) - scope-18 > | > |---POJoinPackage(true,true)[tuple] - scope-23 > Global sort: false > > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1643) join fails for a query with input having 'load using pigstorage without schema' + 'foreach'
[ https://issues.apache.org/jira/browse/PIG-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1643: Attachment: PIG-1643.3.patch PIG-1643.3.patch is more general than PIG-1643.2.patch. It solves this null schema issue for all expressions. > join fails for a query with input having 'load using pigstorage without > schema' + 'foreach' > --- > > Key: PIG-1643 > URL: https://issues.apache.org/jira/browse/PIG-1643 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Thejas M Nair >Assignee: Thejas M Nair > Fix For: 0.8.0 > > Attachments: PIG-1643.1.patch, PIG-1643.2.patch, PIG-1643.3.patch > > > {code} > l1 = load 'std.txt'; > l2 = load 'std.txt'; > f1 = foreach l1 generate $0 as abc, $1 as def; > -- j = join f1 by $0, l2 by $0 using 'replicated'; > -- j = join l2 by $0, f1 by $0 using 'replicated'; > j = join l2 by $0, f1 by $0 ; > dump j; > {code} > the error - > {code} > 2010-09-22 16:24:48,584 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 2044: The type null cannot be collected as a Key type > {code} > The MR plan from explain - > {code} > #-- > # Map Reduce Plan > #-- > MapReduce node scope-21 > Map Plan > Union[tuple] - scope-22 > | > |---j: Local Rearrange[tuple]{bytearray}(false) - scope-11 > | | | > | | Project[bytearray][0] - scope-12 > | | > | |---l2: > Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage) > - scope-0 > | > |---j: Local Rearrange[tuple]{NULL}(false) - scope-13 > | | > | Project[NULL][0] - scope-14 > | > |---f1: New For Each(false,false)[bag] - scope-6 > | | > | Project[bytearray][0] - scope-2 > | | > | Project[bytearray][1] - scope-4 > | > |---l1: > Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage) > - scope-1 > Reduce Plan > j: Store(/tmp/x:org.apache.pig.builtin.PigStorage) - scope-18 > | > |---POJoinPackage(true,true)[tuple] - scope-23 > Global sort: false > > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1639) New logical plan: PushUpFilter should not push before group/cogroup if filter condition contains UDF
[ https://issues.apache.org/jira/browse/PIG-1639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1639: Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed Patch committed to both trunk and 0.8 branch. > New logical plan: PushUpFilter should not push before group/cogroup if filter > condition contains UDF > > > Key: PIG-1639 > URL: https://issues.apache.org/jira/browse/PIG-1639 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Xuefu Zhang > Fix For: 0.8.0 > > Attachments: jira-1639-1.patch > > > The following script fail: > {code} > a = load 'file' AS (f1, f2, f3); > b = group a by f1; > c = filter b by COUNT(a) > 1; > dump c; > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1643) join fails for a query with input having 'load using pigstorage without schema' + 'foreach'
[ https://issues.apache.org/jira/browse/PIG-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1643: Attachment: PIG-1643.2.patch Attach a fix. > join fails for a query with input having 'load using pigstorage without > schema' + 'foreach' > --- > > Key: PIG-1643 > URL: https://issues.apache.org/jira/browse/PIG-1643 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Thejas M Nair >Assignee: Thejas M Nair > Fix For: 0.8.0 > > Attachments: PIG-1643.1.patch, PIG-1643.2.patch > > > {code} > l1 = load 'std.txt'; > l2 = load 'std.txt'; > f1 = foreach l1 generate $0 as abc, $1 as def; > -- j = join f1 by $0, l2 by $0 using 'replicated'; > -- j = join l2 by $0, f1 by $0 using 'replicated'; > j = join l2 by $0, f1 by $0 ; > dump j; > {code} > the error - > {code} > 2010-09-22 16:24:48,584 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 2044: The type null cannot be collected as a Key type > {code} > The MR plan from explain - > {code} > #-- > # Map Reduce Plan > #-- > MapReduce node scope-21 > Map Plan > Union[tuple] - scope-22 > | > |---j: Local Rearrange[tuple]{bytearray}(false) - scope-11 > | | | > | | Project[bytearray][0] - scope-12 > | | > | |---l2: > Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage) > - scope-0 > | > |---j: Local Rearrange[tuple]{NULL}(false) - scope-13 > | | > | Project[NULL][0] - scope-14 > | > |---f1: New For Each(false,false)[bag] - scope-6 > | | > | Project[bytearray][0] - scope-2 > | | > | Project[bytearray][1] - scope-4 > | > |---l1: > Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage) > - scope-1 > Reduce Plan > j: Store(/tmp/x:org.apache.pig.builtin.PigStorage) - scope-18 > | > |---POJoinPackage(true,true)[tuple] - scope-23 > Global sort: false > > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Reopened: (PIG-1643) join fails for a query with input having 'load using pigstorage without schema' + 'foreach'
[ https://issues.apache.org/jira/browse/PIG-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai reopened PIG-1643: - The following script does not produce the right result after patch: {code} a = load '/grid/2/dev/pigqa/in/singlefile/studenttab10k'; b = foreach a generate *; store b into '/grid/2/dev/pigqa/out/log/hadoopqa.1285338379/Foreach_2.out'; {code} > join fails for a query with input having 'load using pigstorage without > schema' + 'foreach' > --- > > Key: PIG-1643 > URL: https://issues.apache.org/jira/browse/PIG-1643 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Thejas M Nair >Assignee: Thejas M Nair > Fix For: 0.8.0 > > Attachments: PIG-1643.1.patch, PIG-1643.2.patch > > > {code} > l1 = load 'std.txt'; > l2 = load 'std.txt'; > f1 = foreach l1 generate $0 as abc, $1 as def; > -- j = join f1 by $0, l2 by $0 using 'replicated'; > -- j = join l2 by $0, f1 by $0 using 'replicated'; > j = join l2 by $0, f1 by $0 ; > dump j; > {code} > the error - > {code} > 2010-09-22 16:24:48,584 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 2044: The type null cannot be collected as a Key type > {code} > The MR plan from explain - > {code} > #-- > # Map Reduce Plan > #-- > MapReduce node scope-21 > Map Plan > Union[tuple] - scope-22 > | > |---j: Local Rearrange[tuple]{bytearray}(false) - scope-11 > | | | > | | Project[bytearray][0] - scope-12 > | | > | |---l2: > Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage) > - scope-0 > | > |---j: Local Rearrange[tuple]{NULL}(false) - scope-13 > | | > | Project[NULL][0] - scope-14 > | > |---f1: New For Each(false,false)[bag] - scope-6 > | | > | Project[bytearray][0] - scope-2 > | | > | Project[bytearray][1] - scope-4 > | > |---l1: > Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage) > - scope-1 > Reduce Plan > j: Store(/tmp/x:org.apache.pig.builtin.PigStorage) - scope-18 > | > |---POJoinPackage(true,true)[tuple] - scope-23 > Global sort: false > > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1635) Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed
[ https://issues.apache.org/jira/browse/PIG-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914675#action_12914675 ] Daniel Dai commented on PIG-1635: - +1 for commit. > Logical simplifier does not simplify away constants under AND and OR; after > simplificaion the ordering of operands of AND and OR may get changed > > > Key: PIG-1635 > URL: https://issues.apache.org/jira/browse/PIG-1635 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Yan Zhou >Assignee: Yan Zhou >Priority: Minor > Fix For: 0.8.0 > > Attachments: PIG-1635.patch > > > b = FILTER a by (( f1 > 1) AND (1 == 1)) > or > b = FILTER a by ((f1 > 1) OR ( 1==0)) > should be simplified to > b = FILTER a by f1 > 1; > Regarding ordering change, an example is that > b = filter a by ((f1 is not null) AND (f2 is not null)); > Even without possible simplification, the expression is changed to > b = filter a by ((f2 is not null) AND (f1 is not null)); > Even though the ordering change in this case, and probably in most other > cases, does not create any difference, but for two reasons some users might > care about the ordering: if stateful UDFs are used as operands of AND or OR; > and if the ordering is intended by the application designer to maximize the > chances to shortcut the composite boolean evaluation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1644) New logical plan: Plan.connect with position is misused in some places
[ https://issues.apache.org/jira/browse/PIG-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1644: Attachment: PIG-1644-3.patch Find one bug introduced by refactory. Attach PIG-1644-3.patch with the fix, and running the tests again. > New logical plan: Plan.connect with position is misused in some places > -- > > Key: PIG-1644 > URL: https://issues.apache.org/jira/browse/PIG-1644 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: PIG-1644-1.patch, PIG-1644-2.patch, PIG-1644-3.patch > > > When we replace/remove/insert a node, we will use disconnect/connect methods > of OperatorPlan. When we disconnect an edge, we shall save the position of > the edge in origination and destination, and use this position when connect > to the new predecessor/successor. Some of the pattens are: > Insert a new node: > {code} > Pair pos = plan.disconnect(pred, succ); > plan.connect(pred, pos.first, newnode, 0); > plan.connect(newnode, 0, succ, pos.second); > {code} > Remove a node: > {code} > Pair pos1 = plan.disconnect(pred, nodeToRemove); > Pair pos2 = plan.disconnect(nodeToRemove, succ); > plan.connect(pred, pos1.first, succ, pos2.second); > {code} > Replace a node: > {code} > Pair pos1 = plan.disconnect(pred, nodeToReplace); > Pair pos2 = plan.disconnect(nodeToReplace, succ); > plan.connect(pred, pos1.first, newNode, pos1.second); > plan.connect(newNode, pos2.first, succ, pos2.second); > {code} > There are couple of places of we does not follow this pattern, that results > some error. For example, the following script fail: > {code} > a = load '1.txt' as (a0, a1, a2, a3); > b = foreach a generate a0, a1, a2; > store b into 'aaa'; > c = order b by a2; > d = foreach c generate a2; > store d into 'bbb'; > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1635) Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed
[ https://issues.apache.org/jira/browse/PIG-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914662#action_12914662 ] Daniel Dai commented on PIG-1635: - +1, patch looks good. Also can you have a review of all connect/disconnect usage in ExpressionSimplifer, according to [PIG-1644|https://issues.apache.org/jira/browse/PIG-1644]? I see lots of misuse in other rules. > Logical simplifier does not simplify away constants under AND and OR; after > simplificaion the ordering of operands of AND and OR may get changed > > > Key: PIG-1635 > URL: https://issues.apache.org/jira/browse/PIG-1635 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Yan Zhou >Assignee: Yan Zhou >Priority: Minor > Fix For: 0.8.0 > > Attachments: PIG-1635.patch > > > b = FILTER a by (( f1 > 1) AND (1 == 1)) > or > b = FILTER a by ((f1 > 1) OR ( 1==0)) > should be simplified to > b = FILTER a by f1 > 1; > Regarding ordering change, an example is that > b = filter a by ((f1 is not null) AND (f2 is not null)); > Even without possible simplification, the expression is changed to > b = filter a by ((f2 is not null) AND (f1 is not null)); > Even though the ordering change in this case, and probably in most other > cases, does not create any difference, but for two reasons some users might > care about the ordering: if stateful UDFs are used as operands of AND or OR; > and if the ordering is intended by the application designer to maximize the > chances to shortcut the composite boolean evaluation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1644) New logical plan: Plan.connect with position is misused in some places
[ https://issues.apache.org/jira/browse/PIG-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1644: Attachment: PIG-1644-2.patch Attach the patch with new methods and refactory of existing code. > New logical plan: Plan.connect with position is misused in some places > -- > > Key: PIG-1644 > URL: https://issues.apache.org/jira/browse/PIG-1644 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: PIG-1644-1.patch, PIG-1644-2.patch > > > When we replace/remove/insert a node, we will use disconnect/connect methods > of OperatorPlan. When we disconnect an edge, we shall save the position of > the edge in origination and destination, and use this position when connect > to the new predecessor/successor. Some of the pattens are: > Insert a new node: > {code} > Pair pos = plan.disconnect(pred, succ); > plan.connect(pred, pos.first, newnode, 0); > plan.connect(newnode, 0, succ, pos.second); > {code} > Remove a node: > {code} > Pair pos1 = plan.disconnect(pred, nodeToRemove); > Pair pos2 = plan.disconnect(nodeToRemove, succ); > plan.connect(pred, pos1.first, succ, pos2.second); > {code} > Replace a node: > {code} > Pair pos1 = plan.disconnect(pred, nodeToReplace); > Pair pos2 = plan.disconnect(nodeToReplace, succ); > plan.connect(pred, pos1.first, newNode, pos1.second); > plan.connect(newNode, pos2.first, succ, pos2.second); > {code} > There are couple of places of we does not follow this pattern, that results > some error. For example, the following script fail: > {code} > a = load '1.txt' as (a0, a1, a2, a3); > b = foreach a generate a0, a1, a2; > store b into 'aaa'; > c = order b by a2; > d = foreach c generate a2; > store d into 'bbb'; > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1644) New logical plan: Plan.connect with position is misused in some places
[ https://issues.apache.org/jira/browse/PIG-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914317#action_12914317 ] Daniel Dai commented on PIG-1644: - After looking into the existing code, seems insertBetween is a more useful method. So I want to drop insertBefore/insertAfter, and add insertBetween {code} insertBetween(Operator pred, Operator operatorToInsert, Operator succ) {code} > New logical plan: Plan.connect with position is misused in some places > -- > > Key: PIG-1644 > URL: https://issues.apache.org/jira/browse/PIG-1644 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: PIG-1644-1.patch > > > When we replace/remove/insert a node, we will use disconnect/connect methods > of OperatorPlan. When we disconnect an edge, we shall save the position of > the edge in origination and destination, and use this position when connect > to the new predecessor/successor. Some of the pattens are: > Insert a new node: > {code} > Pair pos = plan.disconnect(pred, succ); > plan.connect(pred, pos.first, newnode, 0); > plan.connect(newnode, 0, succ, pos.second); > {code} > Remove a node: > {code} > Pair pos1 = plan.disconnect(pred, nodeToRemove); > Pair pos2 = plan.disconnect(nodeToRemove, succ); > plan.connect(pred, pos1.first, succ, pos2.second); > {code} > Replace a node: > {code} > Pair pos1 = plan.disconnect(pred, nodeToReplace); > Pair pos2 = plan.disconnect(nodeToReplace, succ); > plan.connect(pred, pos1.first, newNode, pos1.second); > plan.connect(newNode, pos2.first, succ, pos2.second); > {code} > There are couple of places of we does not follow this pattern, that results > some error. For example, the following script fail: > {code} > a = load '1.txt' as (a0, a1, a2, a3); > b = foreach a generate a0, a1, a2; > store b into 'aaa'; > c = order b by a2; > d = foreach c generate a2; > store d into 'bbb'; > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1639) New logical plan: PushUpFilter should not push before group/cogroup if filter condition contains UDF
[ https://issues.apache.org/jira/browse/PIG-1639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914154#action_12914154 ] Daniel Dai commented on PIG-1639: - +1 if all tests pass. > New logical plan: PushUpFilter should not push before group/cogroup if filter > condition contains UDF > > > Key: PIG-1639 > URL: https://issues.apache.org/jira/browse/PIG-1639 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Xuefu Zhang > Fix For: 0.8.0 > > Attachments: jira-1639-1.patch > > > The following script fail: > {code} > a = load 'file' AS (f1, f2, f3); > b = group a by f1; > c = filter b by COUNT(a) > 1; > dump c; > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1639) New logical plan: PushUpFilter should not push before group/cogroup if filter condition contains UDF
[ https://issues.apache.org/jira/browse/PIG-1639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1639: Summary: New logical plan: PushUpFilter should not push before group/cogroup if filter condition contains UDF (was: New logical plan: PushUpFilter should not optimize if filter condition contains UDF) > New logical plan: PushUpFilter should not push before group/cogroup if filter > condition contains UDF > > > Key: PIG-1639 > URL: https://issues.apache.org/jira/browse/PIG-1639 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Xuefu Zhang > Fix For: 0.8.0 > > Attachments: jira-1639-1.patch > > > The following script fail: > {code} > a = load 'file' AS (f1, f2, f3); > b = group a by f1; > c = filter b by COUNT(a) > 1; > dump c; > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1644) New logical plan: Plan.connect with position is misused in some places
[ https://issues.apache.org/jira/browse/PIG-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914147#action_12914147 ] Daniel Dai commented on PIG-1644: - Yes, I think we can do replace/remove/insert. They should be simple and clear enough to use. Here is the new methods adding to OperatorPlan: {code} replace(Operator oldOperator, Operator newOperator) remove(Operator operatorToRemove) // Connect all its successors to predecessor/connect all it's predecessors to successor insertBefore(Operator operatorToInsert, Operator pos) // Insert operatorToInsert before pos, connect all pos's predecessors to operatorToInsert insertAfter(Operator operatorToInsert, Operator pos) // Insert operatorToInsert after pos, connect operatorToInsert to all pos's successor {code} How does it sounds? > New logical plan: Plan.connect with position is misused in some places > -- > > Key: PIG-1644 > URL: https://issues.apache.org/jira/browse/PIG-1644 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: PIG-1644-1.patch > > > When we replace/remove/insert a node, we will use disconnect/connect methods > of OperatorPlan. When we disconnect an edge, we shall save the position of > the edge in origination and destination, and use this position when connect > to the new predecessor/successor. Some of the pattens are: > Insert a new node: > {code} > Pair pos = plan.disconnect(pred, succ); > plan.connect(pred, pos.first, newnode, 0); > plan.connect(newnode, 0, succ, pos.second); > {code} > Remove a node: > {code} > Pair pos1 = plan.disconnect(pred, nodeToRemove); > Pair pos2 = plan.disconnect(nodeToRemove, succ); > plan.connect(pred, pos1.first, succ, pos2.second); > {code} > Replace a node: > {code} > Pair pos1 = plan.disconnect(pred, nodeToReplace); > Pair pos2 = plan.disconnect(nodeToReplace, succ); > plan.connect(pred, pos1.first, newNode, pos1.second); > plan.connect(newNode, pos2.first, succ, pos2.second); > {code} > There are couple of places of we does not follow this pattern, that results > some error. For example, the following script fail: > {code} > a = load '1.txt' as (a0, a1, a2, a3); > b = foreach a generate a0, a1, a2; > store b into 'aaa'; > c = order b by a2; > d = foreach c generate a2; > store d into 'bbb'; > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1643) join fails for a query with input having 'load using pigstorage without schema' + 'foreach'
[ https://issues.apache.org/jira/browse/PIG-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914126#action_12914126 ] Daniel Dai commented on PIG-1643: - +1 if tests pass. > join fails for a query with input having 'load using pigstorage without > schema' + 'foreach' > --- > > Key: PIG-1643 > URL: https://issues.apache.org/jira/browse/PIG-1643 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Thejas M Nair >Assignee: Thejas M Nair > Fix For: 0.8.0 > > Attachments: PIG-1643.1.patch > > > {code} > l1 = load 'std.txt'; > l2 = load 'std.txt'; > f1 = foreach l1 generate $0 as abc, $1 as def; > -- j = join f1 by $0, l2 by $0 using 'replicated'; > -- j = join l2 by $0, f1 by $0 using 'replicated'; > j = join l2 by $0, f1 by $0 ; > dump j; > {code} > the error - > {code} > 2010-09-22 16:24:48,584 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 2044: The type null cannot be collected as a Key type > {code} > The MR plan from explain - > {code} > #-- > # Map Reduce Plan > #-- > MapReduce node scope-21 > Map Plan > Union[tuple] - scope-22 > | > |---j: Local Rearrange[tuple]{bytearray}(false) - scope-11 > | | | > | | Project[bytearray][0] - scope-12 > | | > | |---l2: > Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage) > - scope-0 > | > |---j: Local Rearrange[tuple]{NULL}(false) - scope-13 > | | > | Project[NULL][0] - scope-14 > | > |---f1: New For Each(false,false)[bag] - scope-6 > | | > | Project[bytearray][0] - scope-2 > | | > | Project[bytearray][1] - scope-4 > | > |---l1: > Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage) > - scope-1 > Reduce Plan > j: Store(/tmp/x:org.apache.pig.builtin.PigStorage) - scope-18 > | > |---POJoinPackage(true,true)[tuple] - scope-23 > Global sort: false > > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1644) New logical plan: Plan.connect with position is misused in some places
[ https://issues.apache.org/jira/browse/PIG-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1644: Attachment: PIG-1644-1.patch Attach the patch to address all such places in new logical plan, except for ExpressionSimplifier. There is some work underway for ExpressionSimplifier ([PIG-1635|https://issues.apache.org/jira/browse/PIG-1635]) include some of these changes, I don't want to conflict with that patch. So after PIG-1635, we may also review the connect/disconnect usage of ExpressionSimplifier. > New logical plan: Plan.connect with position is misused in some places > -- > > Key: PIG-1644 > URL: https://issues.apache.org/jira/browse/PIG-1644 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: PIG-1644-1.patch > > > When we replace/remove/insert a node, we will use disconnect/connect methods > of OperatorPlan. When we disconnect an edge, we shall save the position of > the edge in origination and destination, and use this position when connect > to the new predecessor/successor. Some of the pattens are: > Insert a new node: > {code} > Pair pos = plan.disconnect(pred, succ); > plan.connect(pred, pos.first, newnode, 0); > plan.connect(newnode, 0, succ, pos.second); > {code} > Remove a node: > {code} > Pair pos1 = plan.disconnect(pred, nodeToRemove); > Pair pos2 = plan.disconnect(nodeToRemove, succ); > plan.connect(pred, pos1.first, succ, pos2.second); > {code} > Replace a node: > {code} > Pair pos1 = plan.disconnect(pred, nodeToReplace); > Pair pos2 = plan.disconnect(nodeToReplace, succ); > plan.connect(pred, pos1.first, newNode, pos1.second); > plan.connect(newNode, pos2.first, succ, pos2.second); > {code} > There are couple of places of we does not follow this pattern, that results > some error. For example, the following script fail: > {code} > a = load '1.txt' as (a0, a1, a2, a3); > b = foreach a generate a0, a1, a2; > store b into 'aaa'; > c = order b by a2; > d = foreach c generate a2; > store d into 'bbb'; > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1644) New logical plan: Plan.connect with position is misused in some places
[ https://issues.apache.org/jira/browse/PIG-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1644: Attachment: PIG-1644-1.patch > New logical plan: Plan.connect with position is misused in some places > -- > > Key: PIG-1644 > URL: https://issues.apache.org/jira/browse/PIG-1644 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: PIG-1644-1.patch > > > When we replace/remove/insert a node, we will use disconnect/connect methods > of OperatorPlan. When we disconnect an edge, we shall save the position of > the edge in origination and destination, and use this position when connect > to the new predecessor/successor. Some of the pattens are: > Insert a new node: > {code} > Pair pos = plan.disconnect(pred, succ); > plan.connect(pred, pos.first, newnode, 0); > plan.connect(newnode, 0, succ, pos.second); > {code} > Remove a node: > {code} > Pair pos1 = plan.disconnect(pred, nodeToRemove); > Pair pos2 = plan.disconnect(nodeToRemove, succ); > plan.connect(pred, pos1.first, succ, pos2.second); > {code} > Replace a node: > {code} > Pair pos1 = plan.disconnect(pred, nodeToReplace); > Pair pos2 = plan.disconnect(nodeToReplace, succ); > plan.connect(pred, pos1.first, newNode, pos1.second); > plan.connect(newNode, pos2.first, succ, pos2.second); > {code} > There are couple of places of we does not follow this pattern, that results > some error. For example, the following script fail: > {code} > a = load '1.txt' as (a0, a1, a2, a3); > b = foreach a generate a0, a1, a2; > store b into 'aaa'; > c = order b by a2; > d = foreach c generate a2; > store d into 'bbb'; > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1644) New logical plan: Plan.connect with position is misused in some places
[ https://issues.apache.org/jira/browse/PIG-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1644: Attachment: (was: PIG-1644-1.patch) > New logical plan: Plan.connect with position is misused in some places > -- > > Key: PIG-1644 > URL: https://issues.apache.org/jira/browse/PIG-1644 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: PIG-1644-1.patch > > > When we replace/remove/insert a node, we will use disconnect/connect methods > of OperatorPlan. When we disconnect an edge, we shall save the position of > the edge in origination and destination, and use this position when connect > to the new predecessor/successor. Some of the pattens are: > Insert a new node: > {code} > Pair pos = plan.disconnect(pred, succ); > plan.connect(pred, pos.first, newnode, 0); > plan.connect(newnode, 0, succ, pos.second); > {code} > Remove a node: > {code} > Pair pos1 = plan.disconnect(pred, nodeToRemove); > Pair pos2 = plan.disconnect(nodeToRemove, succ); > plan.connect(pred, pos1.first, succ, pos2.second); > {code} > Replace a node: > {code} > Pair pos1 = plan.disconnect(pred, nodeToReplace); > Pair pos2 = plan.disconnect(nodeToReplace, succ); > plan.connect(pred, pos1.first, newNode, pos1.second); > plan.connect(newNode, pos2.first, succ, pos2.second); > {code} > There are couple of places of we does not follow this pattern, that results > some error. For example, the following script fail: > {code} > a = load '1.txt' as (a0, a1, a2, a3); > b = foreach a generate a0, a1, a2; > store b into 'aaa'; > c = order b by a2; > d = foreach c generate a2; > store d into 'bbb'; > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1644) New logical plan: Plan.connect with position is misused in some places
New logical plan: Plan.connect with position is misused in some places -- Key: PIG-1644 URL: https://issues.apache.org/jira/browse/PIG-1644 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 When we replace/remove/insert a node, we will use disconnect/connect methods of OperatorPlan. When we disconnect an edge, we shall save the position of the edge in origination and destination, and use this position when connect to the new predecessor/successor. Some of the pattens are: Insert a new node: {code} Pair pos = plan.disconnect(pred, succ); plan.connect(pred, pos.first, newnode, 0); plan.connect(newnode, 0, succ, pos.second); {code} Remove a node: {code} Pair pos1 = plan.disconnect(pred, nodeToRemove); Pair pos2 = plan.disconnect(nodeToRemove, succ); plan.connect(pred, pos1.first, succ, pos2.second); {code} Replace a node: {code} Pair pos1 = plan.disconnect(pred, nodeToReplace); Pair pos2 = plan.disconnect(nodeToReplace, succ); plan.connect(pred, pos1.first, newNode, pos1.second); plan.connect(newNode, pos2.first, succ, pos2.second); {code} There are couple of places of we does not follow this pattern, that results some error. For example, the following script fail: {code} a = load '1.txt' as (a0, a1, a2, a3); b = foreach a generate a0, a1, a2; store b into 'aaa'; c = order b by a2; d = foreach c generate a2; store d into 'bbb'; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1636) Scalar fail if the scalar variable is generated by limit
[ https://issues.apache.org/jira/browse/PIG-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai resolved PIG-1636. - Hadoop Flags: [Reviewed] Resolution: Fixed Patch committed to both trunk and 0.8 branch. > Scalar fail if the scalar variable is generated by limit > > > Key: PIG-1636 > URL: https://issues.apache.org/jira/browse/PIG-1636 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: PIG-1636-1.patch > > > The following script fail: > {code} > a = load 'studenttab10k' as (name: chararray, age: int, gpa: float); > b = group a all; > c = foreach b generate SUM(a.age) as total; > c1= limit c 1; > d = foreach a generate name, age/(double)c1.total as d_sum; > store d into '111'; > {code} > The problem is we have a reference to c1 in d. In the optimizer, we push > limit before foreach, d still reference to limit, and we get the wrong schema > for the scalar. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1636) Scalar fail if the scalar variable is generated by limit
[ https://issues.apache.org/jira/browse/PIG-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913714#action_12913714 ] Daniel Dai commented on PIG-1636: - test-patch result: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. All tests pass. > Scalar fail if the scalar variable is generated by limit > > > Key: PIG-1636 > URL: https://issues.apache.org/jira/browse/PIG-1636 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: PIG-1636-1.patch > > > The following script fail: > {code} > a = load 'studenttab10k' as (name: chararray, age: int, gpa: float); > b = group a all; > c = foreach b generate SUM(a.age) as total; > c1= limit c 1; > d = foreach a generate name, age/(double)c1.total as d_sum; > store d into '111'; > {code} > The problem is we have a reference to c1 in d. In the optimizer, we push > limit before foreach, d still reference to limit, and we get the wrong schema > for the scalar. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1605) Adding soft link to plan to solve input file dependency
[ https://issues.apache.org/jira/browse/PIG-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai resolved PIG-1605. - Hadoop Flags: [Reviewed] Resolution: Fixed Release audit warning is due to jdiff. No new file added. Patch committed to both trunk and 0.8 branch. > Adding soft link to plan to solve input file dependency > --- > > Key: PIG-1605 > URL: https://issues.apache.org/jira/browse/PIG-1605 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: PIG-1605-1.patch, PIG-1605-2.patch > > > In scalar implementation, we need to deal with implicit dependencies. > [PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve > the problem by adding a LOScalar operator. Here is a different approach. We > will add a soft link to the plan, and soft link is only visible to the > walkers. By doing this, we can make sure we visit LOStore which generate > scalar first, and then LOForEach which use the scalar. All other part of the > logical plan does not know the existence of the soft link. The benefits are: > 1. Logical plan do not need to deal with LOScalar, this makes logical plan > cleaner > 2. Conceptually scalar dependency is different. Regular link represent a data > flow in pipeline. In scalar, the dependency means an operator depends on a > file generated by the other operator. It's different type of data dependency. > 3. Soft link can solve other dependency problem in the future. If we > introduce another UDF dependent on a file generated by another operator, we > can use this mechanism to solve it. > 4. With soft link, we can use scalar come from different sources in the same > statement, which in my mind is not a rare use case. (eg: D = foreach C > generate c0/A.total, c1/B.count; ) > Currently, there are two cases we can use soft link: > 1. scalar dependency, where ReadScalar UDF will use a file generate by a > LOStore > 2. store-load dependency, where we will load a file which is generated by a > store in the same script. This happens in multi-store case. Currently we > solve it by regular link. It is better to use a soft link. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1605) Adding soft link to plan to solve input file dependency
[ https://issues.apache.org/jira/browse/PIG-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1605: Attachment: PIG-1605-2.patch PIG-1605-2.patch fix findbug warnings. test-patch result: [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 6 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] -1 release audit. The applied patch generated 455 release audit warnings (more than the trunk's current 453 warning s). > Adding soft link to plan to solve input file dependency > --- > > Key: PIG-1605 > URL: https://issues.apache.org/jira/browse/PIG-1605 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: PIG-1605-1.patch, PIG-1605-2.patch > > > In scalar implementation, we need to deal with implicit dependencies. > [PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve > the problem by adding a LOScalar operator. Here is a different approach. We > will add a soft link to the plan, and soft link is only visible to the > walkers. By doing this, we can make sure we visit LOStore which generate > scalar first, and then LOForEach which use the scalar. All other part of the > logical plan does not know the existence of the soft link. The benefits are: > 1. Logical plan do not need to deal with LOScalar, this makes logical plan > cleaner > 2. Conceptually scalar dependency is different. Regular link represent a data > flow in pipeline. In scalar, the dependency means an operator depends on a > file generated by the other operator. It's different type of data dependency. > 3. Soft link can solve other dependency problem in the future. If we > introduce another UDF dependent on a file generated by another operator, we > can use this mechanism to solve it. > 4. With soft link, we can use scalar come from different sources in the same > statement, which in my mind is not a rare use case. (eg: D = foreach C > generate c0/A.total, c1/B.count; ) > Currently, there are two cases we can use soft link: > 1. scalar dependency, where ReadScalar UDF will use a file generate by a > LOStore > 2. store-load dependency, where we will load a file which is generated by a > store in the same script. This happens in multi-store case. Currently we > solve it by regular link. It is better to use a soft link. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1598) Pig gobbles up error messages - Part 2
[ https://issues.apache.org/jira/browse/PIG-1598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1598: Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed Patch looks good. Committed to both trunk and 0.8 branch. > Pig gobbles up error messages - Part 2 > -- > > Key: PIG-1598 > URL: https://issues.apache.org/jira/browse/PIG-1598 > Project: Pig > Issue Type: Improvement >Reporter: Ashutosh Chauhan >Assignee: niraj rai > Fix For: 0.8.0 > > Attachments: PIG-1598_0.patch > > > Another case of PIG-1531 . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1639) New logical plan: PushUpFilter should not optimize if filter condition contains UDF
[ https://issues.apache.org/jira/browse/PIG-1639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1639: Description: The following script fail: {code} a = load 'file' AS (f1, f2, f3); b = group a by f1; c = filter b by COUNT(a) > 1; dump c; {code} > New logical plan: PushUpFilter should not optimize if filter condition > contains UDF > --- > > Key: PIG-1639 > URL: https://issues.apache.org/jira/browse/PIG-1639 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.8.0 > > > The following script fail: > {code} > a = load 'file' AS (f1, f2, f3); > b = group a by f1; > c = filter b by COUNT(a) > 1; > dump c; > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1639) New logical plan: PushUpFilter should not optimize if filter condition contains UDF
New logical plan: PushUpFilter should not optimize if filter condition contains UDF --- Key: PIG-1639 URL: https://issues.apache.org/jira/browse/PIG-1639 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1636) Scalar fail if the scalar variable is generated by limit
[ https://issues.apache.org/jira/browse/PIG-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1636: Attachment: PIG-1636-1.patch This patch depends on PIG-1605. > Scalar fail if the scalar variable is generated by limit > > > Key: PIG-1636 > URL: https://issues.apache.org/jira/browse/PIG-1636 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: PIG-1636-1.patch > > > The following script fail: > {code} > a = load 'studenttab10k' as (name: chararray, age: int, gpa: float); > b = group a all; > c = foreach b generate SUM(a.age) as total; > c1= limit c 1; > d = foreach a generate name, age/(double)c1.total as d_sum; > store d into '111'; > {code} > The problem is we have a reference to c1 in d. In the optimizer, we push > limit before foreach, d still reference to limit, and we get the wrong schema > for the scalar. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1637) Combiner not use because optimizor inserts a foreach between group and algebric function
Combiner not use because optimizor inserts a foreach between group and algebric function Key: PIG-1637 URL: https://issues.apache.org/jira/browse/PIG-1637 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 The following script does not use combiner after new optimization change. {code} A = load ':INPATH:/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B = foreach A generate user, (int)timespent as timespent, (double)estimated_revenue as estimated_revenue; C = group B all; D = foreach C generate SUM(B.timespent), AVG(B.estimated_revenue); store D into ':OUTPATH:'; {code} This is because after group, optimizer detect group key is not used afterward, it add a foreach statement after C. This is how it looks like after optimization: {code} A = load ':INPATH:/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B = foreach A generate user, (int)timespent as timespent, (double)estimated_revenue as estimated_revenue; C = group B all; C1 = foreach C generate B; D = foreach C1 generate SUM(B.timespent), AVG(B.estimated_revenue); store D into ':OUTPATH:'; {code} That cancel the combiner optimization for D. The way to solve the issue is to merge the C1 we inserted and D. Currently, we do not merge these two foreach. The reason is that one output of the first foreach (B) is referred twice in D, and currently rule assume after merge, we need to calculate B twice in D. Actually, C1 is only doing projection, no calculation of B. Merging C1 and D will not result calculating B twice. So C1 and D should be merged. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1636) Scalar fail if the scalar variable is generated by limit
Scalar fail if the scalar variable is generated by limit Key: PIG-1636 URL: https://issues.apache.org/jira/browse/PIG-1636 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 The following script fail: {code} a = load 'studenttab10k' as (name: chararray, age: int, gpa: float); b = group a all; c = foreach b generate SUM(a.age) as total; c1= limit c 1; d = foreach a generate name, age/(double)c1.total as d_sum; store d into '111'; {code} The problem is we have a reference to c1 in d. In the optimizer, we push limit before foreach, d still reference to limit, and we get the wrong schema for the scalar. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1605) Adding soft link to plan to solve input file dependency
[ https://issues.apache.org/jira/browse/PIG-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1605: Attachment: PIG-1605-1.patch > Adding soft link to plan to solve input file dependency > --- > > Key: PIG-1605 > URL: https://issues.apache.org/jira/browse/PIG-1605 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: PIG-1605-1.patch > > > In scalar implementation, we need to deal with implicit dependencies. > [PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve > the problem by adding a LOScalar operator. Here is a different approach. We > will add a soft link to the plan, and soft link is only visible to the > walkers. By doing this, we can make sure we visit LOStore which generate > scalar first, and then LOForEach which use the scalar. All other part of the > logical plan does not know the existence of the soft link. The benefits are: > 1. Logical plan do not need to deal with LOScalar, this makes logical plan > cleaner > 2. Conceptually scalar dependency is different. Regular link represent a data > flow in pipeline. In scalar, the dependency means an operator depends on a > file generated by the other operator. It's different type of data dependency. > 3. Soft link can solve other dependency problem in the future. If we > introduce another UDF dependent on a file generated by another operator, we > can use this mechanism to solve it. > 4. With soft link, we can use scalar come from different sources in the same > statement, which in my mind is not a rare use case. (eg: D = foreach C > generate c0/A.total, c1/B.count; ) > Currently, there are two cases we can use soft link: > 1. scalar dependency, where ReadScalar UDF will use a file generate by a > LOStore > 2. store-load dependency, where we will load a file which is generated by a > store in the same script. This happens in multi-store case. Currently we > solve it by regular link. It is better to use a soft link. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1614) javacc.jar pulled twice from maven repository
javacc.jar pulled twice from maven repository - Key: PIG-1614 URL: https://issues.apache.org/jira/browse/PIG-1614 Project: Pig Issue Type: Bug Components: build Reporter: Daniel Dai Priority: Trivial ant pull javacc.jar twice from maven. One is javacc.jar, and the other is javacc-4.2.jar. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1608) pig should always include pig-default.properties and pig.properties in the pig.jar
[ https://issues.apache.org/jira/browse/PIG-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1608: Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed Patch committed to trunk. Thanks Niraj! > pig should always include pig-default.properties and pig.properties in the > pig.jar > -- > > Key: PIG-1608 > URL: https://issues.apache.org/jira/browse/PIG-1608 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: niraj rai >Assignee: niraj rai > Fix For: 0.9.0 > > Attachments: PIG-1608_0.patch, PIG-1608_1.patch > > > pig should always include pig-default.properties and pig.properties as a part > of the pig.jar file -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1608) pig should always include pig-default.properties and pig.properties in the pig.jar
[ https://issues.apache.org/jira/browse/PIG-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1608: Fix Version/s: 0.9.0 Affects Version/s: 0.8.0 > pig should always include pig-default.properties and pig.properties in the > pig.jar > -- > > Key: PIG-1608 > URL: https://issues.apache.org/jira/browse/PIG-1608 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: niraj rai >Assignee: niraj rai > Fix For: 0.9.0 > > Attachments: PIG-1608_0.patch, PIG-1608_1.patch > > > pig should always include pig-default.properties and pig.properties as a part > of the pig.jar file -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1608) pig should always include pig-default.properties and pig.properties in the pig.jar
[ https://issues.apache.org/jira/browse/PIG-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12909821#action_12909821 ] Daniel Dai commented on PIG-1608: - Two comments: 1. target "buildJar-withouthadoop" should also include this change 2. format comment: use space instead of tab Target "jar", "package" looks good. > pig should always include pig-default.properties and pig.properties in the > pig.jar > -- > > Key: PIG-1608 > URL: https://issues.apache.org/jira/browse/PIG-1608 > Project: Pig > Issue Type: Bug >Reporter: niraj rai >Assignee: niraj rai > Attachments: PIG-1608_0.patch > > > pig should always include pig-default.properties and pig.properties as a part > of the pig.jar file -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1605) Adding soft link to plan to solve input file dependency
[ https://issues.apache.org/jira/browse/PIG-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12909007#action_12909007 ] Daniel Dai commented on PIG-1605: - Changes are reasonably small. Here is a summary: 1. Add the following methods to the plan (both old and new): {code} public void createSoftLink(E from, E to) public List getSoftLinkPredecessors(E op) public List getSoftLinkSuccessors(E op) {code} 2. All walkers need to change. When walker get predecessors/successors, it need to get both soft/regular link predecessors. The changes are straight forward, eg from: {code} Collection newSuccessors = mPlan.getSuccessors(suc); {code} to: {code} Collection newSuccessors = mPlan.getSuccessors(suc); newSuccessors.addAll(mPlan.getSoftLinkSuccessors(suc)); {code} 3. Change plan utility functions, such as replace, replaceAndAddSucessors, replaceAndAddPredecessors, etc In new logical plan, there is no change since we only have minimum utility functions. In old logical plan, there should be some change to make those utility functions aware of soft link, but if we decide not support old logical plan going forward, no change needed, only need to note those utility functions does not deal with soft link within the function. 4. Change scalar to use soft link This include creating soft link, maintaining soft link when doing transform (migrating to new plan, translating to physical plan). 5. Change store-load to use soft link This is an optional step. Currently we use regular link, conceptually we shall use soft link. It is Ok if we don't do this for now. Also note in most cases, there is no soft link, the plan will behave just like before, so this change should be safe enough. > Adding soft link to plan to solve input file dependency > --- > > Key: PIG-1605 > URL: https://issues.apache.org/jira/browse/PIG-1605 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.8.0 > > > In scalar implementation, we need to deal with implicit dependencies. > [PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve > the problem by adding a LOScalar operator. Here is a different approach. We > will add a soft link to the plan, and soft link is only visible to the > walkers. By doing this, we can make sure we visit LOStore which generate > scalar first, and then LOForEach which use the scalar. All other part of the > logical plan does not know the existence of the soft link. The benefits are: > 1. Logical plan do not need to deal with LOScalar, this makes logical plan > cleaner > 2. Conceptually scalar dependency is different. Regular link represent a data > flow in pipeline. In scalar, the dependency means an operator depends on a > file generated by the other operator. It's different type of data dependency. > 3. Soft link can solve other dependency problem in the future. If we > introduce another UDF dependent on a file generated by another operator, we > can use this mechanism to solve it. > 4. With soft link, we can use scalar come from different sources in the same > statement, which in my mind is not a rare use case. (eg: D = foreach C > generate c0/A.total, c1/B.count; ) > Currently, there are two cases we can use soft link: > 1. scalar dependency, where ReadScalar UDF will use a file generate by a > LOStore > 2. store-load dependency, where we will load a file which is generated by a > store in the same script. This happens in multi-store case. Currently we > solve it by regular link. It is better to use a soft link. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1605) Adding soft link to plan to solve input file dependency
[ https://issues.apache.org/jira/browse/PIG-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12909008#action_12909008 ] Daniel Dai commented on PIG-1605: - Yes, Thejas is right. The first 3 are the main reasons for the change. > Adding soft link to plan to solve input file dependency > --- > > Key: PIG-1605 > URL: https://issues.apache.org/jira/browse/PIG-1605 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.8.0 > > > In scalar implementation, we need to deal with implicit dependencies. > [PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve > the problem by adding a LOScalar operator. Here is a different approach. We > will add a soft link to the plan, and soft link is only visible to the > walkers. By doing this, we can make sure we visit LOStore which generate > scalar first, and then LOForEach which use the scalar. All other part of the > logical plan does not know the existence of the soft link. The benefits are: > 1. Logical plan do not need to deal with LOScalar, this makes logical plan > cleaner > 2. Conceptually scalar dependency is different. Regular link represent a data > flow in pipeline. In scalar, the dependency means an operator depends on a > file generated by the other operator. It's different type of data dependency. > 3. Soft link can solve other dependency problem in the future. If we > introduce another UDF dependent on a file generated by another operator, we > can use this mechanism to solve it. > 4. With soft link, we can use scalar come from different sources in the same > statement, which in my mind is not a rare use case. (eg: D = foreach C > generate c0/A.total, c1/B.count; ) > Currently, there are two cases we can use soft link: > 1. scalar dependency, where ReadScalar UDF will use a file generate by a > LOStore > 2. store-load dependency, where we will load a file which is generated by a > store in the same script. This happens in multi-store case. Currently we > solve it by regular link. It is better to use a soft link. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1605) Adding soft link to plan to solve input file dependency
[ https://issues.apache.org/jira/browse/PIG-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1605: Description: In scalar implementation, we need to deal with implicit dependencies. [PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve the problem by adding a LOScalar operator. Here is a different approach. We will add a soft link to the plan, and soft link is only visible to the walkers. By doing this, we can make sure we visit LOStore which generate scalar first, and then LOForEach which use the scalar. All other part of the logical plan does not know the existence of the soft link. The benefits are: 1. Logical plan do not need to deal with LOScalar, this makes logical plan cleaner 2. Conceptually scalar dependency is different. Regular link represent a data flow in pipeline. In scalar, the dependency means an operator depends on a file generated by the other operator. It's different type of data dependency. 3. Soft link can solve other dependency problem in the future. If we introduce another UDF dependent on a file generated by another operator, we can use this mechanism to solve it. 4. With soft link, we can use scalar come from different sources in the same statement, which in my mind is not a rare use case. (eg: D = foreach C generate c0/A.total, c1/B.count; ) Currently, there are two cases we can use soft link: 1. scalar dependency, where ReadScalar UDF will use a file generate by a LOStore 2. store-load dependency, where we will load a file which is generated by a store in the same script. This happens in multi-store case. Currently we solve it by regular link. It is better to use a soft link. was: In scalar implementation, we need to deal with implicit dependencies. [PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve the problem by adding a LOScalar operator. Here is a different approach. We will add a soft link to the plan, and soft link is only visible to the walkers. By doing this, we can make sure we visit LOStore which generate scalar first, and then LOForEach which use the scalar. All other part of the logical plan does not know the existence of the soft link. The benefits are: 1. Logical plan do not need to deal with LOScalar, this makes logical plan cleaner 2. Conceptually scalar dependency is different. Regular link represent a data flow in pipeline. In scalar, the dependency means an operator depends on a file generated by the other operator. It's different type of data dependency. 3. Soft link can solve other dependency problem in the future. If we introduce another UDF dependent on a file generated by another operator, we can use this mechanism to solve it. 4. With soft link, we can use scalar come from different sources in the same statement, which in my mind is not a rare use case. (eg: D = foreach C generate c0/A.total, c1/B.count;) Currently, there are two cases we can use soft link: 1. scalar dependency, where ReadScalar UDF will use a file generate by a LOStore 2. store-load dependency, where we will load a file which is generated by a store in the same script. This happens in multi-store case. Currently we solve it by regular link. It is better to use a soft link. > Adding soft link to plan to solve input file dependency > --- > > Key: PIG-1605 > URL: https://issues.apache.org/jira/browse/PIG-1605 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.8.0 > > > In scalar implementation, we need to deal with implicit dependencies. > [PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve > the problem by adding a LOScalar operator. Here is a different approach. We > will add a soft link to the plan, and soft link is only visible to the > walkers. By doing this, we can make sure we visit LOStore which generate > scalar first, and then LOForEach which use the scalar. All other part of the > logical plan does not know the existence of the soft link. The benefits are: > 1. Logical plan do not need to deal with LOScalar, this makes logical plan > cleaner > 2. Conceptually scalar dependency is different. Regular link represent a data > flow in pipeline. In scalar, the dependency means an operator depends on a > file generated by the other operator. It's different type of data dependency. > 3. Soft link can solve other dependency problem in the future. If we > introduce another UDF dependent on a file generated by another operator, we > can use this mechanism to solve it. > 4. With soft link, we can use scalar come from different sources in the same > statement, which in my mind is not a rare use case. (eg: D = foreach C > generate c0/A.total, c1
[jira] Updated: (PIG-1605) Adding soft link to plan to solve input file dependency
[ https://issues.apache.org/jira/browse/PIG-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1605: Description: In scalar implementation, we need to deal with implicit dependencies. [PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve the problem by adding a LOScalar operator. Here is a different approach. We will add a soft link to the plan, and soft link is only visible to the walkers. By doing this, we can make sure we visit LOStore which generate scalar first, and then LOForEach which use the scalar. All other part of the logical plan does not know the existence of the soft link. The benefits are: 1. Logical plan do not need to deal with LOScalar, this makes logical plan cleaner 2. Conceptually scalar dependency is different. Regular link represent a data flow in pipeline. In scalar, the dependency means an operator depends on a file generated by the other operator. It's different type of data dependency. 3. Soft link can solve other dependency problem in the future. If we introduce another UDF dependent on a file generated by another operator, we can use this mechanism to solve it. 4. With soft link, we can use scalar come from different sources in the same statement, which in my mind is not a rare use case. (eg: D = foreach C generate c0/A.total, c1/B.count;) Currently, there are two cases we can use soft link: 1. scalar dependency, where ReadScalar UDF will use a file generate by a LOStore 2. store-load dependency, where we will load a file which is generated by a store in the same script. This happens in multi-store case. Currently we solve it by regular link. It is better to use a soft link. was: In scalar implementation, we need to deal with implicit dependencies. [PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve the problem by adding a LOScalar operator. Here is a different approach. We will add a soft link to the plan, and soft link is only visible to the walkers. By doing this, we can make sure we visit LOStore which generate scalar first, and then LOForEach which use the scalar. All other part of the logical plan does not know the existence of the soft link. The benefits are: 1. Logical plan do not need to deal with LOScalar, this makes logical plan cleaner 2. Conceptually scalar dependency is different. Regular link represent a data flow in pipeline. In scalar, the dependency means an operator depends on a file generated by the other operator. It's different type of data dependency. 3. Soft link can solve other dependency problem in the future. If we introduce another UDF dependent on a file generated by another operator, we can use this mechanism to solve it. Currently, there are two cases we can use soft link: 1. scalar dependency, where ReadScalar UDF will use a file generate by a LOStore 2. store-load dependency, where we will load a file which is generated by a store in the same script. This happens in multi-store case. Currently we solve it by regular link. It is better to use a soft link. > Adding soft link to plan to solve input file dependency > --- > > Key: PIG-1605 > URL: https://issues.apache.org/jira/browse/PIG-1605 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.8.0 > > > In scalar implementation, we need to deal with implicit dependencies. > [PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve > the problem by adding a LOScalar operator. Here is a different approach. We > will add a soft link to the plan, and soft link is only visible to the > walkers. By doing this, we can make sure we visit LOStore which generate > scalar first, and then LOForEach which use the scalar. All other part of the > logical plan does not know the existence of the soft link. The benefits are: > 1. Logical plan do not need to deal with LOScalar, this makes logical plan > cleaner > 2. Conceptually scalar dependency is different. Regular link represent a data > flow in pipeline. In scalar, the dependency means an operator depends on a > file generated by the other operator. It's different type of data dependency. > 3. Soft link can solve other dependency problem in the future. If we > introduce another UDF dependent on a file generated by another operator, we > can use this mechanism to solve it. > 4. With soft link, we can use scalar come from different sources in the same > statement, which in my mind is not a rare use case. (eg: D = foreach C > generate c0/A.total, c1/B.count;) > Currently, there are two cases we can use soft link: > 1. scalar dependency, where ReadScalar UDF will use a file generate by a > LOStore > 2. store-load dependency, where
[jira] Commented: (PIG-1608) pig should always include pig-default.properties and pig.properties in the pig.jar
[ https://issues.apache.org/jira/browse/PIG-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12908886#action_12908886 ] Daniel Dai commented on PIG-1608: - pig should include pig-default.properties into pig.jar, but not pig.properties, just like hadoop does for core-default.xml, core-site.xml. > pig should always include pig-default.properties and pig.properties in the > pig.jar > -- > > Key: PIG-1608 > URL: https://issues.apache.org/jira/browse/PIG-1608 > Project: Pig > Issue Type: Bug >Reporter: niraj rai >Assignee: niraj rai > > pig should always include pig-default.properties and pig.properties as a part > of the pig.jar file -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1605) Adding soft link to plan to solve input file dependency
[ https://issues.apache.org/jira/browse/PIG-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1605: Description: In scalar implementation, we need to deal with implicit dependencies. [PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve the problem by adding a LOScalar operator. Here is a different approach. We will add a soft link to the plan, and soft link is only visible to the walkers. By doing this, we can make sure we visit LOStore which generate scalar first, and then LOForEach which use the scalar. All other part of the logical plan does not know the existence of the soft link. The benefits are: 1. Logical plan do not need to deal with LOScalar, this makes logical plan cleaner 2. Conceptually scalar dependency is different. Regular link represent a data flow in pipeline. In scalar, the dependency means an operator depends on a file generated by the other operator. It's different type of data dependency. 3. Soft link can solve other dependency problem in the future. If we introduce another UDF dependent on a file generated by another operator, we can use this mechanism to solve it. Currently, there are two cases we can use soft link: 1. scalar dependency, where ReadScalar UDF will use a file generate by a LOStore 2. store-load dependency, where we will load a file which is generated by a store in the same script. This happens in multi-store case. Currently we solve it by regular link. It is better to use a soft link. was: In scalar implementation, we need to deal with implicit dependencies. [PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve the problem by adding a LOScalar operator. Here is a different approach. We will add a soft link to the plan, and soft link is only visible to the walkers. All other part of the logical plan does not know the existence of the soft link. The benefits are: 1. Logical plan do not need to deal with LOScalar, this makes logical plan cleaner 2. Conceptually scalar dependency is different. Regular link represent a data flow in pipeline. In scalar, the dependency means an operator depends on a file generated by the other operator. It's different type of data dependency. 3. Soft link can solve other dependency problem in the future. If we introduce another UDF dependent on a file generated by another operator, we can use this mechanism to solve it. Currently, there are two cases we can use soft link: 1. scalar dependency, where ReadScalar UDF will use a file generate by a LOStore 2. store-load dependency, where we will load a file which is generated by a store in the same script. This happens in multi-store case. Currently we solve it by regular link. It is better to use a soft link. > Adding soft link to plan to solve input file dependency > --- > > Key: PIG-1605 > URL: https://issues.apache.org/jira/browse/PIG-1605 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.8.0 > > > In scalar implementation, we need to deal with implicit dependencies. > [PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve > the problem by adding a LOScalar operator. Here is a different approach. We > will add a soft link to the plan, and soft link is only visible to the > walkers. By doing this, we can make sure we visit LOStore which generate > scalar first, and then LOForEach which use the scalar. All other part of the > logical plan does not know the existence of the soft link. The benefits are: > 1. Logical plan do not need to deal with LOScalar, this makes logical plan > cleaner > 2. Conceptually scalar dependency is different. Regular link represent a data > flow in pipeline. In scalar, the dependency means an operator depends on a > file generated by the other operator. It's different type of data dependency. > 3. Soft link can solve other dependency problem in the future. If we > introduce another UDF dependent on a file generated by another operator, we > can use this mechanism to solve it. > Currently, there are two cases we can use soft link: > 1. scalar dependency, where ReadScalar UDF will use a file generate by a > LOStore > 2. store-load dependency, where we will load a file which is generated by a > store in the same script. This happens in multi-store case. Currently we > solve it by regular link. It is better to use a soft link. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1605) Adding soft link to plan to solve input file dependency
Adding soft link to plan to solve input file dependency --- Key: PIG-1605 URL: https://issues.apache.org/jira/browse/PIG-1605 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 In scalar implementation, we need to deal with implicit dependencies. [PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve the problem by adding a LOScalar operator. Here is a different approach. We will add a soft link to the plan, and soft link is only visible to the walkers. All other part of the logical plan does not know the existence of the soft link. The benefits are: 1. Logical plan do not need to deal with LOScalar, this makes logical plan cleaner 2. Conceptually scalar dependency is different. Regular link represent a data flow in pipeline. In scalar, the dependency means an operator depends on a file generated by the other operator. It's different type of data dependency. 3. Soft link can solve other dependency problem in the future. If we introduce another UDF dependent on a file generated by another operator, we can use this mechanism to solve it. Currently, there are two cases we can use soft link: 1. scalar dependency, where ReadScalar UDF will use a file generate by a LOStore 2. store-load dependency, where we will load a file which is generated by a store in the same script. This happens in multi-store case. Currently we solve it by regular link. It is better to use a soft link. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1604) 'relation as scalar' does not work with complex types
[ https://issues.apache.org/jira/browse/PIG-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12908096#action_12908096 ] Daniel Dai commented on PIG-1604: - +1, patch looks good. > 'relation as scalar' does not work with complex types > -- > > Key: PIG-1604 > URL: https://issues.apache.org/jira/browse/PIG-1604 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Thejas M Nair >Assignee: Thejas M Nair > Fix For: 0.8.0 > > Attachments: PIG-1604.1.patch > > > Statement such as > sclr = limit b 1; > d = foreach a generate name, age/(double)sclr.mapcol#'it' as some_sum; > Results in the following parse error: > ERROR 1000: Error during parsing. Non-atomic field expected but found atomic > field -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1437) [Optimization] Rewrite GroupBy-Foreach-flatten(group) to Distinct
[ https://issues.apache.org/jira/browse/PIG-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1437: Assignee: Xuefu Zhang Fix Version/s: 0.9.0 > [Optimization] Rewrite GroupBy-Foreach-flatten(group) to Distinct > - > > Key: PIG-1437 > URL: https://issues.apache.org/jira/browse/PIG-1437 > Project: Pig > Issue Type: Sub-task > Components: impl >Affects Versions: 0.7.0 >Reporter: Ashutosh Chauhan >Assignee: Xuefu Zhang >Priority: Minor > Fix For: 0.9.0 > > > Its possible to rewrite queries like this > {code} > A = load 'data' as (name,age); > B = group A by (name,age); > C = foreach B generate group.name, group.age; > dump C; > {code} > or > {code} > (name,age); > B = group A by (name > A = load 'data' as,age); > C = foreach B generate flatten(group); > dump C; > {code} > to > {code} > A = load 'data' as (name,age); > B = distinct A; > dump B; > {code} > This could only be done if no columns within the bags are referenced > subsequently in the script. Since in Pig-Hadoop world DISTINCT will be > executed more effeciently then group-by this will be a huge win. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1322) Logical Optimizer: change outer join into regular join
[ https://issues.apache.org/jira/browse/PIG-1322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1322: Assignee: Xuefu Zhang (was: Daniel Dai) Fix Version/s: 0.9.0 > Logical Optimizer: change outer join into regular join > -- > > Key: PIG-1322 > URL: https://issues.apache.org/jira/browse/PIG-1322 > Project: Pig > Issue Type: Sub-task > Components: impl >Affects Versions: 0.7.0 >Reporter: Daniel Dai >Assignee: Xuefu Zhang > Fix For: 0.9.0 > > > In some cases, we can change the outer join into a regular join. The benefit > is regular join is easier to optimize in subsequent optimization. > Example: > C = join A by a0 LEFT OUTER, B by b0; > D = filter C by b0 > 0; > => > C = join A by a0, B by b0; > D = filter C by b0 > 0; > Because we made this change, so PushUpFilter rule can further push the filter > in front of regular join which otherwise cannot. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with
[ https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1178: Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed > LogicalPlan and Optimizer are too complex and hard to work with > --- > > Key: PIG-1178 > URL: https://issues.apache.org/jira/browse/PIG-1178 > Project: Pig > Issue Type: Improvement >Reporter: Alan Gates >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: expressions-2.patch, expressions.patch, lp.patch, > lp.patch, PIG-1178-10.patch, PIG-1178-11.patch, PIG-1178-4.patch, > PIG-1178-5.patch, PIG-1178-6.patch, PIG-1178-7.patch, PIG-1178-8.patch, > PIG-1178-9.patch, pig_1178.patch, pig_1178.patch, PIG_1178.patch, > pig_1178_2.patch, pig_1178_3.2.patch, pig_1178_3.3.patch, pig_1178_3.4.patch, > pig_1178_3.patch > > > The current implementation of the logical plan and the logical optimizer in > Pig has proven to not be easily extensible. Developer feedback has indicated > that adding new rules to the optimizer is quite burdensome. In addition, the > logical plan has been an area of numerous bugs, many of which have been > difficult to fix. Developers also feel that the logical plan is difficult to > understand and maintain. The root cause for these issues is that a number of > design decisions that were made as part of the 0.2 rewrite of the front end > have now proven to be sub-optimal. The heart of this proposal is to revisit a > number of those proposals and rebuild the logical plan with a simpler design > that will make it much easier to maintain the logical plan as well as extend > the logical optimizer. > See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full > details. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with
[ https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907061#action_12907061 ] Daniel Dai commented on PIG-1178: - PIG-1178-11.patch committed to both trunk and 0.8 branch. > LogicalPlan and Optimizer are too complex and hard to work with > --- > > Key: PIG-1178 > URL: https://issues.apache.org/jira/browse/PIG-1178 > Project: Pig > Issue Type: Improvement >Reporter: Alan Gates >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: expressions-2.patch, expressions.patch, lp.patch, > lp.patch, PIG-1178-10.patch, PIG-1178-11.patch, PIG-1178-4.patch, > PIG-1178-5.patch, PIG-1178-6.patch, PIG-1178-7.patch, PIG-1178-8.patch, > PIG-1178-9.patch, pig_1178.patch, pig_1178.patch, PIG_1178.patch, > pig_1178_2.patch, pig_1178_3.2.patch, pig_1178_3.3.patch, pig_1178_3.4.patch, > pig_1178_3.patch > > > The current implementation of the logical plan and the logical optimizer in > Pig has proven to not be easily extensible. Developer feedback has indicated > that adding new rules to the optimizer is quite burdensome. In addition, the > logical plan has been an area of numerous bugs, many of which have been > difficult to fix. Developers also feel that the logical plan is difficult to > understand and maintain. The root cause for these issues is that a number of > design decisions that were made as part of the 0.2 rewrite of the front end > have now proven to be sub-optimal. The heart of this proposal is to revisit a > number of those proposals and rebuild the logical plan with a simpler design > that will make it much easier to maintain the logical plan as well as extend > the logical optimizer. > See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full > details. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with
[ https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1178: Attachment: PIG-1178-11.patch PIG-1178-11.patch change the layout of explain, error code and comments, etc. No real functional changes. test-patch result: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 11 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. > LogicalPlan and Optimizer are too complex and hard to work with > --- > > Key: PIG-1178 > URL: https://issues.apache.org/jira/browse/PIG-1178 > Project: Pig > Issue Type: Improvement >Reporter: Alan Gates >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: expressions-2.patch, expressions.patch, lp.patch, > lp.patch, PIG-1178-10.patch, PIG-1178-11.patch, PIG-1178-4.patch, > PIG-1178-5.patch, PIG-1178-6.patch, PIG-1178-7.patch, PIG-1178-8.patch, > PIG-1178-9.patch, pig_1178.patch, pig_1178.patch, PIG_1178.patch, > pig_1178_2.patch, pig_1178_3.2.patch, pig_1178_3.3.patch, pig_1178_3.4.patch, > pig_1178_3.patch > > > The current implementation of the logical plan and the logical optimizer in > Pig has proven to not be easily extensible. Developer feedback has indicated > that adding new rules to the optimizer is quite burdensome. In addition, the > logical plan has been an area of numerous bugs, many of which have been > difficult to fix. Developers also feel that the logical plan is difficult to > understand and maintain. The root cause for these issues is that a number of > design decisions that were made as part of the 0.2 rewrite of the front end > have now proven to be sub-optimal. The heart of this proposal is to revisit a > number of those proposals and rebuild the logical plan with a simpler design > that will make it much easier to maintain the logical plan as well as extend > the logical optimizer. > See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full > details. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1595) casting relation to scalar- problem with handling of data from non PigStorage loaders
[ https://issues.apache.org/jira/browse/PIG-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906932#action_12906932 ] Daniel Dai commented on PIG-1595: - +1 for the test failure fix. > casting relation to scalar- problem with handling of data from non PigStorage > loaders > - > > Key: PIG-1595 > URL: https://issues.apache.org/jira/browse/PIG-1595 > Project: Pig > Issue Type: Bug >Reporter: Thejas M Nair >Assignee: Thejas M Nair > Fix For: 0.8.0 > > Attachments: PIG-1595.1.patch, PIG-1595.2.patch > > > If load functions that don't follow the same bytearray format as PigStorage > for other supported datatypes, or those that don't implement the LoadCaster > interface are used in 'casting relation to scalar' (PIG-1434), it can cause > the query to fail or create incorrect results. > The root cause of the problem is that there is a real dependency between the > ReadScalars udf that returns the scalar value and the LogicalOperator that > acts as its input. But the logicalplan does not capture this dependency. So > in SchemaResetter visitor used by the optimizer, the order in which schema is > reset and evaluated does not take this into consideration. If the schema of > the input LogicalOperator does not get evaluated before the ReadScalar udf, > the resutltype of ReadScalar udf becomes bytearray. POUserFunc will convert > the input to bytearray using ' new DataByteArray(inp.toString().getBytes())'. > But this bytearray encoding of other supported types might not be same for > the LoadFunction associated with the column, and that can result in problems. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1601) Make scalar work for secure hadoop
[ https://issues.apache.org/jira/browse/PIG-1601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai resolved PIG-1601. - Hadoop Flags: [Reviewed] Resolution: Fixed Patch committed to both trunk and 0.8 branch. > Make scalar work for secure hadoop > -- > > Key: PIG-1601 > URL: https://issues.apache.org/jira/browse/PIG-1601 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: PIG-1601-1.patch > > > Error message: > open file > 'hdfs://gsbl90890.blue.ygrid.yahoo.com/tmp/temp851711738/tmp727366271'; error > = > java.io.IOException: Delegation Token can be issued only with kerberos or web > authentication at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:4975) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.getDelegationToken(NameNode.java:432) > at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source) at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) at > org.apache.hadoop.ipc.RPC$Server.call(RPC.java:523) at > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1301) at > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1297) at > java.security.AccessController.doPrivileged(Native Method) at > javax.security.auth.Subject.doAs(Subject.java:396) at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1295) at > org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:66) at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:313) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:448) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:441) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.Divide.getNext(Divide.java:72) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:358) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:638) at > org.apache.hadoop.mapred.MapTask.run(MapTask.java:314) at > org.apache.hadoop.mapred.Child$4.run(Child.java:217) at > java.security.AccessController.doPrivileged(Native Method) at > javax.security.auth.Subject.doAs(Subject.java:396) at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062) > at org.apache.hadoop.mapred.Child.main(Child.java:211) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1594) NullPointerException in new logical planner
[ https://issues.apache.org/jira/browse/PIG-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai resolved PIG-1594. - Resolution: Fixed This issue is fixed by PIG-1178-10.patch. > NullPointerException in new logical planner > --- > > Key: PIG-1594 > URL: https://issues.apache.org/jira/browse/PIG-1594 > Project: Pig > Issue Type: Bug >Reporter: Andrew Hitchcock >Assignee: Daniel Dai > Fix For: 0.8.0 > > > I've been testing the trunk version of Pig on Elastic MapReduce against our > log processing sample application(1). When I try to run the query it throws a > NullPointerException and suggests I disable the new logical plan. Disabling > it works and the script succeeds. Here is the query I'm trying to run: > {code} > register file:/home/hadoop/lib/pig/piggybank.jar > DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT(); > RAW_LOGS = LOAD '$INPUT' USING TextLoader as (line:chararray); > LOGS_BASE= foreach RAW_LOGS generate FLATTEN(EXTRACT(line, '^(\\S+) (\\S+) > (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)" > "([^"]*)"')) as (remoteAddr:chararray, remoteLogname:chararray, > user:chararray, time:chararray, request:chararray, status:int, > bytes_string:chararray, referrer:chararray, browser:chararray); > REFERRER_ONLY = FOREACH LOGS_BASE GENERATE referrer; > FILTERED = FILTER REFERRER_ONLY BY referrer matches '.*bing.*' OR referrer > matches '.*google.*'; > SEARCH_TERMS = FOREACH FILTERED GENERATE FLATTEN(EXTRACT(referrer, > '.*[&\\?]q=([^&]+).*')) as terms:chararray; > SEARCH_TERMS_FILTERED = FILTER SEARCH_TERMS BY NOT $0 IS NULL; > SEARCH_TERMS_COUNT = FOREACH (GROUP SEARCH_TERMS_FILTERED BY $0) GENERATE > $0, COUNT($1) as num; > SEARCH_TERMS_COUNT_SORTED = LIMIT(ORDER SEARCH_TERMS_COUNT BY num DESC) 50; > STORE SEARCH_TERMS_COUNT_SORTED into '$OUTPUT'; > {code} > And here is the stack trace that results: > {code} > ERROR 2042: Error in new logical plan. Try -Dpig.usenewlogicalplan=false. > org.apache.pig.backend.executionengine.ExecException: ERROR 2042: Error in > new logical plan. Try -Dpig.usenewlogicalplan=false. > at > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:285) > at org.apache.pig.PigServer.compilePp(PigServer.java:1301) > at > org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1154) > at org.apache.pig.PigServer.execute(PigServer.java:1148) > at org.apache.pig.PigServer.access$100(PigServer.java:123) > at org.apache.pig.PigServer$Graph.execute(PigServer.java:1464) > at org.apache.pig.PigServer.executeBatchEx(PigServer.java:350) > at org.apache.pig.PigServer.executeBatch(PigServer.java:324) > at > org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:111) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:140) > at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90) > at org.apache.pig.Main.run(Main.java:491) > at org.apache.pig.Main.main(Main.java:107) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.util.RunJar.main(RunJar.java:156) > Caused by: java.lang.NullPointerException > at org.apache.pig.EvalFunc.getSchemaName(EvalFunc.java:76) > at > org.apache.pig.piggybank.impl.ErrorCatchingBase.outputSchema(ErrorCatchingBase.java:76) > at > org.apache.pig.newplan.logical.expression.UserFuncExpression.getFieldSchema(UserFuncExpression.java:111) > at > org.apache.pig.newplan.logical.optimizer.FieldSchemaResetter.execute(SchemaResetter.java:175) > at > org.apache.pig.newplan.logical.expression.AllSameExpressionVisitor.visit(AllSameExpressionVisitor.java:143) > at > org.apache.pig.newplan.logical.expression.UserFuncExpression.accept(UserFuncExpression.java:55) > at > org.apache.pig.newplan.ReverseDependencyOrderWalker.walk(ReverseDependencyOrderWalker.java:69) > at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50) > at > org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:87) > at > org.apache.pig.newplan.logical.relational.LOGenerate.accept(LOGenerate.java:149) > at > org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:74) > at > org.apache.pig.newpl
[jira] Commented: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with
[ https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906592#action_12906592 ] Daniel Dai commented on PIG-1178: - Patch PIG-1178-10.patch committed. > LogicalPlan and Optimizer are too complex and hard to work with > --- > > Key: PIG-1178 > URL: https://issues.apache.org/jira/browse/PIG-1178 > Project: Pig > Issue Type: Improvement >Reporter: Alan Gates >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: expressions-2.patch, expressions.patch, lp.patch, > lp.patch, PIG-1178-10.patch, PIG-1178-4.patch, PIG-1178-5.patch, > PIG-1178-6.patch, PIG-1178-7.patch, PIG-1178-8.patch, PIG-1178-9.patch, > pig_1178.patch, pig_1178.patch, PIG_1178.patch, pig_1178_2.patch, > pig_1178_3.2.patch, pig_1178_3.3.patch, pig_1178_3.4.patch, pig_1178_3.patch > > > The current implementation of the logical plan and the logical optimizer in > Pig has proven to not be easily extensible. Developer feedback has indicated > that adding new rules to the optimizer is quite burdensome. In addition, the > logical plan has been an area of numerous bugs, many of which have been > difficult to fix. Developers also feel that the logical plan is difficult to > understand and maintain. The root cause for these issues is that a number of > design decisions that were made as part of the 0.2 rewrite of the front end > have now proven to be sub-optimal. The heart of this proposal is to revisit a > number of those proposals and rebuild the logical plan with a simpler design > that will make it much easier to maintain the logical plan as well as extend > the logical optimizer. > See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full > details. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with
[ https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1178: Attachment: PIG-1178-10.patch Patch PIG-1178-10.patch address foreach user defined schema. test-patch result: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. All test pass. > LogicalPlan and Optimizer are too complex and hard to work with > --- > > Key: PIG-1178 > URL: https://issues.apache.org/jira/browse/PIG-1178 > Project: Pig > Issue Type: Improvement >Reporter: Alan Gates >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: expressions-2.patch, expressions.patch, lp.patch, > lp.patch, PIG-1178-10.patch, PIG-1178-4.patch, PIG-1178-5.patch, > PIG-1178-6.patch, PIG-1178-7.patch, PIG-1178-8.patch, PIG-1178-9.patch, > pig_1178.patch, pig_1178.patch, PIG_1178.patch, pig_1178_2.patch, > pig_1178_3.2.patch, pig_1178_3.3.patch, pig_1178_3.4.patch, pig_1178_3.patch > > > The current implementation of the logical plan and the logical optimizer in > Pig has proven to not be easily extensible. Developer feedback has indicated > that adding new rules to the optimizer is quite burdensome. In addition, the > logical plan has been an area of numerous bugs, many of which have been > difficult to fix. Developers also feel that the logical plan is difficult to > understand and maintain. The root cause for these issues is that a number of > design decisions that were made as part of the 0.2 rewrite of the front end > have now proven to be sub-optimal. The heart of this proposal is to revisit a > number of those proposals and rebuild the logical plan with a simpler design > that will make it much easier to maintain the logical plan as well as extend > the logical optimizer. > See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full > details. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1575) Complete the migration of optimization rule PushUpFilter including missing test cases
[ https://issues.apache.org/jira/browse/PIG-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1575: Attachment: jira-1575-5.patch Patch looks good. Attach the final patch. test patch result: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 6 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. All tests pass. Patch committed to both trunk and 0.8 branch. > Complete the migration of optimization rule PushUpFilter including missing > test cases > - > > Key: PIG-1575 > URL: https://issues.apache.org/jira/browse/PIG-1575 > Project: Pig > Issue Type: Bug >Reporter: Xuefu Zhang >Assignee: Xuefu Zhang > Fix For: 0.8.0 > > Attachments: jira-1575-1.patch, jira-1575-2.patch, jira-1575-3.patch, > jira-1575-4.patch, jira-1575-5.patch > > > The Optimization rule under the new logical plan, PushUpFilter, only does a > subset of optimization scenarios compared to the same rule under the old > logical plan. For instance, it only considers filter after join, but the old > optimization also considers other operators such as CoGroup, Union, Cross, > etc. The migration of the rule should be complete. > Also, the test cases created for testing the old PushUpFilter wasn't migrated > to the new logical plan code base. It should be also migrated. (A few has > been migrated in JIRA-1574.) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1575) Complete the migration of optimization rule PushUpFilter including missing test cases
[ https://issues.apache.org/jira/browse/PIG-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1575: Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed > Complete the migration of optimization rule PushUpFilter including missing > test cases > - > > Key: PIG-1575 > URL: https://issues.apache.org/jira/browse/PIG-1575 > Project: Pig > Issue Type: Bug >Reporter: Xuefu Zhang >Assignee: Xuefu Zhang > Fix For: 0.8.0 > > Attachments: jira-1575-1.patch, jira-1575-2.patch, jira-1575-3.patch, > jira-1575-4.patch, jira-1575-5.patch > > > The Optimization rule under the new logical plan, PushUpFilter, only does a > subset of optimization scenarios compared to the same rule under the old > logical plan. For instance, it only considers filter after join, but the old > optimization also considers other operators such as CoGroup, Union, Cross, > etc. The migration of the rule should be complete. > Also, the test cases created for testing the old PushUpFilter wasn't migrated > to the new logical plan code base. It should be also migrated. (A few has > been migrated in JIRA-1574.) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1595) casting relation to scalar- problem with handling of data from non PigStorage loaders
[ https://issues.apache.org/jira/browse/PIG-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906322#action_12906322 ] Daniel Dai commented on PIG-1595: - Patch break TestScalarAliases.testScalarErrMultipleRowsInInput. Comment out TestScalarAliases.testScalarErrMultipleRowsInInput temporarily. > casting relation to scalar- problem with handling of data from non PigStorage > loaders > - > > Key: PIG-1595 > URL: https://issues.apache.org/jira/browse/PIG-1595 > Project: Pig > Issue Type: Bug >Reporter: Thejas M Nair >Assignee: Thejas M Nair > Fix For: 0.8.0 > > Attachments: PIG-1595.1.patch > > > If load functions that don't follow the same bytearray format as PigStorage > for other supported datatypes, or those that don't implement the LoadCaster > interface are used in 'casting relation to scalar' (PIG-1434), it can cause > the query to fail or create incorrect results. > The root cause of the problem is that there is a real dependency between the > ReadScalars udf that returns the scalar value and the LogicalOperator that > acts as its input. But the logicalplan does not capture this dependency. So > in SchemaResetter visitor used by the optimizer, the order in which schema is > reset and evaluated does not take this into consideration. If the schema of > the input LogicalOperator does not get evaluated before the ReadScalar udf, > the resutltype of ReadScalar udf becomes bytearray. POUserFunc will convert > the input to bytearray using ' new DataByteArray(inp.toString().getBytes())'. > But this bytearray encoding of other supported types might not be same for > the LoadFunction associated with the column, and that can result in problems. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1548) Optimize scalar to consolidate the part file
[ https://issues.apache.org/jira/browse/PIG-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906321#action_12906321 ] Daniel Dai commented on PIG-1548: - Patch break TestFRJoin2.testConcatenateJobForScalar3. Comment out TestFRJoin2.testConcatenateJobForScalar3 temporarily. > Optimize scalar to consolidate the part file > > > Key: PIG-1548 > URL: https://issues.apache.org/jira/browse/PIG-1548 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: Daniel Dai >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1548.patch, PIG-1548_1.patch > > > Current scalar implementation will write a scalar file onto dfs. When Pig > need the scalar, it will open the dfs file directly. Each scalar file > contains more than one part file though it contains only one record. This > puts a huge load to namenode. We should consolidate part file before open it. > Another optional step is put the consolicated file into distributed cache. > This further bring down the load of namenode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1601) Make scalar work for secure hadoop
[ https://issues.apache.org/jira/browse/PIG-1601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1601: Attachment: PIG-1601-1.patch > Make scalar work for secure hadoop > -- > > Key: PIG-1601 > URL: https://issues.apache.org/jira/browse/PIG-1601 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: PIG-1601-1.patch > > > Error message: > open file > 'hdfs://gsbl90890.blue.ygrid.yahoo.com/tmp/temp851711738/tmp727366271'; error > = > java.io.IOException: Delegation Token can be issued only with kerberos or web > authentication at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:4975) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.getDelegationToken(NameNode.java:432) > at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source) at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) at > org.apache.hadoop.ipc.RPC$Server.call(RPC.java:523) at > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1301) at > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1297) at > java.security.AccessController.doPrivileged(Native Method) at > javax.security.auth.Subject.doAs(Subject.java:396) at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1295) at > org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:66) at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:313) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:448) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:441) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.Divide.getNext(Divide.java:72) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:358) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:638) at > org.apache.hadoop.mapred.MapTask.run(MapTask.java:314) at > org.apache.hadoop.mapred.Child$4.run(Child.java:217) at > java.security.AccessController.doPrivileged(Native Method) at > javax.security.auth.Subject.doAs(Subject.java:396) at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062) > at org.apache.hadoop.mapred.Child.main(Child.java:211) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1601) Make scalar work for secure hadoop
Make scalar work for secure hadoop -- Key: PIG-1601 URL: https://issues.apache.org/jira/browse/PIG-1601 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1601-1.patch Error message: open file 'hdfs://gsbl90890.blue.ygrid.yahoo.com/tmp/temp851711738/tmp727366271'; error = java.io.IOException: Delegation Token can be issued only with kerberos or web authentication at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:4975) at org.apache.hadoop.hdfs.server.namenode.NameNode.getDelegationToken(NameNode.java:432) at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:523) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1301) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1297) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1295) at org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:66) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:313) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:448) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:441) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.Divide.getNext(Divide.java:72) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:358) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:638) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:314) at org.apache.hadoop.mapred.Child$4.run(Child.java:217) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062) at org.apache.hadoop.mapred.Child.main(Child.java:211) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1543) IsEmpty returns the wrong value after using LIMIT
[ https://issues.apache.org/jira/browse/PIG-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1543: Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed Patch committed to both trunk and 0.8 branch. > IsEmpty returns the wrong value after using LIMIT > - > > Key: PIG-1543 > URL: https://issues.apache.org/jira/browse/PIG-1543 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Justin Hu >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: PIG-1543-1.patch > > > 1. Two input files: > 1a: limit_empty.input_a > 1 > 1 > 1 > 1b: limit_empty.input_b > 2 > 2 > 2. > The pig script: limit_empty.pig > -- A contains only 1's & B contains only 2's > A = load 'limit_empty.input_a' as (a1:int); > B = load 'limit_empty.input_a' as (b1:int); > C =COGROUP A by a1, B by b1; > D = FOREACH C generate A, B, (IsEmpty(A)? 0:1), (IsEmpty(B)? 0:1), COUNT(A), > COUNT(B); > store D into 'limit_empty.output/d'; > -- After the script done, we see the right results: > -- {(1),(1),(1)} {} 1 0 3 0 > -- {} {(2),(2)} 0 1 0 2 > C1 = foreach C { Alim = limit A 1; Blim = limit B 1; generate Alim, Blim; } > D1 = FOREACH C1 generate Alim,Blim, (IsEmpty(Alim)? 0:1), (IsEmpty(Blim)? > 0:1), COUNT(Alim), COUNT(Blim); > store D1 into 'limit_empty.output/d1'; > -- After the script done, we see the unexpected results: > -- {(1)} {}1 1 1 0 > -- {} {(2)} 1 1 0 1 > dump D; > dump D1; > 3. Run the scrip and redirect the stdout (2 dumps) file. There are two issues: > The major one: > IsEmpty() returns FALSE for empty bag in limit_empty.output/d1/*, while > IsEmpty() returns correctly in limit_empty.output/d/*. > The difference is that one has been applied with "LIMIT" before using > IsEmpty(). > The minor one: > The redirected output only contains the first dump: > ({(1),(1),(1)},{},1,0,3L,0L) > ({},{(2),(2)},0,1,0L,2L) > We expect two more lines like: > ({(1)},{},1,1,1L,0L) > ({},{(2)},1,1,0L,1L) > Besides, there is error says: > [main] ERROR org.apache.pig.backend.hadoop.executionengine.HJob - > java.lang.ClassCastException: java.lang.Integer cannot be cast to > org.apache.pig.data.Tuple -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1591) pig does not create a log file, if tje MR job succeeds but front end fails.
[ https://issues.apache.org/jira/browse/PIG-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai resolved PIG-1591. - Hadoop Flags: [Reviewed] Fix Version/s: 0.8.0 Resolution: Fixed Patch committed to both trunk and 0.8 branch. > pig does not create a log file, if tje MR job succeeds but front end fails. > --- > > Key: PIG-1591 > URL: https://issues.apache.org/jira/browse/PIG-1591 > Project: Pig > Issue Type: Bug >Reporter: niraj rai >Assignee: niraj rai > Fix For: 0.8.0 > > Attachments: pig_1591.patch > > > When I run this script: > A = load 'limit_empty.input_a' as (a1:int); > B = load 'limit_empty.input_b' as (b1:int); > C =COGROUP A by a1, B by b1; > C1 = foreach C { Alim = limit A 1; Blim = limit B 1; generate Alim, Blim; } > D1 = FOREACH C1 generate Alim,Blim, (IsEmpty(Alim)? 0:1), (IsEmpty(Blim)? > 0:1), COUNT(Alim), COUNT(Blim); > dump D1; > The MR job succeeds but the pig job fails with the fillowing error: > 2010-08-31 13:33:09,960 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics > - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - > already initialized > 2010-08-31 13:33:09,962 [main] INFO org.apache.pig.impl.io.InterStorage - > Pig Internal storage in use > 2010-08-31 13:33:09,963 [main] INFO org.apache.pig.impl.io.InterStorage - > Pig Internal storage in use > 2010-08-31 13:33:09,963 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - Success! > 2010-08-31 13:33:09,964 [main] INFO org.apache.pig.impl.io.InterStorage - > Pig Internal storage in use > 2010-08-31 13:33:09,965 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics > - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - > already initialized > 2010-08-31 13:33:09,969 [main] INFO > org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to > process : 1 > 2010-08-31 13:33:09,969 [main] INFO > org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input > paths to process : 1 > 2010-08-31 13:33:09,973 [main] ERROR > org.apache.pig.backend.hadoop.executionengine.HJob - > java.lang.ClassCastException: java.lang.Integer cannot be cast to > org.apache.pig.data.Tuple > since MR job is succeeded, so the pig does not create any log file, but it > should still create a log file, giving the cause of failure in the pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1595) casting relation to scalar- problem with handling of data from non PigStorage loaders
[ https://issues.apache.org/jira/browse/PIG-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906116#action_12906116 ] Daniel Dai commented on PIG-1595: - Patch looks good. This patch is to address the problem that we cannot get output schema of the scalar UDF at compile time. Another approach is write ReadScalars.outputSchema(), and use the input schema to figure out the output schema. But again we need to address the dependency to make sure input schema is correctly set before calling outputSchema(). So both approach should be equivalent. > casting relation to scalar- problem with handling of data from non PigStorage > loaders > - > > Key: PIG-1595 > URL: https://issues.apache.org/jira/browse/PIG-1595 > Project: Pig > Issue Type: Bug >Reporter: Thejas M Nair >Assignee: Thejas M Nair > Fix For: 0.8.0 > > Attachments: PIG-1595.1.patch > > > If load functions that don't follow the same bytearray format as PigStorage > for other supported datatypes, or those that don't implement the LoadCaster > interface are used in 'casting relation to scalar' (PIG-1434), it can cause > the query to fail or create incorrect results. > The root cause of the problem is that there is a real dependency between the > ReadScalars udf that returns the scalar value and the LogicalOperator that > acts as its input. But the logicalplan does not capture this dependency. So > in SchemaResetter visitor used by the optimizer, the order in which schema is > reset and evaluated does not take this into consideration. If the schema of > the input LogicalOperator does not get evaluated before the ReadScalar udf, > the resutltype of ReadScalar udf becomes bytearray. POUserFunc will convert > the input to bytearray using ' new DataByteArray(inp.toString().getBytes())'. > But this bytearray encoding of other supported types might not be same for > the LoadFunction associated with the column, and that can result in problems. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1591) pig does not create a log file, if tje MR job succeeds but front end fails.
[ https://issues.apache.org/jira/browse/PIG-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905966#action_12905966 ] Daniel Dai commented on PIG-1591: - +1. No unit test needed since it is about error message. Manually tested and it works. Will commit it shortly. > pig does not create a log file, if tje MR job succeeds but front end fails. > --- > > Key: PIG-1591 > URL: https://issues.apache.org/jira/browse/PIG-1591 > Project: Pig > Issue Type: Bug >Reporter: niraj rai >Assignee: niraj rai > Attachments: pig_1591.patch > > > When I run this script: > A = load 'limit_empty.input_a' as (a1:int); > B = load 'limit_empty.input_b' as (b1:int); > C =COGROUP A by a1, B by b1; > C1 = foreach C { Alim = limit A 1; Blim = limit B 1; generate Alim, Blim; } > D1 = FOREACH C1 generate Alim,Blim, (IsEmpty(Alim)? 0:1), (IsEmpty(Blim)? > 0:1), COUNT(Alim), COUNT(Blim); > dump D1; > The MR job succeeds but the pig job fails with the fillowing error: > 2010-08-31 13:33:09,960 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics > - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - > already initialized > 2010-08-31 13:33:09,962 [main] INFO org.apache.pig.impl.io.InterStorage - > Pig Internal storage in use > 2010-08-31 13:33:09,963 [main] INFO org.apache.pig.impl.io.InterStorage - > Pig Internal storage in use > 2010-08-31 13:33:09,963 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - Success! > 2010-08-31 13:33:09,964 [main] INFO org.apache.pig.impl.io.InterStorage - > Pig Internal storage in use > 2010-08-31 13:33:09,965 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics > - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - > already initialized > 2010-08-31 13:33:09,969 [main] INFO > org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to > process : 1 > 2010-08-31 13:33:09,969 [main] INFO > org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input > paths to process : 1 > 2010-08-31 13:33:09,973 [main] ERROR > org.apache.pig.backend.hadoop.executionengine.HJob - > java.lang.ClassCastException: java.lang.Integer cannot be cast to > org.apache.pig.data.Tuple > since MR job is succeeded, so the pig does not create any log file, but it > should still create a log file, giving the cause of failure in the pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1543) IsEmpty returns the wrong value after using LIMIT
[ https://issues.apache.org/jira/browse/PIG-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905587#action_12905587 ] Daniel Dai commented on PIG-1543: - test-patch result: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. All tests pass > IsEmpty returns the wrong value after using LIMIT > - > > Key: PIG-1543 > URL: https://issues.apache.org/jira/browse/PIG-1543 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Justin Hu >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: PIG-1543-1.patch > > > 1. Two input files: > 1a: limit_empty.input_a > 1 > 1 > 1 > 1b: limit_empty.input_b > 2 > 2 > 2. > The pig script: limit_empty.pig > -- A contains only 1's & B contains only 2's > A = load 'limit_empty.input_a' as (a1:int); > B = load 'limit_empty.input_a' as (b1:int); > C =COGROUP A by a1, B by b1; > D = FOREACH C generate A, B, (IsEmpty(A)? 0:1), (IsEmpty(B)? 0:1), COUNT(A), > COUNT(B); > store D into 'limit_empty.output/d'; > -- After the script done, we see the right results: > -- {(1),(1),(1)} {} 1 0 3 0 > -- {} {(2),(2)} 0 1 0 2 > C1 = foreach C { Alim = limit A 1; Blim = limit B 1; generate Alim, Blim; } > D1 = FOREACH C1 generate Alim,Blim, (IsEmpty(Alim)? 0:1), (IsEmpty(Blim)? > 0:1), COUNT(Alim), COUNT(Blim); > store D1 into 'limit_empty.output/d1'; > -- After the script done, we see the unexpected results: > -- {(1)} {}1 1 1 0 > -- {} {(2)} 1 1 0 1 > dump D; > dump D1; > 3. Run the scrip and redirect the stdout (2 dumps) file. There are two issues: > The major one: > IsEmpty() returns FALSE for empty bag in limit_empty.output/d1/*, while > IsEmpty() returns correctly in limit_empty.output/d/*. > The difference is that one has been applied with "LIMIT" before using > IsEmpty(). > The minor one: > The redirected output only contains the first dump: > ({(1),(1),(1)},{},1,0,3L,0L) > ({},{(2),(2)},0,1,0L,2L) > We expect two more lines like: > ({(1)},{},1,1,1L,0L) > ({},{(2)},1,1,0L,1L) > Besides, there is error says: > [main] ERROR org.apache.pig.backend.hadoop.executionengine.HJob - > java.lang.ClassCastException: java.lang.Integer cannot be cast to > org.apache.pig.data.Tuple -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1583) piggybank unit test TestLookupInFiles is broken
[ https://issues.apache.org/jira/browse/PIG-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1583: Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed Patch committed to both trunk and 0.8 branch. > piggybank unit test TestLookupInFiles is broken > --- > > Key: PIG-1583 > URL: https://issues.apache.org/jira/browse/PIG-1583 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: PIG-1583-1.patch > > > Error message: > 10/08/31 09:32:12 INFO mapred.TaskInProgress: Error from > attempt_20100831093139211_0001_m_00_3: > org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught > error from UDF: org.apache.pig.piggybank.evaluation.string.LookupInFiles > [LookupInFiles : Cannot open file one] > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:262) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:283) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:355) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) > at org.apache.hadoop.mapred.Child.main(Child.java:170) > Caused by: java.io.IOException: LookupInFiles : Cannot open file one > at > org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:92) > at > org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:115) > at > org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:49) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229) > ... 10 more > Caused by: java.io.IOException: hdfs://localhost:47453/user/hadoopqa/one > does not exist > at > org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:224) > at > org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:172) > at > org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:89) > ... 13 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1572) change default datatype when relations are used as scalar to bytearray
[ https://issues.apache.org/jira/browse/PIG-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905293#action_12905293 ] Daniel Dai commented on PIG-1572: - Patch looks good. One minor doubt is when we migrate to new logical plan, UserFuncExpression already have necessary cast inserted, seems we do not need to change new logical plan's UserFuncExpression.getFieldSchema(), am I right? > change default datatype when relations are used as scalar to bytearray > -- > > Key: PIG-1572 > URL: https://issues.apache.org/jira/browse/PIG-1572 > Project: Pig > Issue Type: Bug >Reporter: Thejas M Nair >Assignee: Thejas M Nair > Fix For: 0.8.0 > > Attachments: PIG-1572.1.patch, PIG-1572.2.patch > > > When relations are cast to scalar, the current default type is chararray. > This is inconsistent with the behavior in rest of pig-latin. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1543) IsEmpty returns the wrong value after using LIMIT
[ https://issues.apache.org/jira/browse/PIG-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1543: Status: Patch Available (was: Open) > IsEmpty returns the wrong value after using LIMIT > - > > Key: PIG-1543 > URL: https://issues.apache.org/jira/browse/PIG-1543 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Justin Hu >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: PIG-1543-1.patch > > > 1. Two input files: > 1a: limit_empty.input_a > 1 > 1 > 1 > 1b: limit_empty.input_b > 2 > 2 > 2. > The pig script: limit_empty.pig > -- A contains only 1's & B contains only 2's > A = load 'limit_empty.input_a' as (a1:int); > B = load 'limit_empty.input_a' as (b1:int); > C =COGROUP A by a1, B by b1; > D = FOREACH C generate A, B, (IsEmpty(A)? 0:1), (IsEmpty(B)? 0:1), COUNT(A), > COUNT(B); > store D into 'limit_empty.output/d'; > -- After the script done, we see the right results: > -- {(1),(1),(1)} {} 1 0 3 0 > -- {} {(2),(2)} 0 1 0 2 > C1 = foreach C { Alim = limit A 1; Blim = limit B 1; generate Alim, Blim; } > D1 = FOREACH C1 generate Alim,Blim, (IsEmpty(Alim)? 0:1), (IsEmpty(Blim)? > 0:1), COUNT(Alim), COUNT(Blim); > store D1 into 'limit_empty.output/d1'; > -- After the script done, we see the unexpected results: > -- {(1)} {}1 1 1 0 > -- {} {(2)} 1 1 0 1 > dump D; > dump D1; > 3. Run the scrip and redirect the stdout (2 dumps) file. There are two issues: > The major one: > IsEmpty() returns FALSE for empty bag in limit_empty.output/d1/*, while > IsEmpty() returns correctly in limit_empty.output/d/*. > The difference is that one has been applied with "LIMIT" before using > IsEmpty(). > The minor one: > The redirected output only contains the first dump: > ({(1),(1),(1)},{},1,0,3L,0L) > ({},{(2),(2)},0,1,0L,2L) > We expect two more lines like: > ({(1)},{},1,1,1L,0L) > ({},{(2)},1,1,0L,1L) > Besides, there is error says: > [main] ERROR org.apache.pig.backend.hadoop.executionengine.HJob - > java.lang.ClassCastException: java.lang.Integer cannot be cast to > org.apache.pig.data.Tuple -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1543) IsEmpty returns the wrong value after using LIMIT
[ https://issues.apache.org/jira/browse/PIG-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1543: Attachment: PIG-1543-1.patch This patch fix the first issue. The problem is we erroneously put a null in the bag when we expect an empty bag The second issue is a side effect of first issue. BinInterSedes has the assumption that bag only contains tuple, so it does not expect a null inside bag. This issue is fixed automatically once first issue is in. > IsEmpty returns the wrong value after using LIMIT > - > > Key: PIG-1543 > URL: https://issues.apache.org/jira/browse/PIG-1543 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Justin Hu >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: PIG-1543-1.patch > > > 1. Two input files: > 1a: limit_empty.input_a > 1 > 1 > 1 > 1b: limit_empty.input_b > 2 > 2 > 2. > The pig script: limit_empty.pig > -- A contains only 1's & B contains only 2's > A = load 'limit_empty.input_a' as (a1:int); > B = load 'limit_empty.input_a' as (b1:int); > C =COGROUP A by a1, B by b1; > D = FOREACH C generate A, B, (IsEmpty(A)? 0:1), (IsEmpty(B)? 0:1), COUNT(A), > COUNT(B); > store D into 'limit_empty.output/d'; > -- After the script done, we see the right results: > -- {(1),(1),(1)} {} 1 0 3 0 > -- {} {(2),(2)} 0 1 0 2 > C1 = foreach C { Alim = limit A 1; Blim = limit B 1; generate Alim, Blim; } > D1 = FOREACH C1 generate Alim,Blim, (IsEmpty(Alim)? 0:1), (IsEmpty(Blim)? > 0:1), COUNT(Alim), COUNT(Blim); > store D1 into 'limit_empty.output/d1'; > -- After the script done, we see the unexpected results: > -- {(1)} {}1 1 1 0 > -- {} {(2)} 1 1 0 1 > dump D; > dump D1; > 3. Run the scrip and redirect the stdout (2 dumps) file. There are two issues: > The major one: > IsEmpty() returns FALSE for empty bag in limit_empty.output/d1/*, while > IsEmpty() returns correctly in limit_empty.output/d/*. > The difference is that one has been applied with "LIMIT" before using > IsEmpty(). > The minor one: > The redirected output only contains the first dump: > ({(1),(1),(1)},{},1,0,3L,0L) > ({},{(2),(2)},0,1,0L,2L) > We expect two more lines like: > ({(1)},{},1,1,1L,0L) > ({},{(2)},1,1,0L,1L) > Besides, there is error says: > [main] ERROR org.apache.pig.backend.hadoop.executionengine.HJob - > java.lang.ClassCastException: java.lang.Integer cannot be cast to > org.apache.pig.data.Tuple -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1587) Cloning utility functions for new logical plan
[ https://issues.apache.org/jira/browse/PIG-1587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1587: Description: We sometimes need to copy a logical operator/plan when writing an optimization rule. Currently copy an operator/plan is awkward. We need to write some utilities to facilitate this process. Swati contribute PIG-1510 but we feel it still cannot address most use cases. I propose to add some more utilities into new logical plan: all LogicalExpressions: {code} copy(LogicalExpressionPlan newPlan, boolean keepUid); {code} * Do a shallow copy of the logical expression operator (except for fieldSchema, uidOnlySchema, ProjectExpression.attachedRelationalOp) * Set the plan to newPlan * If keepUid is true, further copy uidOnlyFieldSchema all LogicalRelationalOperators: {code} copy(LogicalPlan newPlan, boolean keepUid); {code} * Do a shallow copy of the logical relational operator (except for schema, uid related fields) * Set the plan to newPlan; * If the operator have inner plan/expression plan, copy the whole inner plan with the same keepUid flag (Especially, LOInnerLoad will copy its inner project, with the same keepUid flag) * If keepUid is true, further copy uid related fields (LOUnion.uidMapping, LOCogroup.groupKeyUidOnlySchema, LOCogroup.generatedInputUids) LogicalExpressionPlan.java {code} LogicalExpressionPlan copy(LogicalRelationalOperator attachedRelationalOp, boolean keepUid); LogicalExpressionPlan copyAbove(LogicalExpression leave, LogicalRelationalOperator attachedRelationalOp, boolean keepUid); LogicalExpressionPlan copyBelow(LogicalExpression root, LogicalRelationalOperator attachedRelationalOp, boolean keepUid); {code} * Create a new logical expression plan and copy expression operator along with connection with the same keepUid flag * Set all ProjectExpression.attachedRelationalOp to attachedRelationalOp parameter {code} Pair, List> merge(LogicalExpressionPlan plan, LogicalRelationalOperator attachedRelationalOp); {code} * Merge plan into the current logical expression plan as an independent tree * attachedRelationalOp is the destination operator new logical expression plan attached to * return the sources/sinks of this independent tree LogicalPlan.java {code} LogicalPlan copy(LOForEach foreach, boolean keepUid); LogicalPlan copyAbove(LogicalRelationalOperator leave, LOForEach foreach, boolean keepUid); LogicalPlan copyBelow(LogicalRelationalOperator root, LOForEach foreach, boolean keepUid); {code} * Main use case to copy inner plan of ForEach * Create a new logical plan and copy relational operator along with connection * Copy all expression plans inside relational operator, set plan and attachedRelationalOp properly * If the plan is ForEach inner plan, param foreach is the destination ForEach operator; otherwise, pass null {code} Pair, List> merge(LogicalPlan plan, LOForEach foreach); {code} * Merge plan into the current logical plan as an independent tree * foreach is the destination LOForEach is the destination plan is a ForEach inner plan; otherwise, pass null * return the sources/sinks of this independent tree was: We sometimes need to copy a logical operator/plan when writing an optimization rule. Currently copy an operator/plan is awkward. We need to write some utilities to facilitate this process. Swati contribute PIG-1510 but we feel it still cannot address most use cases. I propose to add some more utilities into new logical plan: all LogicalExpressions: {code} copy(LogicalExpressionPlan newPlan, boolean keepUid); {code} * Do a shallow copy of the logical expression operator (except for fieldSchema, uidOnlySchema, ProjectExpression.attachedRelationalOp) * Set the plan to newPlan * If keepUid is true, further copy uidOnlyFieldSchema all LogicalRelationalOperators: {code} copy(LogicalPlan newPlan, boolean keepUid); {code} * Do a shallow copy of the logical relational operator (except for schema, uid related fields) * Set the plan to newPlan; * If the operator have inner plan/expression plan, copy the whole inner plan with the same keepUid flag (Especially, LOInnerLoad will copy its inner project, with the same keepUid flag) * If keepUid is true, further copy uid related fields (LOUnion.uidMapping, LOCogroup.groupKeyUidOnlySchema, LOCogroup.generatedInputUids) LogicalExpressionPlan.java {code} LogicalExpressionPlan copy(LogicalRelationalOperator attachedRelationalOp, boolean keepUid); {code} * Copy expression operator along with connection with the same keepUid flag * Set all ProjectExpression.attachedRelationalOp to attachedRelationalOp parameter {code} List merge(LogicalExpressionPlan plan); {code} * Merge plan into the current logical expression plan as an independent tree * return the sources of this independent tree LogicalPlan.java {code} LogicalPlan copy(boolean keepUid); {code} * Main use case to copy inner plan of ForEach * Copy all rel
[jira] Created: (PIG-1587) Cloning utility functions for new logical plan
Cloning utility functions for new logical plan -- Key: PIG-1587 URL: https://issues.apache.org/jira/browse/PIG-1587 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Fix For: 0.9.0 We sometimes need to copy a logical operator/plan when writing an optimization rule. Currently copy an operator/plan is awkward. We need to write some utilities to facilitate this process. Swati contribute PIG-1510 but we feel it still cannot address most use cases. I propose to add some more utilities into new logical plan: all LogicalExpressions: {code} copy(LogicalExpressionPlan newPlan, boolean keepUid); {code} * Do a shallow copy of the logical expression operator (except for fieldSchema, uidOnlySchema, ProjectExpression.attachedRelationalOp) * Set the plan to newPlan * If keepUid is true, further copy uidOnlyFieldSchema all LogicalRelationalOperators: {code} copy(LogicalPlan newPlan, boolean keepUid); {code} * Do a shallow copy of the logical relational operator (except for schema, uid related fields) * Set the plan to newPlan; * If the operator have inner plan/expression plan, copy the whole inner plan with the same keepUid flag (Especially, LOInnerLoad will copy its inner project, with the same keepUid flag) * If keepUid is true, further copy uid related fields (LOUnion.uidMapping, LOCogroup.groupKeyUidOnlySchema, LOCogroup.generatedInputUids) LogicalExpressionPlan.java {code} LogicalExpressionPlan copy(LogicalRelationalOperator attachedRelationalOp, boolean keepUid); {code} * Copy expression operator along with connection with the same keepUid flag * Set all ProjectExpression.attachedRelationalOp to attachedRelationalOp parameter {code} List merge(LogicalExpressionPlan plan); {code} * Merge plan into the current logical expression plan as an independent tree * return the sources of this independent tree LogicalPlan.java {code} LogicalPlan copy(boolean keepUid); {code} * Main use case to copy inner plan of ForEach * Copy all relational operator along with connection * Copy all expression plans inside relational operator, set plan and attachedRelationalOp properly {code} List merge(LogicalPlan plan); {code} * Merge plan into the current logical plan as an independent tree * return the sources of this independent tree -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1583) piggybank unit test TestLookupInFiles is broken
[ https://issues.apache.org/jira/browse/PIG-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1583: Attachment: PIG-1583-1.patch > piggybank unit test TestLookupInFiles is broken > --- > > Key: PIG-1583 > URL: https://issues.apache.org/jira/browse/PIG-1583 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: PIG-1583-1.patch > > > Error message: > 10/08/31 09:32:12 INFO mapred.TaskInProgress: Error from > attempt_20100831093139211_0001_m_00_3: > org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught > error from UDF: org.apache.pig.piggybank.evaluation.string.LookupInFiles > [LookupInFiles : Cannot open file one] > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:262) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:283) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:355) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) > at org.apache.hadoop.mapred.Child.main(Child.java:170) > Caused by: java.io.IOException: LookupInFiles : Cannot open file one > at > org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:92) > at > org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:115) > at > org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:49) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229) > ... 10 more > Caused by: java.io.IOException: hdfs://localhost:47453/user/hadoopqa/one > does not exist > at > org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:224) > at > org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:172) > at > org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:89) > ... 13 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1583) piggybank unit test TestLookupInFiles is broken
[ https://issues.apache.org/jira/browse/PIG-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1583: Attachment: (was: PIG-1583-1.patch) > piggybank unit test TestLookupInFiles is broken > --- > > Key: PIG-1583 > URL: https://issues.apache.org/jira/browse/PIG-1583 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: PIG-1583-1.patch > > > Error message: > 10/08/31 09:32:12 INFO mapred.TaskInProgress: Error from > attempt_20100831093139211_0001_m_00_3: > org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught > error from UDF: org.apache.pig.piggybank.evaluation.string.LookupInFiles > [LookupInFiles : Cannot open file one] > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:262) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:283) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:355) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) > at org.apache.hadoop.mapred.Child.main(Child.java:170) > Caused by: java.io.IOException: LookupInFiles : Cannot open file one > at > org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:92) > at > org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:115) > at > org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:49) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229) > ... 10 more > Caused by: java.io.IOException: hdfs://localhost:47453/user/hadoopqa/one > does not exist > at > org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:224) > at > org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:172) > at > org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:89) > ... 13 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1583) piggybank unit test TestLookupInFiles is broken
[ https://issues.apache.org/jira/browse/PIG-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1583: Attachment: PIG-1583-1.patch > piggybank unit test TestLookupInFiles is broken > --- > > Key: PIG-1583 > URL: https://issues.apache.org/jira/browse/PIG-1583 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: PIG-1583-1.patch > > > Error message: > 10/08/31 09:32:12 INFO mapred.TaskInProgress: Error from > attempt_20100831093139211_0001_m_00_3: > org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught > error from UDF: org.apache.pig.piggybank.evaluation.string.LookupInFiles > [LookupInFiles : Cannot open file one] > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:262) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:283) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:355) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) > at org.apache.hadoop.mapred.Child.main(Child.java:170) > Caused by: java.io.IOException: LookupInFiles : Cannot open file one > at > org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:92) > at > org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:115) > at > org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:49) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229) > ... 10 more > Caused by: java.io.IOException: hdfs://localhost:47453/user/hadoopqa/one > does not exist > at > org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:224) > at > org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:172) > at > org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:89) > ... 13 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1583) piggybank unit test TestLookupInFiles is broken
piggybank unit test TestLookupInFiles is broken --- Key: PIG-1583 URL: https://issues.apache.org/jira/browse/PIG-1583 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1583-1.patch Error message: 10/08/31 09:32:12 INFO mapred.TaskInProgress: Error from attempt_20100831093139211_0001_m_00_3: org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught error from UDF: org.apache.pig.piggybank.evaluation.string.LookupInFiles [LookupInFiles : Cannot open file one] at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:262) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:283) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:355) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) Caused by: java.io.IOException: LookupInFiles : Cannot open file one at org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:92) at org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:115) at org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:49) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229) ... 10 more Caused by: java.io.IOException: hdfs://localhost:47453/user/hadoopqa/one does not exist at org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:224) at org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:172) at org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:89) ... 13 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with
[ https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904485#action_12904485 ] Daniel Dai commented on PIG-1178: - PIG-1178-9.patch committed. > LogicalPlan and Optimizer are too complex and hard to work with > --- > > Key: PIG-1178 > URL: https://issues.apache.org/jira/browse/PIG-1178 > Project: Pig > Issue Type: Improvement >Reporter: Alan Gates >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: expressions-2.patch, expressions.patch, lp.patch, > lp.patch, PIG-1178-4.patch, PIG-1178-5.patch, PIG-1178-6.patch, > PIG-1178-7.patch, PIG-1178-8.patch, PIG-1178-9.patch, pig_1178.patch, > pig_1178.patch, PIG_1178.patch, pig_1178_2.patch, pig_1178_3.2.patch, > pig_1178_3.3.patch, pig_1178_3.4.patch, pig_1178_3.patch > > > The current implementation of the logical plan and the logical optimizer in > Pig has proven to not be easily extensible. Developer feedback has indicated > that adding new rules to the optimizer is quite burdensome. In addition, the > logical plan has been an area of numerous bugs, many of which have been > difficult to fix. Developers also feel that the logical plan is difficult to > understand and maintain. The root cause for these issues is that a number of > design decisions that were made as part of the 0.2 rewrite of the front end > have now proven to be sub-optimal. The heart of this proposal is to revisit a > number of those proposals and rebuild the logical plan with a simpler design > that will make it much easier to maintain the logical plan as well as extend > the logical optimizer. > See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full > details. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with
[ https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1178: Attachment: PIG-1178-9.patch Update help message. > LogicalPlan and Optimizer are too complex and hard to work with > --- > > Key: PIG-1178 > URL: https://issues.apache.org/jira/browse/PIG-1178 > Project: Pig > Issue Type: Improvement >Reporter: Alan Gates >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: expressions-2.patch, expressions.patch, lp.patch, > lp.patch, PIG-1178-4.patch, PIG-1178-5.patch, PIG-1178-6.patch, > PIG-1178-7.patch, PIG-1178-8.patch, PIG-1178-9.patch, pig_1178.patch, > pig_1178.patch, PIG_1178.patch, pig_1178_2.patch, pig_1178_3.2.patch, > pig_1178_3.3.patch, pig_1178_3.4.patch, pig_1178_3.patch > > > The current implementation of the logical plan and the logical optimizer in > Pig has proven to not be easily extensible. Developer feedback has indicated > that adding new rules to the optimizer is quite burdensome. In addition, the > logical plan has been an area of numerous bugs, many of which have been > difficult to fix. Developers also feel that the logical plan is difficult to > understand and maintain. The root cause for these issues is that a number of > design decisions that were made as part of the 0.2 rewrite of the front end > have now proven to be sub-optimal. The heart of this proposal is to revisit a > number of those proposals and rebuild the logical plan with a simpler design > that will make it much easier to maintain the logical plan as well as extend > the logical optimizer. > See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full > details. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1543) IsEmpty returns the wrong value after using LIMIT
[ https://issues.apache.org/jira/browse/PIG-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904291#action_12904291 ] Daniel Dai commented on PIG-1543: - This seems not a logical layer problem and new optimizer does not address it. It might related to [PIG-747|https://issues.apache.org/jira/browse/PIG-747], need further investigation. > IsEmpty returns the wrong value after using LIMIT > - > > Key: PIG-1543 > URL: https://issues.apache.org/jira/browse/PIG-1543 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Justin Hu >Assignee: Daniel Dai > Fix For: 0.8.0 > > > 1. Two input files: > 1a: limit_empty.input_a > 1 > 1 > 1 > 1b: limit_empty.input_b > 2 > 2 > 2. > The pig script: limit_empty.pig > -- A contains only 1's & B contains only 2's > A = load 'limit_empty.input_a' as (a1:int); > B = load 'limit_empty.input_a' as (b1:int); > C =COGROUP A by a1, B by b1; > D = FOREACH C generate A, B, (IsEmpty(A)? 0:1), (IsEmpty(B)? 0:1), COUNT(A), > COUNT(B); > store D into 'limit_empty.output/d'; > -- After the script done, we see the right results: > -- {(1),(1),(1)} {} 1 0 3 0 > -- {} {(2),(2)} 0 1 0 2 > C1 = foreach C { Alim = limit A 1; Blim = limit B 1; generate Alim, Blim; } > D1 = FOREACH C1 generate Alim,Blim, (IsEmpty(Alim)? 0:1), (IsEmpty(Blim)? > 0:1), COUNT(Alim), COUNT(Blim); > store D1 into 'limit_empty.output/d1'; > -- After the script done, we see the unexpected results: > -- {(1)} {}1 1 1 0 > -- {} {(2)} 1 1 0 1 > dump D; > dump D1; > 3. Run the scrip and redirect the stdout (2 dumps) file. There are two issues: > The major one: > IsEmpty() returns FALSE for empty bag in limit_empty.output/d1/*, while > IsEmpty() returns correctly in limit_empty.output/d/*. > The difference is that one has been applied with "LIMIT" before using > IsEmpty(). > The minor one: > The redirected output only contains the first dump: > ({(1),(1),(1)},{},1,0,3L,0L) > ({},{(2),(2)},0,1,0L,2L) > We expect two more lines like: > ({(1)},{},1,1,1L,0L) > ({},{(2)},1,1,0L,1L) > Besides, there is error says: > [main] ERROR org.apache.pig.backend.hadoop.executionengine.HJob - > java.lang.ClassCastException: java.lang.Integer cannot be cast to > org.apache.pig.data.Tuple -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1579) Intermittent unit test failure for TestScriptUDF.testPythonScriptUDFNullInputOutput
[ https://issues.apache.org/jira/browse/PIG-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1579: Description: Error message: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Error executing function: Traceback (most recent call last): File "", line 5, in multStr TypeError: can't multiply sequence by non-int of type 'NoneType' at org.apache.pig.scripting.jython.JythonFunction.exec(JythonFunction.java:107) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:295) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:346) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) > Intermittent unit test failure for > TestScriptUDF.testPythonScriptUDFNullInputOutput > --- > > Key: PIG-1579 > URL: https://issues.apache.org/jira/browse/PIG-1579 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Daniel Dai > Fix For: 0.8.0 > > Attachments: PIG-1579-1.patch > > > Error message: > org.apache.pig.backend.executionengine.ExecException: ERROR 0: Error > executing function: Traceback (most recent call last): > File "", line 5, in multStr > TypeError: can't multiply sequence by non-int of type 'NoneType' > at > org.apache.pig.scripting.jython.JythonFunction.exec(JythonFunction.java:107) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:295) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:346) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) > at org.apache.hadoop.mapred.Child.main(Child.java:170) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1579) Intermittent unit test failure for TestScriptUDF.testPythonScriptUDFNullInputOutput
[ https://issues.apache.org/jira/browse/PIG-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1579: Attachment: PIG-1579-1.patch Attach a fix. However, this fix is shallow and may need an in-depth look. Commit the temporary fix and leave the Jira open. > Intermittent unit test failure for > TestScriptUDF.testPythonScriptUDFNullInputOutput > --- > > Key: PIG-1579 > URL: https://issues.apache.org/jira/browse/PIG-1579 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Daniel Dai > Fix For: 0.8.0 > > Attachments: PIG-1579-1.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1579) Intermittent unit test failure for TestScriptUDF.testPythonScriptUDFNullInputOutput
Intermittent unit test failure for TestScriptUDF.testPythonScriptUDFNullInputOutput --- Key: PIG-1579 URL: https://issues.apache.org/jira/browse/PIG-1579 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Daniel Dai Fix For: 0.8.0 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1568) Optimization rule FilterAboveForeach is too restrictive and doesn't handle project * correctly
[ https://issues.apache.org/jira/browse/PIG-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1568: Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed test-patch result: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 6 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. Patch committed. Thanks Xuefu! > Optimization rule FilterAboveForeach is too restrictive and doesn't handle > project * correctly > -- > > Key: PIG-1568 > URL: https://issues.apache.org/jira/browse/PIG-1568 > Project: Pig > Issue Type: Bug >Reporter: Xuefu Zhang >Assignee: Xuefu Zhang > Fix For: 0.8.0 > > Attachments: jira-1568-1.patch, jira-1568-1.patch > > > FilterAboveForeach rule is to optimize the plan by pushing up filter above > previous foreach operator. However, during code review, two major problems > were found: > 1. Current implementation assumes that if no projection is found in the > filter condition then all columns from foreach are projected. This issue > prevents the following optimization: > A = LOAD 'file.txt' AS (a(u,v), b, c); > B = FOREACH A GENERATE $0, b; > C = FILTER B BY 8 > 5; > STORE C INTO 'empty'; > 2. Current implementation doesn't handle * probjection, which means project > all columns. As a result, it wasn't able to optimize the following: > A = LOAD 'file.txt' AS (a(u,v), b, c); > B = FOREACH A GENERATE $0, b; > C = FILTER B BY Identity.class.getName(*) > 5; > STORE C INTO 'empty'; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1574) Optimization rule PushUpFilter causes filter to be pushed up out joins
[ https://issues.apache.org/jira/browse/PIG-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1574: Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed test-patch result: jira-1574-1.patch [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. This patch does not push filter before join if the join is outer join. Actually we can push filter to the outer side of the join. I assume it will be addressed in PIG-1575. Patch jira-1574-1.patch committed. Thanks Xuefu! > Optimization rule PushUpFilter causes filter to be pushed up out joins > -- > > Key: PIG-1574 > URL: https://issues.apache.org/jira/browse/PIG-1574 > Project: Pig > Issue Type: Bug >Reporter: Xuefu Zhang >Assignee: Xuefu Zhang > Fix For: 0.8.0 > > Attachments: jira-1574-1.patch > > > The PushUpFilter optimization rule in the new logical plan moves the filter > up to one of the join branch. It does this aggressively by find an operator > that has all the projection UIDs. However, it didn't consider that the found > operator might be another join. If that join is outer, then we cannot simply > move the filter to one of its branches. > As an example, the following script will be erroneously optimized: > A = load 'myfile' as (d1:int); > B = load 'anotherfile' as (d2:int); > C = join A by d1 full outer, B by d2; > D = load 'xxx' as (d3:int); > E = join C by d1, D by d3; > F = filter E by d1 > 5; > G = store F into 'dummy'; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-365) Map side optimization for Limit (top k case)
[ https://issues.apache.org/jira/browse/PIG-365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903996#action_12903996 ] Daniel Dai commented on PIG-365: Hi, Gianmarco, Yes, you are right. This is a quite old Jira and it is no longer applicable. I will close this Jira. More recent limit optimization we are still looking at is [PIG-1270|https://issues.apache.org/jira/browse/PIG-1270]. > Map side optimization for Limit (top k case) > > > Key: PIG-365 > URL: https://issues.apache.org/jira/browse/PIG-365 > Project: Pig > Issue Type: Improvement > Components: impl >Affects Versions: 0.2.0 >Reporter: Daniel Dai >Assignee: Daniel Dai >Priority: Minor > > In map side, only collect top k records to improve performance -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.