[jira] Commented: (PIG-957) Tutorial is broken with 0.4 branch and trunk
[ https://issues.apache.org/jira/browse/PIG-957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754478#action_12754478 ] Hadoop QA commented on PIG-957: --- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12419363/PIG-957.patch against trunk revision 814075. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/6/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/6/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/6/console This message is automatically generated. > Tutorial is broken with 0.4 branch and trunk > > > Key: PIG-957 > URL: https://issues.apache.org/jira/browse/PIG-957 > Project: Pig > Issue Type: Bug >Affects Versions: 0.3.0 >Reporter: Olga Natkovich >Assignee: Pradeep Kamath > Fix For: 0.4.0 > > Attachments: PIG-957.patch > > > As I was testing the Pig Tutorial in preparation for the release, I found > that we broke the second script both in local mode and in MR mode. The issue > has to do with schema and naming fields. > Here is what I see: > > java -cp pig.jar org.apache.pig.Main -x local script2-local.pig > 2009-09-11 12:52:46,961 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 1000: Error during parsing. Invalid alias: hour00::group::ngram in > {group::ngram: chararray,group::hour: chararray,hour00::count: long,ngram: > chararray,hour: chararray,hour12::count: long} > 09/09/11 12:52:46 ERROR grunt.Grunt: ERROR 1000: Error during parsing. > Invalid alias: hour00::group::ngram in {group::ngram: chararray,group::hour: > chararray,hour00::count: long,ngram: chararray,hour: chararray,hour12::count: > long} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-957) Tutorial is broken with 0.4 branch and trunk
[ https://issues.apache.org/jira/browse/PIG-957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754468#action_12754468 ] Daniel Dai commented on PIG-957: +1 > Tutorial is broken with 0.4 branch and trunk > > > Key: PIG-957 > URL: https://issues.apache.org/jira/browse/PIG-957 > Project: Pig > Issue Type: Bug >Affects Versions: 0.3.0 >Reporter: Olga Natkovich >Assignee: Pradeep Kamath > Fix For: 0.4.0 > > Attachments: PIG-957.patch > > > As I was testing the Pig Tutorial in preparation for the release, I found > that we broke the second script both in local mode and in MR mode. The issue > has to do with schema and naming fields. > Here is what I see: > > java -cp pig.jar org.apache.pig.Main -x local script2-local.pig > 2009-09-11 12:52:46,961 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 1000: Error during parsing. Invalid alias: hour00::group::ngram in > {group::ngram: chararray,group::hour: chararray,hour00::count: long,ngram: > chararray,hour: chararray,hour12::count: long} > 09/09/11 12:52:46 ERROR grunt.Grunt: ERROR 1000: Error during parsing. > Invalid alias: hour00::group::ngram in {group::ngram: chararray,group::hour: > chararray,hour00::count: long,ngram: chararray,hour: chararray,hour12::count: > long} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-955) Skewed join generates incorrect results
[ https://issues.apache.org/jira/browse/PIG-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754466#action_12754466 ] Hadoop QA commented on PIG-955: --- +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12419352/PIG-955.patch2 against trunk revision 814075. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/25/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/25/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/25/console This message is automatically generated. > Skewed join generates incorrect results > - > > Key: PIG-955 > URL: https://issues.apache.org/jira/browse/PIG-955 > Project: Pig > Issue Type: Improvement >Reporter: Ying He > Attachments: PIG-955.patch, PIG-955.patch2 > > > SkewedPartitioner doesn't partition the skewed keys in partition table (first > table) correctly. This can cause data loss. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-957) Tutorial is broken with 0.4 branch and trunk
[ https://issues.apache.org/jira/browse/PIG-957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-957: --- Status: Patch Available (was: Open) > Tutorial is broken with 0.4 branch and trunk > > > Key: PIG-957 > URL: https://issues.apache.org/jira/browse/PIG-957 > Project: Pig > Issue Type: Bug >Affects Versions: 0.3.0 >Reporter: Olga Natkovich >Assignee: Pradeep Kamath > Fix For: 0.4.0 > > Attachments: PIG-957.patch > > > As I was testing the Pig Tutorial in preparation for the release, I found > that we broke the second script both in local mode and in MR mode. The issue > has to do with schema and naming fields. > Here is what I see: > > java -cp pig.jar org.apache.pig.Main -x local script2-local.pig > 2009-09-11 12:52:46,961 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 1000: Error during parsing. Invalid alias: hour00::group::ngram in > {group::ngram: chararray,group::hour: chararray,hour00::count: long,ngram: > chararray,hour: chararray,hour12::count: long} > 09/09/11 12:52:46 ERROR grunt.Grunt: ERROR 1000: Error during parsing. > Invalid alias: hour00::group::ngram in {group::ngram: chararray,group::hour: > chararray,hour00::count: long,ngram: chararray,hour: chararray,hour12::count: > long} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-957) Tutorial is broken with 0.4 branch and trunk
[ https://issues.apache.org/jira/browse/PIG-957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-957: --- Attachment: PIG-957.patch Attached patch to address the issue. LOJoin's getSchema() now keeps both the disambiguated (outeralias::inneralis) alias and the simple inner alias for non duplicate columns coming out of the LOJoin. > Tutorial is broken with 0.4 branch and trunk > > > Key: PIG-957 > URL: https://issues.apache.org/jira/browse/PIG-957 > Project: Pig > Issue Type: Bug >Affects Versions: 0.3.0 >Reporter: Olga Natkovich >Assignee: Pradeep Kamath > Fix For: 0.4.0 > > Attachments: PIG-957.patch > > > As I was testing the Pig Tutorial in preparation for the release, I found > that we broke the second script both in local mode and in MR mode. The issue > has to do with schema and naming fields. > Here is what I see: > > java -cp pig.jar org.apache.pig.Main -x local script2-local.pig > 2009-09-11 12:52:46,961 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 1000: Error during parsing. Invalid alias: hour00::group::ngram in > {group::ngram: chararray,group::hour: chararray,hour00::count: long,ngram: > chararray,hour: chararray,hour12::count: long} > 09/09/11 12:52:46 ERROR grunt.Grunt: ERROR 1000: Error during parsing. > Invalid alias: hour00::group::ngram in {group::ngram: chararray,group::hour: > chararray,hour00::count: long,ngram: chararray,hour: chararray,hour12::count: > long} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-957) Tutorial is broken with 0.4 branch and trunk
Tutorial is broken with 0.4 branch and trunk Key: PIG-957 URL: https://issues.apache.org/jira/browse/PIG-957 Project: Pig Issue Type: Bug Affects Versions: 0.3.0 Reporter: Olga Natkovich Assignee: Pradeep Kamath Fix For: 0.4.0 As I was testing the Pig Tutorial in preparation for the release, I found that we broke the second script both in local mode and in MR mode. The issue has to do with schema and naming fields. Here is what I see: java -cp pig.jar org.apache.pig.Main -x local script2-local.pig 2009-09-11 12:52:46,961 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Invalid alias: hour00::group::ngram in {group::ngram: chararray,group::hour: chararray,hour00::count: long,ngram: chararray,hour: chararray,hour12::count: long} 09/09/11 12:52:46 ERROR grunt.Grunt: ERROR 1000: Error during parsing. Invalid alias: hour00::group::ngram in {group::ngram: chararray,group::hour: chararray,hour00::count: long,ngram: chararray,hour: chararray,hour12::count: long} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-955) Skewed join generates incorrect results
[ https://issues.apache.org/jira/browse/PIG-955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-955: --- Status: Patch Available (was: Open) > Skewed join generates incorrect results > - > > Key: PIG-955 > URL: https://issues.apache.org/jira/browse/PIG-955 > Project: Pig > Issue Type: Improvement >Reporter: Ying He > Attachments: PIG-955.patch, PIG-955.patch2 > > > SkewedPartitioner doesn't partition the skewed keys in partition table (first > table) correctly. This can cause data loss. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-955) Skewed join generates incorrect results
[ https://issues.apache.org/jira/browse/PIG-955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ying He updated PIG-955: Attachment: PIG-955.patch2 add Junit test > Skewed join generates incorrect results > - > > Key: PIG-955 > URL: https://issues.apache.org/jira/browse/PIG-955 > Project: Pig > Issue Type: Improvement >Reporter: Ying He > Attachments: PIG-955.patch, PIG-955.patch2 > > > SkewedPartitioner doesn't partition the skewed keys in partition table (first > table) correctly. This can cause data loss. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-954) Skewed join fails when pig.skewedjoin.reduce.memusage is not configured
[ https://issues.apache.org/jira/browse/PIG-954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-954: --- Resolution: Fixed Fix Version/s: 0.4.0 Status: Resolved (was: Patch Available) patch committed. Thanks, Ying for a quick fix! > Skewed join fails when pig.skewedjoin.reduce.memusage is not configured > --- > > Key: PIG-954 > URL: https://issues.apache.org/jira/browse/PIG-954 > Project: Pig > Issue Type: Improvement >Reporter: Ying He > Fix For: 0.4.0 > > Attachments: PIG-954.patch, PIG-954.patch2 > > > query fails if pig.skewedjoin.reduce.memusage is not configured. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-954) Skewed join fails when pig.skewedjoin.reduce.memusage is not configured
[ https://issues.apache.org/jira/browse/PIG-954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754396#action_12754396 ] Hadoop QA commented on PIG-954: --- +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12419336/PIG-954.patch2 against trunk revision 814016. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/5/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/5/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/5/console This message is automatically generated. > Skewed join fails when pig.skewedjoin.reduce.memusage is not configured > --- > > Key: PIG-954 > URL: https://issues.apache.org/jira/browse/PIG-954 > Project: Pig > Issue Type: Improvement >Reporter: Ying He > Attachments: PIG-954.patch, PIG-954.patch2 > > > query fails if pig.skewedjoin.reduce.memusage is not configured. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-641) Fragment replicate join does not work in local mode
[ https://issues.apache.org/jira/browse/PIG-641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath resolved PIG-641. Resolution: Fixed Fix Version/s: 0.4.0 This issue is fixed in current trunk since LocalLogToPhyTranslationVisitor always translates LOJoin into POCogroup followed by foreach flatten regardless of join type. Here is a script I tried to validate: [prade...@chargesize:~/dev/pig-apache/pig/trunk]cat a.txt 1 2 3 2 3 4 3 4 5 [prade...@chargesize:~/dev/pig-apache/pig/trunk]cat b.txt 3 a 1 x 4 b [prade...@chargesize:~/dev/pig-apache/pig/trunk]cat c.txt 1 20 30 [prade...@chargesize:~/dev/pig-apache/pig/trunk]java -cp /tmp/svncheckout/trunk/pig.jar org.apache.pig.Main -x local -e "a = load 'a.txt'; b = load 'b.txt'; c = load 'c.txt'; d = join a by \$0, b by \$0 using \"replicated\"; dump d; e = join a by \$0, c by \$0 using \"replicated\"; dump e;" 2009-09-11 15:27:54,852 [main] INFO org.apache.pig.Main - Logging error messages to: /homes/pradeepk/dev/pig-apache/pig/trunk/pig_1252708074851.log 2009-09-11 15:27:55,217 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Successfully stored result in: "file:/tmp/temp-1388892738/tmp1991974517" 2009-09-11 15:27:55,218 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Records written : 2 2009-09-11 15:27:55,218 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Bytes written : 0 2009-09-11 15:27:55,218 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete! 2009-09-11 15:27:55,218 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!! (1,2,3,1,x) (3,4,5,3,a) 2009-09-11 15:27:55,253 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Successfully stored result in: "file:/tmp/temp-1388892738/tmp84396309" 2009-09-11 15:27:55,253 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Records written : 1 2009-09-11 15:27:55,253 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Bytes written : 0 2009-09-11 15:27:55,254 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete! 2009-09-11 15:27:55,254 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!! (1,2,3,1,20,30) [prade...@chargesize:~/dev/pig-apache/pig/trunk] > Fragment replicate join does not work in local mode > --- > > Key: PIG-641 > URL: https://issues.apache.org/jira/browse/PIG-641 > Project: Pig > Issue Type: Bug >Reporter: Olga Natkovich >Assignee: Shubham Chopra > Fix For: 0.4.0 > > Attachments: 641.patch, 641.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-929) Default value of memusage for skewed join is not correct
[ https://issues.apache.org/jira/browse/PIG-929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ying He updated PIG-929: Description: default value pig.skewedjoin.reduce.memusage , which is used in skewed join, should be set to 0.3 (was: Fragmented replicated join has a few limitations: - One of the tables needs to be loaded into memory - Join is limited to two tables Skewed join partitions the table and joins the records in the reduce phase. It computes a histogram of the key space to account for skewing in the input records. Further, it adjusts the number of reducers depending on the key distribution. We need to implement the skewed join in pig.) > Default value of memusage for skewed join is not correct > > > Key: PIG-929 > URL: https://issues.apache.org/jira/browse/PIG-929 > Project: Pig > Issue Type: Improvement >Reporter: Ying He > Attachments: memusage.patch > > > default value pig.skewedjoin.reduce.memusage , which is used in skewed join, > should be set to 0.3 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-954) Skewed join fails when pig.skewedjoin.reduce.memusage is not configured
[ https://issues.apache.org/jira/browse/PIG-954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ying He updated PIG-954: Description: query fails if pig.skewedjoin.reduce.memusage is not configured. (was: Fragmented replicated join has a few limitations: - One of the tables needs to be loaded into memory - Join is limited to two tables Skewed join partitions the table and joins the records in the reduce phase. It computes a histogram of the key space to account for skewing in the input records. Further, it adjusts the number of reducers depending on the key distribution. We need to implement the skewed join in pig.) > Skewed join fails when pig.skewedjoin.reduce.memusage is not configured > --- > > Key: PIG-954 > URL: https://issues.apache.org/jira/browse/PIG-954 > Project: Pig > Issue Type: Improvement >Reporter: Ying He > Attachments: PIG-954.patch, PIG-954.patch2 > > > query fails if pig.skewedjoin.reduce.memusage is not configured. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-955) Skewed join generates incorrect results
[ https://issues.apache.org/jira/browse/PIG-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754370#action_12754370 ] Ying He commented on PIG-955: - This is not related to replicate join. The original description is misleading. It came from the the JIRA that this one is cloned from. I've updated it to the correct one. > Skewed join generates incorrect results > - > > Key: PIG-955 > URL: https://issues.apache.org/jira/browse/PIG-955 > Project: Pig > Issue Type: Improvement >Reporter: Ying He > Attachments: PIG-955.patch > > > SkewedPartitioner doesn't partition the skewed keys in partition table (first > table) correctly. This can cause data loss. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-955) Skewed join generates incorrect results
[ https://issues.apache.org/jira/browse/PIG-955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ying He updated PIG-955: Description: SkewedPartitioner doesn't the skewed keys in partition table correctly. This can cause data loss. (was: Fragmented replicated join has a few limitations: - One of the tables needs to be loaded into memory - Join is limited to two tables Skewed join partitions the table and joins the records in the reduce phase. It computes a histogram of the key space to account for skewing in the input records. Further, it adjusts the number of reducers depending on the key distribution. We need to implement the skewed join in pig.) > Skewed join generates incorrect results > - > > Key: PIG-955 > URL: https://issues.apache.org/jira/browse/PIG-955 > Project: Pig > Issue Type: Improvement >Reporter: Ying He > Attachments: PIG-955.patch > > > SkewedPartitioner doesn't the skewed keys in partition table correctly. This > can cause data loss. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-955) Skewed join generates incorrect results
[ https://issues.apache.org/jira/browse/PIG-955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ying He updated PIG-955: Description: SkewedPartitioner doesn't partition the skewed keys in partition table (first table) correctly. This can cause data loss. (was: SkewedPartitioner doesn't the skewed keys in partition table correctly. This can cause data loss.) > Skewed join generates incorrect results > - > > Key: PIG-955 > URL: https://issues.apache.org/jira/browse/PIG-955 > Project: Pig > Issue Type: Improvement >Reporter: Ying He > Attachments: PIG-955.patch > > > SkewedPartitioner doesn't partition the skewed keys in partition table (first > table) correctly. This can cause data loss. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-949) Zebra Bug: splitting map into multiple column group using storage hint causes unexpected behaviour
[ https://issues.apache.org/jira/browse/PIG-949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754353#action_12754353 ] Jing Huang commented on PIG-949: Thanks Alok. I am able to reproduce the problem. I was only using i/o layer (not pig loader) to test map split. This is what I did: final static String STR_SCHEMA = "m1:map(string),m2:map(map(int))"; final static String STR_STORAGE = "[m1#{a}];[m2#{x|y}]; [m1#{b}, m2#{z}];[m1]"; ...create table and insert data .. load: String projection = new String("m1#{a}"); I only got null returned. Without storage hint [m1], everything works fine. , i.e. final static String STR_STORAGE = "[m1#{a}];[m2#{x|y}]; [m1#{b}, m2#{z}]"; ...create table and insert data .. load: String projection = new String("m1#{a}"); I am able to get value m1#{a}. Zebra team is working on the fix. > Zebra Bug: splitting map into multiple column group using storage hint causes > unexpected behaviour > -- > > Key: PIG-949 > URL: https://issues.apache.org/jira/browse/PIG-949 > Project: Pig > Issue Type: Bug > Environment: linux >Reporter: Alok Singh > > Hi > The storage hint > specification plays a important part whether the output table is readable or > not > say if we have have the map 'map'. > One can split the map into a column group using [map#{k1}, map#{k2}...] > however the remaining map field will automatically be added to the default > group. > if user try to create a new column group for the remaining fields as follows > [map#{k1}, map#{k2}, ..][map] i.e create a seperate column group > the table writer will create the table. > however, if one tries to load the created table via pig or via map reduce > using TableInputFormat > > then the reader have problem reading the map > We get the following stack trace > 09/09/09 00:09:45 INFO mapred.JobClient: Task Id : > attempt_200908191538_33939_m_21_2, Status : FAILED > java.io.IOException: getValue() failed: null > at > org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.getValue(BasicTable.java:775) > at > org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableInputFormat.java:717) > at > org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableInputFormat.java:651) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:191) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:175) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:356) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) > at org.apache.hadoop.mapred.Child.main(Child.java:170) > Alok -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-955) Skewed join generates incorrect results
[ https://issues.apache.org/jira/browse/PIG-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754349#action_12754349 ] Santhosh Srinivasan commented on PIG-955: - Hi Ying, How are Fragment Replicate Join and Skewed Join related as you mention in your bug description? Also, skewed join has been part of trunk for more than a month now. Your bug description states that Pig needs skewed join. Thanks, Santhosh > Skewed join generates incorrect results > - > > Key: PIG-955 > URL: https://issues.apache.org/jira/browse/PIG-955 > Project: Pig > Issue Type: Improvement >Reporter: Ying He > Attachments: PIG-955.patch > > > Fragmented replicated join has a few limitations: > - One of the tables needs to be loaded into memory > - Join is limited to two tables > Skewed join partitions the table and joins the records in the reduce phase. > It computes a histogram of the key space to account for skewing in the input > records. Further, it adjusts the number of reducers depending on the key > distribution. > We need to implement the skewed join in pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-955) Skewed join generates incorrect results
[ https://issues.apache.org/jira/browse/PIG-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754336#action_12754336 ] Olga Natkovich commented on PIG-955: Updated wrong JIRA > Skewed join generates incorrect results > - > > Key: PIG-955 > URL: https://issues.apache.org/jira/browse/PIG-955 > Project: Pig > Issue Type: Improvement >Reporter: Ying He > Attachments: PIG-955.patch > > > Fragmented replicated join has a few limitations: > - One of the tables needs to be loaded into memory > - Join is limited to two tables > Skewed join partitions the table and joins the records in the reduce phase. > It computes a histogram of the key space to account for skewing in the input > records. Further, it adjusts the number of reducers depending on the key > distribution. > We need to implement the skewed join in pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-954) Skewed join fails when pig.skewedjoin.reduce.memusage is not configured
[ https://issues.apache.org/jira/browse/PIG-954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-954: --- Status: Patch Available (was: Open) > Skewed join fails when pig.skewedjoin.reduce.memusage is not configured > --- > > Key: PIG-954 > URL: https://issues.apache.org/jira/browse/PIG-954 > Project: Pig > Issue Type: Improvement >Reporter: Ying He > Attachments: PIG-954.patch, PIG-954.patch2 > > > Fragmented replicated join has a few limitations: > - One of the tables needs to be loaded into memory > - Join is limited to two tables > Skewed join partitions the table and joins the records in the reduce phase. > It computes a histogram of the key space to account for skewing in the input > records. Further, it adjusts the number of reducers depending on the key > distribution. > We need to implement the skewed join in pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-955) Skewed join generates incorrect results
[ https://issues.apache.org/jira/browse/PIG-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754335#action_12754335 ] Olga Natkovich commented on PIG-955: +1. Changes look good. Just need to wait for test results > Skewed join generates incorrect results > - > > Key: PIG-955 > URL: https://issues.apache.org/jira/browse/PIG-955 > Project: Pig > Issue Type: Improvement >Reporter: Ying He > Attachments: PIG-955.patch > > > Fragmented replicated join has a few limitations: > - One of the tables needs to be loaded into memory > - Join is limited to two tables > Skewed join partitions the table and joins the records in the reduce phase. > It computes a histogram of the key space to account for skewing in the input > records. Further, it adjusts the number of reducers depending on the key > distribution. > We need to implement the skewed join in pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-955) Skewed join generates incorrect results
[ https://issues.apache.org/jira/browse/PIG-955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-955: --- Status: Open (was: Patch Available) > Skewed join generates incorrect results > - > > Key: PIG-955 > URL: https://issues.apache.org/jira/browse/PIG-955 > Project: Pig > Issue Type: Improvement >Reporter: Ying He > Attachments: PIG-955.patch > > > Fragmented replicated join has a few limitations: > - One of the tables needs to be loaded into memory > - Join is limited to two tables > Skewed join partitions the table and joins the records in the reduce phase. > It computes a histogram of the key space to account for skewing in the input > records. Further, it adjusts the number of reducers depending on the key > distribution. > We need to implement the skewed join in pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-954) Skewed join fails when pig.skewedjoin.reduce.memusage is not configured
[ https://issues.apache.org/jira/browse/PIG-954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754338#action_12754338 ] Olga Natkovich commented on PIG-954: +1 on the code changes. Need to wait for test results > Skewed join fails when pig.skewedjoin.reduce.memusage is not configured > --- > > Key: PIG-954 > URL: https://issues.apache.org/jira/browse/PIG-954 > Project: Pig > Issue Type: Improvement >Reporter: Ying He > Attachments: PIG-954.patch, PIG-954.patch2 > > > Fragmented replicated join has a few limitations: > - One of the tables needs to be loaded into memory > - Join is limited to two tables > Skewed join partitions the table and joins the records in the reduce phase. > It computes a histogram of the key space to account for skewing in the input > records. Further, it adjusts the number of reducers depending on the key > distribution. > We need to implement the skewed join in pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-955) Skewed join generates incorrect results
[ https://issues.apache.org/jira/browse/PIG-955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-955: --- Status: Patch Available (was: Open) > Skewed join generates incorrect results > - > > Key: PIG-955 > URL: https://issues.apache.org/jira/browse/PIG-955 > Project: Pig > Issue Type: Improvement >Reporter: Ying He > Attachments: PIG-955.patch > > > Fragmented replicated join has a few limitations: > - One of the tables needs to be loaded into memory > - Join is limited to two tables > Skewed join partitions the table and joins the records in the reduce phase. > It computes a histogram of the key space to account for skewing in the input > records. Further, it adjusts the number of reducers depending on the key > distribution. > We need to implement the skewed join in pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-882) log level not propogated to loggers
[ https://issues.apache.org/jira/browse/PIG-882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-882: --- Resolution: Fixed Fix Version/s: 0.4.0 Status: Resolved (was: Patch Available) I don't have a unit test case for the same reason of the first patch. See my first comment. > log level not propogated to loggers > > > Key: PIG-882 > URL: https://issues.apache.org/jira/browse/PIG-882 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.3.0 >Reporter: Thejas M Nair >Assignee: Daniel Dai > Fix For: 0.4.0 > > Attachments: duplicate_message.patch, PIG-882-1.patch, > PIG-882-2.patch, PIG-882-3.patch, PIG-882-4.patch, PIG-882-5.patch > > > Pig accepts log level as a parameter. But the log level it captures is not > set appropriately, so that loggers in different classes log at the specified > level. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-954) Skewed join fails when pig.skewedjoin.reduce.memusage is not configured
[ https://issues.apache.org/jira/browse/PIG-954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ying He updated PIG-954: Attachment: PIG-954.patch2 add JUnit test > Skewed join fails when pig.skewedjoin.reduce.memusage is not configured > --- > > Key: PIG-954 > URL: https://issues.apache.org/jira/browse/PIG-954 > Project: Pig > Issue Type: Improvement >Reporter: Ying He > Attachments: PIG-954.patch, PIG-954.patch2 > > > Fragmented replicated join has a few limitations: > - One of the tables needs to be loaded into memory > - Join is limited to two tables > Skewed join partitions the table and joins the records in the reduce phase. > It computes a histogram of the key space to account for skewing in the input > records. Further, it adjusts the number of reducers depending on the key > distribution. > We need to implement the skewed join in pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-955) Skewed join generates incorrect results
[ https://issues.apache.org/jira/browse/PIG-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754319#action_12754319 ] Ying He commented on PIG-955: - the sampling process generated a file which contains skewed keys and their pre-allocated reducer indexes. Each (key, beginning index, ending index) is stored as a tuple. during join process, this file is loaded by SkewedPartitioner as lookup table. For tuples from partition table, its key is matched against this lookup table, if match is found, it returns a value in range of [beginning index, ending index] in round robin fashion. If no match found, it then use hash() to calculate index. the problem is in SkewedPartitioner, when looking up the table, the PigNullableWritable format of input tuple is used, while the lookup table uses Pig type Tuple as keys. Therefore, no match is found. The indexes are calculated using hash() even for skewed keys. This causes the data for this key all goes to the same reducer. But for streaming table, if key is skewed key, each tuple is replicated to each reducer that are pre-allocated during sampling process. Because the reducer indexes are calculated wrong for skewed keys in partition table, tuples from first table are sent to wrong reducers, if it doesn't fall into its pre-calculated index range, the join with second table ends up with empty data set for that key. The query still appears successfully, but it has data loss. The fix is to change SkewedPartitioner to use correct object type to lookup skewed key tables > Skewed join generates incorrect results > - > > Key: PIG-955 > URL: https://issues.apache.org/jira/browse/PIG-955 > Project: Pig > Issue Type: Improvement >Reporter: Ying He > Attachments: PIG-955.patch > > > Fragmented replicated join has a few limitations: > - One of the tables needs to be loaded into memory > - Join is limited to two tables > Skewed join partitions the table and joins the records in the reduce phase. > It computes a histogram of the key space to account for skewing in the input > records. Further, it adjusts the number of reducers depending on the key > distribution. > We need to implement the skewed join in pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-955) Skewed join generates incorrect results
[ https://issues.apache.org/jira/browse/PIG-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754306#action_12754306 ] Olga Natkovich commented on PIG-955: Hi Ying, Thanks for the patch. From the description it is not clear what kind of scripts would be effected by this issue. Adding an example to the JIRA description would be helpful. Also, the patch needs a unit test > Skewed join generates incorrect results > - > > Key: PIG-955 > URL: https://issues.apache.org/jira/browse/PIG-955 > Project: Pig > Issue Type: Improvement >Reporter: Ying He > Attachments: PIG-955.patch > > > Fragmented replicated join has a few limitations: > - One of the tables needs to be loaded into memory > - Join is limited to two tables > Skewed join partitions the table and joins the records in the reduce phase. > It computes a histogram of the key space to account for skewing in the input > records. Further, it adjusts the number of reducers depending on the key > distribution. > We need to implement the skewed join in pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-955) Skewed join generates incorrect results
[ https://issues.apache.org/jira/browse/PIG-955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ying He updated PIG-955: Attachment: PIG-955.patch use tuple type to lookup skewed key map > Skewed join generates incorrect results > - > > Key: PIG-955 > URL: https://issues.apache.org/jira/browse/PIG-955 > Project: Pig > Issue Type: Improvement >Reporter: Ying He > Attachments: PIG-955.patch > > > Fragmented replicated join has a few limitations: > - One of the tables needs to be loaded into memory > - Join is limited to two tables > Skewed join partitions the table and joins the records in the reduce phase. > It computes a histogram of the key space to account for skewing in the input > records. Further, it adjusts the number of reducers depending on the key > distribution. > We need to implement the skewed join in pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (PIG-956) Reduce patch testing time
[ https://issues.apache.org/jira/browse/PIG-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754298#action_12754298 ] Olga Natkovich edited comment on PIG-956 at 9/11/09 12:41 PM: -- My plan is to do the following: (1) Take all the tests that take 5 seconds or less and put them into 10 minute tests (2) Create a TestCheckin - that runs a few end-to-end tests (1) + (2) combined will be the Ten-minute test group. Going forward, any files (this is at the test file level) that take 5 seconds or less can be added to the Ten-minute tests. Also, when any really major feature is added, an end-2-end query can be added or existing one modified in the TestCheckin. was (Author: olgan): My plan is to do the following: (1) Take all the tests that take 5 seconds or less and put them into 10 minute tests (2) Create a TestCheckin - that runs a few end-to-end tests (1) + (2) combined will be the Ten-minute test group. Goint forward, any files (this is at the test file level) that take 5 seconds or less can be added to the Ten-minute tests. Also, when any really major feature is added, an end-2-end query can be added or existing one modified in the TestCheckin. > Reduce patch testing time > - > > Key: PIG-956 > URL: https://issues.apache.org/jira/browse/PIG-956 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.4.0 >Reporter: Olga Natkovich >Assignee: Olga Natkovich > Fix For: 0.6.0 > > > The proposal is to split the tests into 2 groups: > (1) Ten-minute tests - this is a set of tests that run with every patch > submission and takes aproximately 10 minutes > (2) All tests - these include all tests and they will run nightly > This is similar to work done in Hadoop: > http://issues.apache.org/jira/browse/HDFS-458 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-956) Reduce patch testing time
[ https://issues.apache.org/jira/browse/PIG-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754298#action_12754298 ] Olga Natkovich commented on PIG-956: My plan is to do the following: (1) Take all the tests that take 5 seconds or less and put them into 10 minute tests (2) Create a TestCheckin - that runs a few end-to-end tests (1) + (2) combined will be the Ten-minute test group. Goint forward, any files (this is at the test file level) that take 5 seconds or less can be added to the Ten-minute tests. Also, when any really major feature is added, an end-2-end query can be added or existing one modified in the TestCheckin. > Reduce patch testing time > - > > Key: PIG-956 > URL: https://issues.apache.org/jira/browse/PIG-956 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.4.0 >Reporter: Olga Natkovich >Assignee: Olga Natkovich > Fix For: 0.6.0 > > > The proposal is to split the tests into 2 groups: > (1) Ten-minute tests - this is a set of tests that run with every patch > submission and takes aproximately 10 minutes > (2) All tests - these include all tests and they will run nightly > This is similar to work done in Hadoop: > http://issues.apache.org/jira/browse/HDFS-458 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-956) Reduce patch testing time
Reduce patch testing time - Key: PIG-956 URL: https://issues.apache.org/jira/browse/PIG-956 Project: Pig Issue Type: Improvement Affects Versions: 0.4.0 Reporter: Olga Natkovich Assignee: Olga Natkovich Fix For: 0.6.0 The proposal is to split the tests into 2 groups: (1) Ten-minute tests - this is a set of tests that run with every patch submission and takes aproximately 10 minutes (2) All tests - these include all tests and they will run nightly This is similar to work done in Hadoop: http://issues.apache.org/jira/browse/HDFS-458 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-660) Integration with Hadoop 0.20
[ https://issues.apache.org/jira/browse/PIG-660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-660: --- Affects Version/s: (was: 0.5.0) 0.4.0 Fix Version/s: 0.5.0 > Integration with Hadoop 0.20 > > > Key: PIG-660 > URL: https://issues.apache.org/jira/browse/PIG-660 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.4.0 > Environment: Hadoop 0.20 >Reporter: Santhosh Srinivasan >Assignee: Santhosh Srinivasan > Fix For: 0.5.0 > > Attachments: hadoop20.jar.gz, PIG-660-for-branch-0.3.patch, > PIG-660.patch, PIG-660_1.patch, PIG-660_2.patch, PIG-660_3.patch, > PIG-660_4.patch, PIG-660_5.patch, PIG-660_trunk.patch, PIG-660_trunk_2.patch, > pig_660_shims.patch, pig_660_shims_2.patch, pig_660_shims_3.patch > > > With Hadoop 0.20, it will be possible to query the status of each map and > reduce in a map reduce job. This will allow better error reporting. Some of > the other items that could be on Hadoop's feature requests/bugs are > documented here for tracking. > 1. Hadoop should return objects instead of strings when exceptions are thrown > 2. The JobControl should handle all exceptions and report them appropriately. > For example, when the JobControl fails to launch jobs, it should handle > exceptions appropriately and should support APIs that query this state, i.e., > failure to launch jobs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-955) Skewed join generates incorrect results
Skewed join generates incorrect results - Key: PIG-955 URL: https://issues.apache.org/jira/browse/PIG-955 Project: Pig Issue Type: Improvement Reporter: Ying He Fragmented replicated join has a few limitations: - One of the tables needs to be loaded into memory - Join is limited to two tables Skewed join partitions the table and joins the records in the reduce phase. It computes a histogram of the key space to account for skewing in the input records. Further, it adjusts the number of reducers depending on the key distribution. We need to implement the skewed join in pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Request for feedback: cost-based optimizer
Hi Alan, Thanks for the detailed review. After getting Daniel's feedback (and grokking the relationship between Pig's logical and physical operators, which is a little different than that described in the literature), we agree that the proper place to put the optimizer is at the logical layer, although we will need to compile to the physical layer to get cost estimates (for example, the number of generated MR jobs, which have associated network/queueing/startup costs). In order to adaptively adjust estimates, we will need to be able to trace back from an executed MR job ("job set", really, as some operations like order and join may require several jobs that are considered a single unit) to the logical operators this job covered. Adding that ability will have the additional benefit of enabling more helpful debugging output to end users by associating a failed MR job with what it was supposed to be doing. Totally agree with respect to PigServer and MapReduceLauncher. Making PigServer an actual "server" would be good, but is somewhat orthogonal to this work. Great to know you are working on statistics, looking forward to looking at the proposal. Are you working on just data stats or also execution stats (time per operator per record, that sort of thing)? Thanks -Dmitriy On Fri, Sep 11, 2009 at 1:56 PM, Alan Gates wrote: > This is a good start at adding a cost based optimizer to Pig. I have a > number of comments: > > 1) Your argument for putting it in the physical layer rather than the > logical is that the logical layer does not know physical statistics. This > need not be true. You suggest adding a getStatistics call to the loader to > give statistics. The logical layer can make this call and make decisions > based on the results without understanding the underlying physical layer. > It seems that the real reason you want to put the optimizer in the physical > layer is, rather than trying to do predictive statistics (such as we guess > this join will result in a 2x data explosion) you want to see the results of > actual MR jobs and then make decisions. This seems like a reasonable choice > for a couple of reasons: a) statistical guesses are hard to get right, and > Pig has limited statistics to begin with; b) since Pig Latin scripts can be > arbitrarily long, bad guesses at the beginning will have a worse ripple > effect than bad guesses in a SQL optimizer. > > 2) The changes you propose in Pig Server are quite complex. Would it be > possible instead to put the changes in MapReduceLauncher? It could run the > first MR job in a Pig Latin script, look at the results, and then rerun your > CBO on the remaining physical plan and re-translate this to a new MR plan > and resubmit. This would require annotations to the MR plan to indicate > where in a physical plan the MR boundaries fall, so that correct portions of > the original physical plan could be used for reoptimization and > recompilation. But it would contain the complexity of your changes to > MapReduceLauncher instead of scattering them through the entire system. > > 3) On adding getStatistics, I am currently working on a proposal to make a > number of changes to the load interface, including getStatistics. I hope to > publish that proposal by next week. Similarly I am working on a proposal of > how Pig will interact with metadata systems (such as Owl) which I also hope > to propose next week. We will be actively working in these areas because we > need them for our SQL implementation. So, one, you'll get a lot of this for > free; two, we should stay connected on these things so what we implement > works for what you need. > > Alan. > > On Sep 1, 2009, at 9:54 AM, Dmitriy Ryaboy wrote: > >> Whoops :-) >> Here's the Google doc: >> >> http://docs.google.com/Doc?docid=0Adqb7pZsloe6ZGM4Z3o1OG1fMjFrZjViZ21jdA&hl=en >> >> -Dmitriy >> >> On Tue, Sep 1, 2009 at 12:51 PM, Santhosh Srinivasan >> wrote: >>> >>> Dmitriy and Gang, >>> >>> The mailing list does not allow attachments. Can you post it on a >>> website and just send the URL ? >>> >>> Thanks, >>> Santhosh >>> >>> -Original Message- >>> From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com] >>> Sent: Tuesday, September 01, 2009 9:48 AM >>> To: pig-dev@hadoop.apache.org >>> Subject: Request for feedback: cost-based optimizer >>> >>> Hi everyone, >>> Attached is a (very) preliminary document outlining a rough design we >>> are proposing for a cost-based optimizer for Pig. >>> This is being done as a capstone project by three CMU Master's students >>> (myself, Ashutosh Chauhan, and Tejal Desai). As such, it is not >>> necessarily meant for immediate incorporation into the Pig codebase, >>> although it would be nice if it, or parts of it, are found to be useful >>> in the mainline. >>> >>> We would love to get some feedback from the developer community >>> regarding the ideas expressed in the document, any concerns about the >>> design, suggestions for improvement, etc. >>> >>> Thanks
[jira] Updated: (PIG-660) Integration with Hadoop 0.20
[ https://issues.apache.org/jira/browse/PIG-660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-660: --- Affects Version/s: (was: 0.2.0) 0.5.0 > Integration with Hadoop 0.20 > > > Key: PIG-660 > URL: https://issues.apache.org/jira/browse/PIG-660 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.5.0 > Environment: Hadoop 0.20 >Reporter: Santhosh Srinivasan >Assignee: Santhosh Srinivasan > Fix For: 0.4.0 > > Attachments: hadoop20.jar.gz, PIG-660-for-branch-0.3.patch, > PIG-660.patch, PIG-660_1.patch, PIG-660_2.patch, PIG-660_3.patch, > PIG-660_4.patch, PIG-660_5.patch, PIG-660_trunk.patch, PIG-660_trunk_2.patch, > pig_660_shims.patch, pig_660_shims_2.patch, pig_660_shims_3.patch > > > With Hadoop 0.20, it will be possible to query the status of each map and > reduce in a map reduce job. This will allow better error reporting. Some of > the other items that could be on Hadoop's feature requests/bugs are > documented here for tracking. > 1. Hadoop should return objects instead of strings when exceptions are thrown > 2. The JobControl should handle all exceptions and report them appropriately. > For example, when the JobControl fails to launch jobs, it should handle > exceptions appropriately and should support APIs that query this state, i.e., > failure to launch jobs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-892) Make COUNT and AVG deal with nulls accordingly with SQL standar
[ https://issues.apache.org/jira/browse/PIG-892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-892: --- Resolution: Fixed Status: Resolved (was: Patch Available) > Make COUNT and AVG deal with nulls accordingly with SQL standar > --- > > Key: PIG-892 > URL: https://issues.apache.org/jira/browse/PIG-892 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.3.0 >Reporter: Olga Natkovich >Assignee: Olga Natkovich > Fix For: 0.4.0 > > Attachments: PIG-892.patch, PIG-892_v2.patch, PIG-892_v3.patch > > > both COUNT and AVG need to ignore nulls. Also add COUNT_STAR to match > COUNT(*) in SQL -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-895) Default parallel for Pig
[ https://issues.apache.org/jira/browse/PIG-895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-895: --- Resolution: Fixed Status: Resolved (was: Patch Available) > Default parallel for Pig > > > Key: PIG-895 > URL: https://issues.apache.org/jira/browse/PIG-895 > Project: Pig > Issue Type: New Feature > Components: impl >Affects Versions: 0.3.0 >Reporter: Daniel Dai > Fix For: 0.4.0 > > Attachments: PIG-895-1.patch, PIG-895-2.patch, PIG-895-3.patch > > > For hadoop 20, if user don't specify the number of reducers, hadoop will use > 1 reducer as the default value. It is different from previous of hadoop, in > which default reducer number is usually good. 1 reducer is not what user want > for sure. Although user can use "parallel" keyword to specify number of > reducers for each statement, it is wordy. We need a convenient way for users > to express a desired number of reducers. Here is my propose: > 1. Add one property "default_parallel" to Pig. User can set default_parallel > in script. Eg: >set default_parallel 10; > 2. default_parallel is a hint to Pig. Pig is free to optimize the number of > reducers (unlike parallel keyword). Currently, since we do not have a > mechanism to determine the optimal number of reducers, default_parallel will > be always granted, unless it is override by "parallel" keyword. > 3. If user put multiple default_parallel inside script, the last entry will > be taken. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
proposed changes to Pig UDFs
Hi, As you know, a lot of work this year went into performance optimization of Pig. One of the main sources of performance problems is high memory usage. In an effort to address this problem we propose switching internal implementation of strings from Java Strings to Hadoop Text because text has lower memory overhead. Examples (assumes ASCII data; sizes are in bytes): Real StringJava StringHadoop Text 5 46 37 10 56 42 20 76 52 40 116 72 80 196 112 As the size of the strings grows so does the gap between the two implementations. Making this change would have no impact on pig users; however, it will have impact on existing UDFs that work with Strings. Our question is whether UDF writers/owners are comfortable with the proposed transition and will update their UDFs. Please, let us know by the end of next week if you strongly object to this proposal. Otherwise, we will go forward with this plan. Thanks, Olga
Re: Request for feedback: cost-based optimizer
This is a good start at adding a cost based optimizer to Pig. I have a number of comments: 1) Your argument for putting it in the physical layer rather than the logical is that the logical layer does not know physical statistics. This need not be true. You suggest adding a getStatistics call to the loader to give statistics. The logical layer can make this call and make decisions based on the results without understanding the underlying physical layer. It seems that the real reason you want to put the optimizer in the physical layer is, rather than trying to do predictive statistics (such as we guess this join will result in a 2x data explosion) you want to see the results of actual MR jobs and then make decisions. This seems like a reasonable choice for a couple of reasons: a) statistical guesses are hard to get right, and Pig has limited statistics to begin with; b) since Pig Latin scripts can be arbitrarily long, bad guesses at the beginning will have a worse ripple effect than bad guesses in a SQL optimizer. 2) The changes you propose in Pig Server are quite complex. Would it be possible instead to put the changes in MapReduceLauncher? It could run the first MR job in a Pig Latin script, look at the results, and then rerun your CBO on the remaining physical plan and re-translate this to a new MR plan and resubmit. This would require annotations to the MR plan to indicate where in a physical plan the MR boundaries fall, so that correct portions of the original physical plan could be used for reoptimization and recompilation. But it would contain the complexity of your changes to MapReduceLauncher instead of scattering them through the entire system. 3) On adding getStatistics, I am currently working on a proposal to make a number of changes to the load interface, including getStatistics. I hope to publish that proposal by next week. Similarly I am working on a proposal of how Pig will interact with metadata systems (such as Owl) which I also hope to propose next week. We will be actively working in these areas because we need them for our SQL implementation. So, one, you'll get a lot of this for free; two, we should stay connected on these things so what we implement works for what you need. Alan. On Sep 1, 2009, at 9:54 AM, Dmitriy Ryaboy wrote: Whoops :-) Here's the Google doc: http://docs.google.com/Doc?docid=0Adqb7pZsloe6ZGM4Z3o1OG1fMjFrZjViZ21jdA&hl=en -Dmitriy On Tue, Sep 1, 2009 at 12:51 PM, Santhosh Srinivasaninc.com> wrote: Dmitriy and Gang, The mailing list does not allow attachments. Can you post it on a website and just send the URL ? Thanks, Santhosh -Original Message- From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com] Sent: Tuesday, September 01, 2009 9:48 AM To: pig-dev@hadoop.apache.org Subject: Request for feedback: cost-based optimizer Hi everyone, Attached is a (very) preliminary document outlining a rough design we are proposing for a cost-based optimizer for Pig. This is being done as a capstone project by three CMU Master's students (myself, Ashutosh Chauhan, and Tejal Desai). As such, it is not necessarily meant for immediate incorporation into the Pig codebase, although it would be nice if it, or parts of it, are found to be useful in the mainline. We would love to get some feedback from the developer community regarding the ideas expressed in the document, any concerns about the design, suggestions for improvement, etc. Thanks, Dmitriy, Ashutosh, Tejal
[jira] Resolved: (PIG-950) Pig Loader does not handle unix hidden files ( files starting with dot)
[ https://issues.apache.org/jira/browse/PIG-950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai resolved PIG-950. Resolution: Invalid Fix Version/s: 0.4.0 It is a limitation of Hadoop map-reduce, so we cannot solve it in Pig side. > Pig Loader does not handle unix hidden files ( files starting with dot) > --- > > Key: PIG-950 > URL: https://issues.apache.org/jira/browse/PIG-950 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.4.0 >Reporter: Jing Huang > Fix For: 0.4.0 > > > I am trying to load .btschema file using pig loader, ( .btschema is not an > empty file) > This is what I did: > grunt> a = load '.btschema'; > grunt> dump a; > 2009-09-09 17:41:21,170 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer > - MR plan size before optimization: 1 > 2009-09-09 17:41:21,170 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer > - MR plan size after optimization: 1 > 2009-09-09 17:41:23,092 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler > - Setting up single store job > 2009-09-09 17:41:23,106 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics > - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - > already initialized > 2009-09-09 17:41:23,127 [Thread-4] WARN org.apache.hadoop.mapred.JobClient - > Use GenericOptionsParser for parsing the arguments. Applications should > implement Tool for the same. > 2009-09-09 17:41:23,623 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - 0% complete > 2009-09-09 17:41:28,644 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - 100% complete > 2009-09-09 17:41:28,644 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - Successfully stored result in: "file:/tmp/temp165972/tmp-527102439" > 2009-09-09 17:41:28,645 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - Records written : 0 > 2009-09-09 17:41:28,645 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - Bytes written : 0 > 2009-09-09 17:41:28,645 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - Success! > grunt> > = > it dumps nothing. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-891) Fixing dfs statement for Pig
[ https://issues.apache.org/jira/browse/PIG-891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754233#action_12754233 ] Daniel Dai commented on PIG-891: Tested in local mode also, the patch even works well in local mode. We have discussed issues in my previous comment, the suggestions are: 1. We can keep existing file system commands for now 2. We shall use "fs" instead of "dfs" to indicate a file system command as latest hadoop does Jeff, can you make this little change ("dfs"->"fs") and submit again? Thanks! > Fixing dfs statement for Pig > > > Key: PIG-891 > URL: https://issues.apache.org/jira/browse/PIG-891 > Project: Pig > Issue Type: Bug >Affects Versions: 0.4.0 >Reporter: Daniel Dai >Assignee: Jeff Zhang >Priority: Minor > Fix For: 0.4.0 > > Attachments: Pig_891.patch > > > Several hadoop dfs commands are not support or restrictive on current Pig. We > need to fix that. These include: > 1. Several commands do not supported: lsr, dus, count, rmr, expunge, put, > moveFromLocal, get, getmerge, text, moveToLocal, mkdir, touchz, test, stat, > tail, chmod, chown, chgrp. A reference for these command can be found in > http://hadoop.apache.org/common/docs/current/hdfs_shell.html > 2. All existing dfs commands do not support globing. > 3. Pig should provide a programmatic way to perform dfs commands. Several of > them exist in PigServer, but not all of them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-882) log level not propogated to loggers
[ https://issues.apache.org/jira/browse/PIG-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754219#action_12754219 ] Olga Natkovich commented on PIG-882: +1 > log level not propogated to loggers > > > Key: PIG-882 > URL: https://issues.apache.org/jira/browse/PIG-882 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.3.0 >Reporter: Thejas M Nair >Assignee: Daniel Dai > Fix For: 0.4.0 > > Attachments: duplicate_message.patch, PIG-882-1.patch, > PIG-882-2.patch, PIG-882-3.patch, PIG-882-4.patch, PIG-882-5.patch > > > Pig accepts log level as a parameter. But the log level it captures is not > set appropriately, so that loggers in different classes log at the specified > level. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-929) Default value of memusage for skewed join is not correct
[ https://issues.apache.org/jira/browse/PIG-929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich resolved PIG-929. Resolution: Fixed > Default value of memusage for skewed join is not correct > > > Key: PIG-929 > URL: https://issues.apache.org/jira/browse/PIG-929 > Project: Pig > Issue Type: Improvement >Reporter: Ying He > Attachments: memusage.patch > > > Fragmented replicated join has a few limitations: > - One of the tables needs to be loaded into memory > - Join is limited to two tables > Skewed join partitions the table and joins the records in the reduce phase. > It computes a histogram of the key space to account for skewing in the input > records. Further, it adjusts the number of reducers depending on the key > distribution. > We need to implement the skewed join in pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Double logs in grunt and ^d don't work ?
Hello, I'm new to pig, I use it on MacOS, and I wonder if there is a way to avoid the double log traces in the grunt console, and if there is a way to make the ^D key work (the DEL key). I think this is really inconvenient. Thanks for you answer.