[jira] Commented: (PIG-1090) Update sources to reflect recent changes in load-store interfaces
[ https://issues.apache.org/jira/browse/PIG-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777565#action_12777565 ] Alan Gates commented on PIG-1090: - Patch looks good. What's the ReadToEndLoader? What's the plan for BinStorage? Are we going to write Input and Output Formats for it? If we have to do that is there an existing binary storage format with existing input and output formats that we can use (like Avro or something)? Update sources to reflect recent changes in load-store interfaces - Key: PIG-1090 URL: https://issues.apache.org/jira/browse/PIG-1090 Project: Pig Issue Type: Sub-task Reporter: Pradeep Kamath Assignee: Pradeep Kamath Attachments: PIG-1090.patch There have been some changes (as recorded in the Changes Section, Nov 2 2009 sub section of http://wiki.apache.org/pig/LoadStoreRedesignProposal) in the load/store interfaces - this jira is to track the task of making those changes under src. Changes under test will be addresses in a different jira. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1077) [Zebra] to support record(row)-based file split in Zebra's TableInputFormat
[ https://issues.apache.org/jira/browse/PIG-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Wang updated PIG-1077: --- Attachment: patch_Pig1077 [Zebra] to support record(row)-based file split in Zebra's TableInputFormat --- Key: PIG-1077 URL: https://issues.apache.org/jira/browse/PIG-1077 Project: Pig Issue Type: New Feature Affects Versions: 0.4.0 Reporter: Chao Wang Assignee: Chao Wang Fix For: 0.6.0 Attachments: patch_Pig1077 TFile currently supports split by record sequence number (see Jira HADOOP-6218). We want to utilize this to provide record(row)-based input split support in Zebra. One prominent benefit is that: in cases where we have very large data files, we can create much more fine-grained input splits than before where we can only create one big split for one big file. In more detail, the new row-based getSplits() works by default (user does not specify no. of splits to be generated) as follows: 1) Select the biggest column group in terms of data size, split all of its TFiles according to hdfs block size (64 MB or 128 MB) and get a list of physical byte offsets as the output per TFile. For example, let us assume for the 1st TFile we get offset1, offset2, ..., offset10; 2) Invoke TFile.getRecordNumNear(long offset) to get the RecordNum of a key-value pair near a byte offset. For the example above, say we get recordNum1, recordNum2, ..., recordNum10; 3) Stitch [0, recordNum1], [recordNum1+1, recordNum2], ..., [recordNum9+1, recordNum10], [recordNum10+1, lastRecordNum] splits of all column groups, respectively to form 11 record-based input splits for the 1st TFile. 4) For each input split, we need to create a TFile scanner through: TFile.createScannerByRecordNum(long beginRecNum, long endRecNum). Note: conversion from byte offset to record number will be done by each mapper, rather than being done at the job initialization phase. This is due to performance concern since the conversion incurs some TFile reading overhead. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: FYI - forking TFile off Hadoop into Zebra
On Nov 11, 2009, at 4:13 PM, Ashutosh Chauhan wrote: On Wed, Nov 11, 2009 at 18:26, Chao Wang ch...@yahoo-inc.com wrote: Last, we would like to point out that this is a short term solution for Zebra and we plan to: 1) port all changes to Zebra TFile back into Hadoop TFile. 2) in the long run have a single unified solution for this. Just for clarity, in long run as Zebra stabilizes and Pig adopts hadoop-0.22, Zebra will get rid of this fork? I think the promise is they'll get rid of the fork at some point, not necessarily at 0.22 though. Alan. Ashutosh
[jira] Assigned: (PIG-1091) [zebra] Exception when load with projection of map keys on a map column that is not map split
[ https://issues.apache.org/jira/browse/PIG-1091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou reassigned PIG-1091: - Assignee: Yan Zhou [zebra] Exception when load with projection of map keys on a map column that is not map split -- Key: PIG-1091 URL: https://issues.apache.org/jira/browse/PIG-1091 Project: Pig Issue Type: Bug Reporter: Yan Zhou Assignee: Yan Zhou Priority: Minor With schema of f1:string, f2:map, storage info of [f1]; [f2], a projection of f2#{a} will see exception. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1091) [zebra] Exception when load with projection of map keys on a map column that is not map split
[zebra] Exception when load with projection of map keys on a map column that is not map split -- Key: PIG-1091 URL: https://issues.apache.org/jira/browse/PIG-1091 Project: Pig Issue Type: Bug Reporter: Yan Zhou Priority: Minor With schema of f1:string, f2:map, storage info of [f1]; [f2], a projection of f2#{a} will see exception. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key
[ https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777589#action_12777589 ] Daniel Dai commented on PIG-1038: - Continue with the last comment. 4. Strip secondary keys from the value 5. Write a byte version of OutputKeyComparator Optimize nested distinct/sort to use secondary key -- Key: PIG-1038 URL: https://issues.apache.org/jira/browse/PIG-1038 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.4.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.6.0 Attachments: PIG-1038-1.patch, PIG-1038-2.patch, PIG-1038-3.patch, PIG-1038-4.patch, PIG-1038-5.patch If nested foreach plan contains sort/distinct, it is possible to use hadoop secondary sort instead of SortedDataBag and DistinctDataBag to optimize the query. Eg1: A = load 'mydata'; B = group A by $0; C = foreach B { D = order A by $1; generate group, D; } store C into 'myresult'; We can specify a secondary sort on A.$1, and drop order A by $1. Eg2: A = load 'mydata'; B = group A by $0; C = foreach B { D = A.$1; E = distinct D; generate group, E; } store C into 'myresult'; We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct D to a special version of distinct, which does not do the sorting. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1093) pig.properties file is missing from distributions
[ https://issues.apache.org/jira/browse/PIG-1093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777598#action_12777598 ] Alan Gates commented on PIG-1093: - This also affects the 0.6 release, and should be repaired before that release. pig.properties file is missing from distributions - Key: PIG-1093 URL: https://issues.apache.org/jira/browse/PIG-1093 Project: Pig Issue Type: Bug Components: build Affects Versions: 0.5.0, 0.6.0 Reporter: Alan Gates pig.properties (in fact the entire conf directory) is not included in the jars distributed as part of the 0.5 release. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1090) Update sources to reflect recent changes in load-store interfaces
[ https://issues.apache.org/jira/browse/PIG-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777599#action_12777599 ] Pradeep Kamath commented on PIG-1090: - bq. What's the ReadToEndLoader? This is a internal utility LoadFunc I wrote to make it easy to read side files. It encapsulates the real Loader. Though this has been implemented as a LoadFunc, the only LoadFunc method which is truly implemented is getNext(). The usage pattern is to construct an instance using the constructor which would take a reference to the true LoadFunc (which can read the side file data) and then repeatedly call getNext() till null is encountered in the return value. The implementation of ReadToEndLoader hides the actions of getting InputSplits from the underlying InputFormat and then processing each split by getting the RecordReader and processing data in the split before moving to the next. bq. What's the plan for BinStorage? An input and output format has already been created and checked in in this branch for Binstorage Update sources to reflect recent changes in load-store interfaces - Key: PIG-1090 URL: https://issues.apache.org/jira/browse/PIG-1090 Project: Pig Issue Type: Sub-task Reporter: Pradeep Kamath Assignee: Pradeep Kamath Attachments: PIG-1090.patch There have been some changes (as recorded in the Changes Section, Nov 2 2009 sub section of http://wiki.apache.org/pig/LoadStoreRedesignProposal) in the load/store interfaces - this jira is to track the task of making those changes under src. Changes under test will be addresses in a different jira. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1090) Update sources to reflect recent changes in load-store interfaces
[ https://issues.apache.org/jira/browse/PIG-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-1090: Resolution: Fixed Hadoop Flags: [Incompatible change, Reviewed] Status: Resolved (was: Patch Available) Patch committed to load-store-redesign branch Update sources to reflect recent changes in load-store interfaces - Key: PIG-1090 URL: https://issues.apache.org/jira/browse/PIG-1090 Project: Pig Issue Type: Sub-task Reporter: Pradeep Kamath Assignee: Pradeep Kamath Attachments: PIG-1090.patch There have been some changes (as recorded in the Changes Section, Nov 2 2009 sub section of http://wiki.apache.org/pig/LoadStoreRedesignProposal) in the load/store interfaces - this jira is to track the task of making those changes under src. Changes under test will be addresses in a different jira. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1064) Behvaiour of COGROUP with and without schema when using * operator
[ https://issues.apache.org/jira/browse/PIG-1064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1064: Attachment: PIG-1064-4.patch Attach a patch to fix TestSecondarySort unit failure. Behvaiour of COGROUP with and without schema when using * operator Key: PIG-1064 URL: https://issues.apache.org/jira/browse/PIG-1064 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: Pradeep Kamath Fix For: 0.6.0 Attachments: PIG-1064-2.patch, PIG-1064-3.patch, PIG-1064-4.patch, PIG-1064.patch I have 2 tab separated files, 1.txt and 2.txt $ cat 1.txt 1 2 2 3 $ cat 2.txt 1 2 2 3 I use COGROUP feature of Pig in the following way: $java -cp pig.jar:$HADOOP_HOME org.apache.pig.Main {code} grunt A = load '1.txt'; grunt B = load '2.txt' as (b0, b1); grunt C = cogroup A by *, B by *; {code} 2009-10-29 12:46:04,150 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1012: Each COGroup input has to have the same number of inner plans Details at logfile: pig_1256845224752.log == If I reverse, the order of the schema's {code} grunt A = load '1.txt' as (a0, a1); grunt B = load '2.txt'; grunt C = cogroup A by *, B by *; {code} 2009-10-29 12:49:27,869 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1013: Grouping attributes can either be star (*) or a list of expressions, but not both. Details at logfile: pig_1256845224752.log == Now running without schema?? {code} grunt A = load '1.txt'; grunt B = load '2.txt'; grunt C = cogroup A by *, B by *; grunt dump C; {code} 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Successfully stored result in: file:/tmp/temp-319926700/tmp-1990275961 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Records written : 2 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Bytes written : 154 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete! 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!! ((1,2),{(1,2)},{(1,2)}) ((2,3),{(2,3)},{(2,3)}) == Is this a bug or a feature? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1064) Behvaiour of COGROUP with and without schema when using * operator
[ https://issues.apache.org/jira/browse/PIG-1064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1064: Status: Patch Available (was: Open) Behvaiour of COGROUP with and without schema when using * operator Key: PIG-1064 URL: https://issues.apache.org/jira/browse/PIG-1064 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: Pradeep Kamath Fix For: 0.6.0 Attachments: PIG-1064-2.patch, PIG-1064-3.patch, PIG-1064-4.patch, PIG-1064.patch I have 2 tab separated files, 1.txt and 2.txt $ cat 1.txt 1 2 2 3 $ cat 2.txt 1 2 2 3 I use COGROUP feature of Pig in the following way: $java -cp pig.jar:$HADOOP_HOME org.apache.pig.Main {code} grunt A = load '1.txt'; grunt B = load '2.txt' as (b0, b1); grunt C = cogroup A by *, B by *; {code} 2009-10-29 12:46:04,150 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1012: Each COGroup input has to have the same number of inner plans Details at logfile: pig_1256845224752.log == If I reverse, the order of the schema's {code} grunt A = load '1.txt' as (a0, a1); grunt B = load '2.txt'; grunt C = cogroup A by *, B by *; {code} 2009-10-29 12:49:27,869 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1013: Grouping attributes can either be star (*) or a list of expressions, but not both. Details at logfile: pig_1256845224752.log == Now running without schema?? {code} grunt A = load '1.txt'; grunt B = load '2.txt'; grunt C = cogroup A by *, B by *; grunt dump C; {code} 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Successfully stored result in: file:/tmp/temp-319926700/tmp-1990275961 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Records written : 2 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Bytes written : 154 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete! 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!! ((1,2),{(1,2)},{(1,2)}) ((2,3),{(2,3)},{(2,3)}) == Is this a bug or a feature? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1094) Fix unit tests corresponding to source changes so far
Fix unit tests corresponding to source changes so far - Key: PIG-1094 URL: https://issues.apache.org/jira/browse/PIG-1094 Project: Pig Issue Type: Sub-task Reporter: Pradeep Kamath Assignee: Pradeep Kamath -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1064) Behvaiour of COGROUP with and without schema when using * operator
[ https://issues.apache.org/jira/browse/PIG-1064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777699#action_12777699 ] Hadoop QA commented on PIG-1064: -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12424878/PIG-1064-4.patch against trunk revision 835499. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 15 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/155/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/155/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/155/console This message is automatically generated. Behvaiour of COGROUP with and without schema when using * operator Key: PIG-1064 URL: https://issues.apache.org/jira/browse/PIG-1064 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: Pradeep Kamath Fix For: 0.6.0 Attachments: PIG-1064-2.patch, PIG-1064-3.patch, PIG-1064-4.patch, PIG-1064.patch I have 2 tab separated files, 1.txt and 2.txt $ cat 1.txt 1 2 2 3 $ cat 2.txt 1 2 2 3 I use COGROUP feature of Pig in the following way: $java -cp pig.jar:$HADOOP_HOME org.apache.pig.Main {code} grunt A = load '1.txt'; grunt B = load '2.txt' as (b0, b1); grunt C = cogroup A by *, B by *; {code} 2009-10-29 12:46:04,150 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1012: Each COGroup input has to have the same number of inner plans Details at logfile: pig_1256845224752.log == If I reverse, the order of the schema's {code} grunt A = load '1.txt' as (a0, a1); grunt B = load '2.txt'; grunt C = cogroup A by *, B by *; {code} 2009-10-29 12:49:27,869 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1013: Grouping attributes can either be star (*) or a list of expressions, but not both. Details at logfile: pig_1256845224752.log == Now running without schema?? {code} grunt A = load '1.txt'; grunt B = load '2.txt'; grunt C = cogroup A by *, B by *; grunt dump C; {code} 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Successfully stored result in: file:/tmp/temp-319926700/tmp-1990275961 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Records written : 2 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Bytes written : 154 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete! 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!! ((1,2),{(1,2)},{(1,2)}) ((2,3),{(2,3)},{(2,3)}) == Is this a bug or a feature? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1064) Behvaiour of COGROUP with and without schema when using * operator
[ https://issues.apache.org/jira/browse/PIG-1064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1247#action_1247 ] Pradeep Kamath commented on PIG-1064: - Can't make out what is wrong with the unit tests from the report above - am running them all on my local box - will update with the results Behvaiour of COGROUP with and without schema when using * operator Key: PIG-1064 URL: https://issues.apache.org/jira/browse/PIG-1064 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: Pradeep Kamath Fix For: 0.6.0 Attachments: PIG-1064-2.patch, PIG-1064-3.patch, PIG-1064-4.patch, PIG-1064.patch I have 2 tab separated files, 1.txt and 2.txt $ cat 1.txt 1 2 2 3 $ cat 2.txt 1 2 2 3 I use COGROUP feature of Pig in the following way: $java -cp pig.jar:$HADOOP_HOME org.apache.pig.Main {code} grunt A = load '1.txt'; grunt B = load '2.txt' as (b0, b1); grunt C = cogroup A by *, B by *; {code} 2009-10-29 12:46:04,150 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1012: Each COGroup input has to have the same number of inner plans Details at logfile: pig_1256845224752.log == If I reverse, the order of the schema's {code} grunt A = load '1.txt' as (a0, a1); grunt B = load '2.txt'; grunt C = cogroup A by *, B by *; {code} 2009-10-29 12:49:27,869 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1013: Grouping attributes can either be star (*) or a list of expressions, but not both. Details at logfile: pig_1256845224752.log == Now running without schema?? {code} grunt A = load '1.txt'; grunt B = load '2.txt'; grunt C = cogroup A by *, B by *; grunt dump C; {code} 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Successfully stored result in: file:/tmp/temp-319926700/tmp-1990275961 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Records written : 2 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Bytes written : 154 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete! 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!! ((1,2),{(1,2)},{(1,2)}) ((2,3),{(2,3)},{(2,3)}) == Is this a bug or a feature? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1072) ReversibleLoadStoreFunc interface should be removed to enable different load and store implementation classes to be used in a reversible manner
[ https://issues.apache.org/jira/browse/PIG-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding reassigned PIG-1072: - Assignee: Richard Ding ReversibleLoadStoreFunc interface should be removed to enable different load and store implementation classes to be used in a reversible manner --- Key: PIG-1072 URL: https://issues.apache.org/jira/browse/PIG-1072 Project: Pig Issue Type: Sub-task Reporter: Pradeep Kamath Assignee: Richard Ding -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface
[ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair updated PIG-1062: --- Attachment: PIG-1062.patch.3 New patch after merge with latest changes to load-store-redesign branch. Incompatible with trunk . Pasting output of test-patch (test cases have not been updated) [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] -1 tests included. The patch doesn't appear to include any new or modified tests. [exec] Please justify why no tests are needed for this patch. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface --- Key: PIG-1062 URL: https://issues.apache.org/jira/browse/PIG-1062 Project: Pig Issue Type: Sub-task Reporter: Thejas M Nair Assignee: Thejas M Nair Attachments: PIG-1062.patch, PIG-1062.patch.3 This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal . PigStorage and BinStorage are now working. SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface. Fixing SampleLoader and RandomSampleLoader will get order-by queries working. PoissonSampleLoader is used by skew join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface
[ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair updated PIG-1062: --- Status: Patch Available (was: Open) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface --- Key: PIG-1062 URL: https://issues.apache.org/jira/browse/PIG-1062 Project: Pig Issue Type: Sub-task Reporter: Thejas M Nair Assignee: Thejas M Nair Attachments: PIG-1062.patch, PIG-1062.patch.3 This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal . PigStorage and BinStorage are now working. SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface. Fixing SampleLoader and RandomSampleLoader will get order-by queries working. PoissonSampleLoader is used by skew join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface
[ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1294#action_1294 ] Hadoop QA commented on PIG-1062: -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12424927/PIG-1062.patch.3 against trunk revision 835499. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/156/console This message is automatically generated. load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface --- Key: PIG-1062 URL: https://issues.apache.org/jira/browse/PIG-1062 Project: Pig Issue Type: Sub-task Reporter: Thejas M Nair Assignee: Thejas M Nair Attachments: PIG-1062.patch, PIG-1062.patch.3 This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal . PigStorage and BinStorage are now working. SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface. Fixing SampleLoader and RandomSampleLoader will get order-by queries working. PoissonSampleLoader is used by skew join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1077) [Zebra] to support record(row)-based file split in Zebra's TableInputFormat
[ https://issues.apache.org/jira/browse/PIG-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1293#action_1293 ] Hadoop QA commented on PIG-1077: +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12424874/patch_Pig1077 against trunk revision 835499. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 104 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/49/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/49/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/49/console This message is automatically generated. [Zebra] to support record(row)-based file split in Zebra's TableInputFormat --- Key: PIG-1077 URL: https://issues.apache.org/jira/browse/PIG-1077 Project: Pig Issue Type: New Feature Affects Versions: 0.4.0 Reporter: Chao Wang Assignee: Chao Wang Fix For: 0.6.0 Attachments: patch_Pig1077 TFile currently supports split by record sequence number (see Jira HADOOP-6218). We want to utilize this to provide record(row)-based input split support in Zebra. One prominent benefit is that: in cases where we have very large data files, we can create much more fine-grained input splits than before where we can only create one big split for one big file. In more detail, the new row-based getSplits() works by default (user does not specify no. of splits to be generated) as follows: 1) Select the biggest column group in terms of data size, split all of its TFiles according to hdfs block size (64 MB or 128 MB) and get a list of physical byte offsets as the output per TFile. For example, let us assume for the 1st TFile we get offset1, offset2, ..., offset10; 2) Invoke TFile.getRecordNumNear(long offset) to get the RecordNum of a key-value pair near a byte offset. For the example above, say we get recordNum1, recordNum2, ..., recordNum10; 3) Stitch [0, recordNum1], [recordNum1+1, recordNum2], ..., [recordNum9+1, recordNum10], [recordNum10+1, lastRecordNum] splits of all column groups, respectively to form 11 record-based input splits for the 1st TFile. 4) For each input split, we need to create a TFile scanner through: TFile.createScannerByRecordNum(long beginRecNum, long endRecNum). Note: conversion from byte offset to record number will be done by each mapper, rather than being done at the job initialization phase. This is due to performance concern since the conversion incurs some TFile reading overhead. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.