[ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724769#action_12724769 ]
Hadoop QA commented on PIG-820: ------------------------------- +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12411945/pig-820_v6.patch against trunk revision 788174. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/104/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/104/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/104/console This message is automatically generated. > PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume > another loader > ----------------------------------------------------------------------------------------- > > Key: PIG-820 > URL: https://issues.apache.org/jira/browse/PIG-820 > Project: Pig > Issue Type: Improvement > Components: impl > Affects Versions: 0.3.0, 0.4.0 > Reporter: Alan Gates > Assignee: Ashutosh Chauhan > Fix For: 0.4.0 > > Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, > pig-820_v4.patch, pig-820_v5.patch, pig-820_v6.patch > > > Currently a sampling job requires that data already be stored in > BinaryStorage format, since RandomSampleLoader extends BinaryStorage. For > order by this > has mostly been acceptable, because users tend to use order by at the end of > their script where other MR jobs have already operated on the data and thus it > is already being stored in BinaryStorage. For pig scripts that just did an > order by, an entire MR job is required to read the data and write it out > in BinaryStorage format. > As we begin work on join algorithms that will require sampling, this > requirement to read the entire input and write it back out will not be > acceptable. > Join is often the first operation of a script, and thus is much more likely > to trigger this useless up front translation job. > Instead RandomSampleLoader can be changed to subsume an existing loader, > using the user specified loader to read the tuples while handling the skipping > between tuples itself. This will require the subsumed loader to implement a > Samplable Interface, that will look something like: > {code} > public interface SamplableLoader extends LoadFunc { > > /** > * Skip ahead in the input stream. > * @param n number of bytes to skip > * @return number of bytes actually skipped. The return semantics are > * exactly the same as {...@link java.io.InpuStream#skip(long)} > */ > public long skip(long n) throws IOException; > > /** > * Get the current position in the stream. > * @return position in the stream. > */ > public long getPosition() throws IOException; > } > {code} > The MRCompiler would then check if the loader being used to load data > implemented the SamplableLoader interface. If so, rather than create an > initial MR > job to do the translation it would create the sampling job, having > RandomSampleLoader use the user specified loader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.