[jira] Commented: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

Hadoop QA (JIRA) Fri, 26 Jun 2009 17:34:13 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724769#action_12724769
 ]


Hadoop QA commented on PIG-820:
-------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12411945/pig-820_v6.patch
  against trunk revision 788174.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 6 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/104/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/104/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/104/console

This message is automatically generated.

> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume 
> another loader
> -----------------------------------------------------------------------------------------
>
>                 Key: PIG-820
>                 URL: https://issues.apache.org/jira/browse/PIG-820
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.3.0, 0.4.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>             Fix For: 0.4.0
>
>         Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, 
> pig-820_v4.patch, pig-820_v5.patch, pig-820_v6.patch
>
>
> Currently a sampling job requires that data already be stored in 
> BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For 
> order by this
> has mostly been acceptable, because users tend to use order by at the end of 
> their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an 
> order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this 
> requirement to read the entire input and write it back out will not be 
> acceptable.
> Join is often the first operation of a script, and thus is much more likely 
> to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, 
> using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a 
> Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
>     
>     /**
>      * Skip ahead in the input stream.
>      * @param n number of bytes to skip
>      * @return number of bytes actually skipped.  The return semantics are
>      * exactly the same as {...@link java.io.InpuStream#skip(long)}
>      */
>     public long skip(long n) throws IOException;
>     
>     /**
>      * Get the current position in the stream.
>      * @return position in the stream.
>      */
>     public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data 
> implemented the SamplableLoader interface.  If so, rather than create an 
> initial MR
> job to do the translation it would create the sampling job, having 
> RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

Reply via email to