[ 
https://issues.apache.org/jira/browse/PIG-460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12652642#action_12652642
 ] 

Alan Gates commented on PIG-460:
--------------------------------

Here's a quick write up of what will need to be done to change order by from 
being a 3 mr job process to 2.  Currently sampling is done via 
org.apache.pig.impl.builtin.RandomSampleLoader.  Since this loader extends 
BinStorage the first mr job reads the data in whatever format and then stores 
it again using BinStorage.  It is then read in the second job using 
RandomSampleLoader.  The tuples that are selected by RandomSampleLoader are 
grouped into a single reducer and then fed to 
org.apache.pig.impl.builtin.FindQuantiles, which builds a side file containing 
partitioning information.  The third mr job again reads the data and uses the 
side file in the SortPartitioner.  (It may be helpful to do an explain on a 
simple order by query to see all this.)

What needs to change is that RandomSampleLoader should instead become an 
EvalFunc, RandomSampler.  The logic inside can remain the same.  The MRCompiler 
will need to change to create two mr jobs for the sort instead of 3.  The first 
job should contain a ForEach operator with the new RandomSampler function in 
the map.  It's reduce should look just like the reduce of the second mr job in 
the current system (that is, singular and having a ForEach operator that calls 
FindQuantiles).  The second job should remain exactly the same as the third job 
in the current system.  Take a look at MRCompiler.visitSort() for an idea of 
how sort jobs are constructed now.  It's this function and the functions it 
calls that you'll be changing in MRCompiler.

> PERFORMANCE:  Order by done in 3 MR jobs, could be done in 2
> ------------------------------------------------------------
>
>                 Key: PIG-460
>                 URL: https://issues.apache.org/jira/browse/PIG-460
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>             Fix For: types_branch
>
>
> Currently order by is done in three MR jobs:
> job 1: read data in whatever loader the user requests, store using BinStorage
> job 2: load using RandomSampleLoader, find quantiles
> job 3: load data again and sort
> It is done this way because RandomSampleLoader extends BinStorage, and so 
> needs the data in that format to read it.
> If the logic in RandomSampleLoader was made into an operator instead of being 
> in a loader then jobs 1 and 2 could be merged.  On average job 1 takes about 
> 15% of the time of an order by script.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to