[ https://issues.apache.org/jira/browse/PIG-460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12652642#action_12652642 ]
Alan Gates commented on PIG-460: -------------------------------- Here's a quick write up of what will need to be done to change order by from being a 3 mr job process to 2. Currently sampling is done via org.apache.pig.impl.builtin.RandomSampleLoader. Since this loader extends BinStorage the first mr job reads the data in whatever format and then stores it again using BinStorage. It is then read in the second job using RandomSampleLoader. The tuples that are selected by RandomSampleLoader are grouped into a single reducer and then fed to org.apache.pig.impl.builtin.FindQuantiles, which builds a side file containing partitioning information. The third mr job again reads the data and uses the side file in the SortPartitioner. (It may be helpful to do an explain on a simple order by query to see all this.) What needs to change is that RandomSampleLoader should instead become an EvalFunc, RandomSampler. The logic inside can remain the same. The MRCompiler will need to change to create two mr jobs for the sort instead of 3. The first job should contain a ForEach operator with the new RandomSampler function in the map. It's reduce should look just like the reduce of the second mr job in the current system (that is, singular and having a ForEach operator that calls FindQuantiles). The second job should remain exactly the same as the third job in the current system. Take a look at MRCompiler.visitSort() for an idea of how sort jobs are constructed now. It's this function and the functions it calls that you'll be changing in MRCompiler. > PERFORMANCE: Order by done in 3 MR jobs, could be done in 2 > ------------------------------------------------------------ > > Key: PIG-460 > URL: https://issues.apache.org/jira/browse/PIG-460 > Project: Pig > Issue Type: Bug > Affects Versions: types_branch > Reporter: Alan Gates > Assignee: Alan Gates > Fix For: types_branch > > > Currently order by is done in three MR jobs: > job 1: read data in whatever loader the user requests, store using BinStorage > job 2: load using RandomSampleLoader, find quantiles > job 3: load data again and sort > It is done this way because RandomSampleLoader extends BinStorage, and so > needs the data in that format to read it. > If the logic in RandomSampleLoader was made into an operator instead of being > in a loader then jobs 1 and 2 could be merged. On average job 1 takes about > 15% of the time of an order by script. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.