[jira] Updated: (PIG-554) Fragment Replicate Join
[ https://issues.apache.org/jira/browse/PIG-554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shravan Matthur Narayanamurthy updated PIG-554: --- Attachment: frjofflat.patch Fragment Replicate Join --- Key: PIG-554 URL: https://issues.apache.org/jira/browse/PIG-554 Project: Pig Issue Type: New Feature Affects Versions: types_branch Reporter: Shravan Matthur Narayanamurthy Assignee: Shravan Matthur Narayanamurthy Fix For: types_branch Attachments: frjofflat.patch Fragment Replicate Join(FRJ) is useful when we want a join between a huge table and a very small table (fitting in memory small) and the join doesn't expand the data by much. The idea is to distribute the processing of the huge files by fragmenting it and replicating the small file to all machines receiving a fragment of the huge file. Because of the availability of the entire small file, the join becomes a trivial task without needing any break in the pipeline. Exhaustive test have done to determine the improvement we get out of FRJ. Will post the details in a wiki and add a link here The patch makes changes to parts of the code where new operators are introduced. Currently, when a new operator is introduced, its alias is not set. For schema computation I have modified this behaviour to set the alias of the new operator to that of its predecessor. The logical side of the patch mimics the cogroup behavior as join syntax closely resembles that of cogroup. Currently, this patch doesn't have support for joins other than inner joins. The rest of the code has been documented. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-460) PERFORMANCE: Order by done in 3 MR jobs, could be done in 2
[ https://issues.apache.org/jira/browse/PIG-460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12652642#action_12652642 ] Alan Gates commented on PIG-460: Here's a quick write up of what will need to be done to change order by from being a 3 mr job process to 2. Currently sampling is done via org.apache.pig.impl.builtin.RandomSampleLoader. Since this loader extends BinStorage the first mr job reads the data in whatever format and then stores it again using BinStorage. It is then read in the second job using RandomSampleLoader. The tuples that are selected by RandomSampleLoader are grouped into a single reducer and then fed to org.apache.pig.impl.builtin.FindQuantiles, which builds a side file containing partitioning information. The third mr job again reads the data and uses the side file in the SortPartitioner. (It may be helpful to do an explain on a simple order by query to see all this.) What needs to change is that RandomSampleLoader should instead become an EvalFunc, RandomSampler. The logic inside can remain the same. The MRCompiler will need to change to create two mr jobs for the sort instead of 3. The first job should contain a ForEach operator with the new RandomSampler function in the map. It's reduce should look just like the reduce of the second mr job in the current system (that is, singular and having a ForEach operator that calls FindQuantiles). The second job should remain exactly the same as the third job in the current system. Take a look at MRCompiler.visitSort() for an idea of how sort jobs are constructed now. It's this function and the functions it calls that you'll be changing in MRCompiler. PERFORMANCE: Order by done in 3 MR jobs, could be done in 2 Key: PIG-460 URL: https://issues.apache.org/jira/browse/PIG-460 Project: Pig Issue Type: Bug Affects Versions: types_branch Reporter: Alan Gates Assignee: Alan Gates Fix For: types_branch Currently order by is done in three MR jobs: job 1: read data in whatever loader the user requests, store using BinStorage job 2: load using RandomSampleLoader, find quantiles job 3: load data again and sort It is done this way because RandomSampleLoader extends BinStorage, and so needs the data in that format to read it. If the logic in RandomSampleLoader was made into an operator instead of being in a loader then jobs 1 and 2 could be merged. On average job 1 takes about 15% of the time of an order by script. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [VOTE] Release Pig 0.1.1 (candidate 0)
+1. I downloaded the release, checked the signatures and checksums. All unit test pass. Arun On Nov 25, 2008, at 3:58 PM, Olga Natkovich wrote: Hi, I have created a candidate build for Pig 0.1.1. This release is almost identical to Pig 0.1.0 with a couple of exceptions: (1) It is integrated with hadoop 18 (2) It has one small bug fix (PIG-253) (3) Several UDF were added to piggybank - pig's UDF repository The rat report is attached. Keys used to sign the release are available at http://svn.apache.org/viewvc/hadoop/pig/trunk/KEYS?view=markup . Please download, test, and try it out: http://people.apache.org/~olga/pig-0.1.1-candidate-0 Should we release this? Vote closes on Wednesday, December 3rd. Olga