[jira] Commented: (PIG-554) Fragment Replicate Join

Shravan Matthur Narayanamurthy (JIRA) Wed, 10 Dec 2008 04:31:11 -0800

    [ 
https://issues.apache.org/jira/browse/PIG-554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12655188#action_12655188
 ]


Shravan Matthur Narayanamurthy commented on PIG-554:
----------------------------------------------------

(1) Have fixed in my local branch
(2) You are right. I missed that one but its a minor fix. Have fixed it in my 
branch
(3) I was copying some code from LOCogroup and copied the comment 
inadvertently. There is no such restriction
(4) Fixed in local branch
(5) We do support any number of replicated tables. Have added a whole bunch of 
test cases to test joins of 3 tables, joins with and without schema & also to 
test schema computation of the frjoin. Please take a look
(6) Yes. As I had mentioned in one of the meetings, if the FRJoin has n 
inputs(1 fragmented & n-1 replicated) then there will be n-1 map jobs that will 
materialize the n-1 replicated inputs to files so that they can then be read to 
construct the hash map.

I am not submitting the patch yet because I see GC overhead limit reached 
exceptions even with 100MB replicated file when the vm is initialized with 1G 
heap space. I am still trying to figure out what is causing them. I noticed 
them while I was trying to figure out the limit for the size of the replicated 
file.

> Fragment Replicate Join
> -----------------------
>
>                 Key: PIG-554
>                 URL: https://issues.apache.org/jira/browse/PIG-554
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: types_branch
>            Reporter: Shravan Matthur Narayanamurthy
>            Assignee: Shravan Matthur Narayanamurthy
>             Fix For: types_branch
>
>         Attachments: frjofflat.patch
>
>
> Fragment Replicate Join(FRJ) is useful when we want a join between a huge 
> table and a very small table (fitting in memory small) and the join doesn't 
> expand the data by much. The idea is to distribute the processing of the huge 
> files by fragmenting it and replicating the small file to all machines 
> receiving a fragment of the huge file. Because of the availability of the 
> entire small file, the join becomes a trivial task without needing any break 
> in the pipeline. Exhaustive test have done to determine the improvement we 
> get out of FRJ. Here are the details: http://wiki.apache.org/pig/PigFRJoin
> The patch makes changes to parts of the code where new operators are 
> introduced. Currently, when a new operator is introduced, its alias is not 
> set. For schema computation I have modified this behaviour to set the alias 
> of the new operator to that of its predecessor. The logical side of the patch 
> mimics the cogroup behavior as join syntax closely resembles that of cogroup. 
> Currently, this patch doesn't have support for joins other than inner joins. 
> The rest of the code has been documented.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-554) Fragment Replicate Join

Reply via email to