[ 
https://issues.apache.org/jira/browse/PIG-4601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated PIG-4601:
----------------------------------
    Description: 
When doing a cogroup operation, we need do a map-reduce. The target of merge 
cogroup is implementing cogroup only by a single stage(map). But we need to 
guarantee the input data are sorted.

There is performance improvement for cases when A(big dataset) merge cogroup B( 
small dataset) because we first generate an index file of A then loading A 
according to the index file and B into memory to do cogroup. The performance 
improves because there is no cost of reduce period comparing cogroup.

How to use
{code}
C = cogroup A by c1, B by c1 using 'merge';
{code}

Here A and B is sorted.



  was:
Implement single-stage ("map-side") co-group where all the input data sets are 
sorted by key:

{code}
C = cogroup A by c1, B by c1 using 'merge';
{code}


> Implement Merge CoGroup for Spark engine
> ----------------------------------------
>
>                 Key: PIG-4601
>                 URL: https://issues.apache.org/jira/browse/PIG-4601
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>    Affects Versions: spark-branch
>            Reporter: Mohit Sabharwal
>            Assignee: liyunzhang_intel
>             Fix For: spark-branch
>
>         Attachments: PIG-4601_1.patch
>
>
> When doing a cogroup operation, we need do a map-reduce. The target of merge 
> cogroup is implementing cogroup only by a single stage(map). But we need to 
> guarantee the input data are sorted.
> There is performance improvement for cases when A(big dataset) merge cogroup 
> B( small dataset) because we first generate an index file of A then loading A 
> according to the index file and B into memory to do cogroup. The performance 
> improves because there is no cost of reduce period comparing cogroup.
> How to use
> {code}
> C = cogroup A by c1, B by c1 using 'merge';
> {code}
> Here A and B is sorted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to