[ https://issues.apache.org/jira/browse/PIG-4601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15152352#comment-15152352 ]
Xuefu Zhang commented on PIG-4601: ---------------------------------- Reverted the old commit and committed the new patch. Thanks, Liyun! > Implement Merge CoGroup for Spark engine > ---------------------------------------- > > Key: PIG-4601 > URL: https://issues.apache.org/jira/browse/PIG-4601 > Project: Pig > Issue Type: Sub-task > Components: spark > Affects Versions: spark-branch > Reporter: Mohit Sabharwal > Assignee: liyunzhang_intel > Fix For: spark-branch > > Attachments: PIG-4601_1.patch, PIG-4601_2.patch, PIG-4601_3.patch, > PIG-4601_4.patch > > > When doing a cogroup operation, we need do a map-reduce. The target of merge > cogroup is implementing cogroup only by a single stage(map). But we need to > guarantee the input data are sorted. > There is performance improvement for cases when A(big dataset) merge cogroup > B( small dataset) because we first generate an index file of A then loading A > according to the index file and B into memory to do cogroup. The performance > improves because there is no cost of reduce period comparing cogroup. > How to use > {code} > C = cogroup A by c1, B by c1 using 'merge'; > {code} > Here A and B is sorted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)