[ https://issues.apache.org/jira/browse/PIG-4601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
liyunzhang_intel updated PIG-4601: ---------------------------------- Attachment: PIG-4601_4.patch [~xuefuz]: PIG-4601_3.patch misses a new file MergeCogroupConverter.java. And it makes compilation error. Can you revert changes of PIG-4601_3.patch and use PIG-4601_4.patch? Very sorry for the inconvenience. > Implement Merge CoGroup for Spark engine > ---------------------------------------- > > Key: PIG-4601 > URL: https://issues.apache.org/jira/browse/PIG-4601 > Project: Pig > Issue Type: Sub-task > Components: spark > Affects Versions: spark-branch > Reporter: Mohit Sabharwal > Assignee: liyunzhang_intel > Fix For: spark-branch > > Attachments: PIG-4601_1.patch, PIG-4601_2.patch, PIG-4601_3.patch, > PIG-4601_4.patch > > > When doing a cogroup operation, we need do a map-reduce. The target of merge > cogroup is implementing cogroup only by a single stage(map). But we need to > guarantee the input data are sorted. > There is performance improvement for cases when A(big dataset) merge cogroup > B( small dataset) because we first generate an index file of A then loading A > according to the index file and B into memory to do cogroup. The performance > improves because there is no cost of reduce period comparing cogroup. > How to use > {code} > C = cogroup A by c1, B by c1 using 'merge'; > {code} > Here A and B is sorted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)