[ https://issues.apache.org/jira/browse/PIG-4601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
liyunzhang_intel updated PIG-4601: ---------------------------------- Comment: was deleted (was: explain more about why there are difference when loading multiple splits in spark and mr. In MR, org.apache.hadoop.mapreduce.JobSubmitter#writeNewSplits will sort the splits into order based on size while in spark this will not happen. {code} writeNewSplits () { List<InputSplit> splits = input.getSplits(job); //... T[] array = (T[]) splits.toArray(new InputSplit[splits.size()]); // sort the splits into order based on size, so that the biggest // go first Arrays.sort(array, new SplitComparator()); JobSplitWriter.createSplitFiles(jobSubmitDir, conf, jobSubmitDir.getFileSystem(conf), array); } {code}) > Implement Merge CoGroup for Spark engine > ---------------------------------------- > > Key: PIG-4601 > URL: https://issues.apache.org/jira/browse/PIG-4601 > Project: Pig > Issue Type: Sub-task > Components: spark > Affects Versions: spark-branch > Reporter: Mohit Sabharwal > Assignee: liyunzhang_intel > Fix For: spark-branch > > Attachments: PIG-4601_1.patch, PIG-4601_2.patch, PIG-4601_3.patch, > PIG-4601_4.patch > > > When doing a cogroup operation, we need do a map-reduce. The target of merge > cogroup is implementing cogroup only by a single stage(map). But we need to > guarantee the input data are sorted. > There is performance improvement for cases when A(big dataset) merge cogroup > B( small dataset) because we first generate an index file of A then loading A > according to the index file and B into memory to do cogroup. The performance > improves because there is no cost of reduce period comparing cogroup. > How to use > {code} > C = cogroup A by c1, B by c1 using 'merge'; > {code} > Here A and B is sorted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)