[jira] [Commented] (TEZ-2358) Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task
[ https://issues.apache.org/jira/browse/TEZ-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14514797#comment-14514797 ] Gopal V commented on TEZ-2358: -- [~hitesh]: marking this as blocker for 0.7.x, because it causes task failures for long running jobs. [~rajesh.balamohan]: Patch LGTM - +1 > Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task > - > > Key: TEZ-2358 > URL: https://issues.apache.org/jira/browse/TEZ-2358 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Gopal V >Assignee: Rajesh Balamohan >Priority: Blocker > Attachments: TEZ-2358.1.patch, TEZ-2358.2.patch, TEZ-2358.3.patch, > TEZ-2358.4.patch, syslog_attempt_1429683757595_0141_1_01_000143_0.syslog.bz2 > > > The Tez MergeManager code assumes that the src-task-id is unique between > merge operations, this results in some confusion when two merge sequences > have to process output from the same src-task-id. > {code} > private TezRawKeyValueIterator finalMerge(Configuration job, FileSystem fs, >List inMemoryMapOutputs, >List onDiskMapOutputs > ... > if (inMemoryMapOutputs.size() > 0) { > int srcTaskId = > inMemoryMapOutputs.get(0).getAttemptIdentifier().getInputIdentifier().getInputIndex(); > ... >// must spill to disk, but can't retain in-mem for intermediate merge > final Path outputPath = > mapOutputFile.getInputFileForWrite(srcTaskId, > inMemToDiskBytes).suffix( > > Constants.MERGED_OUTPUT_PREFIX); > ... > {code} > This or some scenario related to this, results in the following FileChunks > list which contains identical named paths with different lengths. > {code} > 2015-04-23 03:28:50,983 INFO [MemtoDiskMerger [Map_1]] > orderedgrouped.MergeManager: Initiating in-memory merge with 6 segments... > 2015-04-23 03:28:50,987 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: > Merging 6 sorted segments > 2015-04-23 03:28:50,988 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: Down > to the last merge-pass, with 6 segments left of total size: 1165944755 bytes > 2015-04-23 03:28:58,495 INFO [MemtoDiskMerger [Map_1]] > orderedgrouped.MergeManager: attempt_1429683757595_0141_1_01_000143_0_10027 > Merge of the 6 files in-memory complete. Local file is > /grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out.merged > of size 785583965 > 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] > orderedgrouped.MergeManager: finalMerge called with 0 in-memory map-outputs > and 5 on-disk map-outputs > 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] > orderedgrouped.MergeManager: GOPAL: onDiskBytes = 365232290 += > 365232290for/grid/4/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_1023.out > 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] > orderedgrouped.MergeManager: GOPAL: onDiskBytes = 730529899 += > 365297609for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out > 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] > orderedgrouped.MergeManager: GOPAL: onDiskBytes = 1095828683 += > 365298784for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out > {code} > The multiple instances of 404.out indicates that we pulled two pipelined > chunks of the same shuffle src id, once into memory and twice onto disk. > {code} > 2015-04-23 03:28:08,256 INFO > [TezTaskEventRouter[attempt_1429683757595_0141_1_01_000143_0]] > orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: DME srcIdx: 143, > targetIdx: 404, attemptNum: 0, payload: [hasEmptyPartitions: true, host: > cn047-10.l42scl.hortonworks.com, port: 13562, pathComponent: > attempt_1429683757595_0141_1_00_000404_0_10009_0, runDuration: 0] > 2015-04-23 03:28:08,270 INFO > [TezTaskEventRouter[attempt_1429683757595_0141_1_01_000143_0]] > orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: DME srcIdx: 143, > targetIdx: 404, attemptNum: 0, payload: [hasEmptyPartitions: true, host: > cn047-10.l42scl.hortonworks.com, port: 13562, pathComponent: > attempt_1429683757595_0141_1_00_000404_0_10009_1, runDuration: 0] > 2015-04-23 03:28:08,272 INFO > [TezTaskEventRouter[attempt_1429683757595_0141_1_01_000143_0]] > orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: DME srcIdx: 143, > targetIdx: 404, a
[jira] [Commented] (TEZ-2358) Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task
[ https://issues.apache.org/jira/browse/TEZ-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14513830#comment-14513830 ] TezQA commented on TEZ-2358: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12728354/TEZ-2358.4.patch against master revision 2935ef4. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/541//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/541//console This message is automatically generated. > Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task > - > > Key: TEZ-2358 > URL: https://issues.apache.org/jira/browse/TEZ-2358 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Gopal V >Assignee: Rajesh Balamohan > Attachments: TEZ-2358.1.patch, TEZ-2358.2.patch, TEZ-2358.3.patch, > TEZ-2358.4.patch, syslog_attempt_1429683757595_0141_1_01_000143_0.syslog.bz2 > > > The Tez MergeManager code assumes that the src-task-id is unique between > merge operations, this results in some confusion when two merge sequences > have to process output from the same src-task-id. > {code} > private TezRawKeyValueIterator finalMerge(Configuration job, FileSystem fs, >List inMemoryMapOutputs, >List onDiskMapOutputs > ... > if (inMemoryMapOutputs.size() > 0) { > int srcTaskId = > inMemoryMapOutputs.get(0).getAttemptIdentifier().getInputIdentifier().getInputIndex(); > ... >// must spill to disk, but can't retain in-mem for intermediate merge > final Path outputPath = > mapOutputFile.getInputFileForWrite(srcTaskId, > inMemToDiskBytes).suffix( > > Constants.MERGED_OUTPUT_PREFIX); > ... > {code} > This or some scenario related to this, results in the following FileChunks > list which contains identical named paths with different lengths. > {code} > 2015-04-23 03:28:50,983 INFO [MemtoDiskMerger [Map_1]] > orderedgrouped.MergeManager: Initiating in-memory merge with 6 segments... > 2015-04-23 03:28:50,987 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: > Merging 6 sorted segments > 2015-04-23 03:28:50,988 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: Down > to the last merge-pass, with 6 segments left of total size: 1165944755 bytes > 2015-04-23 03:28:58,495 INFO [MemtoDiskMerger [Map_1]] > orderedgrouped.MergeManager: attempt_1429683757595_0141_1_01_000143_0_10027 > Merge of the 6 files in-memory complete. Local file is > /grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out.merged > of size 785583965 > 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] > orderedgrouped.MergeManager: finalMerge called with 0 in-memory map-outputs > and 5 on-disk map-outputs > 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] > orderedgrouped.MergeManager: GOPAL: onDiskBytes = 365232290 += > 365232290for/grid/4/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_1023.out > 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] > orderedgrouped.MergeManager: GOPAL: onDiskBytes = 730529899 += > 365297609for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out > 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] > orderedgrouped.MergeManager: GOPAL: onDiskBytes = 1095828683 += > 365298784for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out > {code} > The multiple instances of 404.out indicates that we pulled two pipelined > chunks of the same shuffle src id, once into memory and twice onto disk. > {code} > 2015-04-23 03:28:08,256 I
[jira] [Commented] (TEZ-2358) Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task
[ https://issues.apache.org/jira/browse/TEZ-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14513802#comment-14513802 ] TezQA commented on TEZ-2358: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12728348/TEZ-2358.3.patch against master revision 2935ef4. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/540//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/540//console This message is automatically generated. > Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task > - > > Key: TEZ-2358 > URL: https://issues.apache.org/jira/browse/TEZ-2358 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Gopal V >Assignee: Rajesh Balamohan > Attachments: TEZ-2358.1.patch, TEZ-2358.2.patch, TEZ-2358.3.patch, > TEZ-2358.4.patch, syslog_attempt_1429683757595_0141_1_01_000143_0.syslog.bz2 > > > The Tez MergeManager code assumes that the src-task-id is unique between > merge operations, this results in some confusion when two merge sequences > have to process output from the same src-task-id. > {code} > private TezRawKeyValueIterator finalMerge(Configuration job, FileSystem fs, >List inMemoryMapOutputs, >List onDiskMapOutputs > ... > if (inMemoryMapOutputs.size() > 0) { > int srcTaskId = > inMemoryMapOutputs.get(0).getAttemptIdentifier().getInputIdentifier().getInputIndex(); > ... >// must spill to disk, but can't retain in-mem for intermediate merge > final Path outputPath = > mapOutputFile.getInputFileForWrite(srcTaskId, > inMemToDiskBytes).suffix( > > Constants.MERGED_OUTPUT_PREFIX); > ... > {code} > This or some scenario related to this, results in the following FileChunks > list which contains identical named paths with different lengths. > {code} > 2015-04-23 03:28:50,983 INFO [MemtoDiskMerger [Map_1]] > orderedgrouped.MergeManager: Initiating in-memory merge with 6 segments... > 2015-04-23 03:28:50,987 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: > Merging 6 sorted segments > 2015-04-23 03:28:50,988 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: Down > to the last merge-pass, with 6 segments left of total size: 1165944755 bytes > 2015-04-23 03:28:58,495 INFO [MemtoDiskMerger [Map_1]] > orderedgrouped.MergeManager: attempt_1429683757595_0141_1_01_000143_0_10027 > Merge of the 6 files in-memory complete. Local file is > /grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out.merged > of size 785583965 > 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] > orderedgrouped.MergeManager: finalMerge called with 0 in-memory map-outputs > and 5 on-disk map-outputs > 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] > orderedgrouped.MergeManager: GOPAL: onDiskBytes = 365232290 += > 365232290for/grid/4/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_1023.out > 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] > orderedgrouped.MergeManager: GOPAL: onDiskBytes = 730529899 += > 365297609for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out > 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] > orderedgrouped.MergeManager: GOPAL: onDiskBytes = 1095828683 += > 365298784for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out > {code} > The multiple instances of 404.out indicates that we pulled two pipelined > chunks of the same shuffle src id, once into memory and twice onto disk. > {code} > 2015-04-23 03:28:08,256 I
[jira] [Commented] (TEZ-2358) Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task
[ https://issues.apache.org/jira/browse/TEZ-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14513742#comment-14513742 ] Gopal V commented on TEZ-2358: -- [~rajesh.balamohan]: minor nit on the checkargument (not or not) pattern - it gets complex to add another condition later. > Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task > - > > Key: TEZ-2358 > URL: https://issues.apache.org/jira/browse/TEZ-2358 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Gopal V >Assignee: Rajesh Balamohan > Attachments: TEZ-2358.1.patch, TEZ-2358.2.patch, TEZ-2358.3.patch, > syslog_attempt_1429683757595_0141_1_01_000143_0.syslog.bz2 > > > The Tez MergeManager code assumes that the src-task-id is unique between > merge operations, this results in some confusion when two merge sequences > have to process output from the same src-task-id. > {code} > private TezRawKeyValueIterator finalMerge(Configuration job, FileSystem fs, >List inMemoryMapOutputs, >List onDiskMapOutputs > ... > if (inMemoryMapOutputs.size() > 0) { > int srcTaskId = > inMemoryMapOutputs.get(0).getAttemptIdentifier().getInputIdentifier().getInputIndex(); > ... >// must spill to disk, but can't retain in-mem for intermediate merge > final Path outputPath = > mapOutputFile.getInputFileForWrite(srcTaskId, > inMemToDiskBytes).suffix( > > Constants.MERGED_OUTPUT_PREFIX); > ... > {code} > This or some scenario related to this, results in the following FileChunks > list which contains identical named paths with different lengths. > {code} > 2015-04-23 03:28:50,983 INFO [MemtoDiskMerger [Map_1]] > orderedgrouped.MergeManager: Initiating in-memory merge with 6 segments... > 2015-04-23 03:28:50,987 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: > Merging 6 sorted segments > 2015-04-23 03:28:50,988 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: Down > to the last merge-pass, with 6 segments left of total size: 1165944755 bytes > 2015-04-23 03:28:58,495 INFO [MemtoDiskMerger [Map_1]] > orderedgrouped.MergeManager: attempt_1429683757595_0141_1_01_000143_0_10027 > Merge of the 6 files in-memory complete. Local file is > /grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out.merged > of size 785583965 > 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] > orderedgrouped.MergeManager: finalMerge called with 0 in-memory map-outputs > and 5 on-disk map-outputs > 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] > orderedgrouped.MergeManager: GOPAL: onDiskBytes = 365232290 += > 365232290for/grid/4/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_1023.out > 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] > orderedgrouped.MergeManager: GOPAL: onDiskBytes = 730529899 += > 365297609for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out > 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] > orderedgrouped.MergeManager: GOPAL: onDiskBytes = 1095828683 += > 365298784for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out > {code} > The multiple instances of 404.out indicates that we pulled two pipelined > chunks of the same shuffle src id, once into memory and twice onto disk. > {code} > 2015-04-23 03:28:08,256 INFO > [TezTaskEventRouter[attempt_1429683757595_0141_1_01_000143_0]] > orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: DME srcIdx: 143, > targetIdx: 404, attemptNum: 0, payload: [hasEmptyPartitions: true, host: > cn047-10.l42scl.hortonworks.com, port: 13562, pathComponent: > attempt_1429683757595_0141_1_00_000404_0_10009_0, runDuration: 0] > 2015-04-23 03:28:08,270 INFO > [TezTaskEventRouter[attempt_1429683757595_0141_1_01_000143_0]] > orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: DME srcIdx: 143, > targetIdx: 404, attemptNum: 0, payload: [hasEmptyPartitions: true, host: > cn047-10.l42scl.hortonworks.com, port: 13562, pathComponent: > attempt_1429683757595_0141_1_00_000404_0_10009_1, runDuration: 0] > 2015-04-23 03:28:08,272 INFO > [TezTaskEventRouter[attempt_1429683757595_0141_1_01_000143_0]] > orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: DME srcIdx: 143, > targetIdx: 404, attemptNum: 0, payload: [hasEmptyPartitions: true, host: > cn047-
[jira] [Commented] (TEZ-2358) Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task
[ https://issues.apache.org/jira/browse/TEZ-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14513650#comment-14513650 ] Gopal V commented on TEZ-2358: -- The patch was again tested at 10Tb scale over the weekend - there seems to be no collisions in naming. I look at the logs and noticed that some earlier tasks did succeed with the duplicate naming, due to the fact that there were only a few spills, resulting in them being split between the disks & not colliding in paths. But for the sake of preventing future breakage, it would help to have an error being triggered when someone violates the no-duplicate rule for onDiskMapOutputs (i.e no two file chunks for merging can start at the same offset of the same file). My original pre-conditions were wrong when auto-reducer parallelism kicks in as we want to merge off a DISK_DIRECT input across two reducers (when auto-reduce parallelism kicks in), which would be different index points into the same DISK_DIRECT file. > Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task > - > > Key: TEZ-2358 > URL: https://issues.apache.org/jira/browse/TEZ-2358 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Gopal V >Assignee: Rajesh Balamohan > Attachments: TEZ-2358.1.patch, TEZ-2358.2.patch, > syslog_attempt_1429683757595_0141_1_01_000143_0.syslog.bz2 > > > The Tez MergeManager code assumes that the src-task-id is unique between > merge operations, this results in some confusion when two merge sequences > have to process output from the same src-task-id. > {code} > private TezRawKeyValueIterator finalMerge(Configuration job, FileSystem fs, >List inMemoryMapOutputs, >List onDiskMapOutputs > ... > if (inMemoryMapOutputs.size() > 0) { > int srcTaskId = > inMemoryMapOutputs.get(0).getAttemptIdentifier().getInputIdentifier().getInputIndex(); > ... >// must spill to disk, but can't retain in-mem for intermediate merge > final Path outputPath = > mapOutputFile.getInputFileForWrite(srcTaskId, > inMemToDiskBytes).suffix( > > Constants.MERGED_OUTPUT_PREFIX); > ... > {code} > This or some scenario related to this, results in the following FileChunks > list which contains identical named paths with different lengths. > {code} > 2015-04-23 03:28:50,983 INFO [MemtoDiskMerger [Map_1]] > orderedgrouped.MergeManager: Initiating in-memory merge with 6 segments... > 2015-04-23 03:28:50,987 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: > Merging 6 sorted segments > 2015-04-23 03:28:50,988 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: Down > to the last merge-pass, with 6 segments left of total size: 1165944755 bytes > 2015-04-23 03:28:58,495 INFO [MemtoDiskMerger [Map_1]] > orderedgrouped.MergeManager: attempt_1429683757595_0141_1_01_000143_0_10027 > Merge of the 6 files in-memory complete. Local file is > /grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out.merged > of size 785583965 > 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] > orderedgrouped.MergeManager: finalMerge called with 0 in-memory map-outputs > and 5 on-disk map-outputs > 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] > orderedgrouped.MergeManager: GOPAL: onDiskBytes = 365232290 += > 365232290for/grid/4/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_1023.out > 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] > orderedgrouped.MergeManager: GOPAL: onDiskBytes = 730529899 += > 365297609for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out > 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] > orderedgrouped.MergeManager: GOPAL: onDiskBytes = 1095828683 += > 365298784for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out > {code} > The multiple instances of 404.out indicates that we pulled two pipelined > chunks of the same shuffle src id, once into memory and twice onto disk. > {code} > 2015-04-23 03:28:08,256 INFO > [TezTaskEventRouter[attempt_1429683757595_0141_1_01_000143_0]] > orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: DME srcIdx: 143, > targetIdx: 404, attemptNum: 0, payload: [hasEmptyPartitions: true, host: > cn047-10.l42scl.hortonworks.com, port: 13562, pathComponent: > attempt_1429683
[jira] [Commented] (TEZ-2358) Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task
[ https://issues.apache.org/jira/browse/TEZ-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14511227#comment-14511227 ] TezQA commented on TEZ-2358: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12727924/TEZ-2358.2.patch against master revision 778c1f5. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/529//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/529//console This message is automatically generated. > Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task > - > > Key: TEZ-2358 > URL: https://issues.apache.org/jira/browse/TEZ-2358 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Gopal V >Assignee: Rajesh Balamohan > Attachments: TEZ-2358.1.patch, TEZ-2358.2.patch, > syslog_attempt_1429683757595_0141_1_01_000143_0.syslog.bz2 > > > The Tez MergeManager code assumes that the src-task-id is unique between > merge operations, this results in some confusion when two merge sequences > have to process output from the same src-task-id. > {code} > private TezRawKeyValueIterator finalMerge(Configuration job, FileSystem fs, >List inMemoryMapOutputs, >List onDiskMapOutputs > ... > if (inMemoryMapOutputs.size() > 0) { > int srcTaskId = > inMemoryMapOutputs.get(0).getAttemptIdentifier().getInputIdentifier().getInputIndex(); > ... >// must spill to disk, but can't retain in-mem for intermediate merge > final Path outputPath = > mapOutputFile.getInputFileForWrite(srcTaskId, > inMemToDiskBytes).suffix( > > Constants.MERGED_OUTPUT_PREFIX); > ... > {code} > This or some scenario related to this, results in the following FileChunks > list which contains identical named paths with different lengths. > {code} > 2015-04-23 03:28:50,983 INFO [MemtoDiskMerger [Map_1]] > orderedgrouped.MergeManager: Initiating in-memory merge with 6 segments... > 2015-04-23 03:28:50,987 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: > Merging 6 sorted segments > 2015-04-23 03:28:50,988 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: Down > to the last merge-pass, with 6 segments left of total size: 1165944755 bytes > 2015-04-23 03:28:58,495 INFO [MemtoDiskMerger [Map_1]] > orderedgrouped.MergeManager: attempt_1429683757595_0141_1_01_000143_0_10027 > Merge of the 6 files in-memory complete. Local file is > /grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out.merged > of size 785583965 > 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] > orderedgrouped.MergeManager: finalMerge called with 0 in-memory map-outputs > and 5 on-disk map-outputs > 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] > orderedgrouped.MergeManager: GOPAL: onDiskBytes = 365232290 += > 365232290for/grid/4/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_1023.out > 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] > orderedgrouped.MergeManager: GOPAL: onDiskBytes = 730529899 += > 365297609for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out > 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] > orderedgrouped.MergeManager: GOPAL: onDiskBytes = 1095828683 += > 365298784for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out > {code} > The multiple instances of 404.out indicates that we pulled two pipelined > chunks of the same shuffle src id, once into memory and twice onto disk. > {code} > 2015-04-23 03:28:08,256 INFO > [TezTaskEventRouter[attempt_1
[jira] [Commented] (TEZ-2358) Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task
[ https://issues.apache.org/jira/browse/TEZ-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14510894#comment-14510894 ] Gopal V commented on TEZ-2358: -- [~rajesh.balamohan]: The patch is looking good in tests. The ETL succeeded (Time taken: 14184.052 seconds) with this fix and there were no task failures due to full disks - also tested with local.fetch/shared.fetch turned on. FYI, I added a rough precondition to throw an exception if a duplicate name was added to FileChunks, which pretty much gave me the confidence that this isn't hit anymore. > Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task > - > > Key: TEZ-2358 > URL: https://issues.apache.org/jira/browse/TEZ-2358 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Gopal V >Assignee: Rajesh Balamohan > Attachments: TEZ-2358.1.patch, > syslog_attempt_1429683757595_0141_1_01_000143_0.syslog.bz2 > > > The Tez MergeManager code assumes that the src-task-id is unique between > merge operations, this results in some confusion when two merge sequences > have to process output from the same src-task-id. > {code} > private TezRawKeyValueIterator finalMerge(Configuration job, FileSystem fs, >List inMemoryMapOutputs, >List onDiskMapOutputs > ... > if (inMemoryMapOutputs.size() > 0) { > int srcTaskId = > inMemoryMapOutputs.get(0).getAttemptIdentifier().getInputIdentifier().getInputIndex(); > ... >// must spill to disk, but can't retain in-mem for intermediate merge > final Path outputPath = > mapOutputFile.getInputFileForWrite(srcTaskId, > inMemToDiskBytes).suffix( > > Constants.MERGED_OUTPUT_PREFIX); > ... > {code} > This or some scenario related to this, results in the following FileChunks > list which contains identical named paths with different lengths. > {code} > 2015-04-23 03:28:50,983 INFO [MemtoDiskMerger [Map_1]] > orderedgrouped.MergeManager: Initiating in-memory merge with 6 segments... > 2015-04-23 03:28:50,987 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: > Merging 6 sorted segments > 2015-04-23 03:28:50,988 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: Down > to the last merge-pass, with 6 segments left of total size: 1165944755 bytes > 2015-04-23 03:28:58,495 INFO [MemtoDiskMerger [Map_1]] > orderedgrouped.MergeManager: attempt_1429683757595_0141_1_01_000143_0_10027 > Merge of the 6 files in-memory complete. Local file is > /grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out.merged > of size 785583965 > 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] > orderedgrouped.MergeManager: finalMerge called with 0 in-memory map-outputs > and 5 on-disk map-outputs > 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] > orderedgrouped.MergeManager: GOPAL: onDiskBytes = 365232290 += > 365232290for/grid/4/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_1023.out > 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] > orderedgrouped.MergeManager: GOPAL: onDiskBytes = 730529899 += > 365297609for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out > 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] > orderedgrouped.MergeManager: GOPAL: onDiskBytes = 1095828683 += > 365298784for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out > {code} > The multiple instances of 404.out indicates that we pulled two pipelined > chunks of the same shuffle src id, once into memory and twice onto disk. > {code} > 2015-04-23 03:28:08,256 INFO > [TezTaskEventRouter[attempt_1429683757595_0141_1_01_000143_0]] > orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: DME srcIdx: 143, > targetIdx: 404, attemptNum: 0, payload: [hasEmptyPartitions: true, host: > cn047-10.l42scl.hortonworks.com, port: 13562, pathComponent: > attempt_1429683757595_0141_1_00_000404_0_10009_0, runDuration: 0] > 2015-04-23 03:28:08,270 INFO > [TezTaskEventRouter[attempt_1429683757595_0141_1_01_000143_0]] > orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: DME srcIdx: 143, > targetIdx: 404, attemptNum: 0, payload: [hasEmptyPartitions: true, host: > cn047-10.l42scl.hortonworks.com, port: 13562, pathComponent: > attempt_1429683757595_0141_1_00_000404_0_10009_1, runDuration: 0] > 2015-04-23 0