[jira] [Commented] (TEZ-2358) Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task

2015-04-27 Thread Gopal V (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14514797#comment-14514797
 ] 

Gopal V commented on TEZ-2358:
--

[~hitesh]: marking this as blocker for 0.7.x, because it causes task failures 
for long running jobs.

[~rajesh.balamohan]: Patch LGTM - +1

> Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task
> -
>
> Key: TEZ-2358
> URL: https://issues.apache.org/jira/browse/TEZ-2358
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Gopal V
>Assignee: Rajesh Balamohan
>Priority: Blocker
> Attachments: TEZ-2358.1.patch, TEZ-2358.2.patch, TEZ-2358.3.patch, 
> TEZ-2358.4.patch, syslog_attempt_1429683757595_0141_1_01_000143_0.syslog.bz2
>
>
> The Tez MergeManager code assumes that the src-task-id is unique between 
> merge operations, this results in some confusion when two merge sequences 
> have to process output from the same src-task-id.
> {code}
> private TezRawKeyValueIterator finalMerge(Configuration job, FileSystem fs,
>List inMemoryMapOutputs,
>List onDiskMapOutputs
> ...
>  if (inMemoryMapOutputs.size() > 0) {
>   int srcTaskId = 
> inMemoryMapOutputs.get(0).getAttemptIdentifier().getInputIdentifier().getInputIndex();
> ...
>// must spill to disk, but can't retain in-mem for intermediate merge
> final Path outputPath = 
>   mapOutputFile.getInputFileForWrite(srcTaskId,
>  inMemToDiskBytes).suffix(
>  
> Constants.MERGED_OUTPUT_PREFIX);
> ...
> {code}
> This or some scenario related to this, results in the following FileChunks 
> list which contains identical named paths with different lengths.
> {code}
> 2015-04-23 03:28:50,983 INFO [MemtoDiskMerger [Map_1]] 
> orderedgrouped.MergeManager: Initiating in-memory merge with 6 segments...
> 2015-04-23 03:28:50,987 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: 
> Merging 6 sorted segments
> 2015-04-23 03:28:50,988 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: Down 
> to the last merge-pass, with 6 segments left of total size: 1165944755 bytes
> 2015-04-23 03:28:58,495 INFO [MemtoDiskMerger [Map_1]] 
> orderedgrouped.MergeManager: attempt_1429683757595_0141_1_01_000143_0_10027 
> Merge of the 6 files in-memory complete. Local file is 
> /grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out.merged
>  of size 785583965
> 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] 
> orderedgrouped.MergeManager: finalMerge called with 0 in-memory map-outputs 
> and 5 on-disk map-outputs
> 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] 
> orderedgrouped.MergeManager: GOPAL: onDiskBytes = 365232290 += 
> 365232290for/grid/4/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_1023.out
> 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] 
> orderedgrouped.MergeManager: GOPAL: onDiskBytes = 730529899 += 
> 365297609for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out
> 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] 
> orderedgrouped.MergeManager: GOPAL: onDiskBytes = 1095828683 += 
> 365298784for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out
> {code}
> The multiple instances of 404.out indicates that we pulled two pipelined 
> chunks of the same shuffle src id, once into memory and twice onto disk.
> {code}
> 2015-04-23 03:28:08,256 INFO 
> [TezTaskEventRouter[attempt_1429683757595_0141_1_01_000143_0]] 
> orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: DME srcIdx: 143, 
> targetIdx: 404, attemptNum: 0, payload: [hasEmptyPartitions: true, host: 
> cn047-10.l42scl.hortonworks.com, port: 13562, pathComponent: 
> attempt_1429683757595_0141_1_00_000404_0_10009_0, runDuration: 0]
> 2015-04-23 03:28:08,270 INFO 
> [TezTaskEventRouter[attempt_1429683757595_0141_1_01_000143_0]] 
> orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: DME srcIdx: 143, 
> targetIdx: 404, attemptNum: 0, payload: [hasEmptyPartitions: true, host: 
> cn047-10.l42scl.hortonworks.com, port: 13562, pathComponent: 
> attempt_1429683757595_0141_1_00_000404_0_10009_1, runDuration: 0]
> 2015-04-23 03:28:08,272 INFO 
> [TezTaskEventRouter[attempt_1429683757595_0141_1_01_000143_0]] 
> orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: DME srcIdx: 143, 
> targetIdx: 404, a

[jira] [Commented] (TEZ-2358) Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task

2015-04-27 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14513830#comment-14513830
 ] 

TezQA commented on TEZ-2358:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12728354/TEZ-2358.4.patch
  against master revision 2935ef4.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/541//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/541//console

This message is automatically generated.

> Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task
> -
>
> Key: TEZ-2358
> URL: https://issues.apache.org/jira/browse/TEZ-2358
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Gopal V
>Assignee: Rajesh Balamohan
> Attachments: TEZ-2358.1.patch, TEZ-2358.2.patch, TEZ-2358.3.patch, 
> TEZ-2358.4.patch, syslog_attempt_1429683757595_0141_1_01_000143_0.syslog.bz2
>
>
> The Tez MergeManager code assumes that the src-task-id is unique between 
> merge operations, this results in some confusion when two merge sequences 
> have to process output from the same src-task-id.
> {code}
> private TezRawKeyValueIterator finalMerge(Configuration job, FileSystem fs,
>List inMemoryMapOutputs,
>List onDiskMapOutputs
> ...
>  if (inMemoryMapOutputs.size() > 0) {
>   int srcTaskId = 
> inMemoryMapOutputs.get(0).getAttemptIdentifier().getInputIdentifier().getInputIndex();
> ...
>// must spill to disk, but can't retain in-mem for intermediate merge
> final Path outputPath = 
>   mapOutputFile.getInputFileForWrite(srcTaskId,
>  inMemToDiskBytes).suffix(
>  
> Constants.MERGED_OUTPUT_PREFIX);
> ...
> {code}
> This or some scenario related to this, results in the following FileChunks 
> list which contains identical named paths with different lengths.
> {code}
> 2015-04-23 03:28:50,983 INFO [MemtoDiskMerger [Map_1]] 
> orderedgrouped.MergeManager: Initiating in-memory merge with 6 segments...
> 2015-04-23 03:28:50,987 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: 
> Merging 6 sorted segments
> 2015-04-23 03:28:50,988 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: Down 
> to the last merge-pass, with 6 segments left of total size: 1165944755 bytes
> 2015-04-23 03:28:58,495 INFO [MemtoDiskMerger [Map_1]] 
> orderedgrouped.MergeManager: attempt_1429683757595_0141_1_01_000143_0_10027 
> Merge of the 6 files in-memory complete. Local file is 
> /grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out.merged
>  of size 785583965
> 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] 
> orderedgrouped.MergeManager: finalMerge called with 0 in-memory map-outputs 
> and 5 on-disk map-outputs
> 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] 
> orderedgrouped.MergeManager: GOPAL: onDiskBytes = 365232290 += 
> 365232290for/grid/4/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_1023.out
> 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] 
> orderedgrouped.MergeManager: GOPAL: onDiskBytes = 730529899 += 
> 365297609for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out
> 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] 
> orderedgrouped.MergeManager: GOPAL: onDiskBytes = 1095828683 += 
> 365298784for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out
> {code}
> The multiple instances of 404.out indicates that we pulled two pipelined 
> chunks of the same shuffle src id, once into memory and twice onto disk.
> {code}
> 2015-04-23 03:28:08,256 I

[jira] [Commented] (TEZ-2358) Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task

2015-04-27 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14513802#comment-14513802
 ] 

TezQA commented on TEZ-2358:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12728348/TEZ-2358.3.patch
  against master revision 2935ef4.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/540//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/540//console

This message is automatically generated.

> Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task
> -
>
> Key: TEZ-2358
> URL: https://issues.apache.org/jira/browse/TEZ-2358
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Gopal V
>Assignee: Rajesh Balamohan
> Attachments: TEZ-2358.1.patch, TEZ-2358.2.patch, TEZ-2358.3.patch, 
> TEZ-2358.4.patch, syslog_attempt_1429683757595_0141_1_01_000143_0.syslog.bz2
>
>
> The Tez MergeManager code assumes that the src-task-id is unique between 
> merge operations, this results in some confusion when two merge sequences 
> have to process output from the same src-task-id.
> {code}
> private TezRawKeyValueIterator finalMerge(Configuration job, FileSystem fs,
>List inMemoryMapOutputs,
>List onDiskMapOutputs
> ...
>  if (inMemoryMapOutputs.size() > 0) {
>   int srcTaskId = 
> inMemoryMapOutputs.get(0).getAttemptIdentifier().getInputIdentifier().getInputIndex();
> ...
>// must spill to disk, but can't retain in-mem for intermediate merge
> final Path outputPath = 
>   mapOutputFile.getInputFileForWrite(srcTaskId,
>  inMemToDiskBytes).suffix(
>  
> Constants.MERGED_OUTPUT_PREFIX);
> ...
> {code}
> This or some scenario related to this, results in the following FileChunks 
> list which contains identical named paths with different lengths.
> {code}
> 2015-04-23 03:28:50,983 INFO [MemtoDiskMerger [Map_1]] 
> orderedgrouped.MergeManager: Initiating in-memory merge with 6 segments...
> 2015-04-23 03:28:50,987 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: 
> Merging 6 sorted segments
> 2015-04-23 03:28:50,988 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: Down 
> to the last merge-pass, with 6 segments left of total size: 1165944755 bytes
> 2015-04-23 03:28:58,495 INFO [MemtoDiskMerger [Map_1]] 
> orderedgrouped.MergeManager: attempt_1429683757595_0141_1_01_000143_0_10027 
> Merge of the 6 files in-memory complete. Local file is 
> /grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out.merged
>  of size 785583965
> 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] 
> orderedgrouped.MergeManager: finalMerge called with 0 in-memory map-outputs 
> and 5 on-disk map-outputs
> 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] 
> orderedgrouped.MergeManager: GOPAL: onDiskBytes = 365232290 += 
> 365232290for/grid/4/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_1023.out
> 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] 
> orderedgrouped.MergeManager: GOPAL: onDiskBytes = 730529899 += 
> 365297609for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out
> 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] 
> orderedgrouped.MergeManager: GOPAL: onDiskBytes = 1095828683 += 
> 365298784for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out
> {code}
> The multiple instances of 404.out indicates that we pulled two pipelined 
> chunks of the same shuffle src id, once into memory and twice onto disk.
> {code}
> 2015-04-23 03:28:08,256 I

[jira] [Commented] (TEZ-2358) Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task

2015-04-27 Thread Gopal V (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14513742#comment-14513742
 ] 

Gopal V commented on TEZ-2358:
--

[~rajesh.balamohan]: minor nit on the checkargument (not or not) pattern - it 
gets complex to add another condition later.

> Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task
> -
>
> Key: TEZ-2358
> URL: https://issues.apache.org/jira/browse/TEZ-2358
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Gopal V
>Assignee: Rajesh Balamohan
> Attachments: TEZ-2358.1.patch, TEZ-2358.2.patch, TEZ-2358.3.patch, 
> syslog_attempt_1429683757595_0141_1_01_000143_0.syslog.bz2
>
>
> The Tez MergeManager code assumes that the src-task-id is unique between 
> merge operations, this results in some confusion when two merge sequences 
> have to process output from the same src-task-id.
> {code}
> private TezRawKeyValueIterator finalMerge(Configuration job, FileSystem fs,
>List inMemoryMapOutputs,
>List onDiskMapOutputs
> ...
>  if (inMemoryMapOutputs.size() > 0) {
>   int srcTaskId = 
> inMemoryMapOutputs.get(0).getAttemptIdentifier().getInputIdentifier().getInputIndex();
> ...
>// must spill to disk, but can't retain in-mem for intermediate merge
> final Path outputPath = 
>   mapOutputFile.getInputFileForWrite(srcTaskId,
>  inMemToDiskBytes).suffix(
>  
> Constants.MERGED_OUTPUT_PREFIX);
> ...
> {code}
> This or some scenario related to this, results in the following FileChunks 
> list which contains identical named paths with different lengths.
> {code}
> 2015-04-23 03:28:50,983 INFO [MemtoDiskMerger [Map_1]] 
> orderedgrouped.MergeManager: Initiating in-memory merge with 6 segments...
> 2015-04-23 03:28:50,987 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: 
> Merging 6 sorted segments
> 2015-04-23 03:28:50,988 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: Down 
> to the last merge-pass, with 6 segments left of total size: 1165944755 bytes
> 2015-04-23 03:28:58,495 INFO [MemtoDiskMerger [Map_1]] 
> orderedgrouped.MergeManager: attempt_1429683757595_0141_1_01_000143_0_10027 
> Merge of the 6 files in-memory complete. Local file is 
> /grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out.merged
>  of size 785583965
> 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] 
> orderedgrouped.MergeManager: finalMerge called with 0 in-memory map-outputs 
> and 5 on-disk map-outputs
> 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] 
> orderedgrouped.MergeManager: GOPAL: onDiskBytes = 365232290 += 
> 365232290for/grid/4/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_1023.out
> 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] 
> orderedgrouped.MergeManager: GOPAL: onDiskBytes = 730529899 += 
> 365297609for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out
> 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] 
> orderedgrouped.MergeManager: GOPAL: onDiskBytes = 1095828683 += 
> 365298784for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out
> {code}
> The multiple instances of 404.out indicates that we pulled two pipelined 
> chunks of the same shuffle src id, once into memory and twice onto disk.
> {code}
> 2015-04-23 03:28:08,256 INFO 
> [TezTaskEventRouter[attempt_1429683757595_0141_1_01_000143_0]] 
> orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: DME srcIdx: 143, 
> targetIdx: 404, attemptNum: 0, payload: [hasEmptyPartitions: true, host: 
> cn047-10.l42scl.hortonworks.com, port: 13562, pathComponent: 
> attempt_1429683757595_0141_1_00_000404_0_10009_0, runDuration: 0]
> 2015-04-23 03:28:08,270 INFO 
> [TezTaskEventRouter[attempt_1429683757595_0141_1_01_000143_0]] 
> orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: DME srcIdx: 143, 
> targetIdx: 404, attemptNum: 0, payload: [hasEmptyPartitions: true, host: 
> cn047-10.l42scl.hortonworks.com, port: 13562, pathComponent: 
> attempt_1429683757595_0141_1_00_000404_0_10009_1, runDuration: 0]
> 2015-04-23 03:28:08,272 INFO 
> [TezTaskEventRouter[attempt_1429683757595_0141_1_01_000143_0]] 
> orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: DME srcIdx: 143, 
> targetIdx: 404, attemptNum: 0, payload: [hasEmptyPartitions: true, host: 
> cn047-

[jira] [Commented] (TEZ-2358) Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task

2015-04-27 Thread Gopal V (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14513650#comment-14513650
 ] 

Gopal V commented on TEZ-2358:
--

The patch was again tested at 10Tb scale over the weekend - there seems to be 
no collisions in naming.

I look at the logs and noticed that some earlier tasks did succeed with the 
duplicate naming, due to the fact that there were only a few spills, resulting 
in them being split between the disks & not colliding in paths.

But for the sake of preventing future breakage, it would help to have an error 
being triggered when someone violates the no-duplicate rule for 
onDiskMapOutputs (i.e no two file chunks for merging can start at the same 
offset of the same file). 

My original pre-conditions were wrong when auto-reducer parallelism kicks in as 
we want to merge off a DISK_DIRECT input across two reducers (when auto-reduce 
parallelism kicks in), which would be different index points into the same 
DISK_DIRECT file.

> Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task
> -
>
> Key: TEZ-2358
> URL: https://issues.apache.org/jira/browse/TEZ-2358
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Gopal V
>Assignee: Rajesh Balamohan
> Attachments: TEZ-2358.1.patch, TEZ-2358.2.patch, 
> syslog_attempt_1429683757595_0141_1_01_000143_0.syslog.bz2
>
>
> The Tez MergeManager code assumes that the src-task-id is unique between 
> merge operations, this results in some confusion when two merge sequences 
> have to process output from the same src-task-id.
> {code}
> private TezRawKeyValueIterator finalMerge(Configuration job, FileSystem fs,
>List inMemoryMapOutputs,
>List onDiskMapOutputs
> ...
>  if (inMemoryMapOutputs.size() > 0) {
>   int srcTaskId = 
> inMemoryMapOutputs.get(0).getAttemptIdentifier().getInputIdentifier().getInputIndex();
> ...
>// must spill to disk, but can't retain in-mem for intermediate merge
> final Path outputPath = 
>   mapOutputFile.getInputFileForWrite(srcTaskId,
>  inMemToDiskBytes).suffix(
>  
> Constants.MERGED_OUTPUT_PREFIX);
> ...
> {code}
> This or some scenario related to this, results in the following FileChunks 
> list which contains identical named paths with different lengths.
> {code}
> 2015-04-23 03:28:50,983 INFO [MemtoDiskMerger [Map_1]] 
> orderedgrouped.MergeManager: Initiating in-memory merge with 6 segments...
> 2015-04-23 03:28:50,987 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: 
> Merging 6 sorted segments
> 2015-04-23 03:28:50,988 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: Down 
> to the last merge-pass, with 6 segments left of total size: 1165944755 bytes
> 2015-04-23 03:28:58,495 INFO [MemtoDiskMerger [Map_1]] 
> orderedgrouped.MergeManager: attempt_1429683757595_0141_1_01_000143_0_10027 
> Merge of the 6 files in-memory complete. Local file is 
> /grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out.merged
>  of size 785583965
> 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] 
> orderedgrouped.MergeManager: finalMerge called with 0 in-memory map-outputs 
> and 5 on-disk map-outputs
> 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] 
> orderedgrouped.MergeManager: GOPAL: onDiskBytes = 365232290 += 
> 365232290for/grid/4/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_1023.out
> 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] 
> orderedgrouped.MergeManager: GOPAL: onDiskBytes = 730529899 += 
> 365297609for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out
> 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] 
> orderedgrouped.MergeManager: GOPAL: onDiskBytes = 1095828683 += 
> 365298784for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out
> {code}
> The multiple instances of 404.out indicates that we pulled two pipelined 
> chunks of the same shuffle src id, once into memory and twice onto disk.
> {code}
> 2015-04-23 03:28:08,256 INFO 
> [TezTaskEventRouter[attempt_1429683757595_0141_1_01_000143_0]] 
> orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: DME srcIdx: 143, 
> targetIdx: 404, attemptNum: 0, payload: [hasEmptyPartitions: true, host: 
> cn047-10.l42scl.hortonworks.com, port: 13562, pathComponent: 
> attempt_1429683

[jira] [Commented] (TEZ-2358) Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task

2015-04-24 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14511227#comment-14511227
 ] 

TezQA commented on TEZ-2358:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12727924/TEZ-2358.2.patch
  against master revision 778c1f5.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/529//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/529//console

This message is automatically generated.

> Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task
> -
>
> Key: TEZ-2358
> URL: https://issues.apache.org/jira/browse/TEZ-2358
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Gopal V
>Assignee: Rajesh Balamohan
> Attachments: TEZ-2358.1.patch, TEZ-2358.2.patch, 
> syslog_attempt_1429683757595_0141_1_01_000143_0.syslog.bz2
>
>
> The Tez MergeManager code assumes that the src-task-id is unique between 
> merge operations, this results in some confusion when two merge sequences 
> have to process output from the same src-task-id.
> {code}
> private TezRawKeyValueIterator finalMerge(Configuration job, FileSystem fs,
>List inMemoryMapOutputs,
>List onDiskMapOutputs
> ...
>  if (inMemoryMapOutputs.size() > 0) {
>   int srcTaskId = 
> inMemoryMapOutputs.get(0).getAttemptIdentifier().getInputIdentifier().getInputIndex();
> ...
>// must spill to disk, but can't retain in-mem for intermediate merge
> final Path outputPath = 
>   mapOutputFile.getInputFileForWrite(srcTaskId,
>  inMemToDiskBytes).suffix(
>  
> Constants.MERGED_OUTPUT_PREFIX);
> ...
> {code}
> This or some scenario related to this, results in the following FileChunks 
> list which contains identical named paths with different lengths.
> {code}
> 2015-04-23 03:28:50,983 INFO [MemtoDiskMerger [Map_1]] 
> orderedgrouped.MergeManager: Initiating in-memory merge with 6 segments...
> 2015-04-23 03:28:50,987 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: 
> Merging 6 sorted segments
> 2015-04-23 03:28:50,988 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: Down 
> to the last merge-pass, with 6 segments left of total size: 1165944755 bytes
> 2015-04-23 03:28:58,495 INFO [MemtoDiskMerger [Map_1]] 
> orderedgrouped.MergeManager: attempt_1429683757595_0141_1_01_000143_0_10027 
> Merge of the 6 files in-memory complete. Local file is 
> /grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out.merged
>  of size 785583965
> 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] 
> orderedgrouped.MergeManager: finalMerge called with 0 in-memory map-outputs 
> and 5 on-disk map-outputs
> 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] 
> orderedgrouped.MergeManager: GOPAL: onDiskBytes = 365232290 += 
> 365232290for/grid/4/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_1023.out
> 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] 
> orderedgrouped.MergeManager: GOPAL: onDiskBytes = 730529899 += 
> 365297609for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out
> 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] 
> orderedgrouped.MergeManager: GOPAL: onDiskBytes = 1095828683 += 
> 365298784for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out
> {code}
> The multiple instances of 404.out indicates that we pulled two pipelined 
> chunks of the same shuffle src id, once into memory and twice onto disk.
> {code}
> 2015-04-23 03:28:08,256 INFO 
> [TezTaskEventRouter[attempt_1

[jira] [Commented] (TEZ-2358) Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task

2015-04-24 Thread Gopal V (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14510894#comment-14510894
 ] 

Gopal V commented on TEZ-2358:
--

[~rajesh.balamohan]: The patch is looking good in tests.

The ETL succeeded (Time taken: 14184.052 seconds) with this fix and there were 
no task failures due to full disks - also tested with local.fetch/shared.fetch 
turned on.

FYI, I added a rough precondition to throw an exception if a duplicate name was 
added to FileChunks, which pretty much gave me the confidence that this isn't 
hit anymore.

> Pipelined Shuffle: MergeManager assumptions about 1 merge per source-task
> -
>
> Key: TEZ-2358
> URL: https://issues.apache.org/jira/browse/TEZ-2358
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Gopal V
>Assignee: Rajesh Balamohan
> Attachments: TEZ-2358.1.patch, 
> syslog_attempt_1429683757595_0141_1_01_000143_0.syslog.bz2
>
>
> The Tez MergeManager code assumes that the src-task-id is unique between 
> merge operations, this results in some confusion when two merge sequences 
> have to process output from the same src-task-id.
> {code}
> private TezRawKeyValueIterator finalMerge(Configuration job, FileSystem fs,
>List inMemoryMapOutputs,
>List onDiskMapOutputs
> ...
>  if (inMemoryMapOutputs.size() > 0) {
>   int srcTaskId = 
> inMemoryMapOutputs.get(0).getAttemptIdentifier().getInputIdentifier().getInputIndex();
> ...
>// must spill to disk, but can't retain in-mem for intermediate merge
> final Path outputPath = 
>   mapOutputFile.getInputFileForWrite(srcTaskId,
>  inMemToDiskBytes).suffix(
>  
> Constants.MERGED_OUTPUT_PREFIX);
> ...
> {code}
> This or some scenario related to this, results in the following FileChunks 
> list which contains identical named paths with different lengths.
> {code}
> 2015-04-23 03:28:50,983 INFO [MemtoDiskMerger [Map_1]] 
> orderedgrouped.MergeManager: Initiating in-memory merge with 6 segments...
> 2015-04-23 03:28:50,987 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: 
> Merging 6 sorted segments
> 2015-04-23 03:28:50,988 INFO [MemtoDiskMerger [Map_1]] impl.TezMerger: Down 
> to the last merge-pass, with 6 segments left of total size: 1165944755 bytes
> 2015-04-23 03:28:58,495 INFO [MemtoDiskMerger [Map_1]] 
> orderedgrouped.MergeManager: attempt_1429683757595_0141_1_01_000143_0_10027 
> Merge of the 6 files in-memory complete. Local file is 
> /grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out.merged
>  of size 785583965
> 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] 
> orderedgrouped.MergeManager: finalMerge called with 0 in-memory map-outputs 
> and 5 on-disk map-outputs
> 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] 
> orderedgrouped.MergeManager: GOPAL: onDiskBytes = 365232290 += 
> 365232290for/grid/4/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_1023.out
> 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] 
> orderedgrouped.MergeManager: GOPAL: onDiskBytes = 730529899 += 
> 365297609for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out
> 2015-04-23 03:28:58,496 INFO [ShuffleAndMergeRunner [Map_1]] 
> orderedgrouped.MergeManager: GOPAL: onDiskBytes = 1095828683 += 
> 365298784for/grid/5/cluster/yarn/local/usercache/gopal/appcache/application_1429683757595_0141/attempt_1429683757595_0141_1_01_000143_0_10027_spill_404.out
> {code}
> The multiple instances of 404.out indicates that we pulled two pipelined 
> chunks of the same shuffle src id, once into memory and twice onto disk.
> {code}
> 2015-04-23 03:28:08,256 INFO 
> [TezTaskEventRouter[attempt_1429683757595_0141_1_01_000143_0]] 
> orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: DME srcIdx: 143, 
> targetIdx: 404, attemptNum: 0, payload: [hasEmptyPartitions: true, host: 
> cn047-10.l42scl.hortonworks.com, port: 13562, pathComponent: 
> attempt_1429683757595_0141_1_00_000404_0_10009_0, runDuration: 0]
> 2015-04-23 03:28:08,270 INFO 
> [TezTaskEventRouter[attempt_1429683757595_0141_1_01_000143_0]] 
> orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: DME srcIdx: 143, 
> targetIdx: 404, attemptNum: 0, payload: [hasEmptyPartitions: true, host: 
> cn047-10.l42scl.hortonworks.com, port: 13562, pathComponent: 
> attempt_1429683757595_0141_1_00_000404_0_10009_1, runDuration: 0]
> 2015-04-23 0