[jira] [Commented] (TEZ-3809) The buffer size allocated for InMemoryMapOutput can be optimized

2017-08-07 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16117078#comment-16117078
 ] 

TezQA commented on TEZ-3809:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12880667/TEZ-3809.002.patch
  against master revision 8dcf8a1.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 3.0.1) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-TEZ-Build/2601//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2601//console

This message is automatically generated.

> The buffer size allocated for InMemoryMapOutput can be optimized
> 
>
> Key: TEZ-3809
> URL: https://issues.apache.org/jira/browse/TEZ-3809
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Muhammad Samir Khan
>Assignee: Muhammad Samir Khan
> Attachments: TEZ-3809.001.patch, TEZ-3809.002.patch
>
>
> Related jiras: TEZ-3752 and TEZ-3732.
> -When shuffling input to memory, the decompressed length is used to create 
> the InMemoryMapOutput object. However, IFile.Reader's readToMemory reads 4 
> bytes less (the IFile header). These 4 bytes can optimized and, in an extreme 
> case of 10,000,000 fetches, can save ~38 MB (TEZ-3732).
> -Memory-to-memory merge sums up the sizes of input InMemoryMapOutput buffers 
> to allocate the new InMemoryMapOutput. However, each input has two 
> EOF_MARKERs while only two are needed at the end.
> -InMemoryWriter wraps the output BoundedByteArrayOutputStream in 
> IFileOutputStream which will write checksum at close. This creates an 
> inconsistency between the primary input buffers which don't have checksum and 
> the merged buffers which do. IFileOutputStream wrap can be removed to save 4 
> bytes per merged buffers.
> -InMemoryWriter does not account for two EOF_MARKERs written at close() in 
> its accounting so that the getRawLength() method is off by two bytes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TEZ-3809) The buffer size allocated for InMemoryMapOutput can be optimized

2017-08-06 Thread Rajesh Balamohan (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16116030#comment-16116030
 ] 

Rajesh Balamohan commented on TEZ-3809:
---

Thanks for the patch [~samirkhan]. Patch looks good to me. Minor comments

1. Can you fix the method names in IFile,InMemoryWriter?
2. Remove unwanted imports in FetcherOrderedGrouped, InMemoryWriter.


> The buffer size allocated for InMemoryMapOutput can be optimized
> 
>
> Key: TEZ-3809
> URL: https://issues.apache.org/jira/browse/TEZ-3809
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Muhammad Samir Khan
>Assignee: Muhammad Samir Khan
> Attachments: TEZ-3809.001.patch
>
>
> Related jiras: TEZ-3752 and TEZ-3732.
> -When shuffling input to memory, the decompressed length is used to create 
> the InMemoryMapOutput object. However, IFile.Reader's readToMemory reads 4 
> bytes less (the IFile header). These 4 bytes can optimized and, in an extreme 
> case of 10,000,000 fetches, can save ~38 MB (TEZ-3732).
> -Memory-to-memory merge sums up the sizes of input InMemoryMapOutput buffers 
> to allocate the new InMemoryMapOutput. However, each input has two 
> EOF_MARKERs while only two are needed at the end.
> -InMemoryWriter wraps the output BoundedByteArrayOutputStream in 
> IFileOutputStream which will write checksum at close. This creates an 
> inconsistency between the primary input buffers which don't have checksum and 
> the merged buffers which do. IFileOutputStream wrap can be removed to save 4 
> bytes per merged buffers.
> -InMemoryWriter does not account for two EOF_MARKERs written at close() in 
> its accounting so that the getRawLength() method is off by two bytes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TEZ-3809) The buffer size allocated for InMemoryMapOutput can be optimized

2017-08-03 Thread Muhammad Samir Khan (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113369#comment-16113369
 ] 

Muhammad Samir Khan commented on TEZ-3809:
--

Request for comments [~rajesh.balamohan]/[~sseth]

> The buffer size allocated for InMemoryMapOutput can be optimized
> 
>
> Key: TEZ-3809
> URL: https://issues.apache.org/jira/browse/TEZ-3809
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Muhammad Samir Khan
>Assignee: Muhammad Samir Khan
> Attachments: TEZ-3809.001.patch
>
>
> Related jiras: TEZ-3752 and TEZ-3732.
> -When shuffling input to memory, the decompressed length is used to create 
> the InMemoryMapOutput object. However, IFile.Reader's readToMemory reads 4 
> bytes less (the IFile header). These 4 bytes can optimized and, in an extreme 
> case of 10,000,000 fetches, can save ~38 MB (TEZ-3732).
> -Memory-to-memory merge sums up the sizes of input InMemoryMapOutput buffers 
> to allocate the new InMemoryMapOutput. However, each input has two 
> EOF_MARKERs while only two are needed at the end.
> -InMemoryWriter wraps the output BoundedByteArrayOutputStream in 
> IFileOutputStream which will write checksum at close. This creates an 
> inconsistency between the primary input buffers which don't have checksum and 
> the merged buffers which do. IFileOutputStream wrap can be removed to save 4 
> bytes per merged buffers.
> -InMemoryWriter does not account for two EOF_MARKERs written at close() in 
> its accounting so that the getRawLength() method is off by two bytes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TEZ-3809) The buffer size allocated for InMemoryMapOutput can be optimized

2017-08-03 Thread Muhammad Samir Khan (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113366#comment-16113366
 ] 

Muhammad Samir Khan commented on TEZ-3809:
--

Also tested with memory-to-memory merger to check if the output is the same. 
For the unordered case, used filterLinesByWord and compared output before and 
after.

> The buffer size allocated for InMemoryMapOutput can be optimized
> 
>
> Key: TEZ-3809
> URL: https://issues.apache.org/jira/browse/TEZ-3809
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Muhammad Samir Khan
>Assignee: Muhammad Samir Khan
> Attachments: TEZ-3809.001.patch
>
>
> Related jiras: TEZ-3752 and TEZ-3732.
> -When shuffling input to memory, the decompressed length is used to create 
> the InMemoryMapOutput object. However, IFile.Reader's readToMemory reads 4 
> bytes less (the IFile header). These 4 bytes can optimized and, in an extreme 
> case of 10,000,000 fetches, can save ~38 MB (TEZ-3732).
> -Memory-to-memory merge sums up the sizes of input InMemoryMapOutput buffers 
> to allocate the new InMemoryMapOutput. However, each input has two 
> EOF_MARKERs while only two are needed at the end.
> -InMemoryWriter wraps the output BoundedByteArrayOutputStream in 
> IFileOutputStream which will write checksum at close. This creates an 
> inconsistency between the primary input buffers which don't have checksum and 
> the merged buffers which do. IFileOutputStream wrap can be removed to save 4 
> bytes per merged buffers.
> -InMemoryWriter does not account for two EOF_MARKERs written at close() in 
> its accounting so that the getRawLength() method is off by two bytes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TEZ-3809) The buffer size allocated for InMemoryMapOutput can be optimized

2017-08-02 Thread Muhammad Samir Khan (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16111794#comment-16111794
 ] 

Muhammad Samir Khan commented on TEZ-3809:
--

Took a heap dump on ordered word count before final merge. In the after case, 
one of the outputs was written to disk instead of kept in memory and that is 
why it has 37 entries. 

Before:
Class Name  
   | Shallow Heap | Retained Heap | Percentage
---
java.lang.Thread @ 0x5d2c473f8  ShuffleAndMergeRunner {Tokenizer} Thread
   |  120 | 2,229,207,992 | 96.48%
|- java.util.ArrayList @ 0x73f978f10
   |   24 | 2,229,206,760 | 96.48%
|  '- java.lang.Object[38] @ 0x73f979130
   |  168 | 2,229,206,736 | 96.48%
| |- 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MapOutput$InMemoryMapOutput
 @ 0x5e4a88898|   32 |68,078,192 |  2.95%
| |- 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MapOutput$InMemoryMapOutput
 @ 0x631e0b260|   32 |67,839,520 |  2.94%
| |- 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MapOutput$InMemoryMapOutput
 @ 0x5e4a888b8|   32 |67,700,608 |  2.93%
| |- 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MapOutput$InMemoryMapOutput
 @ 0x73f9db168|   32 |67,500,816 |  2.92%
| |- 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MapOutput$InMemoryMapOutput
 @ 0x60ab36218|   32 |67,408,704 |  2.92%
| |- 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MapOutput$InMemoryMapOutput
 @ 0x631deed28|   32 |67,367,424 |  2.92%
| |- 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MapOutput$InMemoryMapOutput
 @ 0x743b86ee0|   32 |67,337,936 |  2.91%
| |- 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MapOutput$InMemoryMapOutput
 @ 0x60af3a698|   32 |67,300,896 |  2.91%
| |- 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MapOutput$InMemoryMapOutput
 @ 0x631e0c5b8|   32 |67,282,464 |  2.91%
| |- 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MapOutput$InMemoryMapOutput
 @ 0x60ab33140|   32 |67,264,304 |  2.91%
| |- 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MapOutput$InMemoryMapOutput
 @ 0x5e4a88878|   32 |67,127,368 |  2.91%
| |- 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MapOutput$InMemoryMapOutput
 @ 0x631e0b218|   32 |67,098,216 |  2.90%
| |- 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MapOutput$InMemoryMapOutput
 @ 0x631e0c6c8|   32 |67,064,504 |  2.90%
| |- 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MapOutput$InMemoryMapOutput
 @ 0x5d239a6c8|   32 |67,003,776 |  2.90%
| |- 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MapOutput$InMemoryMapOutput
 @ 0x5d23b7e10|   32 |66,965,296 |  2.90%
| |- 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MapOutput$InMemoryMapOutput
 @ 0x631def2b8|   32 |66,928,032 |  2.90%
| |- 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MapOutput$InMemoryMapOutput
 @ 0x60ab351d0|   32 |66,916,896 |  2.90%
| |- 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MapOutput$InMemoryMapOutput
 @ 0x74805dfb8|   32 |66,886,272 |  2.89%
| |- 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MapOutput$InMemoryMapOutput
 @ 0x60af39598|   32 |66,718,800 |  2.89%
| |- 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MapOutput$InMemoryMapOutput
 @ 0x73fb0fb78|   32 |66,688,296 |  2.89%
| |- 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MapOutput$InMemoryMapOutput
 @ 0x631e0c4b0|   32 |66,656,312 |  2.88%
| |- 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MapOutput$InMemoryMapOutput
 @ 0x60af39578|   32 |66,629,936 |  2.88%
| |- 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MapOutput$InMemoryMapOutput
 @ 0x631deec30|   32 |66,584,576 |  2.88%
| |- 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MapOutput$InMemoryMapOutput
 @ 0x631e0c680|   32 |66,537,624 |  2.88%
| |- 

[jira] [Commented] (TEZ-3809) The buffer size allocated for InMemoryMapOutput can be optimized

2017-08-01 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16109968#comment-16109968
 ] 

TezQA commented on TEZ-3809:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12879916/TEZ-3809.001.patch
  against master revision 2358521.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 1 
warning messages.
See 
https://builds.apache.org/job/PreCommit-TEZ-Build/2596//artifact/patchprocess/diffJavadocWarnings.txt
 for details.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 3.0.1) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-TEZ-Build/2596//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2596//console

This message is automatically generated.

> The buffer size allocated for InMemoryMapOutput can be optimized
> 
>
> Key: TEZ-3809
> URL: https://issues.apache.org/jira/browse/TEZ-3809
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Muhammad Samir Khan
>Assignee: Muhammad Samir Khan
> Attachments: TEZ-3809.001.patch
>
>
> Related jiras: TEZ-3752 and TEZ-3732.
> -When shuffling input to memory, the decompressed length is used to create 
> the InMemoryMapOutput object. However, IFile.Reader's readToMemory reads 4 
> bytes less (the IFile header). These 4 bytes can optimized and, in an extreme 
> case of 10,000,000 fetches, can save ~38 MB (TEZ-3732).
> -Memory-to-memory merge sums up the sizes of input InMemoryMapOutput buffers 
> to allocate the new InMemoryMapOutput. However, each input has two 
> EOF_MARKERs while only two are needed at the end.
> -InMemoryWriter wraps the output BoundedByteArrayOutputStream in 
> IFileOutputStream which will write checksum at close. This creates an 
> inconsistency between the primary input buffers which don't have checksum and 
> the merged buffers which do. IFileOutputStream wrap can be removed to save 4 
> bytes per merged buffers.
> -InMemoryWriter does not account for two EOF_MARKERs written at close() in 
> its accounting so that the getRawLength() method is off by two bytes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)