[jira] [Commented] (TEZ-3752) Reduce Object size of InMemoryMapOutput for large jobs

2017-07-27 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16104002#comment-16104002
 ] 

TezQA commented on TEZ-3752:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12879018/TEZ-3752.001.patch
  against master revision 4b5448d.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 3.0.1) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in :
   
org.apache.tez.runtime.library.common.writers.TestUnorderedPartitionedKVWriter

Test results: 
https://builds.apache.org/job/PreCommit-TEZ-Build/2590//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2590//console

This message is automatically generated.

> Reduce Object size of InMemoryMapOutput for large jobs
> --
>
> Key: TEZ-3752
> URL: https://issues.apache.org/jira/browse/TEZ-3752
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Muhammad Samir Khan
> Attachments: TEZ-3752.001.patch
>
>
> Follow-on jira from TEZ-3732. The InMemoryMapOutput has a 
> BoundedByteArrayOutputStream that is only used in the Merged MapOutput case. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TEZ-3752) Reduce Object size of InMemoryMapOutput for large jobs

2017-07-27 Thread Muhammad Samir Khan (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103880#comment-16103880
 ] 

Muhammad Samir Khan commented on TEZ-3752:
--

However, this test doesn't actually hit the RLE case. InMemoryWriter has RLE 
turned off since the Writer constructor it calls has rle flag set to false.

> Reduce Object size of InMemoryMapOutput for large jobs
> --
>
> Key: TEZ-3752
> URL: https://issues.apache.org/jira/browse/TEZ-3752
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Muhammad Samir Khan
> Attachments: TEZ-3752.001.patch
>
>
> Follow-on jira from TEZ-3732. The InMemoryMapOutput has a 
> BoundedByteArrayOutputStream that is only used in the Merged MapOutput case. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TEZ-3752) Reduce Object size of InMemoryMapOutput for large jobs

2017-07-27 Thread Muhammad Samir Khan (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103716#comment-16103716
 ] 

Muhammad Samir Khan commented on TEZ-3752:
--

Ran orderedwordcount with 
-Dtez.shuffle-vertex-manager.enable.auto-parallel=true 
-Dtez.runtime.io.sort.factor=4 
-Dtez.runtime.shuffle.memory-to-memory.enable=true. Sorted the output (via 
sort) and diff'd against the output from orderedwordcount without the changes.

Also turned on the '"writeFile SAME_KEY count=" + count' log line in 
TezMerger.writeFile to ensure we hit the RLE case with in memory merge:
2017-07-27 18:19:18,128 [INFO] [MemToMemMerger [Tokenizer]] 
|orderedgrouped.MergeManager|: Tokenizer: Initiating Memory-to-Memory merge 
with 4 segments of total-size: 22182024
2017-07-27 18:19:18,770 [INFO] [MemToMemMerger [Tokenizer]] |impl.TezMerger|: 
writeFile SAME_KEY count=1544269
2017-07-27 18:19:18,771 [INFO] [MemToMemMerger [Tokenizer]] 
|orderedgrouped.MergeManager|: Tokenizer Memory-to-Memory merge of the 4 files 
in-memory complete with mergeOutputSize=22182024

> Reduce Object size of InMemoryMapOutput for large jobs
> --
>
> Key: TEZ-3752
> URL: https://issues.apache.org/jira/browse/TEZ-3752
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Muhammad Samir Khan
> Attachments: TEZ-3752.001.patch
>
>
> Follow-on jira from TEZ-3732. The InMemoryMapOutput has a 
> BoundedByteArrayOutputStream that is only used in the Merged MapOutput case. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TEZ-3752) Reduce Object size of InMemoryMapOutput for large jobs

2017-07-26 Thread Jonathan Eagles (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102127#comment-16102127
 ] 

Jonathan Eagles commented on TEZ-3752:
--

This approach and implementation look correct. Can you post some results of 
running jobs with RLE to verify merge is correct in that scenario?

> Reduce Object size of InMemoryMapOutput for large jobs
> --
>
> Key: TEZ-3752
> URL: https://issues.apache.org/jira/browse/TEZ-3752
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Muhammad Samir Khan
> Attachments: TEZ-3752.001.patch
>
>
> Follow-on jira from TEZ-3732. The InMemoryMapOutput has a 
> BoundedByteArrayOutputStream that is only used in the Merged MapOutput case. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TEZ-3752) Reduce Object size of InMemoryMapOutput for large jobs

2017-07-26 Thread Muhammad Samir Khan (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102046#comment-16102046
 ] 

Muhammad Samir Khan commented on TEZ-3752:
--

JOL dump:
Before:
-internals:
{code}
# Running 64-bit HotSpot VM.
# Using compressed oop with 3-bit shift.
# Using compressed klass with 3-bit shift.
# Objects are 8 bytes aligned.
# Field sizes by type: 4, 1, 1, 2, 2, 4, 4, 8, 8 [bytes]
# Array element sizes: 4, 1, 1, 2, 2, 4, 4, 8, 8 [bytes]

Instantiated the sample instance via 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MapOutput$InMemoryMapOutput(org.apache.tez.runtime.library.common.InputAttemptIdentifier,org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetchedInputAllocatorOrderedGrouped,long,boolean,org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MapOutput$1)

org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MapOutput$InMemoryMapOutput
 object internals:
 OFFSET  SIZE   
TYPE DESCRIPTION   VALUE
  0 4   
 (object header)   01 
00 00 00 (0001   ) (1)
  4 4   
 (object header)   00 
00 00 00 (   ) (0)
  8 4   
 (object header)   78 
12 01 f8 (0000 00010010 0001 1000) (-134147464)
 12 4   
 int MapOutput.id  1
 16 1   
 boolean MapOutput.primaryMapOutputfalse
 17 3   
 (alignment/padding gap)  
 20 4   
org.apache.tez.runtime.library.common.InputAttemptIdentifier 
MapOutput.attemptIdentifier   null
 24 4   
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetchedInputAllocatorOrderedGrouped
 MapOutput.callbacknull
 28 4  
org.apache.hadoop.io.BoundedByteArrayOutputStream InMemoryMapOutput.byteStream  
(object)
Instance size: 32 bytes
Space losses: 3 bytes internal + 0 bytes external = 3 bytes total
{code}

-footprint:
{code}
# Running 64-bit HotSpot VM.
# Using compressed oop with 3-bit shift.
# Using compressed klass with 3-bit shift.
# Objects are 8 bytes aligned.
# Field sizes by type: 4, 1, 1, 2, 2, 4, 4, 8, 8 [bytes]
# Array element sizes: 4, 1, 1, 2, 2, 4, 4, 8, 8 [bytes]

Instantiated the sample instance via 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MapOutput$InMemoryMapOutput(org.apache.tez.runtime.library.common.InputAttemptIdentifier,org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetchedInputAllocatorOrderedGrouped,long,boolean,org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MapOutput$1)

org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MapOutput$InMemoryMapOutput@10bdf5e5d
 footprint:
 COUNT   AVG   SUM   DESCRIPTION
 11616   [B
 13232   
org.apache.hadoop.io.BoundedByteArrayOutputStream
 13232   
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MapOutput$InMemoryMapOutput
 3  80   (total)
{code}

After:
-internals:
{code}
# Running 64-bit HotSpot VM.
# Using compressed oop with 3-bit shift.
# Using compressed klass with 3-bit shift.
# Objects are 8 bytes aligned.
# Field sizes by type: 4, 1, 1, 2, 2, 4, 4, 8, 8 [bytes]
# Array element sizes: 4, 1, 1, 2, 2, 4, 4, 8, 8 [bytes]

Instantiated the sample instance via 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MapOutput$InMemoryMapOutput(org.apache.tez.runtime.library.common.InputAttemptIdentifier,org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetchedInputAllocatorOrderedGrouped,long,boolean,org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MapOutput$1)

org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MapOutput$InMemoryMapOutput
 object internals:
 OFFSET  SIZE   
TYPE DESCRIPTION   VALUE
  0 4