[ 
https://issues.apache.org/jira/browse/TEZ-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Muhammad Samir Khan updated TEZ-3809:
-------------------------------------
    Description: 
Related jiras: TEZ-3752 and TEZ-3732.

-When shuffling input to memory, the decompressed length is used to create the 
InMemoryMapOutput object. However, IFile.Reader's readToMemory reads 4 bytes 
less (the IFile header). These 4 bytes can optimized and, in an extreme case of 
10,000,000 fetches, can save ~38 MB (TEZ-3732).

-Memory-to-memory merge sums up the sizes of input InMemoryMapOutput buffers to 
allocate the new InMemoryMapOutput. However, each input has two EOF_MARKERs 
while only two are needed at the end.

-InMemoryWriter wraps the output BoundedByteArrayOutputStream in 
IFileOutputStream which will write checksum at close. This creates an 
inconsistency between the primary input buffers which don't have checksum and 
the merged buffers which do. IFileOutputStream wrap can be removed to save 4 
bytes per merged buffers.

-InMemoryWriter does not account for two EOF_MARKERs written at close() in its 
accounting so that the getRawLength() method is off by two bytes.

  was:
Related jiras: TEZ-3752 and TEZ-3732.

-When shuffling input to memory, the decompressed length is used to create the 
InMemoryMapOutput object. However, IFile.Reader's readToMemory reads 4 bytes 
less (the IFile header). These 4 bytes can optimized and, in an extreme case of 
10,000,000 fetches, can save ~38 MB (TEZ-3732).

-Memory-to-memory merge sums up the sizes of input InMemoryMapOutput buffers to 
allocate the new InMemoryMapOutput. However, each input has two EOF_MARKERs 
while only two are needed at the end.

-InMemoryWriter wraps the output BoundedByteArrayOutputStream in 
IFileOutputStream which will write checksum at close. This can create a 
mismatch between the primary input buffers which don't have checksum and the 
merged buffers which do. IFileOutputStream wrap can be removed to save 4 bytes 
per merged buffers.

-InMemoryWriter does not account for two EOF_MARKERs written at close() in its 
accounting so that the getRawLength() method is off by two bytes.


> The buffer size allocated for InMemoryMapOutput can be optimized
> ----------------------------------------------------------------
>
>                 Key: TEZ-3809
>                 URL: https://issues.apache.org/jira/browse/TEZ-3809
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Muhammad Samir Khan
>            Assignee: Muhammad Samir Khan
>         Attachments: TEZ-3809.001.patch
>
>
> Related jiras: TEZ-3752 and TEZ-3732.
> -When shuffling input to memory, the decompressed length is used to create 
> the InMemoryMapOutput object. However, IFile.Reader's readToMemory reads 4 
> bytes less (the IFile header). These 4 bytes can optimized and, in an extreme 
> case of 10,000,000 fetches, can save ~38 MB (TEZ-3732).
> -Memory-to-memory merge sums up the sizes of input InMemoryMapOutput buffers 
> to allocate the new InMemoryMapOutput. However, each input has two 
> EOF_MARKERs while only two are needed at the end.
> -InMemoryWriter wraps the output BoundedByteArrayOutputStream in 
> IFileOutputStream which will write checksum at close. This creates an 
> inconsistency between the primary input buffers which don't have checksum and 
> the merged buffers which do. IFileOutputStream wrap can be removed to save 4 
> bytes per merged buffers.
> -InMemoryWriter does not account for two EOF_MARKERs written at close() in 
> its accounting so that the getRawLength() method is off by two bytes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to