[ 
https://issues.apache.org/jira/browse/TEZ-3159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16125983#comment-16125983
 ] 

Rohini Palaniswamy commented on TEZ-3159:
-----------------------------------------

General comments:
  1) We need a equivalent class for DataInputBuffer as well. 

FlexibleByteArrayOutputStream.java:
   1) MAX_BUFFER_SIZE should not be required anymore after this patch (after 
DataInputBuffer also has multiple buffers). Can you rename the class to 
UnboundedDataOutputBuffer and extend DataOutputStream instead of OutputStream.
   2) Have been giving some thought and I think instead of 64MB, we should have 
either 1MB or 2MB as DEFAULT_CAPACITY_OF_SINGLE_BUFFER. And only the first 
buffer we should grow from 32 bytes till the capacity. Subsequent buffers we 
should just create with the full capacity. It will more efficient that way.

IFile.java:
   1) Find it confusing to see keyData being passed as null to writeKVPair for 
writeValue methods. It would be cleaner to have separate methods as before 
instead of checking for null and having multiple branches for useRLE in 
writeKVPair method.
  

> Reduce memory utilization while serializing keys and values
> -----------------------------------------------------------
>
>                 Key: TEZ-3159
>                 URL: https://issues.apache.org/jira/browse/TEZ-3159
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Rohini Palaniswamy
>            Assignee: Muhammad Samir Khan
>         Attachments: TEZ-3159.001.patch, TEZ-3159.002.patch, 
> TEZ-3159.003.patch
>
>
>   Currently DataOutputBuffer is used for serializing. The underlying buffer 
> keeps doubling in size when it reaches capacity. In some of the Pig scripts 
> which serialize big bags, we end up with OOM in Tez as there is no space to 
> double the array size. Mapreduce mode runs fine in those cases with 1G heap. 
> The scenarios are
>     - When combiner runs in reducer and some of the fields after combining 
> are still big bags (For eg: distinct). Currently with mapreduce combiner does 
> not run in reducer - MAPREDUCE-5221. Since input sort buffers hold good 
> amount of memory at that time it can easily go OOM.
>    -  While serializing output with bags when there are multiple inputs and 
> outputs and the sort buffers for those take up space.
> It is a pain especially after buffer size hits 128MB. Doubling at 128MB will 
> require 128MB (existing array) +256MB (new array). Any doubling after that 
> requires even more space. But most of the time the data is probably not going 
> to fill up that 256MB leading to wastage.
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to