[ 
https://issues.apache.org/jira/browse/TEZ-3159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16126065#comment-16126065
 ] 

Rohini Palaniswamy commented on TEZ-3159:
-----------------------------------------

Some more comments:
   Just to be clear, when you implement FlexibleDataInputBuffer, instead of 
reading into a byte array and using that byte array after reset as it is 
currently done now
{code}
int i = readData(valBytes, 0, currentValueLength);
      if (i != currentValueLength) {
        throw new IOException(String.format(INCOMPLETE_READ, 
currentValueLength, i));
      }
      value.reset(valBytes, currentValueLength);
{code}
  You will have to read directly into the FlexibleDataInputBuffer
 
FlexibleByteArrayOutputStream.java:
   1) You should keep a totalCount as well which will be the total number of 
bytes written and should return that in size(). And ensure that is always less 
than MAX_BUFFER_SIZE while writing the data itself.
   2) Nitpick. Place the default empty constructor first.
   3) Rename DEFAULT_CAPACITY_OF_SINGLE_BUFFER and 
DEFAULT_INITIAL_SIZE_OF_SINGLE_BUFFER to BUFFER_SIZE_DEFAULT and 
BUFFER_INITIAL_SIZE_DEFAULT to be consistent with the naming convention we 
generally use.
  4)  Code should get simplified a lot if we are doing double of the buffer 
size from 32 bytes to full buffer capacity only for the first buffer and go 
with creating the subsequent buffers with the full buffer capacity. Keep 
separate code paths for them (index == 0) so that code is simpler and easier to 
read.

> Reduce memory utilization while serializing keys and values
> -----------------------------------------------------------
>
>                 Key: TEZ-3159
>                 URL: https://issues.apache.org/jira/browse/TEZ-3159
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Rohini Palaniswamy
>            Assignee: Muhammad Samir Khan
>         Attachments: TEZ-3159.001.patch, TEZ-3159.002.patch, 
> TEZ-3159.003.patch
>
>
>   Currently DataOutputBuffer is used for serializing. The underlying buffer 
> keeps doubling in size when it reaches capacity. In some of the Pig scripts 
> which serialize big bags, we end up with OOM in Tez as there is no space to 
> double the array size. Mapreduce mode runs fine in those cases with 1G heap. 
> The scenarios are
>     - When combiner runs in reducer and some of the fields after combining 
> are still big bags (For eg: distinct). Currently with mapreduce combiner does 
> not run in reducer - MAPREDUCE-5221. Since input sort buffers hold good 
> amount of memory at that time it can easily go OOM.
>    -  While serializing output with bags when there are multiple inputs and 
> outputs and the sort buffers for those take up space.
> It is a pain especially after buffer size hits 128MB. Doubling at 128MB will 
> require 128MB (existing array) +256MB (new array). Any doubling after that 
> requires even more space. But most of the time the data is probably not going 
> to fill up that 256MB leading to wastage.
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to