[ 
https://issues.apache.org/jira/browse/HADOOP-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HADOOP-2054:
----------------------------------

    Fix Version/s:     (was: 0.16.0)

Pushing this to 0.17.0 and beyond...

> Improve memory model for map-side sorts
> ---------------------------------------
>
>                 Key: HADOOP-2054
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2054
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Arun C Murthy
>            Assignee: Arun C Murthy
>
> {{MapTask#MapOutputBuffer}} uses a plain-jane {{DataOutputBuffer}} which 
> defaults to a buffer of size 32-bytes, and the {{DataOutputBuffer#write}} 
> call doubles the underlying byte-array when it needs more space.
> However for maps which output any decent amount of data (e.g. 128MB in 
> examples/Sort.java) this means the buffer grows painfully slowly from 2^6 to 
> 2^28, and each time this results in a new array being created, followed by an 
> array-copy:
> {noformat}
>     public void write(DataInput in, int len) throws IOException {
>       int newcount = count + len;
>       if (newcount > buf.length) {
>         byte newbuf[] = new byte[Math.max(buf.length << 1, newcount)];
>         System.arraycopy(buf, 0, newbuf, 0, count);
>         buf = newbuf;
>       }
>       in.readFully(buf, count, len);
>       count = newcount;
>     }
> {noformat}
> I reckon we could do much better in the {{MapTask}}, specifically... 
> For e.g. we start with a buffer of size 1/4KB and quadruple, rather than 
> double, upto, say 4/8/16MB. Then we resume doubling (or less).
> This means that it quickly ramps up to minimize no. of {{System.arrayCopy}} 
> calls and small-sized buffers to GC; and later start doubling to ensure we 
> don't ramp-up too quickly to minimize memory wastage due to fragmentation.
> Of course, this issue is about benchmarking and figuring if all this is worth 
> it, and, if so, what are the right set of trade-offs to make.
> Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to