[
https://issues.apache.org/jira/browse/HADOOP-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Arun C Murthy updated HADOOP-2054:
----------------------------------
Fix Version/s: (was: 0.16.0)
Pushing this to 0.17.0 and beyond...
> Improve memory model for map-side sorts
> ---------------------------------------
>
> Key: HADOOP-2054
> URL: https://issues.apache.org/jira/browse/HADOOP-2054
> Project: Hadoop
> Issue Type: Improvement
> Components: mapred
> Reporter: Arun C Murthy
> Assignee: Arun C Murthy
>
> {{MapTask#MapOutputBuffer}} uses a plain-jane {{DataOutputBuffer}} which
> defaults to a buffer of size 32-bytes, and the {{DataOutputBuffer#write}}
> call doubles the underlying byte-array when it needs more space.
> However for maps which output any decent amount of data (e.g. 128MB in
> examples/Sort.java) this means the buffer grows painfully slowly from 2^6 to
> 2^28, and each time this results in a new array being created, followed by an
> array-copy:
> {noformat}
> public void write(DataInput in, int len) throws IOException {
> int newcount = count + len;
> if (newcount > buf.length) {
> byte newbuf[] = new byte[Math.max(buf.length << 1, newcount)];
> System.arraycopy(buf, 0, newbuf, 0, count);
> buf = newbuf;
> }
> in.readFully(buf, count, len);
> count = newcount;
> }
> {noformat}
> I reckon we could do much better in the {{MapTask}}, specifically...
> For e.g. we start with a buffer of size 1/4KB and quadruple, rather than
> double, upto, say 4/8/16MB. Then we resume doubling (or less).
> This means that it quickly ramps up to minimize no. of {{System.arrayCopy}}
> calls and small-sized buffers to GC; and later start doubling to ensure we
> don't ramp-up too quickly to minimize memory wastage due to fragmentation.
> Of course, this issue is about benchmarking and figuring if all this is worth
> it, and, if so, what are the right set of trade-offs to make.
> Thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.