[ https://issues.apache.org/jira/browse/HADOOP-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Arun C Murthy updated HADOOP-2054: ---------------------------------- Fix Version/s: (was: 0.16.0) Pushing this to 0.17.0 and beyond... > Improve memory model for map-side sorts > --------------------------------------- > > Key: HADOOP-2054 > URL: https://issues.apache.org/jira/browse/HADOOP-2054 > Project: Hadoop > Issue Type: Improvement > Components: mapred > Reporter: Arun C Murthy > Assignee: Arun C Murthy > > {{MapTask#MapOutputBuffer}} uses a plain-jane {{DataOutputBuffer}} which > defaults to a buffer of size 32-bytes, and the {{DataOutputBuffer#write}} > call doubles the underlying byte-array when it needs more space. > However for maps which output any decent amount of data (e.g. 128MB in > examples/Sort.java) this means the buffer grows painfully slowly from 2^6 to > 2^28, and each time this results in a new array being created, followed by an > array-copy: > {noformat} > public void write(DataInput in, int len) throws IOException { > int newcount = count + len; > if (newcount > buf.length) { > byte newbuf[] = new byte[Math.max(buf.length << 1, newcount)]; > System.arraycopy(buf, 0, newbuf, 0, count); > buf = newbuf; > } > in.readFully(buf, count, len); > count = newcount; > } > {noformat} > I reckon we could do much better in the {{MapTask}}, specifically... > For e.g. we start with a buffer of size 1/4KB and quadruple, rather than > double, upto, say 4/8/16MB. Then we resume doubling (or less). > This means that it quickly ramps up to minimize no. of {{System.arrayCopy}} > calls and small-sized buffers to GC; and later start doubling to ensure we > don't ramp-up too quickly to minimize memory wastage due to fragmentation. > Of course, this issue is about benchmarking and figuring if all this is worth > it, and, if so, what are the right set of trade-offs to make. > Thoughts? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.