[ https://issues.apache.org/jira/browse/MAPREDUCE-64?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12767693#action_12767693 ]
Todd Lipcon commented on MAPREDUCE-64: -------------------------------------- Thanks for those great diagrams - they really helped me understand things much better! A picture is worth 1000 lines of code or something :) I applied your patch just now and ran it through clover for coverage analysis. Here are a couple things I think we should cover before committing: - We don't current run any tests with job.getCompressMapOutput returning true. This caused an issue or two with the shuffle in the past, so we should get at least one test that uses a codec. - Since we're using the Local Runner for these tests, it's all a single partition. This is probably OK, since I imagine other tests throughout Hadoop exercise those paths (I'm only looking at coverage from TestMapCollection here) - Line 1097 ("if (bufindex + headbytelen < avail) {" in void reset()) is always true in our tests. We should get a test case to exercise the other half of this branch. - Line 1365 (kvstart >= kvend ternary in sortAndSpill) is always true. Should exercise the other half of that. On a code level, one more thing I noticed - can you put in a small comment describing the synchronization policy for the various offsets? Those used to be volatile and now they're under a lock, so it should be good to note that in the code. I'll try to get a chance to run some basic benchmarks later this week. > Map-side sort is hampered by io.sort.record.percent > --------------------------------------------------- > > Key: MAPREDUCE-64 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-64 > Project: Hadoop Map/Reduce > Issue Type: Bug > Reporter: Arun C Murthy > Assignee: Chris Douglas > Attachments: M64-0.patch, M64-0i.png, M64-1.patch, M64-1i.png, > M64-2.patch, M64-2i.png, M64-3.patch > > > Currently io.sort.record.percent is a fairly obscure, per-job configurable, > expert-level parameter which controls how much accounting space is available > for records in the map-side sort buffer (io.sort.mb). Typically values for > io.sort.mb (100) and io.sort.record.percent (0.05) imply that we can store > ~350,000 records in the buffer before necessitating a sort/combine/spill. > However for many applications which deal with small records e.g. the > world-famous wordcount and it's family this implies we can only use 5-10% of > io.sort.mb i.e. (5-10M) before we spill inspite of having _much_ more memory > available in the sort-buffer. The word-count for e.g. results in ~12 spills > (given hdfs block size of 64M). The presence of a combiner exacerbates the > problem by piling serialization/deserialization of records too... > Sure, jobs can configure io.sort.record.percent, but it's tedious and > obscure; we really can do better by getting the framework to automagically > pick it by using all available memory (upto io.sort.mb) for either the data > or accounting. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.