[ http://issues.apache.org/jira/browse/HADOOP-331?page=comments#action_12443043 ] Sameer Paranjpye commented on HADOOP-331: -----------------------------------------
Doug: > But then we cannot use SequenceFile's merger, since keys wouldn't be > independently comparable, right? Yes, you're right, they wouldn't. This optimization is probably not worth the trouble. ----- To summarize, this is what we appear to have ended up with. - A couple of key variants class KeyByteOffset { WritableComparable key; int offset; int length; } class PartKey { int partition; WritableComparable key; } - In RAM data structures List<KeyByteOffset>[NumReduces] keyLists; // one list of keys per reduce byte[] valueBuffer; // serialized values are appended to this buffer - For each <key, value> pair. Determine partition for the key. Append the key to keyLists[partition], append the value to valueBuffer. If #records is greater than #maxRecords or #valueBytes is greater than #maxValueBytes OR We have no more records to process. SPILL - The spill is a block compressed sequence file of <PartKey, value> pairs, with a <part, offset> index. Compressed blocks in this file must not span partitions. - At the end, if #numSpills > 1, merge spills > map outputs should be written to a single output file with an index > ------------------------------------------------------------------- > > Key: HADOOP-331 > URL: http://issues.apache.org/jira/browse/HADOOP-331 > Project: Hadoop > Issue Type: Improvement > Components: mapred > Affects Versions: 0.3.2 > Reporter: eric baldeschwieler > Assigned To: Devaraj Das > > The current strategy of writing a file per target map is consuming a lot of > unused buffer space (causing out of memory crashes) and puts a lot of burden > on the FS (many opens, inodes used, etc). > I propose that we write a single file containing all output and also write an > index file IDing which byte range in the file goes to each reduce. This will > remove the issue of buffer waste, address scaling issues with number of open > files and generally set us up better for scaling. It will also have > advantages with very small inputs, since the buffer cache will reduce the > number of seeks needed and the data serving node can open a single file and > just keep it open rather than needing to do directory and open ops on every > request. > The only issue I see is that in cases where the task output is substantiallyu > larger than its input, we may need to spill multiple times. In this case, we > can do a merge after all spills are complete (or during the final spill). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira