Optimize MapTask.MapOutputBuffer.spill() by not deserialize/serialize
keys/values but use appendRaw
---------------------------------------------------------------------------------------------------
Key: HADOOP-1609
URL: https://issues.apache.org/jira/browse/HADOOP-1609
Project: Hadoop
Issue Type: Improvement
Components: mapred
Affects Versions: 0.14.0
Reporter: Espen Amble Kolstad
Attachments: spill.patch
In MapTask.MapOutputBuffer.spill() every key and value is read from buffer and
then written to file with append(key, value):
{code}
DataInputBuffer keyIn = new DataInputBuffer();
DataInputBuffer valIn = new DataInputBuffer();
DataOutputBuffer valOut = new DataOutputBuffer();
while (resultIter.next()) {
keyIn.reset(resultIter.getKey().getData(),
resultIter.getKey().getLength());
key.readFields(keyIn);
valOut.reset();
(resultIter.getValue()).writeUncompressedBytes(valOut);
valIn.reset(valOut.getData(), valOut.getLength());
value.readFields(valIn);
writer.append(key, value);
reporter.progress();
}
{code}
When you have complex objects, like nutch's ParseData or Inlinks, this takes
time and creates lots of garbage.
I've created a patch, it seems to be working, only tested on 0.13.0.
It's a bit clumsy, since ValueBytes is cast to Un-/CompressedBytes in
SequenceFile.Writer.
Thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.