Optimize MapTask.MapOutputBuffer.spill() by not deserialize/serialize 
keys/values but use appendRaw
---------------------------------------------------------------------------------------------------

                 Key: HADOOP-1609
                 URL: https://issues.apache.org/jira/browse/HADOOP-1609
             Project: Hadoop
          Issue Type: Improvement
          Components: mapred
    Affects Versions: 0.14.0
            Reporter: Espen Amble Kolstad
         Attachments: spill.patch

In MapTask.MapOutputBuffer.spill() every key and value is read from buffer and 
then written to file with append(key, value):

{code}
      DataInputBuffer keyIn = new DataInputBuffer();
      DataInputBuffer valIn = new DataInputBuffer();
      DataOutputBuffer valOut = new DataOutputBuffer();
      while (resultIter.next()) {
        keyIn.reset(resultIter.getKey().getData(), 
                    resultIter.getKey().getLength());
        key.readFields(keyIn);
        valOut.reset();
        (resultIter.getValue()).writeUncompressedBytes(valOut);
        valIn.reset(valOut.getData(), valOut.getLength());
        value.readFields(valIn);
        writer.append(key, value);
        reporter.progress();
      }
{code}

When you have complex objects, like nutch's ParseData or Inlinks, this takes 
time and creates lots of garbage.

I've created a patch, it seems to be working, only tested on 0.13.0.
It's a bit clumsy, since ValueBytes is cast to Un-/CompressedBytes in 
SequenceFile.Writer.

Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to