Add Writable for very large lists of key / value pairs
------------------------------------------------------

                 Key: HADOOP-2853
                 URL: https://issues.apache.org/jira/browse/HADOOP-2853
             Project: Hadoop Core
          Issue Type: New Feature
          Components: io
    Affects Versions: 0.17.0
            Reporter: Andrzej Bialecki 
             Fix For: 0.17.0
         Attachments: sequenceWritable-v1.patch

Some map-reduce jobs need to aggregate and process very long lists as a single 
value. This usually happens when keys from a large domain are mapped into a 
small domain, and their associated values cannot be aggregated into few values 
but need to be preserved as members of a large list. Currently this can be 
implemented as a MapWritable or ArrayWritable - however, Hadoop needs to 
deserialize the current key and value completely into memory, which for 
extremely large values causes frequent OOM exceptions. This also works only 
with lists of relatively small size (e.g. 1000 records).

This patch is an implementation of a Writable that can handle arbitrarily long 
lists. Initially it keeps an internal buffer (which can be (de)-serialized in 
the ordinary way), and if the list size exceeds certain threshold it is spilled 
to an external SequenceFile (hence the name) on a configured FileSystem. The 
content of this Writable can be iterated, and the data is pulled either from 
the internal buffer or from the external file in a transparent way.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to