Randomizing keys in sequence file output format

Eric Mon, 27 Dec 2010 08:44:25 -0800

Hi,

I'm creating multiple sequence files as the output of a large MR-job (with
the SequenceFileOutputFormat). As expected, the keys in these sequence files
are nicely ordered since the reduce step does that for us. However, when I
create a MR-job to insert this data from the sequence files into HBase, the
sorted keys pose a problem: all mappers start writing to the same HBase
region, since the keys are ordered and Hadoop chops up the file into parts
starting at the beginning.
I randomized the file names which helps a little, but still there are
changes that large parts of the key space are inserted into the same region
causing slowdowns.


Is there a way to randomize the keys in these sequence files? I can simply
put a random value before the key (like "%RND-keyname"), but I'm wondering
if there is a less dirty method, like a random partitioner class ;-)

--
Eric

Randomizing keys in sequence file output format

Reply via email to