Hi, I'm creating multiple sequence files as the output of a large MR-job (with the SequenceFileOutputFormat). As expected, the keys in these sequence files are nicely ordered since the reduce step does that for us. However, when I create a MR-job to insert this data from the sequence files into HBase, the sorted keys pose a problem: all mappers start writing to the same HBase region, since the keys are ordered and Hadoop chops up the file into parts starting at the beginning. I randomized the file names which helps a little, but still there are changes that large parts of the key space are inserted into the same region causing slowdowns.
Is there a way to randomize the keys in these sequence files? I can simply put a random value before the key (like "%RND-keyname"), but I'm wondering if there is a less dirty method, like a random partitioner class ;-) -- Eric