Hi All, I am using HBase0.92.1. I am trying to break the HBase bulk loading into multiple MR jobs since i want to populate more than one HBase table from a single csv file. I have looked into MultiTableOutputFormat class but i doesnt solve my purpose becasue it does not generates HFile.
I modified the bulk loader job of HBase and removed the reducer phase so that i can generate output of <ImmutableBytesWritable, Put> for multiple tables in one MR job(phase 1). Now, i ended up writing an input format that reads <ImmutableBytesWritable, Put> to use it to read the output of mappers(phase 1) and generate the HFiles for each table. I implemented a RecordReader assuming that i can use the readFields(DataInput) to read ImmutableBytesWritable and Put respectively. As per my understanding, format of the input file(output files of mappers of phase 1) is <deserialized ImmutableBytesWritable><deserialized Put>. However when i am trying to read the file like that, the size of the ImmutableBytesWritable is wrong and its throwing OOM due to that. Size of ImmutableBytesWritable(rowkey) should not be greater than 32 bytes for my use case but the as per the input it is 808460337 bytes. I am pretty sure that either my understanding of input format is wrong or my implementation of record reader is having some problem. Can someone tell me the correct way of deserializing the output file of mapper? or There is some problem with my code? Here is the link to my initial stab at RecordReader: https://dl.dropbox.com/u/64149128/ImmutableBytesWritable_Put_RecordReader.java -- Thanks & Regards, Anil Gupta