Hi,

I'm doing some experiments to import large datasets to Hbase using a Map
job. Before posting some numbers, here is a summary of my test cluster:

I have 7 regionservers and 1 master. I also run HDFS datanodes and Hadoop
tasktrackers on the same 7 regionservers. Similarly, I run the Hadoop
namenode on the same machine that I run the Hbase master. Each machine is an
IBM e325 node that has two 2.2 GHz AMD64 processors, 4 GB RAM, and 80 GB
local disk.

My dataset is simply the output of another map reduce job, consisting of 7
sequence files with a total size of 40 GB. Each file contains key, value
records of the form (Text, LongWritable). The keys are sentences or phrases
extracted from sentences and the values are frequencies. The total number of
records is roughly 420m and an average key is around 100 bytes. (40GB / 420m
- ignoring long writables)

I tried to randomize the (key,value) pairs with another map reduce job and I
also set:

            table.setAutoFlush(false);
            table.setWriteBufferSize(1024*1024*10);

based on some advice that I read before on the list. My Map function that
imports data to Hbase is as follow:

    public void map(Text key, LongWritable value,
OutputCollector<NullWritable, NullWritable> output, Reporter reporter)
throws IOException {

        BatchUpdate update = new BatchUpdate(key.toString());
        update.put("frequency:value",Bytes.toBytes(value.get()));

        table.commit(update);
    }

So far I can hit 20% of the import in 40-45 minutes so importing the whole
data set will presumbly take more than 3.5 hours. I tried diffirent write
buffer sizes between5 MB and 20 MB and didn't get any significant
improvements. I did my experiments with 1 or 2 mappers per node although 1
mapper per node seemed to do better than 2 nodes.  When I refresh the Hbase
master web interface during my imports, I see the requests are generally
divided equally to 7 regionservers and as I keep hitting the refreh button,
I can see that I get 10000 to 70000 requests at once.

I read some earlier posts from Ryan and Stack, and I was actually expecting
at least twice better performonce so I decided to ask to the list whether
this is an expected performance or way below it.

I'd appreciate any comments/suggestions.


Thanks,
Jim

Reply via email to