Re: HBase Performance Improvements?

2012-05-16 Thread Something Something
Hello Bryan Oliver, I am using suggestions from both of you to do the bulk upload. The problem I am running into is that the job that uses 'HFileOutputFormat. configureIncrementalLoad' is taking very long to complete. One thing I noticed is that it's using only 1 Reducer. When I looked at the

Re: HBase Performance Improvements?

2012-05-10 Thread Something Something
Thank you Tim Bryan for the responses. Sorry for the delayed response. Got busy with other things. Bryan - I decided to focus on the region split problem first. The challenge here is to find the correct start key for each region, right? Here are the steps I could think of: 1) Sort the keys.

Re: HBase Performance Improvements?

2012-05-10 Thread Oliver Meyn (GBIF)
Heya Something, I've put my ImportAvro class up for your amusement. It's a maven project so you should be able to check it out, build the jar with dependencies, and then just run it. See the Readme for more details. http://code.google.com/p/gbif-labs/source/browse/import-avro/ For your

Re: HBase Performance Improvements?

2012-05-10 Thread Bryan Beaudreault
Since our Key was ImmutableByteWritable (representing a rowKey) and the Value was KeyValue, there could be many KeyValue's per row key (thus values per hadoop key in the reducer). So yes, what we did is very much the same as what you described. Hadoop will sort the ImutableByteWritable keys

Re: HBase Performance Improvements?

2012-05-10 Thread Something Something
I am beginning to get a sinking feeling about this :( But I won't give up! Problem is that when I use one Reducer the job runs for a long time. I killed it after about an hour. Keep in mind, we do have a decent cluster size. The Map stage completes in a minute when I set no. of reducers to 0

Re: HBase Performance Improvements?

2012-05-10 Thread Bryan Beaudreault
I don't think there is. You need to have a table seeded with the right regions in order to run the bulk loader jobs. My machines are sufficiently fast that it did not take that long to sort. One thing I did do to speed this up was add a mapper to the job that generates the splits, which would

HBase Performance Improvements?

2012-05-09 Thread Something Something
I ran the following MR job that reads AVRO files puts them on HBase. The files have tons of data (billions). We have a fairly decent size cluster. When I ran this MR job, it brought down HBase. When I commented out the Puts on HBase, the job completed in 45 seconds (yes that's seconds).

Re: HBase Performance Improvements?

2012-05-09 Thread Oliver Meyn (GBIF)
Heya Something, I had a similar task recently and by far the best way to go about this is with bulk loading after pre-splitting your target table. As you know ImportTsv doesn't understand Avro files so I hacked together my own ImportAvro class to create the Hfiles that I eventually moved into

Re: HBase Performance Improvements?

2012-05-09 Thread Something Something
Hey Oliver, Thanks a billion for the response -:) I will take any code you can provide even if it's a hack! I will even send you an Amazon gift card - not that you care or need it -:) Can you share some performance statistics? Thanks again. On Wed, May 9, 2012 at 8:02 AM, Oliver Meyn (GBIF)

Re: HBase Performance Improvements?

2012-05-09 Thread Tim Robertson
Hey Something, We can share everything, and even our ganglia is public [1] . We are just setting up a new cluster with Puppet and the HBase master just came up. HBase RS will be up probably tomorrow, where the first task will be a bulk load of 400M records - we're just finishing our working day

Re: HBase Performance Improvements?

2012-05-09 Thread Bryan Beaudreault
I also recently had this problem, trying to index 6+ billion records into HBase. The job would take about 4 hours before it brought down the entire cluster, at only around 60% complete. After trying a bunch of things, we went to bulk loading. This is actually pretty easy, though the hardest