Hello Bryan Oliver,
I am using suggestions from both of you to do the bulk upload. The problem
I am running into is that the job that uses 'HFileOutputFormat.
configureIncrementalLoad' is taking very long to complete. One thing I
noticed is that it's using only 1 Reducer.
When I looked at the
Thank you Tim Bryan for the responses. Sorry for the delayed response.
Got busy with other things.
Bryan - I decided to focus on the region split problem first. The
challenge here is to find the correct start key for each region, right?
Here are the steps I could think of:
1) Sort the keys.
Heya Something,
I've put my ImportAvro class up for your amusement. It's a maven project so
you should be able to check it out, build the jar with dependencies, and then
just run it. See the Readme for more details.
http://code.google.com/p/gbif-labs/source/browse/import-avro/
For your
Since our Key was ImmutableByteWritable (representing a rowKey) and the
Value was KeyValue, there could be many KeyValue's per row key (thus values
per hadoop key in the reducer). So yes, what we did is very much the same
as what you described. Hadoop will sort the ImutableByteWritable keys
I am beginning to get a sinking feeling about this :( But I won't give up!
Problem is that when I use one Reducer the job runs for a long time. I
killed it after about an hour. Keep in mind, we do have a decent cluster
size. The Map stage completes in a minute when I set no. of reducers to
0
I don't think there is. You need to have a table seeded with the right
regions in order to run the bulk loader jobs.
My machines are sufficiently fast that it did not take that long to sort.
One thing I did do to speed this up was add a mapper to the job that
generates the splits, which would
I ran the following MR job that reads AVRO files puts them on HBase. The
files have tons of data (billions). We have a fairly decent size cluster.
When I ran this MR job, it brought down HBase. When I commented out the
Puts on HBase, the job completed in 45 seconds (yes that's seconds).
Heya Something,
I had a similar task recently and by far the best way to go about this is with
bulk loading after pre-splitting your target table. As you know ImportTsv
doesn't understand Avro files so I hacked together my own ImportAvro class to
create the Hfiles that I eventually moved into
Hey Oliver,
Thanks a billion for the response -:) I will take any code you can
provide even if it's a hack! I will even send you an Amazon gift card -
not that you care or need it -:)
Can you share some performance statistics? Thanks again.
On Wed, May 9, 2012 at 8:02 AM, Oliver Meyn (GBIF)
Hey Something,
We can share everything, and even our ganglia is public [1] . We are just
setting up a new cluster with Puppet and the HBase master just came up.
HBase RS will be up probably tomorrow, where the first task will be a bulk
load of 400M records - we're just finishing our working day
I also recently had this problem, trying to index 6+ billion records into
HBase. The job would take about 4 hours before it brought down the entire
cluster, at only around 60% complete.
After trying a bunch of things, we went to bulk loading. This is actually
pretty easy, though the hardest
11 matches
Mail list logo