We are bulk loading 1 billion rows into hbase. The 1 billion file was
split into 20 files of ~22.5GB. Ingesting the file to hdfs took ~2min.
Ingesting the first file to hbase took  ~3 hours. The next took ~5hours,
then it is increasing. By the sixth or seventh file the ingestion just
stops (mapReduce Bulk load stops at 99% of mapper and around 22% of the
reducer). We also noticed that as soon as the reducers are starting, the
progress of the job slows down.

The logs did not show any problem and we do not see any hot spotting (the
table is already salted). We are running out of ideas. Few questions to
get started:
1- Is the increase MR expected? Does MR need to sort the new data again
the already ingested one?
2- Is there a way to speed up this, especially that our data is already
sorted? From 2min on hdfs to 5 hours on hbase is a big gap. A word count
map reduce on 24GB took only ~7 minutes. Removing the reducers from the
existing cvs bulk load will not help as the mappers will spit the data in
a random order.

regards,

Dillon

Dillon Chrimes (PhD)
University of Victoria
Victoria BC Canada





Reply via email to