On Mon, Dec 27, 2010 at 1:54 AM, Nanheng Wu <[email protected]> wrote:
> I am running some tests to load data from HDFS into HBase in a MR job.
> I am pretty new to HBase and I have some questions regarding bulk load
> performance: I have a small cluster with 4 nodes, I set up one node to
> run Namenode/JobTracker/ZK, and the other three nodes all run
> TaskTracker/DataNode/HRegion. During my test I am seeing about 1300
> inserts per second total and it feels kind of slow.

I don't know what your hardware is like but yeah, it sounds kinda slow.


My rows are pretty
> small ~250 bytes. I am wondering if it is a good idea to be running MR
> on all nodes. Would it be better if I run MR load job on separate
> nodes?

Well, where do you think the time is being spent?  What is holding up
the job do you think?  Is your MR job doing any massaging of the data.
 Do you have many concurrent mappers run at same time on each node?
Does your MR job do a map and reduce or just a map?  Is it the insert
into hbase that is slow?  What do the hbase logs say?  Are they
blocking because they are flushing memory?

Also I observe that one task tracker's CPU usage was twice as
> high as the other two.

Maybe its the one that is doing the inserting?  How many regions in
your hbase cluster?  When you look at hbase UI, is load being spread
across the hbase cluster or you just hitting one node?

St.Ack

 I can't figure out why that is, does that
> indicate some hot spots in the cluster? I'd really appreciate some
> ideas, and please let me know if my description is not specific or
> detailed enough and what other information I can provide to help
> diagnose the problem. Thanks!
>

Reply via email to