Thanks for the answers. I will use these as my basis for investigation. I am using a mapper only job, is it better to use the HBase client to write to HBase or TableOutputFormat?
On Mon, Dec 27, 2010 at 8:38 AM, Stack <[email protected]> wrote: > On Mon, Dec 27, 2010 at 1:54 AM, Nanheng Wu <[email protected]> wrote: >> I am running some tests to load data from HDFS into HBase in a MR job. >> I am pretty new to HBase and I have some questions regarding bulk load >> performance: I have a small cluster with 4 nodes, I set up one node to >> run Namenode/JobTracker/ZK, and the other three nodes all run >> TaskTracker/DataNode/HRegion. During my test I am seeing about 1300 >> inserts per second total and it feels kind of slow. > > I don't know what your hardware is like but yeah, it sounds kinda slow. > > > My rows are pretty >> small ~250 bytes. I am wondering if it is a good idea to be running MR >> on all nodes. Would it be better if I run MR load job on separate >> nodes? > > Well, where do you think the time is being spent? What is holding up > the job do you think? Is your MR job doing any massaging of the data. > Do you have many concurrent mappers run at same time on each node? > Does your MR job do a map and reduce or just a map? Is it the insert > into hbase that is slow? What do the hbase logs say? Are they > blocking because they are flushing memory? > > Also I observe that one task tracker's CPU usage was twice as >> high as the other two. > > Maybe its the one that is doing the inserting? How many regions in > your hbase cluster? When you look at hbase UI, is load being spread > across the hbase cluster or you just hitting one node? > > St.Ack > > I can't figure out why that is, does that >> indicate some hot spots in the cluster? I'd really appreciate some >> ideas, and please let me know if my description is not specific or >> detailed enough and what other information I can provide to help >> diagnose the problem. Thanks! >> >
