Hi there, The HBase RefGuide has a comprehensive case study on such a case. This might not be the exact problem, but the diagnostic approach should help.
http://hbase.apache.org/book.html#casestudies.slownode On 1/4/13 10:37 PM, "Liu, Raymond" <raymond....@intel.com> wrote: >Hi > >I encounter a weird lag behind map task issue here : > >I have a small hadoop/hbase cluster with 1 master node and 4 regionserver >node all have 16 CPU with map and reduce slot set to 24. > >A few table is created with regions distributed on each region node >evenly ( say 16 region for each region server). Also each region has >almost the same number of kvs with very similar size. All table had >major_compact done to ensure data locality > >I have a MR job which simply do local region scan in every map task ( so >16 map task for each regionserver node). > >By theory, every map task should finish within similar time. > >But the real case is that some regions on the same region server always >lags behind a lot, say cost 150 ~250% of the other map tasks average >times. > >If this is happen to a single region server for every table, I might >doubt it is a disk issue or other reason that bring down the performance >of this region server. > >But the weird thing is that, though with each single table, almost all >the map task on the the same single regionserver is lag behind. But for >different table, this lag behind regionserver is different! And the >region and region size is distributed evenly which I double checked for a >lot of times. ( I even try to set replica to 4 to ensure every node have >a copy of local data) > >Say table 1, all map task on regionserver node 2 is slow. While for table >2, maybe all map task on regionserver node 3 is slow, and with table 1, >it will always be regionserver node 2 which is slow regardless of cluster >restart, and the slowest map task will always be the very same one. And >it won't go away even I do major compact again..... > >So, anyone could give me some clue on what reason might possible lead to >this weird behavior? Any wild guess is welcome! > >(BTW. I don't encounter this issue a few days ago with the same table. >While I do restart cluster and do a few changes upon config file during >that period, But restore the config file don't help) > > >Best Regards, >Raymond Liu > >