Schubert, Yeah latency was always an issue in HBase since reliability was our focus. This is going to be fixed in 0.20 with a new file format and caching of hot cells.
Regards your problem, please enable DEBUG and look in your regionserver logs to see what's happening. Usually, if you get a RetriesExhausted, it's because something is taking too long to get a region reassigned. I also suggest that you upgrade to HBase 0.19.1 candidate 2 which has 2 fix for massive uploads like yours. I also see that there was some swap, please try setting the swappiness to something like 20. It helped me a lot! http://www.sollers.ca/blog/2008/swappiness/ J-D On Tue, Mar 17, 2009 at 3:49 AM, schubert zhang <[email protected]> wrote: > This the "top" info of a regionserver/datanode/tasktracker node.We can see > the HRegionServer node is very heavy-loaded. > > top - 15:44:17 up 23:46, 3 users, load average: 1.72, 0.55, 0.23 > Tasks: 109 total, 1 running, 108 sleeping, 0 stopped, 0 zombie > Cpu(s): 56.2%us, 3.1%sy, 0.0%ni, 26.6%id, 12.7%wa, 0.2%hi, 1.2%si, > 0.0%st > Mem: 4043484k total, 2816028k used, 1227456k free, 18308k buffers > Swap: 2097144k total, 22944k used, 2074200k free, 1601760k cached > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > > 21797 schubert 24 0 3414m 796m 9848 S 218 20.2 2:55.21 java > (HRegionServer) > 20545 schubert 24 0 3461m 98m 9356 S 15 2.5 10:49.64 java (DataNode) > > 22677 schubert 24 0 1417m 107m 9224 S 9 2.7 0:08.86 java (MapReduce > Child) > > On Tue, Mar 17, 2009 at 3:38 PM, schubert zhang <[email protected]> wrote: > >> Hi all, >> >> I am running a MapReduce Job to read files and insert rows into a HBase >> table. It like a ETL procedure in database world. >> >> Hadoop 0.19.1, HBase 0.19.0, 5 slaves and 1 master, DELL2950 server with >> 4GB memory and 1TB disk on each node. >> >> Issue 1. >> >> Each time the MapReduce Job eat some files. And I want to get a balance in >> this pipe-line (I have defined the HBase table with TTL) since many new >> incoming files need to be eat. But sometimes many following >> RetriesExhaustedException happen and the job is blocked for a long time. And >> the the MapReduce job will be behindhand and cannot catch up with the speed >> of incoming files. >> >> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to contact >> region server Some server for region TABA,113976305...@2009-03-17 >> 13:11:04.135,1237268430402, row '113976399...@2009-03-17 13:40:12.213', but >> failed after 10 attempts. >> >> Exceptions: >> >> at >> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:942) >> at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1372) >> at org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1316) >> at org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1296) >> at >> net.sandmill.control.TableRowRecordWriter.write(TableRowRecordWriter.java:21) >> at >> net.sandmill.control.TableRowRecordWriter.write(TableRowRecordWriter.java:11) >> at >> org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:395) >> at net.sandmill.control.IndexingJob.map(IndexingJob.java:117) >> at net.sandmill.control.IndexingJob.map(IndexingJob.java:27) >> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342) >> at org.apache.hadoop.mapred.Child.main(Child.java:158) >> >> I check the code and find the retry policy when this exception is like BSD >> TCP syn backoff policy: >> >> /** >> * This is a retry backoff multiplier table similar to the BSD TCP syn >> * backoff table, a bit more aggressive than simple exponential backoff. >> */ >> public static int RETRY_BACKOFF[] = { 1, 1, 1, 2, 2, 4, 4, 8, 16, 32 }; >> >> So, 10 retries will table long time: 71*2=142 seconds. So I think following >> tow parameters can be changed and I reduce them. >> hbase.client.pause = 1 (second) >> hbase.client.retries.number=4 >> Now, the block time should be 5 seconds. >> >> In fact, sometimes I have met such exception also when: >> getRegionServerWithRetries, and >> getRegionLocationForRowWithRetries, and >> processBatchOfRows >> >> I want to know what cause above exception happen. Region spliting? >> Compaction? or Region moving to other regionserver? >> This exception happens very frequently in my cluster. >> >> Issue2: Also above workflow, I found the regionserver nodes are very busy >> wigh very heavy load. >> The process of HRegionServer takes almost all of the used CPU load. >> In my previous test round, I ran 4 maptask to insert data in each node, now >> I just run 2 maptask in each node. I think the performance of HRegionServer >> is not so good to achieve some of my applications. Because except for the >> insert/input data procedure, we will query rows from the same table for >> online applications, which need low-latency. I my test experience, I find >> the online query (scan about 50 rows with startRow and endRow) latency is >> 2-6 seconds for the first query, and about 500ms for the re-query. And when >> the cluster is busy for inserting data to table, the query latency may as >> long as 10-30 seconds. And when the table is in sliping/compaction, maybe >> longer. >> >> >> >
