Schubert,

Yeah latency was always an issue in HBase since reliability was our
focus. This is going to be fixed in 0.20 with a new file format and
caching of hot cells.

Regards your problem, please enable DEBUG and look in your
regionserver logs to see what's happening. Usually, if you get a
RetriesExhausted, it's because something is taking too long to get a
region reassigned.

I also suggest that you upgrade to HBase 0.19.1 candidate 2 which has
2 fix for massive uploads like yours.

I also see that there was some swap, please try setting the swappiness
to something like 20. It helped me a lot!
http://www.sollers.ca/blog/2008/swappiness/

J-D

On Tue, Mar 17, 2009 at 3:49 AM, schubert zhang <[email protected]> wrote:
> This the "top" info of a regionserver/datanode/tasktracker node.We can see
> the HRegionServer node is very heavy-loaded.
>
> top - 15:44:17 up 23:46,  3 users,  load average: 1.72, 0.55, 0.23
> Tasks: 109 total,   1 running, 108 sleeping,   0 stopped,   0 zombie
> Cpu(s): 56.2%us,  3.1%sy,  0.0%ni, 26.6%id, 12.7%wa,  0.2%hi,  1.2%si,
>  0.0%st
> Mem:   4043484k total,  2816028k used,  1227456k free,    18308k buffers
> Swap:  2097144k total,    22944k used,  2074200k free,  1601760k cached
>
>  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>
> 21797 schubert  24   0 3414m 796m 9848 S  218 20.2   2:55.21 java
> (HRegionServer)
> 20545 schubert  24   0 3461m  98m 9356 S   15  2.5  10:49.64 java (DataNode)
>
> 22677 schubert  24   0 1417m 107m 9224 S    9  2.7   0:08.86 java (MapReduce
> Child)
>
> On Tue, Mar 17, 2009 at 3:38 PM, schubert zhang <[email protected]> wrote:
>
>> Hi all,
>>
>> I am running a MapReduce Job to read files and insert rows into a HBase 
>> table. It like a ETL procedure in database world.
>>
>> Hadoop 0.19.1, HBase 0.19.0, 5 slaves and 1 master,  DELL2950 server with 
>> 4GB memory and 1TB disk on each node.
>>
>> Issue 1.
>>
>> Each time the MapReduce Job eat some files. And I want to get a balance in 
>> this pipe-line (I have defined the HBase table with TTL) since many new 
>> incoming files need to be eat. But sometimes many following 
>> RetriesExhaustedException happen and the job is blocked for a long time. And 
>> the the MapReduce job will be behindhand and cannot catch up with the speed 
>> of incoming files.
>>
>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to contact 
>> region server Some server for region TABA,113976305...@2009-03-17 
>> 13:11:04.135,1237268430402, row '113976399...@2009-03-17 13:40:12.213', but 
>> failed after 10 attempts.
>>
>> Exceptions:
>>
>>       at 
>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:942)
>>       at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1372)
>>       at org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1316)
>>       at org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1296)
>>       at 
>> net.sandmill.control.TableRowRecordWriter.write(TableRowRecordWriter.java:21)
>>       at 
>> net.sandmill.control.TableRowRecordWriter.write(TableRowRecordWriter.java:11)
>>       at 
>> org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:395)
>>       at net.sandmill.control.IndexingJob.map(IndexingJob.java:117)
>>       at net.sandmill.control.IndexingJob.map(IndexingJob.java:27)
>>       at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
>>       at org.apache.hadoop.mapred.Child.main(Child.java:158)
>>
>> I check the code and find the retry policy when this exception is like BSD 
>> TCP syn backoff policy:
>>
>>   /**
>>    * This is a retry backoff multiplier table similar to the BSD TCP syn
>>    * backoff table, a bit more aggressive than simple exponential backoff.
>>    */
>>   public static int RETRY_BACKOFF[] = { 1, 1, 1, 2, 2, 4, 4, 8, 16, 32 };
>>
>> So, 10 retries will table long time: 71*2=142 seconds. So I think following
>> tow parameters can be changed and I reduce them.
>> hbase.client.pause = 1 (second)
>> hbase.client.retries.number=4
>> Now, the block time should be 5 seconds.
>>
>> In fact, sometimes I have met such exception also when:
>> getRegionServerWithRetries, and
>> getRegionLocationForRowWithRetries, and
>> processBatchOfRows
>>
>> I want to know what cause above exception happen. Region spliting?
>> Compaction? or Region moving to other regionserver?
>> This exception happens very frequently in my cluster.
>>
>> Issue2: Also above workflow, I found the regionserver nodes are very busy
>> wigh very heavy load.
>> The process of HRegionServer takes almost all of the used CPU load.
>> In my previous test round, I ran 4 maptask to insert data in each node, now
>> I just run 2 maptask in each node. I think the performance of HRegionServer
>> is not so good to achieve some of my applications. Because except for the
>> insert/input data procedure, we will query rows from the same table for
>> online applications, which need low-latency. I my test experience, I find
>> the online query (scan about 50 rows with startRow and endRow) latency is
>> 2-6 seconds for the first query, and about 500ms for the re-query. And when
>> the cluster is busy for inserting data to table, the query latency may as
>> long as 10-30 seconds. And when the table is in sliping/compaction, maybe
>> longer.
>>
>>
>>
>

Reply via email to