Hi all,

I am running a MapReduce Job to read files and insert rows into a
HBase table. It like a ETL procedure in database world.

Hadoop 0.19.1, HBase 0.19.0, 5 slaves and 1 master,  DELL2950 server
with 4GB memory and 1TB disk on each node.

Issue 1.

Each time the MapReduce Job eat some files. And I want to get a
balance in this pipe-line (I have defined the HBase table with TTL)
since many new incoming files need to be eat. But sometimes many
following RetriesExhaustedException happen and the job is blocked for
a long time. And the the MapReduce job will be behindhand and cannot
catch up with the speed of incoming files.

org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
contact region server Some server for region
TABA,113976305...@2009-03-17 13:11:04.135,1237268430402, row
'113976399...@2009-03-17 13:40:12.213', but failed after 10 attempts.

Exceptions:

        at 
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:942)
        at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1372)
        at org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1316)
        at org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1296)
        at 
net.sandmill.control.TableRowRecordWriter.write(TableRowRecordWriter.java:21)
        at 
net.sandmill.control.TableRowRecordWriter.write(TableRowRecordWriter.java:11)
        at 
org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:395)
        at net.sandmill.control.IndexingJob.map(IndexingJob.java:117)
        at net.sandmill.control.IndexingJob.map(IndexingJob.java:27)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
        at org.apache.hadoop.mapred.Child.main(Child.java:158)

I check the code and find the retry policy when this exception is like
BSD TCP syn backoff policy:

  /**
   * This is a retry backoff multiplier table similar to the BSD TCP syn
   * backoff table, a bit more aggressive than simple exponential backoff.
   */
  public static int RETRY_BACKOFF[] = { 1, 1, 1, 2, 2, 4, 4, 8, 16, 32 };

So, 10 retries will table long time: 71*2=142 seconds. So I think following
tow parameters can be changed and I reduce them.
hbase.client.pause = 1 (second)
hbase.client.retries.number=4
Now, the block time should be 5 seconds.

In fact, sometimes I have met such exception also when:
getRegionServerWithRetries, and
getRegionLocationForRowWithRetries, and
processBatchOfRows

I want to know what cause above exception happen. Region spliting?
Compaction? or Region moving to other regionserver?
This exception happens very frequently in my cluster.

Issue2: Also above workflow, I found the regionserver nodes are very busy
wigh very heavy load.
The process of HRegionServer takes almost all of the used CPU load.
In my previous test round, I ran 4 maptask to insert data in each node, now
I just run 2 maptask in each node. I think the performance of HRegionServer
is not so good to achieve some of my applications. Because except for the
insert/input data procedure, we will query rows from the same table for
online applications, which need low-latency. I my test experience, I find
the online query (scan about 50 rows with startRow and endRow) latency is
2-6 seconds for the first query, and about 500ms for the re-query. And when
the cluster is busy for inserting data to table, the query latency may as
long as 10-30 seconds. And when the table is in sliping/compaction, maybe
longer.

Reply via email to