Hi All-

I have a repeatable and troublesome HBase interaction that I would like some 
advice on.  

I am running a 5 node cluster on v0.94 on cdh3u3 and accessing through Java 
client API. Each RS has 32G of RAM, is running w/ 16G heap w/ 4G for block 
cache. Used heap of each RS is well below 16G available. 

My client code has a set of deletes to carry out.  After successfully issuing 
19 such deletes the client begins logging HBase errors while trying to complete 
the deletes.  It logs ERRORs every 60s for 10 times and then gives up. 

I estimate that the client successfully deleted about 270MB of data in the 
first 19 deletes.  Each batch delete covering about 144 rows with a row size of 
about 100KB.  

Here is first of 10 ERRORs logged in client: http://pastebin.com/QMJsbgkZ.  
Client errors are 1 per minute between 00:22:48 and 00:32:58 with final error 
being: http://pastebin.com/ajaVxYUZ

Ultimately, the RS became responsive again. Looking at monitoring I see spike 
in CPU utilization on node that is unresponsive; it goes from 2% utilization to 
20% and sticks there for a few minutes.  None of the other nodes in the cluster 
appear busy at this time. 

Logs from unresponsive RS are here: http://pastebin.com/z9qxGuJS  There are no 
ERRORs in the log around the time of the unresponsiveness.

It appears from the server log that the "responseTooSlow" operation completed 
about 7min after the client gave up.  

So, any ideas what was making the RS unresponsive? Did it really take 17min to 
delete 280MB of data?  

I can easily change client RPC timeouts and number of retries, but I feel there 
is some I am missing.  Any suggestions?

Thanks,
Ted



Reply via email to