Hi All- I have a repeatable and troublesome HBase interaction that I would like some advice on.
I am running a 5 node cluster on v0.94 on cdh3u3 and accessing through Java client API. Each RS has 32G of RAM, is running w/ 16G heap w/ 4G for block cache. Used heap of each RS is well below 16G available. My client code has a set of deletes to carry out. After successfully issuing 19 such deletes the client begins logging HBase errors while trying to complete the deletes. It logs ERRORs every 60s for 10 times and then gives up. I estimate that the client successfully deleted about 270MB of data in the first 19 deletes. Each batch delete covering about 144 rows with a row size of about 100KB. Here is first of 10 ERRORs logged in client: http://pastebin.com/QMJsbgkZ. Client errors are 1 per minute between 00:22:48 and 00:32:58 with final error being: http://pastebin.com/ajaVxYUZ Ultimately, the RS became responsive again. Looking at monitoring I see spike in CPU utilization on node that is unresponsive; it goes from 2% utilization to 20% and sticks there for a few minutes. None of the other nodes in the cluster appear busy at this time. Logs from unresponsive RS are here: http://pastebin.com/z9qxGuJS There are no ERRORs in the log around the time of the unresponsiveness. It appears from the server log that the "responseTooSlow" operation completed about 7min after the client gave up. So, any ideas what was making the RS unresponsive? Did it really take 17min to delete 280MB of data? I can easily change client RPC timeouts and number of retries, but I feel there is some I am missing. Any suggestions? Thanks, Ted