I'm having an issue where a Scan gets in a loop, and never completes and never times out. I've seen it run for hours, and is repeatable on my system. After looking through hbase code and logs, I think I understand what is going on.
I'm using hbase 0.96.0-hadoop2, running on Hadoop 2.2.0. I'm using the same version of the HBase client. We do an hourly incremental backup using a simple Java class that sets up a Scan with a time range. We're only adding about 250,000 cells ~100MB per hour across 46 regions on 8 region servers, so there isn't much, but it'll be skipping a lot of data (after a compaction). This Scan will regularly get 'stuck' where we'll be alerted that it has been running for hours. Turns out that one of the scan rpc calls is taking a long time (> rpcTimeout which we have set to 10s), and the ClientScanner.next call loops forever retrying because for some reason, when the RpcRetryingCaller retries after the timeout, it gets an OutOfOrderScannerNextException which the ClientScanner treats as needing to reset the scan. So without another break condition, the ClientScanner.next keeps resetting the scanner, only to get the exact same result. Here is a rough sequence: 1 - the scan gets near the end of a region, and we process those results 2 - then it goes back for more via ClientScanner.next(), and there isn't any new data at the end of that region, so the rpc call comes back to the client (but no new keys are in the result) 3 - then it moves the scan forward to the next region, and goes to fetch more data. This time though, this next region has recently been compacted and it cannot prune store files by the timestamp range, and has a large amount of data to scan through (again filtering by timestamp). 4 - We have our rpcTimeout set to 10s. This scan rpc call takes longer than that b/c of all the data it is churning through. 5 - The RetryingRpcHandler gets a timeout exception, and will retry 6 - The server almost immediately notices this (usually within 10-20ms). I think this indicates that the scan is actively progressing pretty quickly through the data because it is able to check and notice the client has gone away so quickly...there is just a lot of data to get through. 7 - The client retries after 100ms pause time (from the RetryingRpcHandler). But the server immediately rejects the scan returning an OutOfOrderScannerNextException. The client thinks the call next_call_seq=0, server says next_call_seq=1. This maybe shouldn't matter much, it just means the client would reset the scanner and we've only had one extra rpc call here since the ClientScanner treats OutOfOrderScannerNextException as a signal to reset the scanner. But there is no other condition to stop the loop via a timeout or retry count. 8 - So the client side resets the scanner back to the last key from the previous results, and goes to fetch data. And we are back to #1. Right here, there seems to be no check to decide if the ClientScanner.next has taken too long overall. Here are some annotated logs: https://gist.github.com/housejester/a0a530279eaa868db877 Increasing our rpcTimeout fixes the issue, since we give the Scan enough time to complete (it didn't take much, 12s was fine, but we've bumped it up to 60s). However, it seems like the ClientScanner.next should either have a timeout or max reset count? Maybe rpcTimeout * retries? I think if the OutOfOrderScannerNextException didn't occur, then the normal timeout path would have worked. James
