Slow Scan can loop forever

James Estes Tue, 14 Oct 2014 12:59:17 -0700

I'm having an issue where a Scan gets in a loop, and never completes and
never times out. I've seen it run for hours, and is repeatable on my
system. After looking through hbase code and logs, I think I understand
what is going on.


I'm using hbase 0.96.0-hadoop2, running on Hadoop 2.2.0. I'm using the same
version of the HBase client.

We do an hourly incremental backup using a simple Java class that sets up a
Scan with a time range. We're only adding about 250,000 cells ~100MB per
hour across 46 regions on 8 region servers, so there isn't much, but it'll
be skipping a lot of data (after a compaction). This Scan will regularly
get 'stuck' where we'll be alerted that it has been running for hours.

Turns out that one of the scan rpc calls is taking a long time (>
rpcTimeout which we have set to 10s), and the ClientScanner.next call loops
forever retrying because for some reason, when the RpcRetryingCaller
retries after the timeout, it gets an OutOfOrderScannerNextException which
the ClientScanner treats as needing to reset the scan. So without another
break condition, the ClientScanner.next keeps resetting the scanner, only
to get the exact same result.

Here is a rough sequence:
1 - the scan gets near the end of a region, and we process those results
2 - then it goes back for more via ClientScanner.next(), and there isn't
any new data at the end of that region, so the rpc call comes back to the
client (but no new keys are in the result)
3 - then it moves the scan forward to the next region, and goes to fetch
more data. This time though, this next region has recently been compacted
and it cannot prune store files by the timestamp range, and has a large
amount of data to scan through (again filtering by timestamp).
4 - We have our rpcTimeout set to 10s. This scan rpc call takes longer than
that b/c of all the data it is churning through.
5 - The RetryingRpcHandler gets a timeout exception, and will retry
6 - The server almost immediately notices this (usually within 10-20ms). I
think this indicates that the scan is actively progressing pretty quickly
through the data because it is able to check and notice the client has gone
away so quickly...there is just a lot of data to get through.
7 - The client retries after 100ms pause time (from the
RetryingRpcHandler). But the server immediately rejects the scan returning
an OutOfOrderScannerNextException. The client thinks the call
next_call_seq=0, server says next_call_seq=1. This maybe shouldn't matter
much, it just means the client would reset the scanner and we've only had
one extra rpc call here since the ClientScanner treats
OutOfOrderScannerNextException as a signal to reset the scanner. But there
is no other condition to stop the loop via a timeout or retry count.
8 - So the client side resets the scanner back to the last key from the
previous results, and goes to fetch data. And we are back to #1. Right
here, there seems to be no check to decide if the ClientScanner.next has
taken too long overall.

Here are some annotated logs:
https://gist.github.com/housejester/a0a530279eaa868db877

Increasing our rpcTimeout fixes the issue, since we give the Scan enough
time to complete (it didn't take much, 12s was fine, but we've bumped it up
to 60s). However, it seems like the ClientScanner.next should either have a
timeout or max reset count? Maybe rpcTimeout * retries? I think if the
OutOfOrderScannerNextException didn't occur, then the normal timeout path
would have worked.

James

Slow Scan can loop forever

Reply via email to