This sounds similar to HBASE-13262, but on versions that expressly have that fix in place.
Mind putting up a jira with the problem reproduction? On Thu, Jul 30, 2015 at 1:13 PM, James Estes <[email protected]> wrote: > All, > > If a full GC happens on the client when a scan is in progress, the scan can > be missing rows. I have a test that repros this almost every time. > > The test runs against a local standalone server with 10g heap, using > jdk1.7.0_45. > > The Test: > - run with -Xmx1900m to restrict client heap > - run with -verbose:gc to see the GCs > - connect and create a new table with one CF > - add 99 cells, 9mb each to that CF to the same row (individual PUTs in a > loop). > - full-scan the table, only setting the maxResultSize to 2mb (no batch > size) > - if no data, sleep 5s and try to scan again. > > Running this test, it fails the first scan. There is no exception, just no > results returned (results.hasNext is false). The test then sleeps 5s and > tries the scan again, and it usually succeeds on the 2nd or 3rd attempt. > Looking at the logs, we see several full GCs during the scan (but no OOME > stacks before the first failure). Then a curious message: > 2015-07-30 10:42:10,815 [main] DEBUG > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation > - Removed 192.168.1.131:53244 as a location of > > big_row_1438274455440,\x00\x80,1438274455540.b213fc048745241f236bc6e2291092d1. > for tableName=big_row_1438274455440 from cache > > As if the client has somehow decided the region location is bad/gone? After > that, the scan completes with no results. After a sleep, it tries again, > and it usually passes, but oddly there are also actual OOMEs in the client > log just before the scan finishes successfully: > > 2015-07-30 10:42:36,459 [IPC Client (1790044085) connection to / > 192.168.1.131:53244 from james] WARN org.apache.hadoop.ipc.RpcClient - > IPC Client (1790044085) connection to /192.168.1.131:53244 from james: > unexpected exception receiving call responses > java.lang.OutOfMemoryError: Java heap space > 2015-07-30 10:42:36,459 [IPC Client (1790044085) connection to / > 192.168.1.131:53244 from james] DEBUG org.apache.hadoop.ipc.RpcClient - > IPC Client (1790044085) connection to /192.168.1.131:53244 from james: > closing ipc connection to /192.168.1.131:53244: Unexpected exception > receiving call responses > java.io.IOException: Unexpected exception receiving call responses > at org.apache.hadoop.hbase.ipc.RpcClient$Connection.run(RpcClient.java:731) > Caused by: java.lang.OutOfMemoryError: Java heap space > > It seems like the rpc winds up retrying after catching Throwable. > > This test is single threaded, and the single row is large, causing several > full GCs while receiving data. I suspect the same thing may happen if there > are multiple threads scanning, causing mem pressure elsewhere, leading to a > GC and may cause partial results (but I've not proven that). I can make the > tests pass by setting batch size to 10, reducing the mem pressure from this > one row, but again I'm not sure if a full GC were to happen for other > activity in the JVM, the scan wouldn't wind up behaving the same and > missing data. > > I tested the following combinations of client/server versions: > > Repro'ed in: > - 0.98.12 client/server > - 0.98.13 client 0.98.12 server > - 0.98.13 client/server > - 1.1.0 client 0.98.13 server > - 0.98.13 client and 1.1.0 server > - 0.98.12 client and 1.1.0 server > > NOT repro'ed in > - 1.1.0 client/server > > I'm not sure why 1.1.0 client would fail the same way against a 0.98.13 > server, but not a 1.1.0 server. But, more reason for my team to get up to > 1.1 fully :) > > I have not yet run the test against a full cluster. I can provide the test > and logs from my testing if requested. > > Thanks, > James > -- Sean
