Thanks Sean. Filed: https://issues.apache.org/jira/browse/HBASE-14177
It does sound similar. The difference here is that my test is a single, wide row, and attempts to run the same scan over the same data eventually will succeed. If I understand correctly, HBASE-13262 sounds like it would be missing data more or less consistently if no data is added or splits are occurring. Blaming GC sound crazy, I know. But if I run my test with -Xms4g -Xmx4g, then the test has always passed on the first scan attempt. So my concern is that any full gc could cause a scan to be missing data. Maybe there are weak references in play or some pause timeout silently failing the scan? James On Thu, Jul 30, 2015 at 5:13 PM, Sean Busbey <[email protected]> wrote: > This sounds similar to HBASE-13262, but on versions that expressly have > that fix in place. > > Mind putting up a jira with the problem reproduction? > > On Thu, Jul 30, 2015 at 1:13 PM, James Estes <[email protected]> > wrote: > > > All, > > > > If a full GC happens on the client when a scan is in progress, the scan > can > > be missing rows. I have a test that repros this almost every time. > > > > The test runs against a local standalone server with 10g heap, using > > jdk1.7.0_45. > > > > The Test: > > - run with -Xmx1900m to restrict client heap > > - run with -verbose:gc to see the GCs > > - connect and create a new table with one CF > > - add 99 cells, 9mb each to that CF to the same row (individual PUTs in a > > loop). > > - full-scan the table, only setting the maxResultSize to 2mb (no batch > > size) > > - if no data, sleep 5s and try to scan again. > > > > Running this test, it fails the first scan. There is no exception, just > no > > results returned (results.hasNext is false). The test then sleeps 5s and > > tries the scan again, and it usually succeeds on the 2nd or 3rd attempt. > > Looking at the logs, we see several full GCs during the scan (but no OOME > > stacks before the first failure). Then a curious message: > > 2015-07-30 10:42:10,815 [main] DEBUG > > > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation > > - Removed 192.168.1.131:53244 as a location of > > > > > big_row_1438274455440,\x00\x80,1438274455540.b213fc048745241f236bc6e2291092d1. > > for tableName=big_row_1438274455440 from cache > > > > As if the client has somehow decided the region location is bad/gone? > After > > that, the scan completes with no results. After a sleep, it tries again, > > and it usually passes, but oddly there are also actual OOMEs in the > client > > log just before the scan finishes successfully: > > > > 2015-07-30 10:42:36,459 [IPC Client (1790044085) connection to / > > 192.168.1.131:53244 from james] WARN org.apache.hadoop.ipc.RpcClient - > > IPC Client (1790044085) connection to /192.168.1.131:53244 from james: > > unexpected exception receiving call responses > > java.lang.OutOfMemoryError: Java heap space > > 2015-07-30 10:42:36,459 [IPC Client (1790044085) connection to / > > 192.168.1.131:53244 from james] DEBUG org.apache.hadoop.ipc.RpcClient - > > IPC Client (1790044085) connection to /192.168.1.131:53244 from james: > > closing ipc connection to /192.168.1.131:53244: Unexpected exception > > receiving call responses > > java.io.IOException: Unexpected exception receiving call responses > > at > org.apache.hadoop.hbase.ipc.RpcClient$Connection.run(RpcClient.java:731) > > Caused by: java.lang.OutOfMemoryError: Java heap space > > > > It seems like the rpc winds up retrying after catching Throwable. > > > > This test is single threaded, and the single row is large, causing > several > > full GCs while receiving data. I suspect the same thing may happen if > there > > are multiple threads scanning, causing mem pressure elsewhere, leading > to a > > GC and may cause partial results (but I've not proven that). I can make > the > > tests pass by setting batch size to 10, reducing the mem pressure from > this > > one row, but again I'm not sure if a full GC were to happen for other > > activity in the JVM, the scan wouldn't wind up behaving the same and > > missing data. > > > > I tested the following combinations of client/server versions: > > > > Repro'ed in: > > - 0.98.12 client/server > > - 0.98.13 client 0.98.12 server > > - 0.98.13 client/server > > - 1.1.0 client 0.98.13 server > > - 0.98.13 client and 1.1.0 server > > - 0.98.12 client and 1.1.0 server > > > > NOT repro'ed in > > - 1.1.0 client/server > > > > I'm not sure why 1.1.0 client would fail the same way against a 0.98.13 > > server, but not a 1.1.0 server. But, more reason for my team to get up to > > 1.1 fully :) > > > > I have not yet run the test against a full cluster. I can provide the > test > > and logs from my testing if requested. > > > > Thanks, > > James > > > > > > -- > Sean >
