yeah that's what it sounds like. Having a test should make it much easier to chase down, thanks for isolating things.
On Fri, Jul 31, 2015 at 2:14 PM, James Estes <[email protected]> wrote: > Thanks Sean. > > Filed: https://issues.apache.org/jira/browse/HBASE-14177 > > It does sound similar. The difference here is that my test is a single, > wide row, and attempts to run the same scan over the same data eventually > will succeed. If I understand correctly, HBASE-13262 sounds like it would > be missing data more or less consistently if no data is added or splits are > occurring. > > Blaming GC sound crazy, I know. But if I run my test with -Xms4g -Xmx4g, > then the test has always passed on the first scan attempt. So my concern is > that any full gc could cause a scan to be missing data. Maybe there are > weak references in play or some pause timeout silently failing the scan? > > James > > > On Thu, Jul 30, 2015 at 5:13 PM, Sean Busbey <[email protected]> wrote: > > > This sounds similar to HBASE-13262, but on versions that expressly have > > that fix in place. > > > > Mind putting up a jira with the problem reproduction? > > > > On Thu, Jul 30, 2015 at 1:13 PM, James Estes <[email protected]> > > wrote: > > > > > All, > > > > > > If a full GC happens on the client when a scan is in progress, the scan > > can > > > be missing rows. I have a test that repros this almost every time. > > > > > > The test runs against a local standalone server with 10g heap, using > > > jdk1.7.0_45. > > > > > > The Test: > > > - run with -Xmx1900m to restrict client heap > > > - run with -verbose:gc to see the GCs > > > - connect and create a new table with one CF > > > - add 99 cells, 9mb each to that CF to the same row (individual PUTs > in a > > > loop). > > > - full-scan the table, only setting the maxResultSize to 2mb (no batch > > > size) > > > - if no data, sleep 5s and try to scan again. > > > > > > Running this test, it fails the first scan. There is no exception, just > > no > > > results returned (results.hasNext is false). The test then sleeps 5s > and > > > tries the scan again, and it usually succeeds on the 2nd or 3rd > attempt. > > > Looking at the logs, we see several full GCs during the scan (but no > OOME > > > stacks before the first failure). Then a curious message: > > > 2015-07-30 10:42:10,815 [main] DEBUG > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation > > > - Removed 192.168.1.131:53244 as a location of > > > > > > > > > big_row_1438274455440,\x00\x80,1438274455540.b213fc048745241f236bc6e2291092d1. > > > for tableName=big_row_1438274455440 from cache > > > > > > As if the client has somehow decided the region location is bad/gone? > > After > > > that, the scan completes with no results. After a sleep, it tries > again, > > > and it usually passes, but oddly there are also actual OOMEs in the > > client > > > log just before the scan finishes successfully: > > > > > > 2015-07-30 10:42:36,459 [IPC Client (1790044085) connection to / > > > 192.168.1.131:53244 from james] WARN > org.apache.hadoop.ipc.RpcClient - > > > IPC Client (1790044085) connection to /192.168.1.131:53244 from james: > > > unexpected exception receiving call responses > > > java.lang.OutOfMemoryError: Java heap space > > > 2015-07-30 10:42:36,459 [IPC Client (1790044085) connection to / > > > 192.168.1.131:53244 from james] DEBUG > org.apache.hadoop.ipc.RpcClient - > > > IPC Client (1790044085) connection to /192.168.1.131:53244 from james: > > > closing ipc connection to /192.168.1.131:53244: Unexpected exception > > > receiving call responses > > > java.io.IOException: Unexpected exception receiving call responses > > > at > > org.apache.hadoop.hbase.ipc.RpcClient$Connection.run(RpcClient.java:731) > > > Caused by: java.lang.OutOfMemoryError: Java heap space > > > > > > It seems like the rpc winds up retrying after catching Throwable. > > > > > > This test is single threaded, and the single row is large, causing > > several > > > full GCs while receiving data. I suspect the same thing may happen if > > there > > > are multiple threads scanning, causing mem pressure elsewhere, leading > > to a > > > GC and may cause partial results (but I've not proven that). I can make > > the > > > tests pass by setting batch size to 10, reducing the mem pressure from > > this > > > one row, but again I'm not sure if a full GC were to happen for other > > > activity in the JVM, the scan wouldn't wind up behaving the same and > > > missing data. > > > > > > I tested the following combinations of client/server versions: > > > > > > Repro'ed in: > > > - 0.98.12 client/server > > > - 0.98.13 client 0.98.12 server > > > - 0.98.13 client/server > > > - 1.1.0 client 0.98.13 server > > > - 0.98.13 client and 1.1.0 server > > > - 0.98.12 client and 1.1.0 server > > > > > > NOT repro'ed in > > > - 1.1.0 client/server > > > > > > I'm not sure why 1.1.0 client would fail the same way against a 0.98.13 > > > server, but not a 1.1.0 server. But, more reason for my team to get up > to > > > 1.1 fully :) > > > > > > I have not yet run the test against a full cluster. I can provide the > > test > > > and logs from my testing if requested. > > > > > > Thanks, > > > James > > > > > > > > > > > -- > > Sean > > > -- Sean
