This sounds similar to HBASE-13262, but on versions that expressly have
that fix in place.

Mind putting up a jira with the problem reproduction?

On Thu, Jul 30, 2015 at 1:13 PM, James Estes <[email protected]> wrote:

> All,
>
> If a full GC happens on the client when a scan is in progress, the scan can
> be missing rows. I have a test that repros this almost every time.
>
> The test runs against a local standalone server with 10g heap, using
> jdk1.7.0_45.
>
> The Test:
> - run with -Xmx1900m to restrict client heap
> - run with -verbose:gc to see the GCs
> - connect and create a new table with one CF
> - add 99 cells, 9mb each to that CF to the same row (individual PUTs in a
> loop).
> - full-scan the table, only setting the maxResultSize to 2mb (no batch
> size)
> - if no data, sleep 5s and try to scan again.
>
> Running this test, it fails the first scan. There is no exception, just no
> results returned (results.hasNext is false). The test then sleeps 5s and
> tries the scan again, and it usually succeeds on the 2nd or 3rd attempt.
> Looking at the logs, we see several full GCs during the scan (but no OOME
> stacks before the first failure). Then a curious message:
> 2015-07-30 10:42:10,815 [main] DEBUG
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation
>  - Removed 192.168.1.131:53244 as a location of
>
> big_row_1438274455440,\x00\x80,1438274455540.b213fc048745241f236bc6e2291092d1.
> for tableName=big_row_1438274455440 from cache
>
> As if the client has somehow decided the region location is bad/gone? After
> that, the scan completes with no results. After a sleep, it tries again,
> and it usually passes, but oddly there are also actual OOMEs in the client
> log just before the scan finishes successfully:
>
> 2015-07-30 10:42:36,459 [IPC Client (1790044085) connection to /
> 192.168.1.131:53244 from james] WARN  org.apache.hadoop.ipc.RpcClient  -
> IPC Client (1790044085) connection to /192.168.1.131:53244 from james:
> unexpected exception receiving call responses
> java.lang.OutOfMemoryError: Java heap space
> 2015-07-30 10:42:36,459 [IPC Client (1790044085) connection to /
> 192.168.1.131:53244 from james] DEBUG org.apache.hadoop.ipc.RpcClient  -
> IPC Client (1790044085) connection to /192.168.1.131:53244 from james:
> closing ipc connection to /192.168.1.131:53244: Unexpected exception
> receiving call responses
> java.io.IOException: Unexpected exception receiving call responses
> at org.apache.hadoop.hbase.ipc.RpcClient$Connection.run(RpcClient.java:731)
> Caused by: java.lang.OutOfMemoryError: Java heap space
>
> It seems like the rpc winds up retrying after catching Throwable.
>
> This test is single threaded, and the single row is large, causing several
> full GCs while receiving data. I suspect the same thing may happen if there
> are multiple threads scanning, causing mem pressure elsewhere, leading to a
> GC and may cause partial results (but I've not proven that). I can make the
> tests pass by setting batch size to 10, reducing the mem pressure from this
> one row, but again I'm not sure if a full GC were to happen for other
> activity in the JVM, the scan wouldn't wind up behaving the same and
> missing data.
>
> I tested the following combinations of client/server versions:
>
> Repro'ed in:
>  - 0.98.12 client/server
>  - 0.98.13 client 0.98.12 server
>  - 0.98.13 client/server
>  - 1.1.0 client 0.98.13 server
>  - 0.98.13 client and 1.1.0 server
>  - 0.98.12 client and 1.1.0 server
>
> NOT repro'ed in
>  - 1.1.0 client/server
>
> I'm not sure why 1.1.0 client would fail the same way against a 0.98.13
> server, but not a 1.1.0 server. But, more reason for my team to get up to
> 1.1 fully :)
>
> I have not yet run the test against a full cluster. I can provide the test
> and logs from my testing if requested.
>
> Thanks,
> James
>



-- 
Sean

Reply via email to