[
https://issues.apache.org/jira/browse/PHOENIX-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15042350#comment-15042350
]
Samarth Jain edited comment on PHOENIX-2408 at 12/4/15 10:34 PM:
-----------------------------------------------------------------
Spent the last couple of days trying to figure out what is going on here. On my
laptop (1 region server), I loaded a table with 400 millions rows distributed
over 8 regions. I added logging in a few places to see what is going on. I see
errors like these in my logs on the server side:
Exception caught in post scanner open for scan: 4. Exception:
org.apache.hadoop.hbase.ipc.CallerDisconnectedException: Aborting on region
TESTXYZ,\x04\x00\x00\x00\x00\x00\x00\x00\x00,1449215361195.5fa492cebc9f25b9602ecaf1d4601daf.,
call org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl@3efcf4dd
after 121324 ms, since caller disconnected
at
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:4144)
at
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:4061)
at
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:4048)
at
org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.doPostScannerOpen(UngroupedAggregateRegionObserver.java:288)
at
org.apache.phoenix.coprocessor.BaseScannerRegionObserver.postScannerOpen(BaseScannerRegionObserver.java:191)
at
org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$52.call(RegionCoprocessorHost.java:1305)
at
org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionOperation.call(RegionCoprocessorHost.java:1619)
at
org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1694)
at
org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperationWithResult(RegionCoprocessorHost.java:1658)
at
org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.postScannerOpen(RegionCoprocessorHost.java:1300)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3214)
at
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30946)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2093)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101)
at
org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
at java.lang.Thread.run(Thread.java:745)
It looks like org.apache.hadoop.hbase.ipc.CallerDisconnectedException is a
regular IOException and not a DoNotRetryIOException. As a result, the
BaseScannerRegionObserver#doPostScannerOpen() re-throws a regular IO exception
back to the client resulting in retries. These retries however are never
successful and we end up retrying the default number of times (31).
One thought I had was that I may be maxing out the IO on my laptop SSD. But
then, reducing the number of region server handler threads from default to 2
(to limit the I/O) didn't help either.
Will keep digging.
was (Author: samarthjain):
Spent the last couple of days trying to figure out what is going on here. On my
laptop (1 region server), I loaded a table with 400 millions rows distributed
over 8 regions. I added logging in a few places to see what is going on. I see
errors like these in my logs on the server side:
Exception caught in post scanner open for scan: 4. Exception:
org.apache.hadoop.hbase.ipc.CallerDisconnectedException: Aborting on region
TESTXYZ,\x04\x00\x00\x00\x00\x00\x00\x00\x00,1449215361195.5fa492cebc9f25b9602ecaf1d4601daf.,
call org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl@3efcf4dd
after 121324 ms, since caller disconnected
at
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:4144)
at
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:4061)
at
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:4048)
at
org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.doPostScannerOpen(UngroupedAggregateRegionObserver.java:288)
at
org.apache.phoenix.coprocessor.BaseScannerRegionObserver.postScannerOpen(BaseScannerRegionObserver.java:191)
at
org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$52.call(RegionCoprocessorHost.java:1305)
at
org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionOperation.call(RegionCoprocessorHost.java:1619)
at
org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1694)
at
org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperationWithResult(RegionCoprocessorHost.java:1658)
at
org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.postScannerOpen(RegionCoprocessorHost.java:1300)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3214)
at
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30946)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2093)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101)
at
org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
at java.lang.Thread.run(Thread.java:745)
It looks like org.apache.hadoop.hbase.ipc.CallerDisconnectedException is a
regular IOException and not a DoNotRetryIOException. As a result, the
BaseScannerRegionObserver#doPostScannerOpen() re-throws a regular IO exception
back to the client resulting in retries. These retries however are never
successful and we end up retrying the default number of times (31).
One thought I had was that I may be maxing out the IO on my laptop SSD. But
then, reducing the number of region server handler threads from default to 2
(to limit the I/O) didn't help either.
> Update statistics fails to complete
> -----------------------------------
>
> Key: PHOENIX-2408
> URL: https://issues.apache.org/jira/browse/PHOENIX-2408
> Project: Phoenix
> Issue Type: Bug
> Reporter: James Taylor
> Assignee: Samarth Jain
> Fix For: 4.7.0
>
>
> On a production cluster, when UPDATE STATISTICS is run, it fails to complete.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)