[ https://issues.apache.org/jira/browse/PHOENIX-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15042350#comment-15042350 ]
Samarth Jain edited comment on PHOENIX-2408 at 12/4/15 10:34 PM: ----------------------------------------------------------------- Spent the last couple of days trying to figure out what is going on here. On my laptop (1 region server), I loaded a table with 400 millions rows distributed over 8 regions. I added logging in a few places to see what is going on. I see errors like these in my logs on the server side: Exception caught in post scanner open for scan: 4. Exception: org.apache.hadoop.hbase.ipc.CallerDisconnectedException: Aborting on region TESTXYZ,\x04\x00\x00\x00\x00\x00\x00\x00\x00,1449215361195.5fa492cebc9f25b9602ecaf1d4601daf., call org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl@3efcf4dd after 121324 ms, since caller disconnected at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:4144) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:4061) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:4048) at org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.doPostScannerOpen(UngroupedAggregateRegionObserver.java:288) at org.apache.phoenix.coprocessor.BaseScannerRegionObserver.postScannerOpen(BaseScannerRegionObserver.java:191) at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$52.call(RegionCoprocessorHost.java:1305) at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionOperation.call(RegionCoprocessorHost.java:1619) at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1694) at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperationWithResult(RegionCoprocessorHost.java:1658) at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.postScannerOpen(RegionCoprocessorHost.java:1300) at org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3214) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30946) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2093) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107) at java.lang.Thread.run(Thread.java:745) It looks like org.apache.hadoop.hbase.ipc.CallerDisconnectedException is a regular IOException and not a DoNotRetryIOException. As a result, the BaseScannerRegionObserver#doPostScannerOpen() re-throws a regular IO exception back to the client resulting in retries. These retries however are never successful and we end up retrying the default number of times (31). One thought I had was that I may be maxing out the IO on my laptop SSD. But then, reducing the number of region server handler threads from default to 2 (to limit the I/O) didn't help either. Will keep digging. was (Author: samarthjain): Spent the last couple of days trying to figure out what is going on here. On my laptop (1 region server), I loaded a table with 400 millions rows distributed over 8 regions. I added logging in a few places to see what is going on. I see errors like these in my logs on the server side: Exception caught in post scanner open for scan: 4. Exception: org.apache.hadoop.hbase.ipc.CallerDisconnectedException: Aborting on region TESTXYZ,\x04\x00\x00\x00\x00\x00\x00\x00\x00,1449215361195.5fa492cebc9f25b9602ecaf1d4601daf., call org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl@3efcf4dd after 121324 ms, since caller disconnected at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:4144) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:4061) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:4048) at org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.doPostScannerOpen(UngroupedAggregateRegionObserver.java:288) at org.apache.phoenix.coprocessor.BaseScannerRegionObserver.postScannerOpen(BaseScannerRegionObserver.java:191) at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$52.call(RegionCoprocessorHost.java:1305) at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionOperation.call(RegionCoprocessorHost.java:1619) at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1694) at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperationWithResult(RegionCoprocessorHost.java:1658) at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.postScannerOpen(RegionCoprocessorHost.java:1300) at org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3214) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30946) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2093) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107) at java.lang.Thread.run(Thread.java:745) It looks like org.apache.hadoop.hbase.ipc.CallerDisconnectedException is a regular IOException and not a DoNotRetryIOException. As a result, the BaseScannerRegionObserver#doPostScannerOpen() re-throws a regular IO exception back to the client resulting in retries. These retries however are never successful and we end up retrying the default number of times (31). One thought I had was that I may be maxing out the IO on my laptop SSD. But then, reducing the number of region server handler threads from default to 2 (to limit the I/O) didn't help either. > Update statistics fails to complete > ----------------------------------- > > Key: PHOENIX-2408 > URL: https://issues.apache.org/jira/browse/PHOENIX-2408 > Project: Phoenix > Issue Type: Bug > Reporter: James Taylor > Assignee: Samarth Jain > Fix For: 4.7.0 > > > On a production cluster, when UPDATE STATISTICS is run, it fails to complete. -- This message was sent by Atlassian JIRA (v6.3.4#6332)