Could it be that there are times in the TaskManager where there are large pauses between an inputFormat.nextRecord() and the next one..?
On Thu, Nov 27, 2014 at 3:44 PM, Stefano Bortoli <[email protected]> wrote: > hi all, > > I am facing an odd issue while running a quite complex duplicates > detection process. > > The code runs like a charm on a dataset of a million with few duplicates > (3 minutes), but hits the scanner timeout over a dataset of 9.2M. > > The problem happens randomly, and I don't think it is related to the > business logic, or the scan configurations for what matters. > > The caching block is set to 100, and the scan timeout is 900.000 > milliseconds (15min). The job would run normally in around 0.5 seconds on a > 100 entries... therefore I must be hitting something deep. Something > related on how Hadoop and Hbase work together. > > My problem is that it may fail or it may not. Yesterday I could complete > the whole scan without problems, the the job failed over another error. > Today, the same code failed after 3.5h, a little before completion of the > first phase. > > I think it may be something about GC. > > I log the execution time of every single map, and everything finishes > within milliseconds. Even then the exception happens. (as I catch it, > print, and throw it again). > > Any idea of where the issue could be? > > thanks a lot for the support. Stack trace appended. > > saluti, > Stefano > > Error: org.apache.hadoop.hbase.client.ScannerTimeoutException: 2387347ms > passed since the last invocation, timeout is currently set to 900000 > at > org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:352) > at > org.apache.flink.addons.hbase.TableInputFormat.nextRecord(TableInputFormat.java:106) > at > org.apache.flink.addons.hbase.TableInputFormat.nextRecord(TableInputFormat.java:48) > at > org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:195) > at > org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:246) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.hadoop.hbase.UnknownScannerException: > org.apache.hadoop.hbase.UnknownScannerException: Name: 291, already closed? > at > org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3043) > at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29497) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2012) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:98) > at > org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160) > at > org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38) > at > org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110) > at java.lang.Thread.run(Thread.java:745) > > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) > at > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95) > at > org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:283) > at > org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:198) > at > org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:57) > at > org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:114) > at > org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:90) > at > org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:336) > ... 5 more > Caused by: > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.UnknownScannerException): > org.apache.hadoop.hbase.UnknownScannerException: Name: 291, already closed? > at > org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3043) > at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29497) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2012) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:98) > at > org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160) > at > org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38) > at > org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110) > at java.lang.Thread.run(Thread.java:745) > > at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1458) > at > org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1662) > at > org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1720) > at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:29900) > at > org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:168) >
