Hi, I'm running snapshots on 7 tables every hour, the last run managed 15 rounds of snapshots before starting to fail.
I did notice the following while getting the success count 2014-09-10 11:01:55,453 DEBUG [rs(##############,16020,1410126483043)-snapshot-pool151-thread-1] snapshot.FlushSnapshotSubprocedure: Closing region operation on rawScans,250000000000,1408114528441.ddb14a3d8f90cc18b6dfd4f28df5c534. 2014-09-10 11:01:55,453 DEBUG [rs(##############,16020,1410126483043)-snapshot-pool151-thread-1] snapshot.FlushSnapshotSubprocedure: Starting region operation on rawScans,960000000000,1408114528443.0b3010a2712ff1b6acec0705f5d171d9. 2014-09-10 11:01:55,453 DEBUG [rs(##############,16020,1410126483043)-snapshot-pool151-thread-1] snapshot.FlushSnapshotSubprocedure: Flush Snapshotting region rawScans,960000000000,1408114528443.0b3010a2712ff1b6acec0705f5d171d9. started... 2014-09-10 11:01:55,454 WARN [member: '##############,16020,1410126483043' subprocedure-pool1-thread-1] snapshot.RegionServerSnapshotManager: Rethrowing ForeignException from SnapshotSubprocedurePool java.util.concurrent.ExecutionException: org.apache.hadoop.hbase.errorhandling.TimeoutException via timer-java.util.Timer@45c34a8:org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1410343254867, End:1410343314867, diff:60000, max:60000 ms at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:188) at org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager$SnapshotSubprocedurePool.waitForOutstandingTasks(RegionServerSnapshotManager.java:295) at org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure.flushSnapshot(FlushSnapshotSubprocedure.java:118) at org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure.insideBarrier(FlushSnapshotSubprocedure.java:137) at org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:181) at org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:52) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException via timer-java.util.Timer@45c34a8:org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1410343254867, End:1410343314867, diff:60000, max:60000 ms at org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:83) at org.apache.hadoop.hbase.regionserver.HRegion.addRegionToSnapshot(HRegion.java:2816) at org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure$RegionSnapshotTask.call(FlushSnapshotSubprocedure.java:83) at org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure$RegionSnapshotTask.call(FlushSnapshotSubprocedure.java:65) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ... 4 more Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1410343254867, End:1410343314867, diff:60000, max:60000 ms at org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:67) at java.util.TimerThread.mainLoop(Timer.java:555) at java.util.TimerThread.run(Timer.java:505) Is there a configuration option that I need to increase to allow the snapshots to run for more than 60 seconds on a given region? That said, I'm guessing the above shouldn't cause the SubProcedure pool to fill up completely. -Ian On Tuesday 23 September 2014 04:33:39 Ted Yu wrote: > Subprocedure pool is full. > > How many snapshot requests did you submit before seeing the following ? > > Cheers > > On Sep 23, 2014, at 2:28 AM, Ian Brooks <i.bro...@sensewhere.com> wrote: > > > Hi, > > > > I'm seeing an issue on our hbase cluster which is preventing snapshots from > > working. So far the only way i can get it working again is to restart all > > the regionservers which is not ideal. > > > > I'm running hbase 0.98.3 on hadoop 2.4.0, the stacktraces are > > > > From host given the snapshot command via hbase shell > > > > 2014-09-23 10:00:16,116 INFO [regionserver16020-EventThread] > > procedure.ZKProcedureMemberRpcs: Received procedure start children changed > > event: /hbase/online-snapshot/acquired > > 2014-09-23 10:00:16,116 DEBUG [regionserver16020-EventThread] > > procedure.ZKProcedureMemberRpcs: Looking for new procedures under > > znode:'/hbase/online-snapshot/acquired' > > 2014-09-23 10:00:16,117 DEBUG [regionserver16020-EventThread] > > procedure.ZKProcedureMemberRpcs: Found procedure znode: > > /hbase/online-snapshot/acquired/offlineScans-20140923-100009 > > 2014-09-23 10:00:16,117 DEBUG [regionserver16020-EventThread] > > procedure.ZKProcedureMemberRpcs: start proc data length is 59 > > 2014-09-23 10:00:16,117 DEBUG [regionserver16020-EventThread] > > procedure.ZKProcedureMemberRpcs: Found data for > > znode:/hbase/online-snapshot/acquired/offlineScans-20140923-100009 > > 2014-09-23 10:00:16,118 DEBUG [regionserver16020-EventThread] > > snapshot.RegionServerSnapshotManager: Launching subprocedure for snapshot > > offlineScans-20140923-100009 from table offlineScans > > 2014-09-23 10:00:16,118 DEBUG [regionserver16020-EventThread] > > procedure.ProcedureMember: Submitting new > > Subprocedure:offlineScans-20140923-100009 > > 2014-09-23 10:00:16,120 DEBUG [member: '############,16020,1410126483043' > > subprocedure-pool1-thread-314] procedure.Subprocedure: Starting > > subprocedure 'offlineScans-20140923-100009' with timeout 60000ms > > 2014-09-23 10:00:16,120 DEBUG [member: '############,16020,1410126483043' > > subprocedure-pool1-thread-314] errorhandling.TimeoutExceptionInjector: > > Scheduling process timer to run in: 60000 ms > > 2014-09-23 10:00:16,121 DEBUG [member: '############,16020,1410126483043' > > subprocedure-pool1-thread-314] procedure.Subprocedure: Subprocedure > > 'offlineScans-20140923-100009' starting 'acquire' stage > > 2014-09-23 10:00:16,121 DEBUG [member: '############,16020,1410126483043' > > subprocedure-pool1-thread-314] procedure.Subprocedure: Subprocedure > > 'offlineScans-20140923-100009' locally acquired > > 2014-09-23 10:00:16,121 DEBUG [member: '############,16020,1410126483043' > > subprocedure-pool1-thread-314] procedure.ZKProcedureMemberRpcs: Member: > > '##########,16020,1410126483043' joining acquired barrier for procedure > > (offlineScans-20140923-100009) in zk > > 2014-09-23 10:00:16,123 DEBUG [member: '############,16020,1410126483043' > > subprocedure-pool1-thread-314] procedure.ZKProcedureMemberRpcs: Watch for > > global barrier > > reached:/hbase/online-snapshot/reached/offlineScans-20140923-100009 > > 2014-09-23 10:00:16,124 DEBUG [member: '############,16020,1410126483043' > > subprocedure-pool1-thread-314] procedure.Subprocedure: Subprocedure > > 'offlineScans-20140923-100009' coordinator notified of 'acquire', waiting > > on 'reached' or 'abort' from coordinator > > 2014-09-23 10:00:16,133 INFO [regionserver16020-EventThread] > > procedure.ZKProcedureMemberRpcs: Received created > > event:/hbase/online-snapshot/abort/offlineScans-20140923-100009 > > 2014-09-23 10:00:16,134 DEBUG [regionserver16020-EventThread] > > procedure.ZKProcedureMemberRpcs: Aborting procedure member for znode > > /hbase/online-snapshot/abort/offlineScans-20140923-100009 > > 2014-09-23 10:00:16,134 DEBUG [regionserver16020-EventThread] > > procedure.ProcedureMember: Request received to abort procedure > > offlineScans-20140923-100009 > > org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via > > ###############,16020,1408633802567:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: > > > > at > > org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171) > > at > > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320) > > at > > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeCreated(ZKProcedureMemberRpcs.java:94) > > at > > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:327) > > at > > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) > > at > > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) > > Caused by: > > org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: > > at > > org.apache.hadoop.hbase.procedure.Subprocedure.cancel(Subprocedure.java:270) > > at > > org.apache.hadoop.hbase.procedure.ProcedureMember.submitSubprocedure(ProcedureMember.java:171) > > at > > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.startNewSubprocedure(ZKProcedureMemberRpcs.java:214) > > at > > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.waitForNewProcedures(ZKProcedureMemberRpcs.java:172) > > at > > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.access$100(ZKProcedureMemberRpcs.java:55) > > at > > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeChildrenChanged(ZKProcedureMemberRpcs.java:107) > > at > > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:348) > > ... 2 more > > > > > > > > From the regionserver it was trying to contact > > > > > > 2014-09-23 10:00:16,124 INFO [regionserver16020-EventThread] > > procedure.ZKProcedureMemberRpcs: Received procedure start children changed > > event: /hbase/online-snapshot/acquired > > 2014-09-23 10:00:16,124 DEBUG [regionserver16020-EventThread] > > procedure.ZKProcedureMemberRpcs: Looking for new procedures under > > znode:'/hbase/online-snapshot/acquired' > > 2014-09-23 10:00:16,125 DEBUG [regionserver16020-EventThread] > > procedure.ZKProcedureMemberRpcs: Found procedure znode: > > /hbase/online-snapshot/acquired/offlineScans-20140923-100009 > > 2014-09-23 10:00:16,127 DEBUG [regionserver16020-EventThread] > > procedure.ZKProcedureMemberRpcs: start proc data length is 59 > > 2014-09-23 10:00:16,127 DEBUG [regionserver16020-EventThread] > > procedure.ZKProcedureMemberRpcs: Found data for > > znode:/hbase/online-snapshot/acquired/offlineScans-20140923-100009 > > 2014-09-23 10:00:16,127 DEBUG [regionserver16020-EventThread] > > snapshot.RegionServerSnapshotManager: Launching subprocedure for snapshot > > offlineScans-20140923-100009 from table offlineScans > > 2014-09-23 10:00:16,127 DEBUG [regionserver16020-EventThread] > > procedure.ProcedureMember: Submitting new > > Subprocedure:offlineScans-20140923-100009 > > 2014-09-23 10:00:16,128 ERROR [regionserver16020-EventThread] > > procedure.Subprocedure: Subprocedure pool is full! > > 2014-09-23 10:00:16,128 DEBUG [regionserver16020-EventThread] > > errorhandling.ForeignExceptionDispatcher: offlineScans-20140923-100009 > > accepting received exception > > 2014-09-23 10:00:16,128 DEBUG [regionserver16020-EventThread] > > procedure.ZKProcedureMemberRpcs: Aborting procedure > > (offlineScans-20140923-100009) in zk > > 2014-09-23 10:00:16,134 DEBUG [regionserver16020-EventThread] > > procedure.ZKProcedureMemberRpcs: Finished creating abort > > znode:/hbase/online-snapshot/abort/offlineScans-20140923-100009 > > 2014-09-23 10:00:16,134 ERROR [regionserver16020-EventThread] > > procedure.ProcedureMember: Failed to start subprocedure > > 'offlineScans-20140923-100009' > > 2014-09-23 10:00:16,134 INFO [regionserver16020-EventThread] > > procedure.ZKProcedureMemberRpcs: Received created > > event:/hbase/online-snapshot/abort/offlineScans-20140923-100009 > > 2014-09-23 10:00:16,134 DEBUG [regionserver16020-EventThread] > > procedure.ZKProcedureMemberRpcs: Aborting procedure member for znode > > /hbase/online-snapshot/abort/offlineScans-20140923-100009 > > 2014-09-23 10:00:16,145 DEBUG [regionserver16020-EventThread] > > procedure.ProcedureMember: Request received to abort procedure > > offlineScans-20140923-100009 > > org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via > > ###########,16020,1408633802567:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: > > at > > org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171) > > at > > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320) > > at > > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeCreated(ZKProcedureMemberRpcs.java:94) > > at > > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:327) > > at > > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) > > at > > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) > > Caused by: > > org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: > > at > > org.apache.hadoop.hbase.procedure.Subprocedure.cancel(Subprocedure.java:270) > > at > > org.apache.hadoop.hbase.procedure.ProcedureMember.submitSubprocedure(ProcedureMember.java:171) > > at > > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.startNewSubprocedure(ZKProcedureMemberRpcs.java:214) > > at > > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.waitForNewProcedures(ZKProcedureMemberRpcs.java:172) > > at > > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.access$100(ZKProcedureMemberRpcs.java:55) > > at > > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeChildrenChanged(ZKProcedureMemberRpcs.java:107) > > at > > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:348) > > ... 2 more > > > > Any ideas on how I can resolve this? > > > > -Ian > >