Hi all,

I am running HBase 0.94.6 on a 20 node cluster, and am taking daily snapshots 
of our single table (only keeping snapshots for the last 3 days. Yesterday, I 
started seeing the following messages in one of the region servers that had to 
be restarted:


2014-10-22 08:29:19,982 INFO org.apache.hadoop.ipc.HBaseServer: REPL IPC Server 
handler 1 on 60020: starting
2014-10-22 08:29:19,982 INFO org.apache.hadoop.ipc.HBaseServer: REPL IPC Server 
handler 2 on 60020: starting
2014-10-22 08:29:19,986 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Serving as 
njhdslave40,60020,1413980958234, RPC listening on 
njhdslave40/172.30.120.180:60020, sessionid=0x2482ca09984b22a
2014-10-22 08:29:19,986 INFO 
org.apache.hadoop.hbase.regionserver.SplitLogWorker: SplitLogWorker 
njhdslave40,60020,1413980958234 starting
2014-10-22 08:29:19,988 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Registered RegionServer 
MXBean
2014-10-22 08:29:20,024 INFO org.apache.hadoop.hbase.procedure.ProcedureMember: 
Received abort on procedure with no local subprocedure upd-2014_10_19, ignoring 
it.
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via 
timer-java.util.Timer@2cef133c:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
 org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! 
Source:Timeout caused Foreign Exception Start:1413728408736, End:1413728468758, 
diff:60022, max:60000 ms
    at 
org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
    at 
org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320)
    at 
org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.watchForAbortedProcedures(ZKProcedureMemberRpcs.java:143)
    at 
org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.start(ZKProcedureMemberRpcs.java:340)
    at 
org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager.start(RegionServerSnapshotManager.java:141)
    at 
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:734)
    at java.lang.Thread.run(Thread.java:662)
Caused by: 
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: 
org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! 
Source:Timeout caused Foreign Exception Start:1413728408736, End:1413728468758, 
diff:60022, max:60000 ms
    at 
org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:71)
    at java.util.TimerThread.mainLoop(Timer.java:512)
    at java.util.TimerThread.run(Timer.java:462)
2014-10-22 08:29:29,526 WARN org.apache.hadoop.conf.Configuration: 
hadoop.native.lib is deprecated. Instead, use io.native.lib.available




After looking at the code, I see that the RegionServerSnapshotManager is 
watching for aborted nodes, and reports the exception above. Indeed, a few days 
ago, we had some issues with one of the servers, and I guess the creation of 
the daily snapshot was aborted. Indeed, looking in the zookeeper node, we see a 
record of an aborted snapshot from 19 October.



Here is a dump from a  zookeeper node:

[zk: slave:2181 (CONNECTED) 2] ls /hbase/online-snapshot/abort
[upd-2014_10_19]

Just to confirm, I restarted another region server, and saw the same error. It 
seems that the cluster is working correctly, and new snapshots are being 
created. 


My question is,are these error messages expected, and  what process is 
responsible for automatically cleaning up the 'abort' node, and are there any 
orphaned HLogs from the aborted snapshot that need manual cleaning up.

Thanks,
Hayden



Reply via email to