Hi Ted, We'll be upgrading to cdh5 in the coming months but we're unfortunately stuck on 0.94.6 at the moment.
The RS logs were empty around the time of the failed snapshot restore operation, but the following errors were in the master log. The node 'RS-1' is the only node indicated in the logs. These errors occurred throughout the duration of the snapshot_restore operation. Sep 24, 9:51:41.655 PM INFO org.apache.hadoop.hdfs.DFSClient Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as [RS-1]:50010 at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1117) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1040) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:488) Sep 24, 9:51:41.664 PM INFO org.apache.hadoop.hdfs.DFSClient Abandoning BP-1769805853-namenode1-1354129919031:blk_8350736734896383334_327644360 Sep 24, 9:51:41.678 PM INFO org.apache.hadoop.hdfs.DFSClient Excluding datanode RS-1:50010 Sep 24, 9:52:58.954 PM INFO org.apache.hadoop.hdfs.DFSClient Exception in createBlockOutputStream java.net.SocketTimeoutException: 75000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/hmaster1:59726 remote=/RS-1:50010] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:165) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:156) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:129) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:117) at java.io.FilterInputStream.read(FilterInputStream.java:83) at java.io.FilterInputStream.read(FilterInputStream.java:83) at org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:169) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1106) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1040) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:488) Sep 24, 9:52:58.956 PM INFO org.apache.hadoop.hdfs.DFSClient Abandoning BP-1769805853-namenode1-1354129919031:blk_-6817802178798905477_327644519 Sep 24, 9:52:59.011 PM INFO org.apache.hadoop.hdfs.DFSClient Excluding datanode RS-1:50010 Sep 24, 9:54:22.963 PM WARN org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector Timer already marked completed, ignoring! Sep 24, 9:54:22.964 PM ERROR org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler Failed taking snapshot { ss=** table=** type=DISABLED } due to exception:org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1443145459635, End:1443146059635, diff:600000, max:600000 ms Thanks! On Thu, Sep 24, 2015 at 6:34 PM, Ted Yu <yuzhih...@gmail.com> wrote: > 0.94.6 is really old. There have been quite a few bug fixes / improvements > to snapshot feature since its release. > > The error happens when SnapshotDescription corresponding to > kiji.prod.table.site.DI... > was not found by ProcedureCoordinator. > > bq. does the timeout necessarily mean that the restore failed or could it > be still running asynchronously > > Can you check region server logs around the time TimeoutException was > thrown to see which server was the straggler ? > > Thanks > > On Thu, Sep 24, 2015 at 5:13 PM, Alexandre Normand < > alexandre.norm...@gmail.com> wrote: > > > Hey, > > We're trying to restore a snapshot of a relatively big table (20TB) using > > hbase 0.94.6-cdh4.5.0 and we're getting timeouts doing so. We increased > the > > timeout configurations(hbase.snapshot.master.timeoutMillis, > > hbase.snapshot.region.timeout, hbase.snapshot.master.timeout.millis) to > 10 > > minutes but we're still experiencing the timeouts. Here's the error and > > stack trace (table name obfuscated just because): > > > > ERROR: org.apache.hadoop.hbase.snapshot.HBaseSnapshotException: > > org.apache.hadoop.hbase.snapshot.HBaseSnapshotException: Snapshot { > > ss*****-1443136710408 table=******** type=FLUSH } had an error. > > kiji.prod.table.site.DI-1019-1443136710408 not found in proclist [] > > at > > > org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:360) > > at > > org.apache.hadoop.hbase.master.HMaster.isSnapshotDone(HMaster.java:2075) > > at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source) > > at > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:606) > > at > > > org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:320) > > at > > > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1428) > > Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException via > > timer-java.util.Timer@8ad0d5c > > :org.apache.hadoop.hbase.errorhandling.TimeoutException: > > Timeout elapsed! Source:Timeout caused Foreign Exception > > Start:1443136713121, End:1443137313121, diff:600000, max:600000 ms > > at > > > org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:85) > > at > > > org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.rethrowExceptionIfFailed(TakeSnapshotHandler.java:285) > > at > > > org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:350) > > ... 6 more > > Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException: > > Timeout elapsed! Source:Timeout caused Foreign Exception > > Start:1443136713121, End:1443137313121, diff:600000, max:600000 ms > > at > > > org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:68) > > at java.util.TimerThread.mainLoop(Timer.java:555) > > at java.util.TimerThread.run(Timer.java:505) > > > > > > We could increase the timeout again but we'd like to solicit some > feedback > > before trying that. First, does the timeout necessarily mean that the > > restore failed or could it be still running asynchronously and eventually > > completing? What's involved in the snapshot restore that could be useful > in > > informing what timeout value would be appropriate for this operation? > > > > Thanks! > > > > -- > > Alex > > > -- *Alexandre Normand* Software Maker Opower <http://www.opower.com>