Re: Timeouts on snapshot restore

Alexandre Normand Thu, 24 Sep 2015 19:49:04 -0700

Hi Ted,
We'll be upgrading to cdh5 in the coming months but we're unfortunately
stuck on 0.94.6 at the moment.

The RS logs were empty around the time of the failed snapshot restore
operation, but the following errors were in the master log.  The node
'RS-1' is the only node indicated in the logs. These errors occurred
throughout the duration of the snapshot_restore operation.

Sep 24, 9:51:41.655 PM INFO org.apache.hadoop.hdfs.DFSClient
Exception in createBlockOutputStream
java.io.IOException: Bad connect ack with firstBadLink as [RS-1]:50010
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1117)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1040)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:488)
Sep 24, 9:51:41.664 PM INFO org.apache.hadoop.hdfs.DFSClient
Abandoning
BP-1769805853-namenode1-1354129919031:blk_8350736734896383334_327644360
Sep 24, 9:51:41.678 PM INFO org.apache.hadoop.hdfs.DFSClient
Excluding datanode RS-1:50010
Sep 24, 9:52:58.954 PM INFO org.apache.hadoop.hdfs.DFSClient
Exception in createBlockOutputStream
java.net.SocketTimeoutException: 75000 millis timeout while waiting for
channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/hmaster1:59726
remote=/RS-1:50010]
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:165)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:156)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:129)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:117)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at
org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:169)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1106)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1040)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:488)
Sep 24, 9:52:58.956 PM INFO org.apache.hadoop.hdfs.DFSClient
Abandoning
BP-1769805853-namenode1-1354129919031:blk_-6817802178798905477_327644519
Sep 24, 9:52:59.011 PM INFO org.apache.hadoop.hdfs.DFSClient
Excluding datanode RS-1:50010
Sep 24, 9:54:22.963 PM WARN
org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector
Timer already marked completed, ignoring!
Sep 24, 9:54:22.964 PM ERROR
org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler
Failed taking snapshot { ss=** table=** type=DISABLED } due to
exception:org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout
elapsed! Source:Timeout caused Foreign Exception Start:1443145459635,
End:1443146059635, diff:600000, max:600000 ms

Thanks!

On Thu, Sep 24, 2015 at 6:34 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> 0.94.6 is really old. There have been quite a few bug fixes / improvements
> to snapshot feature since its release.
>
> The error happens when SnapshotDescription corresponding to
> kiji.prod.table.site.DI...
> was not found by ProcedureCoordinator.
>
> bq. does the timeout necessarily mean that the restore failed or could it
> be still running asynchronously
>
> Can you check region server logs around the time TimeoutException was
> thrown to see which server was the straggler ?
>
> Thanks
>
> On Thu, Sep 24, 2015 at 5:13 PM, Alexandre Normand <
> alexandre.norm...@gmail.com> wrote:
>
> > Hey,
> > We're trying to restore a snapshot of a relatively big table (20TB) using
> > hbase 0.94.6-cdh4.5.0 and we're getting timeouts doing so. We increased
> the
> > timeout configurations(hbase.snapshot.master.timeoutMillis,
> > hbase.snapshot.region.timeout, hbase.snapshot.master.timeout.millis) to
> 10
> > minutes but we're still experiencing the timeouts. Here's the error and
> > stack trace (table name obfuscated just because):
> >
> > ERROR: org.apache.hadoop.hbase.snapshot.HBaseSnapshotException:
> > org.apache.hadoop.hbase.snapshot.HBaseSnapshotException: Snapshot {
> > ss*****-1443136710408 table=******** type=FLUSH } had an error.
> > kiji.prod.table.site.DI-1019-1443136710408 not found in proclist []
> >         at
> >
> org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:360)
> >         at
> > org.apache.hadoop.hbase.master.HMaster.isSnapshotDone(HMaster.java:2075)
> >         at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source)
> >         at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >         at java.lang.reflect.Method.invoke(Method.java:606)
> >         at
> >
> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:320)
> >         at
> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1428)
> > Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException via
> > timer-java.util.Timer@8ad0d5c
> > :org.apache.hadoop.hbase.errorhandling.TimeoutException:
> > Timeout elapsed! Source:Timeout caused Foreign Exception
> > Start:1443136713121, End:1443137313121, diff:600000, max:600000 ms
> >         at
> >
> org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:85)
> >         at
> >
> org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.rethrowExceptionIfFailed(TakeSnapshotHandler.java:285)
> >         at
> >
> org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:350)
> >         ... 6 more
> > Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException:
> > Timeout elapsed! Source:Timeout caused Foreign Exception
> > Start:1443136713121, End:1443137313121, diff:600000, max:600000 ms
> >         at
> >
> org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:68)
> >         at java.util.TimerThread.mainLoop(Timer.java:555)
> >         at java.util.TimerThread.run(Timer.java:505)
> >
> >
> > We could increase the timeout again but we'd like to solicit some
> feedback
> > before trying that. First, does the timeout necessarily mean that the
> > restore failed or could it be still running asynchronously and eventually
> > completing? What's involved in the snapshot restore that could be useful
> in
> > informing what timeout value would be appropriate for this operation?
> >
> > Thanks!
> >
> > --
> > Alex
> >
>

-- 
*Alexandre Normand*
Software Maker
Opower <http://www.opower.com>

Re: Timeouts on snapshot restore

Reply via email to