Re: Timeouts on snapshot restore

Ted Yu Thu, 24 Sep 2015 19:57:06 -0700

bq. Excluding datanode RS-1:50010

Was RS-1 the only data node to be excluded in that timeframe ?
Have you run fsck to see if hdfs is healthy ?


Cheers

On Thu, Sep 24, 2015 at 7:47 PM, Alexandre Normand <
alexandre.norm...@opower.com> wrote:

> Hi Ted,
> We'll be upgrading to cdh5 in the coming months but we're unfortunately
> stuck on 0.94.6 at the moment.
>
> The RS logs were empty around the time of the failed snapshot restore
> operation, but the following errors were in the master log.  The node
> 'RS-1' is the only node indicated in the logs. These errors occurred
> throughout the duration of the snapshot_restore operation.
>
> Sep 24, 9:51:41.655 PM INFO org.apache.hadoop.hdfs.DFSClient
> Exception in createBlockOutputStream
> java.io.IOException: Bad connect ack with firstBadLink as [RS-1]:50010
> at
>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1117)
> at
>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1040)
> at
>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:488)
> Sep 24, 9:51:41.664 PM INFO org.apache.hadoop.hdfs.DFSClient
> Abandoning
> BP-1769805853-namenode1-1354129919031:blk_8350736734896383334_327644360
> Sep 24, 9:51:41.678 PM INFO org.apache.hadoop.hdfs.DFSClient
> Excluding datanode RS-1:50010
> Sep 24, 9:52:58.954 PM INFO org.apache.hadoop.hdfs.DFSClient
> Exception in createBlockOutputStream
> java.net.SocketTimeoutException: 75000 millis timeout while waiting for
> channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/hmaster1:59726
> remote=/RS-1:50010]
> at
>
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:165)
> at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:156)
> at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:129)
> at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:117)
> at java.io.FilterInputStream.read(FilterInputStream.java:83)
> at java.io.FilterInputStream.read(FilterInputStream.java:83)
> at
>
> org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:169)
> at
>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1106)
> at
>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1040)
> at
>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:488)
> Sep 24, 9:52:58.956 PM INFO org.apache.hadoop.hdfs.DFSClient
> Abandoning
> BP-1769805853-namenode1-1354129919031:blk_-6817802178798905477_327644519
> Sep 24, 9:52:59.011 PM INFO org.apache.hadoop.hdfs.DFSClient
> Excluding datanode RS-1:50010
> Sep 24, 9:54:22.963 PM WARN
> org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector
> Timer already marked completed, ignoring!
> Sep 24, 9:54:22.964 PM ERROR
> org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler
> Failed taking snapshot { ss=** table=** type=DISABLED } due to
> exception:org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout
> elapsed! Source:Timeout caused Foreign Exception Start:1443145459635,
> End:1443146059635, diff:600000, max:600000 ms
>
> Thanks!
>
> On Thu, Sep 24, 2015 at 6:34 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>
> > 0.94.6 is really old. There have been quite a few bug fixes /
> improvements
> > to snapshot feature since its release.
> >
> > The error happens when SnapshotDescription corresponding to
> > kiji.prod.table.site.DI...
> > was not found by ProcedureCoordinator.
> >
> > bq. does the timeout necessarily mean that the restore failed or could it
> > be still running asynchronously
> >
> > Can you check region server logs around the time TimeoutException was
> > thrown to see which server was the straggler ?
> >
> > Thanks
> >
> > On Thu, Sep 24, 2015 at 5:13 PM, Alexandre Normand <
> > alexandre.norm...@gmail.com> wrote:
> >
> > > Hey,
> > > We're trying to restore a snapshot of a relatively big table (20TB)
> using
> > > hbase 0.94.6-cdh4.5.0 and we're getting timeouts doing so. We increased
> > the
> > > timeout configurations(hbase.snapshot.master.timeoutMillis,
> > > hbase.snapshot.region.timeout, hbase.snapshot.master.timeout.millis) to
> > 10
> > > minutes but we're still experiencing the timeouts. Here's the error and
> > > stack trace (table name obfuscated just because):
> > >
> > > ERROR: org.apache.hadoop.hbase.snapshot.HBaseSnapshotException:
> > > org.apache.hadoop.hbase.snapshot.HBaseSnapshotException: Snapshot {
> > > ss*****-1443136710408 table=******** type=FLUSH } had an error.
> > > kiji.prod.table.site.DI-1019-1443136710408 not found in proclist []
> > >         at
> > >
> >
> org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:360)
> > >         at
> > >
> org.apache.hadoop.hbase.master.HMaster.isSnapshotDone(HMaster.java:2075)
> > >         at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source)
> > >         at
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > >         at java.lang.reflect.Method.invoke(Method.java:606)
> > >         at
> > >
> >
> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:320)
> > >         at
> > >
> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1428)
> > > Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException via
> > > timer-java.util.Timer@8ad0d5c
> > > :org.apache.hadoop.hbase.errorhandling.TimeoutException:
> > > Timeout elapsed! Source:Timeout caused Foreign Exception
> > > Start:1443136713121, End:1443137313121, diff:600000, max:600000 ms
> > >         at
> > >
> >
> org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:85)
> > >         at
> > >
> >
> org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.rethrowExceptionIfFailed(TakeSnapshotHandler.java:285)
> > >         at
> > >
> >
> org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:350)
> > >         ... 6 more
> > > Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException:
> > > Timeout elapsed! Source:Timeout caused Foreign Exception
> > > Start:1443136713121, End:1443137313121, diff:600000, max:600000 ms
> > >         at
> > >
> >
> org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:68)
> > >         at java.util.TimerThread.mainLoop(Timer.java:555)
> > >         at java.util.TimerThread.run(Timer.java:505)
> > >
> > >
> > > We could increase the timeout again but we'd like to solicit some
> > feedback
> > > before trying that. First, does the timeout necessarily mean that the
> > > restore failed or could it be still running asynchronously and
> eventually
> > > completing? What's involved in the snapshot restore that could be
> useful
> > in
> > > informing what timeout value would be appropriate for this operation?
> > >
> > > Thanks!
> > >
> > > --
> > > Alex
> > >
> >
>
>
>
> --
> *Alexandre Normand*
> Software Maker
> Opower <http://www.opower.com>
>

Re: Timeouts on snapshot restore

Reply via email to