Re:Re: Timeouts on snapshot restore

王勇强 Fri, 06 Nov 2015 22:37:20 -0800

when use new version hbase, is restore more quickly






At 2015-09-26 01:21:41, "Alexandre Normand" <alexandre.norm...@opower.com> 
wrote:
>We tried decommissioning the slow node and that still didn't help. We then
>increased the timeouts to 90 minutes and we had a successful restore after
>32 minutes.
>
>Better a slow restore than a failed restore.
>
>Cheers!
>
>On Thu, Sep 24, 2015 at 8:34 PM, Alexandre Normand <
>alexandre.norm...@opower.com> wrote:
>
>> Yes, RS-1 was the only data node to be excluded.
>>
>> The result of 'hdfs fsck /' is healthy. Blocks are reasonably replicated
>> and we don't have any missing or corrupted blocks.
>>
>> Thanks again.
>>
>> On Thu, Sep 24, 2015 at 7:55 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>
>>> bq. Excluding datanode RS-1:50010
>>>
>>> Was RS-1 the only data node to be excluded in that timeframe ?
>>> Have you run fsck to see if hdfs is healthy ?
>>>
>>> Cheers
>>>
>>> On Thu, Sep 24, 2015 at 7:47 PM, Alexandre Normand <
>>> alexandre.norm...@opower.com> wrote:
>>>
>>> > Hi Ted,
>>> > We'll be upgrading to cdh5 in the coming months but we're unfortunately
>>> > stuck on 0.94.6 at the moment.
>>> >
>>> > The RS logs were empty around the time of the failed snapshot restore
>>> > operation, but the following errors were in the master log.  The node
>>> > 'RS-1' is the only node indicated in the logs. These errors occurred
>>> > throughout the duration of the snapshot_restore operation.
>>> >
>>> > Sep 24, 9:51:41.655 PM INFO org.apache.hadoop.hdfs.DFSClient
>>> > Exception in createBlockOutputStream
>>> > java.io.IOException: Bad connect ack with firstBadLink as [RS-1]:50010
>>> > at
>>> >
>>> >
>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1117)
>>> > at
>>> >
>>> >
>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1040)
>>> > at
>>> >
>>> >
>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:488)
>>> > Sep 24, 9:51:41.664 PM INFO org.apache.hadoop.hdfs.DFSClient
>>> > Abandoning
>>> > BP-1769805853-namenode1-1354129919031:blk_8350736734896383334_327644360
>>> > Sep 24, 9:51:41.678 PM INFO org.apache.hadoop.hdfs.DFSClient
>>> > Excluding datanode RS-1:50010
>>> > Sep 24, 9:52:58.954 PM INFO org.apache.hadoop.hdfs.DFSClient
>>> > Exception in createBlockOutputStream
>>> > java.net.SocketTimeoutException: 75000 millis timeout while waiting for
>>> > channel to be ready for read. ch :
>>> > java.nio.channels.SocketChannel[connected local=/hmaster1:59726
>>> > remote=/RS-1:50010]
>>> > at
>>> >
>>> >
>>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:165)
>>> > at
>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:156)
>>> > at
>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:129)
>>> > at
>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:117)
>>> > at java.io.FilterInputStream.read(FilterInputStream.java:83)
>>> > at java.io.FilterInputStream.read(FilterInputStream.java:83)
>>> > at
>>> >
>>> >
>>> org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:169)
>>> > at
>>> >
>>> >
>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1106)
>>> > at
>>> >
>>> >
>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1040)
>>> > at
>>> >
>>> >
>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:488)
>>> > Sep 24, 9:52:58.956 PM INFO org.apache.hadoop.hdfs.DFSClient
>>> > Abandoning
>>> > BP-1769805853-namenode1-1354129919031:blk_-6817802178798905477_327644519
>>> > Sep 24, 9:52:59.011 PM INFO org.apache.hadoop.hdfs.DFSClient
>>> > Excluding datanode RS-1:50010
>>> > Sep 24, 9:54:22.963 PM WARN
>>> > org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector
>>> > Timer already marked completed, ignoring!
>>> > Sep 24, 9:54:22.964 PM ERROR
>>> > org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler
>>> > Failed taking snapshot { ss=** table=** type=DISABLED } due to
>>> > exception:org.apache.hadoop.hbase.errorhandling.TimeoutException:
>>> Timeout
>>> > elapsed! Source:Timeout caused Foreign Exception Start:1443145459635,
>>> > End:1443146059635, diff:600000, max:600000 ms
>>> >
>>> > Thanks!
>>> >
>>> > On Thu, Sep 24, 2015 at 6:34 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>> >
>>> > > 0.94.6 is really old. There have been quite a few bug fixes /
>>> > improvements
>>> > > to snapshot feature since its release.
>>> > >
>>> > > The error happens when SnapshotDescription corresponding to
>>> > > kiji.prod.table.site.DI...
>>> > > was not found by ProcedureCoordinator.
>>> > >
>>> > > bq. does the timeout necessarily mean that the restore failed or
>>> could it
>>> > > be still running asynchronously
>>> > >
>>> > > Can you check region server logs around the time TimeoutException was
>>> > > thrown to see which server was the straggler ?
>>> > >
>>> > > Thanks
>>> > >
>>> > > On Thu, Sep 24, 2015 at 5:13 PM, Alexandre Normand <
>>> > > alexandre.norm...@gmail.com> wrote:
>>> > >
>>> > > > Hey,
>>> > > > We're trying to restore a snapshot of a relatively big table (20TB)
>>> > using
>>> > > > hbase 0.94.6-cdh4.5.0 and we're getting timeouts doing so. We
>>> increased
>>> > > the
>>> > > > timeout configurations(hbase.snapshot.master.timeoutMillis,
>>> > > > hbase.snapshot.region.timeout,
>>> hbase.snapshot.master.timeout.millis) to
>>> > > 10
>>> > > > minutes but we're still experiencing the timeouts. Here's the error
>>> and
>>> > > > stack trace (table name obfuscated just because):
>>> > > >
>>> > > > ERROR: org.apache.hadoop.hbase.snapshot.HBaseSnapshotException:
>>> > > > org.apache.hadoop.hbase.snapshot.HBaseSnapshotException: Snapshot {
>>> > > > ss*****-1443136710408 table=******** type=FLUSH } had an error.
>>> > > > kiji.prod.table.site.DI-1019-1443136710408 not found in proclist []
>>> > > >         at
>>> > > >
>>> > >
>>> >
>>> org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:360)
>>> > > >         at
>>> > > >
>>> > org.apache.hadoop.hbase.master.HMaster.isSnapshotDone(HMaster.java:2075)
>>> > > >         at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown
>>> Source)
>>> > > >         at
>>> > > >
>>> > >
>>> >
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>> > > >         at java.lang.reflect.Method.invoke(Method.java:606)
>>> > > >         at
>>> > > >
>>> > >
>>> >
>>> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:320)
>>> > > >         at
>>> > > >
>>> > >
>>> >
>>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1428)
>>> > > > Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException
>>> via
>>> > > > timer-java.util.Timer@8ad0d5c
>>> > > > :org.apache.hadoop.hbase.errorhandling.TimeoutException:
>>> > > > Timeout elapsed! Source:Timeout caused Foreign Exception
>>> > > > Start:1443136713121, End:1443137313121, diff:600000, max:600000 ms
>>> > > >         at
>>> > > >
>>> > >
>>> >
>>> org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:85)
>>> > > >         at
>>> > > >
>>> > >
>>> >
>>> org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.rethrowExceptionIfFailed(TakeSnapshotHandler.java:285)
>>> > > >         at
>>> > > >
>>> > >
>>> >
>>> org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:350)
>>> > > >         ... 6 more
>>> > > > Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException:
>>> > > > Timeout elapsed! Source:Timeout caused Foreign Exception
>>> > > > Start:1443136713121, End:1443137313121, diff:600000, max:600000 ms
>>> > > >         at
>>> > > >
>>> > >
>>> >
>>> org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:68)
>>> > > >         at java.util.TimerThread.mainLoop(Timer.java:555)
>>> > > >         at java.util.TimerThread.run(Timer.java:505)
>>> > > >
>>> > > >
>>> > > > We could increase the timeout again but we'd like to solicit some
>>> > > feedback
>>> > > > before trying that. First, does the timeout necessarily mean that
>>> the
>>> > > > restore failed or could it be still running asynchronously and
>>> > eventually
>>> > > > completing? What's involved in the snapshot restore that could be
>>> > useful
>>> > > in
>>> > > > informing what timeout value would be appropriate for this
>>> operation?
>>> > > >
>>> > > > Thanks!
>>> > > >
>>> > > > --
>>> > > > Alex
>>> > > >
>>> > >
>>> >
>>> >
>>> >
>>> > --
>>> > *Alexandre Normand*
>>> > Software Maker
>>> > Opower <http://www.opower.com>
>>> >
>>>
>>
>>
>>
>> --
>> *Alexandre Normand*
>> Software Maker
>> Opower <http://www.opower.com>
>>
>
>
>
>-- 
>*Alexandre Normand*
>Software Maker
>Opower <http://www.opower.com>

Re:Re: Timeouts on snapshot restore

Reply via email to