when use new version hbase, is restore more quickly
At 2015-09-26 01:21:41, "Alexandre Normand" <alexandre.norm...@opower.com> wrote: >We tried decommissioning the slow node and that still didn't help. We then >increased the timeouts to 90 minutes and we had a successful restore after >32 minutes. > >Better a slow restore than a failed restore. > >Cheers! > >On Thu, Sep 24, 2015 at 8:34 PM, Alexandre Normand < >alexandre.norm...@opower.com> wrote: > >> Yes, RS-1 was the only data node to be excluded. >> >> The result of 'hdfs fsck /' is healthy. Blocks are reasonably replicated >> and we don't have any missing or corrupted blocks. >> >> Thanks again. >> >> On Thu, Sep 24, 2015 at 7:55 PM, Ted Yu <yuzhih...@gmail.com> wrote: >> >>> bq. Excluding datanode RS-1:50010 >>> >>> Was RS-1 the only data node to be excluded in that timeframe ? >>> Have you run fsck to see if hdfs is healthy ? >>> >>> Cheers >>> >>> On Thu, Sep 24, 2015 at 7:47 PM, Alexandre Normand < >>> alexandre.norm...@opower.com> wrote: >>> >>> > Hi Ted, >>> > We'll be upgrading to cdh5 in the coming months but we're unfortunately >>> > stuck on 0.94.6 at the moment. >>> > >>> > The RS logs were empty around the time of the failed snapshot restore >>> > operation, but the following errors were in the master log. The node >>> > 'RS-1' is the only node indicated in the logs. These errors occurred >>> > throughout the duration of the snapshot_restore operation. >>> > >>> > Sep 24, 9:51:41.655 PM INFO org.apache.hadoop.hdfs.DFSClient >>> > Exception in createBlockOutputStream >>> > java.io.IOException: Bad connect ack with firstBadLink as [RS-1]:50010 >>> > at >>> > >>> > >>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1117) >>> > at >>> > >>> > >>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1040) >>> > at >>> > >>> > >>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:488) >>> > Sep 24, 9:51:41.664 PM INFO org.apache.hadoop.hdfs.DFSClient >>> > Abandoning >>> > BP-1769805853-namenode1-1354129919031:blk_8350736734896383334_327644360 >>> > Sep 24, 9:51:41.678 PM INFO org.apache.hadoop.hdfs.DFSClient >>> > Excluding datanode RS-1:50010 >>> > Sep 24, 9:52:58.954 PM INFO org.apache.hadoop.hdfs.DFSClient >>> > Exception in createBlockOutputStream >>> > java.net.SocketTimeoutException: 75000 millis timeout while waiting for >>> > channel to be ready for read. ch : >>> > java.nio.channels.SocketChannel[connected local=/hmaster1:59726 >>> > remote=/RS-1:50010] >>> > at >>> > >>> > >>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:165) >>> > at >>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:156) >>> > at >>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:129) >>> > at >>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:117) >>> > at java.io.FilterInputStream.read(FilterInputStream.java:83) >>> > at java.io.FilterInputStream.read(FilterInputStream.java:83) >>> > at >>> > >>> > >>> org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:169) >>> > at >>> > >>> > >>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1106) >>> > at >>> > >>> > >>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1040) >>> > at >>> > >>> > >>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:488) >>> > Sep 24, 9:52:58.956 PM INFO org.apache.hadoop.hdfs.DFSClient >>> > Abandoning >>> > BP-1769805853-namenode1-1354129919031:blk_-6817802178798905477_327644519 >>> > Sep 24, 9:52:59.011 PM INFO org.apache.hadoop.hdfs.DFSClient >>> > Excluding datanode RS-1:50010 >>> > Sep 24, 9:54:22.963 PM WARN >>> > org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector >>> > Timer already marked completed, ignoring! >>> > Sep 24, 9:54:22.964 PM ERROR >>> > org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler >>> > Failed taking snapshot { ss=** table=** type=DISABLED } due to >>> > exception:org.apache.hadoop.hbase.errorhandling.TimeoutException: >>> Timeout >>> > elapsed! Source:Timeout caused Foreign Exception Start:1443145459635, >>> > End:1443146059635, diff:600000, max:600000 ms >>> > >>> > Thanks! >>> > >>> > On Thu, Sep 24, 2015 at 6:34 PM, Ted Yu <yuzhih...@gmail.com> wrote: >>> > >>> > > 0.94.6 is really old. There have been quite a few bug fixes / >>> > improvements >>> > > to snapshot feature since its release. >>> > > >>> > > The error happens when SnapshotDescription corresponding to >>> > > kiji.prod.table.site.DI... >>> > > was not found by ProcedureCoordinator. >>> > > >>> > > bq. does the timeout necessarily mean that the restore failed or >>> could it >>> > > be still running asynchronously >>> > > >>> > > Can you check region server logs around the time TimeoutException was >>> > > thrown to see which server was the straggler ? >>> > > >>> > > Thanks >>> > > >>> > > On Thu, Sep 24, 2015 at 5:13 PM, Alexandre Normand < >>> > > alexandre.norm...@gmail.com> wrote: >>> > > >>> > > > Hey, >>> > > > We're trying to restore a snapshot of a relatively big table (20TB) >>> > using >>> > > > hbase 0.94.6-cdh4.5.0 and we're getting timeouts doing so. We >>> increased >>> > > the >>> > > > timeout configurations(hbase.snapshot.master.timeoutMillis, >>> > > > hbase.snapshot.region.timeout, >>> hbase.snapshot.master.timeout.millis) to >>> > > 10 >>> > > > minutes but we're still experiencing the timeouts. Here's the error >>> and >>> > > > stack trace (table name obfuscated just because): >>> > > > >>> > > > ERROR: org.apache.hadoop.hbase.snapshot.HBaseSnapshotException: >>> > > > org.apache.hadoop.hbase.snapshot.HBaseSnapshotException: Snapshot { >>> > > > ss*****-1443136710408 table=******** type=FLUSH } had an error. >>> > > > kiji.prod.table.site.DI-1019-1443136710408 not found in proclist [] >>> > > > at >>> > > > >>> > > >>> > >>> org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:360) >>> > > > at >>> > > > >>> > org.apache.hadoop.hbase.master.HMaster.isSnapshotDone(HMaster.java:2075) >>> > > > at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown >>> Source) >>> > > > at >>> > > > >>> > > >>> > >>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>> > > > at java.lang.reflect.Method.invoke(Method.java:606) >>> > > > at >>> > > > >>> > > >>> > >>> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:320) >>> > > > at >>> > > > >>> > > >>> > >>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1428) >>> > > > Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException >>> via >>> > > > timer-java.util.Timer@8ad0d5c >>> > > > :org.apache.hadoop.hbase.errorhandling.TimeoutException: >>> > > > Timeout elapsed! Source:Timeout caused Foreign Exception >>> > > > Start:1443136713121, End:1443137313121, diff:600000, max:600000 ms >>> > > > at >>> > > > >>> > > >>> > >>> org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:85) >>> > > > at >>> > > > >>> > > >>> > >>> org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.rethrowExceptionIfFailed(TakeSnapshotHandler.java:285) >>> > > > at >>> > > > >>> > > >>> > >>> org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:350) >>> > > > ... 6 more >>> > > > Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException: >>> > > > Timeout elapsed! Source:Timeout caused Foreign Exception >>> > > > Start:1443136713121, End:1443137313121, diff:600000, max:600000 ms >>> > > > at >>> > > > >>> > > >>> > >>> org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:68) >>> > > > at java.util.TimerThread.mainLoop(Timer.java:555) >>> > > > at java.util.TimerThread.run(Timer.java:505) >>> > > > >>> > > > >>> > > > We could increase the timeout again but we'd like to solicit some >>> > > feedback >>> > > > before trying that. First, does the timeout necessarily mean that >>> the >>> > > > restore failed or could it be still running asynchronously and >>> > eventually >>> > > > completing? What's involved in the snapshot restore that could be >>> > useful >>> > > in >>> > > > informing what timeout value would be appropriate for this >>> operation? >>> > > > >>> > > > Thanks! >>> > > > >>> > > > -- >>> > > > Alex >>> > > > >>> > > >>> > >>> > >>> > >>> > -- >>> > *Alexandre Normand* >>> > Software Maker >>> > Opower <http://www.opower.com> >>> > >>> >> >> >> >> -- >> *Alexandre Normand* >> Software Maker >> Opower <http://www.opower.com> >> > > > >-- >*Alexandre Normand* >Software Maker >Opower <http://www.opower.com>