0.94.6 is really old. There have been quite a few bug fixes / improvements
to snapshot feature since its release.

The error happens when SnapshotDescription corresponding to
kiji.prod.table.site.DI...
was not found by ProcedureCoordinator.

bq. does the timeout necessarily mean that the restore failed or could it
be still running asynchronously

Can you check region server logs around the time TimeoutException was
thrown to see which server was the straggler ?

Thanks

On Thu, Sep 24, 2015 at 5:13 PM, Alexandre Normand <
alexandre.norm...@gmail.com> wrote:

> Hey,
> We're trying to restore a snapshot of a relatively big table (20TB) using
> hbase 0.94.6-cdh4.5.0 and we're getting timeouts doing so. We increased the
> timeout configurations(hbase.snapshot.master.timeoutMillis,
> hbase.snapshot.region.timeout, hbase.snapshot.master.timeout.millis) to 10
> minutes but we're still experiencing the timeouts. Here's the error and
> stack trace (table name obfuscated just because):
>
> ERROR: org.apache.hadoop.hbase.snapshot.HBaseSnapshotException:
> org.apache.hadoop.hbase.snapshot.HBaseSnapshotException: Snapshot {
> ss*****-1443136710408 table=******** type=FLUSH } had an error.
> kiji.prod.table.site.DI-1019-1443136710408 not found in proclist []
>         at
> org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:360)
>         at
> org.apache.hadoop.hbase.master.HMaster.isSnapshotDone(HMaster.java:2075)
>         at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at
> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:320)
>         at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1428)
> Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException via
> timer-java.util.Timer@8ad0d5c
> :org.apache.hadoop.hbase.errorhandling.TimeoutException:
> Timeout elapsed! Source:Timeout caused Foreign Exception
> Start:1443136713121, End:1443137313121, diff:600000, max:600000 ms
>         at
> org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:85)
>         at
> org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.rethrowExceptionIfFailed(TakeSnapshotHandler.java:285)
>         at
> org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:350)
>         ... 6 more
> Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException:
> Timeout elapsed! Source:Timeout caused Foreign Exception
> Start:1443136713121, End:1443137313121, diff:600000, max:600000 ms
>         at
> org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:68)
>         at java.util.TimerThread.mainLoop(Timer.java:555)
>         at java.util.TimerThread.run(Timer.java:505)
>
>
> We could increase the timeout again but we'd like to solicit some feedback
> before trying that. First, does the timeout necessarily mean that the
> restore failed or could it be still running asynchronously and eventually
> completing? What's involved in the snapshot restore that could be useful in
> informing what timeout value would be appropriate for this operation?
>
> Thanks!
>
> --
> Alex
>

Reply via email to