> My question is: is it safe to ignore these TimeoutExceptions? if the SnapshotRegionManifests are not being written due to a timeout does that mean we are losing data or getting inconsistencies? I don't think ignoring the TimeoutException is a good idea. You need to find out why did the snapshot taking timeout. I also encountered the similar case in my cluster, the cluster have 30 RS, but the table have 1600 regions, means each RS need to flush & write the region mainfests for 50+ regions. while the SnapshotSubprocedurePool has a default 3 threadSize...it's easy to timeout for that case because of the pool is too small.
So I enlarged the config keys (Note that increasing the 'hbase.snapshot.master.timeout.millis' is not enough, because the RS can also be timeout) and rolled update the clusters, it works pretty good for me now. hbase.snapshot.master.timeout.millis=1200000 hbase.snapshot.region.timeout=1200000 hbase.snapshot.region.concurrentTasks=20 Hope it will be helpfull for you , Arwin. On Tue, Jul 23, 2019 at 3:55 PM Arwin Tio <[email protected]> wrote: > Hi all, > > I've been running into these issues after restoring from snapshots: > > https://issues.apache.org/jira/browse/HBASE-16464 > https://issues.apache.org/jira/browse/HBASE-17992 > > Essentially, HRegion#addRegionToSnapshot has been timing out in > TakeSnapshotHandler, resulting in some leftover tmp files. The leftover tmp > files causes archivedHFileCleaner, which manifests in an extremely large > archive folder that doesn't get cleaned up. > > HBASE-16464 solves the bloating archive folder by preventing the > SnapshotRegionManifest from being written if the operation has timed out > (see: > https://github.com/apache/hbase/commit/ab011391ab392f1a62b6ea9bdca87fc950af42a9#diff-4ec74c1b12f2be4f52c33260fd8b73efR86 > ) > > My question is: is it safe to ignore these TimeoutExceptions? if the > SnapshotRegionManifests are not being written due to a timeout does that > mean we are losing data or getting inconsistencies? > > If so, what are some potential remedies for this? I'm thinking we can just > increase the timeout 'hbase.snapshot.master.timeout.millis' but is there a > better way? > > Thanks >
