+1 for this change I tested the part on S3, it is as fast as peter mentioned and cuts down a lot of chore time if a user is actively using a snapshot and the archive is growing very fast.
-Stephen On Thu, Jan 26, 2023 at 8:36 AM Peter Somogyi <[email protected]> wrote: > > Hi, > > I was doing performance testing of the CleanerChores using S3 root > directory with HBase 2.4. When the archive directory was large the HFile > cleaner threads were blocked on acquiring the SnapshotWriteLock. The flame > graph [1] showed that inside SnapshotFileCache.getUnreferencedFiles there > were a lot of listing operations to S3. The lock was held for 45 seconds > with a directory containing 1000 files. > I've also seen one case when a snapshot creation failed (timed out). I > assume that can also be related to the long locking. > > The locking time can be drastically reduced by changing the > Iterable<FileStatus> parameter to List<FileStatus> for the > FileCleanerDelegate.getDeletableFiles. With this change, the lock was > released in approximately 100ms, however, the overall time for file > listing, evaluation, and deletion took the same time (45sec). Since the > different cleaner threads are not blocked on the snapshot lock the overall > deletion speed can be increased significantly. > CleanerChore.checkAndDeleteFiles already receives a List<FileStatus> > parameter, and later converts it to Iterable<FileStatus> so I don't expect > a drastic change in the Cleaners' memory usage. > > I've done some testing with HDFS as well. > Cleanup for 100k files took 63518ms with Iterable implementation, the lock > was held for ~130ms for every 1000 files. > Cleanup for 100k files took 64411ms with List implementation, the lock was > held for ~2ms for every 1000 files. > > Do you have any concerns about making this change? > > Thanks, > Peter > > [1] https://issues.apache.org/jira/browse/HBASE-27590 > [2] https://github.com/apache/hbase/pull/4995
