After looking at the code, I think the direct reason is what Wei-Chiu
Chuang has already pointed out on the PR, Iterables.filter is lazy
evaluation, so when we iterate over the files in
SnapshotFileCache.getUnreferencedFiles, we will actually trigger all
the filter code, where we are under the taking snapshot write lock
protection.

I think the root problem here is the write lock , as the XXX comment
says, it is inefficient. What we want to prevent is taking snapshots
at the same time, but we should allow accessing the cache at the same
time. Of course it is not easy, otherwise we should have it in place
already.

And on the fix here, I do not think we need to change all the Iterable
to List in our code, a simple one line fix is enough.In
SnapshotHFileCleaner.getDeletableFiles, convert files to List and then
use the List to call SnapshotFileCache.getUnreferencedFiles. I think
this could fix the problem here.

Thanks.

In

Tak Lon (Stephen) Wu <[email protected]> 于2023年1月27日周五 01:55写道:
>
> +1 for this change
>
> I tested the part on S3, it is as fast as peter mentioned and cuts
> down a lot of chore time if a user is actively using a snapshot and
> the archive is growing very fast.
>
> -Stephen
>
> On Thu, Jan 26, 2023 at 8:36 AM Peter Somogyi
> <[email protected]> wrote:
> >
> > Hi,
> >
> > I was doing performance testing of the CleanerChores using S3 root
> > directory with HBase 2.4. When the archive directory was large the HFile
> > cleaner threads were blocked on acquiring the SnapshotWriteLock. The flame
> > graph [1] showed that inside SnapshotFileCache.getUnreferencedFiles there
> > were a lot of listing operations to S3. The lock was held for 45 seconds
> > with a directory containing 1000 files.
> > I've also seen one case when a snapshot creation failed (timed out). I
> > assume that can also be related to the long locking.
> >
> > The locking time can be drastically reduced by changing the
> > Iterable<FileStatus> parameter to List<FileStatus> for the
> > FileCleanerDelegate.getDeletableFiles. With this change, the lock was
> > released in approximately 100ms, however, the overall time for file
> > listing, evaluation, and deletion took the same time (45sec). Since the
> > different cleaner threads are not blocked on the snapshot lock the overall
> > deletion speed can be increased significantly.
> > CleanerChore.checkAndDeleteFiles already receives a List<FileStatus>
> > parameter, and later converts it to Iterable<FileStatus> so I don't expect
> > a drastic change in the Cleaners' memory usage.
> >
> > I've done some testing with HDFS as well.
> > Cleanup for 100k files took 63518ms with Iterable implementation, the lock
> > was held for ~130ms for every 1000 files.
> > Cleanup for 100k files took 64411ms with List implementation, the lock was
> > held for ~2ms for every 1000 files.
> >
> > Do you have any concerns about making this change?
> >
> > Thanks,
> > Peter
> >
> > [1] https://issues.apache.org/jira/browse/HBASE-27590
> > [2] https://github.com/apache/hbase/pull/4995

Reply via email to