On Wed, Feb 06, 2019 at 03:46:13PM -0500, Josef Bacik wrote:
> With my delayed refs rsv patches in place we started hitting issues in our 
> build
> servers that do a lot of snapshot deletions.  Turns out there was a bug in
> btrfs_end_transaction_throttle() that caused it to basically always commit the
> transaction, which uncovered this particular bug.
> 
> The gory details are in the change logs for both patches, but generally 
> speaking
> it's a problem with how we update our root_item->drop_progress key.  We will
> skip updating it some times even though we will have dropped references to
> blocks.  If we crash or unmount at these times we will start at a point 
> earlier
> in our delete than we should be and try to free blocks that we already freed,
> thus ending up with a transaction abort because we couldn't find the extent
> reference.
> 
> There are 2 patches, 1 patch to deal with already broken file systems, and 1
> patch to keep this problem from happening in the first place.
> 
> The steps to reproduce this easily are sort of tricky, I had to add a couple 
> of
> debug patches to the kernel in order to make it easy, basically I just needed 
> to
> make sure we did actually commit the transaction every time we finished a
> walk_down_tree/walk_up_tree combo.
> 
> The reproducer
> 
> 1) Creates a base subvolume.
> 2) Creates 100k files in the subvolume.
> 3) Snapshots the base subvolume (snap1).
> 4) Touches files 5000-6000 in snap1.
> 5) Snapshots snap1 (snap2).
> 6) Deletes snap1.
> 
> I do this with dm-log-writes, and then replay to every FUA in the log and fsck
> the fs.  Without these patches this falls over pretty quickly.  With just the
> first patch we can mount the fs at the point that the fsck fails and it cleans
> everything up properly.  With both patches applied the fsck never fails and
> we're golden.  Thanks,

I copied the reproducer steps to the 2nd patch. 1 and 2 added to
misc-next.

Reply via email to