On 5/3/18 11:52 AM, Nikolay Borisov wrote:
> 
> 
> On  3.05.2018 16:39, Jeff Mahoney wrote:
>> On 5/3/18 3:24 AM, Nikolay Borisov wrote:
>>>
>>>
>>> On  3.05.2018 00:11, je...@suse.com wrote:
>>>> From: Jeff Mahoney <je...@suse.com>
>>>>
>>>> Commit 8d9eddad194 (Btrfs: fix qgroup rescan worker initialization)
>>>> fixed the issue with BTRFS_IOC_QUOTA_RESCAN_WAIT being racy, but
>>>> ended up reintroducing the hang-on-unmount bug that the commit it
>>>> intended to fix addressed.
>>>>
>>>> The race this time is between qgroup_rescan_init setting
>>>> ->qgroup_rescan_running = true and the worker starting.  There are
>>>> many scenarios where we initialize the worker and never start it.  The
>>>> completion btrfs_ioctl_quota_rescan_wait waits for will never come.
>>>> This can happen even without involving error handling, since mounting
>>>> the file system read-only returns between initializing the worker and
>>>> queueing it.
>>>>
>>>> The right place to do it is when we're queuing the worker.  The flag
>>>> really just means that btrfs_ioctl_quota_rescan_wait should wait for
>>>> a completion.
>>>>
>>>> Since the BTRFS_QGROUP_STATUS_FLAG_RESCAN flag is overloaded to
>>>> refer to both runtime behavior and on-disk state, we introduce a new
>>>> fs_info->qgroup_rescan_ready to indicate that we're initialized and
>>>> waiting to start.
>>>
>>> Am I correct in my understanding that this qgroup_rescan_ready flag is
>>> used to avoid qgroup_rescan_init being called AFTER it has already been
>>> called but BEFORE queue_rescan_worker ? Why wasn't the initial version
>>> of this patch without this flag sufficient?
>>
>> No, the race is between clearing the BTRFS_QGROUP_STATUS_FLAG_RESCAN
>> flag near the end of the worker and clearing the running flag.  The
>> rescan lock is dropped in between, so btrfs_rescan_init will let a new
>> rescan request in while we update the status item on disk.  We wouldn't
>> have queued another worker since that's what the warning catches, but if
>> there were already tasks waiting for completion, they wouldn't have been
>> woken since the wait queue list would be reinitialized.  There's no way
>> to reorder clearing the flag without changing how we handle
>> ->qgroup_flags.  I plan on doing that separately.  This was just meant
>> to be the simple fix.
> 
> Great, I think some of this information should go into the change log,
> in explaining what the symptoms of the race condition are.

You're right.  I was treating as a race that my patch introduced but it
didn't.  It just complained about it.

-Jeff

-- 
Jeff Mahoney
SUSE Labs

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to