Hi Miao,
I am seeing another issue. Your fix prevents from TRANS_START to get
in the way of a committing transaction. But it does not prevent from
TRANS_JOIN. On the other hand, btrfs_commit_transaction has the
following loop:

do {
    // attempt to do some useful stuff and/or sleep
} while (atomic_read(&cur_trans->num_writers) > 1 ||
                 (should_grow && cur_trans->num_joined != joined));

What I see is basically that new writers join the transaction, while
btrfs_commit_transaction() does this loop. I see
cur_trans->num_writers decreasing, but then it increases, then
decreases etc. This can go for several seconds during heavy IO load.
There is nothing to prevent new TRANS_JOIN writers coming and joining
a transaction over and over, thus delaying transaction commit. The IO
path uses TRANS_JOIN; for example run_delalloc_nocow() does that.

Do you observe such behavior? Do you believe it's problematic?

Thanks,
Alex.


On Mon, Feb 25, 2013 at 12:20 PM, Miao Xie <mi...@cn.fujitsu.com> wrote:
> On sun, 24 Feb 2013 21:49:55 +0200, Alex Lyakas wrote:
>> Hi Miao,
>> can you please explain your solution a bit more.
>>
>> On Wed, Feb 20, 2013 at 11:16 AM, Miao Xie <mi...@cn.fujitsu.com> wrote:
>>> Now btrfs_commit_transaction() does this
>>>
>>> ret = btrfs_run_ordered_operations(root, 0)
>>>
>>> which async flushes all inodes on the ordered operations list, it introduced
>>> a deadlock that transaction-start task, transaction-commit task and the 
>>> flush
>>> workers waited for each other.
>>> (See the following URL to get the detail
>>>  http://marc.info/?l=linux-btrfs&m=136070705732646&w=2)
>>>
>>> As we know, if ->in_commit is set, it means someone is committing the
>>> current transaction, we should not try to join it if we are not JOIN
>>> or JOIN_NOLOCK, wait is the best choice for it. In this way, we can avoid
>>> the above problem. In this way, there is another benefit: there is no new
>>> transaction handle to block the transaction which is on the way of commit,
>>> once we set ->in_commit.
>>>
>>> Signed-off-by: Miao Xie <mi...@cn.fujitsu.com>
>>> ---
>>>  fs/btrfs/transaction.c |   17 ++++++++++++++++-
>>>  1 files changed, 16 insertions(+), 1 deletions(-)
>>>
>>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
>>> index bc2f2d1..71b7e2e 100644
>>> --- a/fs/btrfs/transaction.c
>>> +++ b/fs/btrfs/transaction.c
>>> @@ -51,6 +51,14 @@ static noinline void switch_commit_root(struct 
>>> btrfs_root *root)
>>>         root->commit_root = btrfs_root_node(root);
>>>  }
>>>
>>> +static inline int can_join_transaction(struct btrfs_transaction *trans,
>>> +                                      int type)
>>> +{
>>> +       return !(trans->in_commit &&
>>> +                type != TRANS_JOIN &&
>>> +                type != TRANS_JOIN_NOLOCK);
>>> +}
>>> +
>>>  /*
>>>   * either allocate a new transaction or hop into the existing one
>>>   */
>>> @@ -86,6 +94,10 @@ loop:
>>>                         spin_unlock(&fs_info->trans_lock);
>>>                         return cur_trans->aborted;
>>>                 }
>>> +               if (!can_join_transaction(cur_trans, type)) {
>>> +                       spin_unlock(&fs_info->trans_lock);
>>> +                       return -EBUSY;
>>> +               }
>>>                 atomic_inc(&cur_trans->use_count);
>>>                 atomic_inc(&cur_trans->num_writers);
>>>                 cur_trans->num_joined++;
>>> @@ -360,8 +372,11 @@ again:
>>>
>>>         do {
>>>                 ret = join_transaction(root, type);
>>> -               if (ret == -EBUSY)
>>> +               if (ret == -EBUSY) {
>>>                         wait_current_trans(root);
>>> +                       if (unlikely(type == TRANS_ATTACH))
>>> +                               ret = -ENOENT;
>>> +               }
>>
>> So I understand that instead of incrementing num_writes and joining
>> the current transaction, you do not join and wait for the current
>> transaction to unblock.
>
> More specifically,TRANS_START、TRANS_USERSPACE and TRANS_ATTACH can not
> join and just wait for the current transaction to unblock if ->in_commit
> is set.
>
>> Which task in Josef's example
>> http://marc.info/?l=linux-btrfs&m=136070705732646&w=2
>> task 1, task 2 or task 3 is the one that will not join the
>> transaction, but instead wait?
>
> Task1 will not join the transaction, in this way, async inode flush
> won't run, and then task3 won't do anything.
>
> Before applying the patch:
> Start/Attach_Trans_Task                 Commit_Task                     
> Flush_Worker
> (Task1)                                 (Task2)                         
> (Task3)         -- the name in Josef's example
> btrfs_start_transaction()
>  |->may_wait_transaction()
>  |  (return 0)
>  |                                      btrfs_commit_transaction()
>  |                                       |->set ->in_commit and
>  |                                       |  blocked to 1
>  |                                       |->wait writers to be 1
>  |                                       |  (writers is 1)
>  |->join_transaction()                   |
>  |  (writers is 2)                       |
>  |->btrfs_commit_transaction()           |
>      |                                   |->set trans_no_join to 1
>      |                                   |  (close join transaction)
>      |->btrfs_run_ordered_operations     |
>         (Those ordered operations        |
>          are added when releasing        |
>          file)                           |
>          |->async inode flush()          |
>          |->wait_flush_comlete()         |
>                                          |                              
> work_loop()
>                                          |                               
> |->run_work()
>                                          |                                   
> |->btrfs_join_transaction()
>                                          |                                    
>    |->wait_current_trans()
>                                          |->wait writers to be 1
>
> This three tasks waited for each other.
>
> After applying this patch:
> Start/Attach_Trans_Task                 Commit_Task                     
> Flush_Worker
> (Task1)                                 (Task2)                         
> (Task3)
> btrfs_start_transaction()
>  |->may_wait_transaction()
>  |  (return 0)
>  |                                      btrfs_commit_transaction()
>  |                                       |->set ->in_commit and
>  |                                       |  blocked to 1
>  |                                       |->wait writers to be 1
>  |                                       |  (writers is 1)
>  |->join_transaction() fail              |
>  |  (return -EBUSY, writers is still 1)  |
>  |->wait_current_trans()                 |
>                                          |->set trans_no_join to 1
>                                          |  (close join transaction)
>                                          |->wait writers to be 1
>                                          |->continue committing
>                                                                         
> (Task3 does nothing)
>> Also, I think I don't fully understand Josef's example. What is
>> preventing from async flushing to complete?
>> Is task 3 waiting because trans_no_join is set?
>> Is task 3 the one that actually does the delalloc flush?
>
> See above.
>
> Thanks
> Miao
>
>>
>> Thanks,
>> Alex.
>>
>>
>>
>>
>>
>>
>>>         } while (ret == -EBUSY);
>>>
>>>         if (ret < 0) {
>>> --
>>> 1.6.5.2
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to