Hi Miao, On Thu, May 9, 2013 at 10:57 AM, Miao Xie <mi...@cn.fujitsu.com> wrote: > Hi, Alex > > Could you try the following patchset? > > git://github.com/miaoxie/linux-btrfs.git trans-commit-improve > > I think it can avoid the problem you said below. > > Note: this patchset is against chris's for-linus branch.
I reviewed the code starting from: 69aef69a1bc154 Btrfs: don't wait for all the writers circularly during the transaction commit until 2ce7935bf4cdf3 Btrfs: remove the time check in btrfs_commit_transaction() It looks very good. Let me check if I understand the fix correctly: # When transaction starts to commit, we want to wait only for external writers (those that did ATTACH/START/USERSPACE) # We guarantee at this point that no new external writers will hop on the committing transaction, by setting ->blocked state, so we only wait for existing extwriters to detach from transaction # We do not care at this point for TRANS_JOIN etc, we let them hop on if they want # When all external writers have detached, we flush their delalloc and then we prevent all the others to join (TRANS_JOIN etc) # Previously, we had the do-while loop, that intended to do the same, but it used num_writers, which counts both external writers and also TRANS_JOIN. So the loop was racy because new joins prevented it from completing. Is my understanding correct? I have some questions: # Why was the do-while loop needed? Can we just delete the do-while loop as it was before, call flush_all_pending stuffs(), then set trans_no_join and wait for all writers to detach? Is there some correctness problem here? Or we need to wait for external writers to detach before calling flush_all_pending_stuffs() one last time? # Why TRANS_ATTACH is considered external writer? # Can I apply this fix to 3.8.x kernel (manually, of course)? Or some additional things are needed that are missing in this kernel? Thanks, Alex. > > Thanks > Miao > > On Wed, 10 Apr 2013 21:45:43 +0300, Alex Lyakas wrote: >> Hi Miao, >> I attempted to fix the issue by not joining a transaction that has >> trans->in_commit set. I did something similar to what >> wait_current_trans() does, but I did: >> >> smp_rmb(); >> if (cur_trans && cur_trans->in_commit) { >> ... >> wait_event(root->fs_info->transaction_wait, !cur_trans->blocked); >> ... >> >> I also had to change the order of setting in_commit and blocked in >> btrfs_commit_transaction: >> trans->transaction->blocked = 1; >> trans->transaction->in_commit = 1; >> smp_wmb(); >> to make sure that if in_commit is set, then blocked cannot be 0, >> because btrfs_commit_transaction haven't set it yet to 1. >> >> However, with this fix I observe two issues: >> # With large trees and heavy commits, join_transaction() is delayed >> sometimes by 1-3 seconds. This delays the host IO by too much. >> # With this fix, I think too many transactions happen. Basically with >> this fix, once transaction->in_commit is set, then I insist to open a >> new transaction and not to join the current one. It has some bad >> influence on host response times pattern, but I cannot exactly tell >> why is that. >> >> Did you have other fix in mind? >> >> Without the fix, I observe sometimes commits that take like 80 >> seconds, out of which like 50 seconds are spent in the do-while loop >> of btrfs_commit_transaction. >> >> Thanks, >> Alex. >> >> >> >> On Mon, Mar 25, 2013 at 11:11 AM, Alex Lyakas >> <alex.bt...@zadarastorage.com> wrote: >>> Hi Miao, >>> >>> On Mon, Mar 25, 2013 at 3:51 AM, Miao Xie <mi...@cn.fujitsu.com> wrote: >>>> On Sun, 24 Mar 2013 13:13:22 +0200, Alex Lyakas wrote: >>>>> Hi Miao, >>>>> I am seeing another issue. Your fix prevents from TRANS_START to get >>>>> in the way of a committing transaction. But it does not prevent from >>>>> TRANS_JOIN. On the other hand, btrfs_commit_transaction has the >>>>> following loop: >>>>> >>>>> do { >>>>> // attempt to do some useful stuff and/or sleep >>>>> } while (atomic_read(&cur_trans->num_writers) > 1 || >>>>> (should_grow && cur_trans->num_joined != joined)); >>>>> >>>>> What I see is basically that new writers join the transaction, while >>>>> btrfs_commit_transaction() does this loop. I see >>>>> cur_trans->num_writers decreasing, but then it increases, then >>>>> decreases etc. This can go for several seconds during heavy IO load. >>>>> There is nothing to prevent new TRANS_JOIN writers coming and joining >>>>> a transaction over and over, thus delaying transaction commit. The IO >>>>> path uses TRANS_JOIN; for example run_delalloc_nocow() does that. >>>>> >>>>> Do you observe such behavior? Do you believe it's problematic? >>>> >>>> I know this behavior, there is no problem with it, the latter code >>>> will prevent from TRANS_JOIN. >>>> >>>> 1672 spin_lock(&root->fs_info->trans_lock); >>>> 1673 root->fs_info->trans_no_join = 1; >>>> 1674 spin_unlock(&root->fs_info->trans_lock); >>>> 1675 wait_event(cur_trans->writer_wait, >>>> 1676 atomic_read(&cur_trans->num_writers) == 1); >>>> >>> Yes, this code prevents anybody from joining, but before >>> btrfs_commit_transaction() gets to this code, it may spend sometimes >>> 10 seconds (in my tests) in the do-while loop, while new writers come >>> and go. Basically, it is not deterministic when the do-while loop will >>> exit, it depends on the IO pattern. >>> >>>> And if we block the TRANS_JOIN at the place you point out, the deadlock >>>> will happen because we need deal with the ordered operations which will >>>> use TRANS_JOIN here. >>>> >>>> (I am dealing with the problem you said above by adding a new type of >>>> TRANS_* now) >>> >>> Thanks. >>> Alex. >>> >>> >>>> >>>> Thanks >>>> Miao >>>> >>>>> Thanks, >>>>> Alex. >>>>> >>>>> >>>>> On Mon, Feb 25, 2013 at 12:20 PM, Miao Xie <mi...@cn.fujitsu.com> wrote: >>>>>> On sun, 24 Feb 2013 21:49:55 +0200, Alex Lyakas wrote: >>>>>>> Hi Miao, >>>>>>> can you please explain your solution a bit more. >>>>>>> >>>>>>> On Wed, Feb 20, 2013 at 11:16 AM, Miao Xie <mi...@cn.fujitsu.com> wrote: >>>>>>>> Now btrfs_commit_transaction() does this >>>>>>>> >>>>>>>> ret = btrfs_run_ordered_operations(root, 0) >>>>>>>> >>>>>>>> which async flushes all inodes on the ordered operations list, it >>>>>>>> introduced >>>>>>>> a deadlock that transaction-start task, transaction-commit task and >>>>>>>> the flush >>>>>>>> workers waited for each other. >>>>>>>> (See the following URL to get the detail >>>>>>>> http://marc.info/?l=linux-btrfs&m=136070705732646&w=2) >>>>>>>> >>>>>>>> As we know, if ->in_commit is set, it means someone is committing the >>>>>>>> current transaction, we should not try to join it if we are not JOIN >>>>>>>> or JOIN_NOLOCK, wait is the best choice for it. In this way, we can >>>>>>>> avoid >>>>>>>> the above problem. In this way, there is another benefit: there is no >>>>>>>> new >>>>>>>> transaction handle to block the transaction which is on the way of >>>>>>>> commit, >>>>>>>> once we set ->in_commit. >>>>>>>> >>>>>>>> Signed-off-by: Miao Xie <mi...@cn.fujitsu.com> >>>>>>>> --- >>>>>>>> fs/btrfs/transaction.c | 17 ++++++++++++++++- >>>>>>>> 1 files changed, 16 insertions(+), 1 deletions(-) >>>>>>>> >>>>>>>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c >>>>>>>> index bc2f2d1..71b7e2e 100644 >>>>>>>> --- a/fs/btrfs/transaction.c >>>>>>>> +++ b/fs/btrfs/transaction.c >>>>>>>> @@ -51,6 +51,14 @@ static noinline void switch_commit_root(struct >>>>>>>> btrfs_root *root) >>>>>>>> root->commit_root = btrfs_root_node(root); >>>>>>>> } >>>>>>>> >>>>>>>> +static inline int can_join_transaction(struct btrfs_transaction >>>>>>>> *trans, >>>>>>>> + int type) >>>>>>>> +{ >>>>>>>> + return !(trans->in_commit && >>>>>>>> + type != TRANS_JOIN && >>>>>>>> + type != TRANS_JOIN_NOLOCK); >>>>>>>> +} >>>>>>>> + >>>>>>>> /* >>>>>>>> * either allocate a new transaction or hop into the existing one >>>>>>>> */ >>>>>>>> @@ -86,6 +94,10 @@ loop: >>>>>>>> spin_unlock(&fs_info->trans_lock); >>>>>>>> return cur_trans->aborted; >>>>>>>> } >>>>>>>> + if (!can_join_transaction(cur_trans, type)) { >>>>>>>> + spin_unlock(&fs_info->trans_lock); >>>>>>>> + return -EBUSY; >>>>>>>> + } >>>>>>>> atomic_inc(&cur_trans->use_count); >>>>>>>> atomic_inc(&cur_trans->num_writers); >>>>>>>> cur_trans->num_joined++; >>>>>>>> @@ -360,8 +372,11 @@ again: >>>>>>>> >>>>>>>> do { >>>>>>>> ret = join_transaction(root, type); >>>>>>>> - if (ret == -EBUSY) >>>>>>>> + if (ret == -EBUSY) { >>>>>>>> wait_current_trans(root); >>>>>>>> + if (unlikely(type == TRANS_ATTACH)) >>>>>>>> + ret = -ENOENT; >>>>>>>> + } >>>>>>> >>>>>>> So I understand that instead of incrementing num_writes and joining >>>>>>> the current transaction, you do not join and wait for the current >>>>>>> transaction to unblock. >>>>>> >>>>>> More specifically,TRANS_START、TRANS_USERSPACE and TRANS_ATTACH can not >>>>>> join and just wait for the current transaction to unblock if ->in_commit >>>>>> is set. >>>>>> >>>>>>> Which task in Josef's example >>>>>>> http://marc.info/?l=linux-btrfs&m=136070705732646&w=2 >>>>>>> task 1, task 2 or task 3 is the one that will not join the >>>>>>> transaction, but instead wait? >>>>>> >>>>>> Task1 will not join the transaction, in this way, async inode flush >>>>>> won't run, and then task3 won't do anything. >>>>>> >>>>>> Before applying the patch: >>>>>> Start/Attach_Trans_Task Commit_Task >>>>>> Flush_Worker >>>>>> (Task1) (Task2) >>>>>> (Task3) -- the name in Josef's example >>>>>> btrfs_start_transaction() >>>>>> |->may_wait_transaction() >>>>>> | (return 0) >>>>>> | btrfs_commit_transaction() >>>>>> | |->set ->in_commit and >>>>>> | | blocked to 1 >>>>>> | |->wait writers to be 1 >>>>>> | | (writers is 1) >>>>>> |->join_transaction() | >>>>>> | (writers is 2) | >>>>>> |->btrfs_commit_transaction() | >>>>>> | |->set trans_no_join to 1 >>>>>> | | (close join transaction) >>>>>> |->btrfs_run_ordered_operations | >>>>>> (Those ordered operations | >>>>>> are added when releasing | >>>>>> file) | >>>>>> |->async inode flush() | >>>>>> |->wait_flush_comlete() | >>>>>> | >>>>>> work_loop() >>>>>> | >>>>>> |->run_work() >>>>>> | >>>>>> |->btrfs_join_transaction() >>>>>> | >>>>>> |->wait_current_trans() >>>>>> |->wait writers to be 1 >>>>>> >>>>>> This three tasks waited for each other. >>>>>> >>>>>> After applying this patch: >>>>>> Start/Attach_Trans_Task Commit_Task >>>>>> Flush_Worker >>>>>> (Task1) (Task2) >>>>>> (Task3) >>>>>> btrfs_start_transaction() >>>>>> |->may_wait_transaction() >>>>>> | (return 0) >>>>>> | btrfs_commit_transaction() >>>>>> | |->set ->in_commit and >>>>>> | | blocked to 1 >>>>>> | |->wait writers to be 1 >>>>>> | | (writers is 1) >>>>>> |->join_transaction() fail | >>>>>> | (return -EBUSY, writers is still 1) | >>>>>> |->wait_current_trans() | >>>>>> |->set trans_no_join to 1 >>>>>> | (close join transaction) >>>>>> |->wait writers to be 1 >>>>>> |->continue committing >>>>>> >>>>>> (Task3 does nothing) >>>>>>> Also, I think I don't fully understand Josef's example. What is >>>>>>> preventing from async flushing to complete? >>>>>>> Is task 3 waiting because trans_no_join is set? >>>>>>> Is task 3 the one that actually does the delalloc flush? >>>>>> >>>>>> See above. >>>>>> >>>>>> Thanks >>>>>> Miao >>>>>> >>>>>>> >>>>>>> Thanks, >>>>>>> Alex. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> } while (ret == -EBUSY); >>>>>>>> >>>>>>>> if (ret < 0) { >>>>>>>> -- >>>>>>>> 1.6.5.2 >>>>>>>> -- >>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" >>>>>>>> in >>>>>>>> the body of a message to majord...@vger.kernel.org >>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>>> -- >>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" >>>>>>> in >>>>>>> the body of a message to majord...@vger.kernel.org >>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>>> >>>>>> >>>>> >>>> >> > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html