Re: [PATCH] xfs: Wake CIL push waiters more reliably
On 16.02.21 12:18, Brian Foster wrote: On Mon, Feb 15, 2021 at 02:36:38PM +0100, Donald Buczek wrote: On 13.01.21 22:53, Dave Chinner wrote: [...] I agree that a throttling fix is needed, but I'm trying to understand the scope and breadth of the problem first instead of jumping the gun and making the wrong fix for the wrong reasons that just papers over the underlying problems that the throttling bug has made us aware of... Are you still working on this? If it takes more time to understand the potential underlying problem, the fix for the problem at hand should be applied. This is a real world problem, accidentally found in the wild. It appears very rarely, but it freezes a filesystem or the whole system. It exists in 5.7 , 5.8 , 5.9 , 5.10 and 5.11 and is caused by c7f87f3984cf ("xfs: fix use-after-free on CIL context on shutdown") which silently added a condition to the wakeup. The condition is based on a wrong assumption. Why is this "papering over"? If a reminder was needed, there were better ways than randomly hanging the system. Why is if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) wake_up_all(&cil->xc_push_wait); , which doesn't work reliably, preferable to if (waitqueue_active(&cil->xc_push_wait)) wake_up_all(&cil->xc_push_wait); which does? JFYI, Dave followed up with a patch a couple weeks or so ago: https://lore.kernel.org/linux-xfs/20210128044154.806715-5-da...@fromorbit.com/ Oh, great. I apologize for the unneeded reminder. Best Donald Brian Best Donald Cheers, Dave
Re: [PATCH] xfs: Wake CIL push waiters more reliably
On Mon, Feb 15, 2021 at 02:36:38PM +0100, Donald Buczek wrote: > On 13.01.21 22:53, Dave Chinner wrote: > > [...] > > I agree that a throttling fix is needed, but I'm trying to > > understand the scope and breadth of the problem first instead of > > jumping the gun and making the wrong fix for the wrong reasons that > > just papers over the underlying problems that the throttling bug has > > made us aware of... > > Are you still working on this? > > If it takes more time to understand the potential underlying problem, the fix > for the problem at hand should be applied. > > This is a real world problem, accidentally found in the wild. It appears very > rarely, but it freezes a filesystem or the whole system. It exists in 5.7 , > 5.8 , 5.9 , 5.10 and 5.11 and is caused by c7f87f3984cf ("xfs: fix > use-after-free on CIL context on shutdown") which silently added a condition > to the wakeup. The condition is based on a wrong assumption. > > Why is this "papering over"? If a reminder was needed, there were better ways > than randomly hanging the system. > > Why is > > if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) > wake_up_all(&cil->xc_push_wait); > > , which doesn't work reliably, preferable to > > if (waitqueue_active(&cil->xc_push_wait)) > wake_up_all(&cil->xc_push_wait); > > which does? > JFYI, Dave followed up with a patch a couple weeks or so ago: https://lore.kernel.org/linux-xfs/20210128044154.806715-5-da...@fromorbit.com/ Brian > Best > Donald > > > Cheers, > > > > Dave >
Re: [PATCH] xfs: Wake CIL push waiters more reliably
On 13.01.21 22:53, Dave Chinner wrote: [...] I agree that a throttling fix is needed, but I'm trying to understand the scope and breadth of the problem first instead of jumping the gun and making the wrong fix for the wrong reasons that just papers over the underlying problems that the throttling bug has made us aware of... Are you still working on this? If it takes more time to understand the potential underlying problem, the fix for the problem at hand should be applied. This is a real world problem, accidentally found in the wild. It appears very rarely, but it freezes a filesystem or the whole system. It exists in 5.7 , 5.8 , 5.9 , 5.10 and 5.11 and is caused by c7f87f3984cf ("xfs: fix use-after-free on CIL context on shutdown") which silently added a condition to the wakeup. The condition is based on a wrong assumption. Why is this "papering over"? If a reminder was needed, there were better ways than randomly hanging the system. Why is if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) wake_up_all(&cil->xc_push_wait); , which doesn't work reliably, preferable to if (waitqueue_active(&cil->xc_push_wait)) wake_up_all(&cil->xc_push_wait); which does? Best Donald Cheers, Dave
Re: [PATCH] xfs: Wake CIL push waiters more reliably
On Fri, Jan 08, 2021 at 11:56:57AM -0500, Brian Foster wrote: > On Fri, Jan 08, 2021 at 08:54:44AM +1100, Dave Chinner wrote: > > e.g. we run the first transaction into the CIL, it steals the sapce > > needed for the cil checkpoint headers for the transaciton. Then if > > the space returned by the item formatting is negative (because it is > > in the AIL and being relogged), the CIL checkpoint now doesn't have > > the space reserved it needs to run a checkpoint. That transaction is > > a sync transaction, so it forces the log, and now we push the CIL > > without sufficient reservation to write out the log headers and the > > items we just formatted > > > > Hmmm... that seems like an odd scenario because I'd expect the space > usage delta to reflect what might or might not have already been added > to the CIL context, not necessarily the AIL. IOW, shouldn't a negative > delta only occur for items being relogged while still CIL resident > (regardless of AIL residency)? > > From a code standpoint, the way a particular log item delta comes out > negative is from having a shadow lv size smaller than the ->li_lv size. > Thus, xlog_cil_insert_format_items() subtracts the currently formatted > lv size from the delta, formats the current state of the item, and > xfs_cil_prepare_item() adds the new (presumably smaller) size to the > delta. We reuse ->li_lv in this scenario so both it and the shadow > buffer remain, but a CIL push pulls ->li_lv from all log items and > chains them to the CIL context for writing, so I don't see how we could > have an item return a negative delta on an empty CIL. Hm? In theory, yes. But like I said, I've made a bunch of assumptions that this won't happen, and so without actually auditting the code I'm not actually sure that it won't. i.e. I need to go check what happens with items being marked stale, how shadow buffer reallocation interacts with items that end up being smaller than the existing buffer, etc. I've paged a lot of this detail out of my brain, so until I spend the time to go over it all again I'm not going to be able to answer definitively. > (I was also wondering whether repeated smaller relogs of an item could > be a vector for this to go wrong, but it looks like > xlog_cil_insert_format_items() always uses the formatted size of the > current buffer...). That's my nagging doubt about all this - that there's an interaction of this nature that I haven't considered due to the assumptions I've made that allows underflow to occur. That would be much worse than the current situation of hanging on a missing wakeup when the CIL is full and used_space goes backwards Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH] xfs: Wake CIL push waiters more reliably
On Mon, Jan 11, 2021 at 11:38:48AM -0500, Brian Foster wrote: > On Fri, Jan 08, 2021 at 11:56:57AM -0500, Brian Foster wrote: > > On Fri, Jan 08, 2021 at 08:54:44AM +1100, Dave Chinner wrote: > > > On Mon, Jan 04, 2021 at 11:23:53AM -0500, Brian Foster wrote: > > > > On Thu, Dec 31, 2020 at 09:16:11AM +1100, Dave Chinner wrote: > > > > > On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote: > > > > > > If the value goes below the limit while some threads are > > > > > > already waiting but before the push worker gets to it, these > > > > > > threads are > > > > > > not woken. > > > > > > > > > > > > Always wake all CIL push waiters. Test with waitqueue_active() as an > > > > > > optimization. This is possible, because we hold the xc_push_lock > > > > > > spinlock, which prevents additions to the waitqueue. > > > > > > > > > > > > Signed-off-by: Donald Buczek > > > > > > --- > > > > > > fs/xfs/xfs_log_cil.c | 2 +- > > > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > > > > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c > > > > > > index b0ef071b3cb5..d620de8e217c 100644 > > > > > > --- a/fs/xfs/xfs_log_cil.c > > > > > > +++ b/fs/xfs/xfs_log_cil.c > > > > > > @@ -670,7 +670,7 @@ xlog_cil_push_work( > > > > > > /* > > > > > > * Wake up any background push waiters now this context is > > > > > > being pushed. > > > > > > */ > > > > > > - if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) > > > > > > + if (waitqueue_active(&cil->xc_push_wait)) > > > > > > wake_up_all(&cil->xc_push_wait); > > > > > > > > > > That just smells wrong to me. It *might* be correct, but this > > > > > condition should pair with the sleep condition, as space used by a > > > > > CIL context should never actually decrease > > > > > > > > > > > > > ... but I'm a little confused by this assertion. The shadow buffer > > > > allocation code refers to the possibility of shadow buffers falling out > > > > that are smaller than currently allocated buffers. Further, the > > > > _insert_format_items() code appears to explicitly optimize for this > > > > possibility by reusing the active buffer, subtracting the old size/count > > > > values from the diff variables and then reformatting the latest > > > > (presumably smaller) item to the lv. > > > > > > Individual items might shrink, but the overall transaction should > > > grow. Think of a extent to btree conversion of an inode fork. THe > > > data in the inode fork decreases from a list of extents to a btree > > > root block pointer, so the inode item shrinks. But then we add a new > > > btree root block that contains all the extents + the btree block > > > header, and it gets rounded up to ithe 128 byte buffer logging chunk > > > size. > > > > > > IOWs, while the inode item has decreased in size, the overall > > > space consumed by the transaction has gone up and so the CIL ctx > > > used_space should increase. Hence we can't just look at individual > > > log items and whether they have decreased in size - we have to look > > > at all the items in the transaction to understand how the space used > > > in that transaction has changed. i.e. it's the aggregation of all > > > items in the transaction that matter here, not so much the > > > individual items. > > > > > > > Ok, that makes more sense... > > > > > > Of course this could just be implementation detail. I haven't dug into > > > > the details in the remainder of this thread and I don't have specific > > > > examples off the top of my head, but perhaps based on the ability of > > > > various structures to change formats and the ability of log vectors to > > > > shrink in size, shouldn't we expect the possibility of a CIL context to > > > > shrink in size as well? Just from poking around the CIL it seems like > > > > the surrounding code supports it (xlog_cil_insert_items() checks len > 0 > > > > for recalculating split res as well)... > > > > > > Yes, there may be situations where it decreases. It may be this is > > > fine, but the assumption *I've made* in lots of the CIL push code is > > > that ctx->used_space rarely, if ever, will go backwards. > > > > > > > ... and rarely seems a bit more pragmatic than never. > > > > FWIW, a cursory look at the inode size/format code (motivated by > Donald's recent log dump that appears to show inode log items changing > size) suggested that a simple local format size change might be enough > to cause this condition on an item. A subsequent test to create and > immediately remove a file from an otherwise empty directory triggers a > tracepoint I injected in xlog_cil_insert_items() to detect a negative > transaction delta. As expected, the absolute value of the delta does > seem to increase with a larger filename. This also produces a negative > iovec delta, fwiw. E.g.: > > # touch `for i in $(seq 0 63); do echo -n a; done` > # rm -f `for i in $(seq 0 63); do echo -n a; done` > # > > rm-9265[001]
Re: [PATCH] xfs: Wake CIL push waiters more reliably
On Fri, Jan 08, 2021 at 11:56:57AM -0500, Brian Foster wrote: > On Fri, Jan 08, 2021 at 08:54:44AM +1100, Dave Chinner wrote: > > On Mon, Jan 04, 2021 at 11:23:53AM -0500, Brian Foster wrote: > > > On Thu, Dec 31, 2020 at 09:16:11AM +1100, Dave Chinner wrote: > > > > On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote: > > > > > If the value goes below the limit while some threads are > > > > > already waiting but before the push worker gets to it, these threads > > > > > are > > > > > not woken. > > > > > > > > > > Always wake all CIL push waiters. Test with waitqueue_active() as an > > > > > optimization. This is possible, because we hold the xc_push_lock > > > > > spinlock, which prevents additions to the waitqueue. > > > > > > > > > > Signed-off-by: Donald Buczek > > > > > --- > > > > > fs/xfs/xfs_log_cil.c | 2 +- > > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c > > > > > index b0ef071b3cb5..d620de8e217c 100644 > > > > > --- a/fs/xfs/xfs_log_cil.c > > > > > +++ b/fs/xfs/xfs_log_cil.c > > > > > @@ -670,7 +670,7 @@ xlog_cil_push_work( > > > > > /* > > > > >* Wake up any background push waiters now this context is > > > > > being pushed. > > > > >*/ > > > > > - if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) > > > > > + if (waitqueue_active(&cil->xc_push_wait)) > > > > > wake_up_all(&cil->xc_push_wait); > > > > > > > > That just smells wrong to me. It *might* be correct, but this > > > > condition should pair with the sleep condition, as space used by a > > > > CIL context should never actually decrease > > > > > > > > > > ... but I'm a little confused by this assertion. The shadow buffer > > > allocation code refers to the possibility of shadow buffers falling out > > > that are smaller than currently allocated buffers. Further, the > > > _insert_format_items() code appears to explicitly optimize for this > > > possibility by reusing the active buffer, subtracting the old size/count > > > values from the diff variables and then reformatting the latest > > > (presumably smaller) item to the lv. > > > > Individual items might shrink, but the overall transaction should > > grow. Think of a extent to btree conversion of an inode fork. THe > > data in the inode fork decreases from a list of extents to a btree > > root block pointer, so the inode item shrinks. But then we add a new > > btree root block that contains all the extents + the btree block > > header, and it gets rounded up to ithe 128 byte buffer logging chunk > > size. > > > > IOWs, while the inode item has decreased in size, the overall > > space consumed by the transaction has gone up and so the CIL ctx > > used_space should increase. Hence we can't just look at individual > > log items and whether they have decreased in size - we have to look > > at all the items in the transaction to understand how the space used > > in that transaction has changed. i.e. it's the aggregation of all > > items in the transaction that matter here, not so much the > > individual items. > > > > Ok, that makes more sense... > > > > Of course this could just be implementation detail. I haven't dug into > > > the details in the remainder of this thread and I don't have specific > > > examples off the top of my head, but perhaps based on the ability of > > > various structures to change formats and the ability of log vectors to > > > shrink in size, shouldn't we expect the possibility of a CIL context to > > > shrink in size as well? Just from poking around the CIL it seems like > > > the surrounding code supports it (xlog_cil_insert_items() checks len > 0 > > > for recalculating split res as well)... > > > > Yes, there may be situations where it decreases. It may be this is > > fine, but the assumption *I've made* in lots of the CIL push code is > > that ctx->used_space rarely, if ever, will go backwards. > > > > ... and rarely seems a bit more pragmatic than never. > FWIW, a cursory look at the inode size/format code (motivated by Donald's recent log dump that appears to show inode log items changing size) suggested that a simple local format size change might be enough to cause this condition on an item. A subsequent test to create and immediately remove a file from an otherwise empty directory triggers a tracepoint I injected in xlog_cil_insert_items() to detect a negative transaction delta. As expected, the absolute value of the delta does seem to increase with a larger filename. This also produces a negative iovec delta, fwiw. E.g.: # touch `for i in $(seq 0 63); do echo -n a; done` # rm -f `for i in $(seq 0 63); do echo -n a; done` # rm-9265[001] 4660.177806: xfs_log_commit_cil: 409: len -72 diff_iovecs 0 rm-9265[001] .N.1 4660.177913: xfs_log_commit_cil: 419: len -72 diff_iovecs 0 rm-9265[001] 4660.178313: xfs_log_commit_cil: 409: len -52 diff_iovecs -1 rm-9265
Re: [PATCH] xfs: Wake CIL push waiters more reliably
On 07.01.21 23:19, Dave Chinner wrote: On Sun, Jan 03, 2021 at 05:03:33PM +0100, Donald Buczek wrote: On 02.01.21 23:44, Dave Chinner wrote: On Sat, Jan 02, 2021 at 08:12:56PM +0100, Donald Buczek wrote: On 31.12.20 22:59, Dave Chinner wrote: On Thu, Dec 31, 2020 at 12:48:56PM +0100, Donald Buczek wrote: On 30.12.20 23:16, Dave Chinner wrote: One could argue that, but one should also understand the design constraints for a particular algorithm are before suggesting that their solution is "robust". :) Yes, but an understanding to the extend required by the argument should be sufficient :-) And that's the fundamental conceit described by Dunning-Kruger. I.e. someone thinks they know enough to understand the argument, when in fact they don't understand enough about the subject matter to realise they don't understand what the expert at the other end is saying at all # seq 29 2020-12-29T20:08:15.652167+01:00 deadbird kernel: [ 1053.860637] XXX trigger cil e374c6f1 ctx 4967d650 ctx->space_used=33554656 , push_seq=29, ctx->sequence=29 So, at 20:08:15 we get a push trigger and the work is queued. But... . 2020-12-29T20:09:04.961088+01:00 deadbird kernel: [ 1103.168964] XXX wakecil e374c6f1 ctx 4967d650 ctx->space_used=67109136 >= 67108864, push_seq=29, ctx->sequence=29 It takes the best part of *50 seconds* before the push work actually runs? That's well and truly screwed up - the work should run on that CPU on the very next time it yeilds the CPU. If we're holding the CPU without yeilding it for that long, hangcheck and RCU warnings should be going off... No such warnings. But the load is probably I/O bound to the log: - creates `cp -a` copies of a directory tree with small files (linux source repository) - source tree probably completely cached. - two copies in parallel - slow log (on software raid6) Isn't it to be expected that sooner or later you need to wait for log writes when you write as fast as possible with lots of metadata updates and not so much data? No, incoming transactions waits for transaction reservation space, not log writes. Reservation space is freed up by metadata writeback. So if you have new transactions blocking in xfs_trans_reserve(), then we are blocking on metadata writeback. The CIL worker thread may be waiting for log IO completion before it issues more log IO, and in that case it is waiting on iclog buffer space (i.e. only waiting on internal log state, not metadata writeback). Can you find that kworker thread stack and paste it? If bound on log IO, it will be blocked in xlog_get_iclog_space(). Please paste the entire stack output, too, not just the bits you think are relevant or understand That would be a kworker on the "xfs-cil/%s" workqueue, correct? And you mean before the lockup, when the I/O is still active, correct? This is the usual stack output from that thread: # # /proc/2080/task/2080: kworker/u82:3+xfs-cil/md0 : # cat /proc/2080/task/2080/stack [<0>] md_flush_request+0x87/0x190 [<0>] raid5_make_request+0x61b/0xb30 [<0>] md_handle_request+0x127/0x1a0 [<0>] md_submit_bio+0xbd/0x100 [<0>] submit_bio_noacct+0x151/0x410 [<0>] submit_bio+0x4b/0x1a0 [<0>] xlog_state_release_iclog+0x87/0xb0 [<0>] xlog_write+0x452/0x6d0 [<0>] xlog_cil_push_work+0x2e0/0x4d0 [<0>] process_one_work+0x1dd/0x3e0 [<0>] worker_thread+0x23f/0x3b0 [<0>] kthread+0x118/0x130 [<0>] ret_from_fork+0x22/0x30 sampled three times with a few seconds in between, stack identical. Yeah, so that is MD blocking waiting for a running device cache flush to finish before submitting the iclog IO. IOWs, it's waiting for hardware to flush volatile caches to stable storage. Every log IO has a cache flush preceding it, and if the device doesn't support FUA, then there is a post-IO cache flush as well. These flushes are necessary for correct IO ordering between the log and metadata/data writeback. Essentially, you're waiting for MD to flush all it's dirty stripe cache to the backing devices and then flush the backing device caches as well. Nothing can be done at the filesystem level to make this faster Also, cp -a of a linux source tree is just as data intensive as it is metadata intensive. There's probably more data IO than metadata IO, so that's more likely to be what is slowing the disks down as metadata writeback is... I'm a bit concerned, though, that there seem to be a rather unlimited (~ 1000) number of kernel worker threads waiting for the cil push and indirectly for log writes. That's normal - XFS is highly asynchronous and defers a lot of work to completion threads. So it dropped by 16 bytes (seems to be common) which is unexpected. I wonder if it filled a hole in a buffer and so needed one less xlog_op_header()? But then the size would have gone up by at least 128 bytes for the hole that was filled, so it still shouldn't go down in size. I think you need to instrument xlog_cil_insert_items() and c
Re: [PATCH] xfs: Wake CIL push waiters more reliably
On Fri, Jan 08, 2021 at 08:54:44AM +1100, Dave Chinner wrote: > On Mon, Jan 04, 2021 at 11:23:53AM -0500, Brian Foster wrote: > > On Thu, Dec 31, 2020 at 09:16:11AM +1100, Dave Chinner wrote: > > > On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote: > > > > If the value goes below the limit while some threads are > > > > already waiting but before the push worker gets to it, these threads are > > > > not woken. > > > > > > > > Always wake all CIL push waiters. Test with waitqueue_active() as an > > > > optimization. This is possible, because we hold the xc_push_lock > > > > spinlock, which prevents additions to the waitqueue. > > > > > > > > Signed-off-by: Donald Buczek > > > > --- > > > > fs/xfs/xfs_log_cil.c | 2 +- > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c > > > > index b0ef071b3cb5..d620de8e217c 100644 > > > > --- a/fs/xfs/xfs_log_cil.c > > > > +++ b/fs/xfs/xfs_log_cil.c > > > > @@ -670,7 +670,7 @@ xlog_cil_push_work( > > > > /* > > > > * Wake up any background push waiters now this context is > > > > being pushed. > > > > */ > > > > - if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) > > > > + if (waitqueue_active(&cil->xc_push_wait)) > > > > wake_up_all(&cil->xc_push_wait); > > > > > > That just smells wrong to me. It *might* be correct, but this > > > condition should pair with the sleep condition, as space used by a > > > CIL context should never actually decrease > > > > > > > ... but I'm a little confused by this assertion. The shadow buffer > > allocation code refers to the possibility of shadow buffers falling out > > that are smaller than currently allocated buffers. Further, the > > _insert_format_items() code appears to explicitly optimize for this > > possibility by reusing the active buffer, subtracting the old size/count > > values from the diff variables and then reformatting the latest > > (presumably smaller) item to the lv. > > Individual items might shrink, but the overall transaction should > grow. Think of a extent to btree conversion of an inode fork. THe > data in the inode fork decreases from a list of extents to a btree > root block pointer, so the inode item shrinks. But then we add a new > btree root block that contains all the extents + the btree block > header, and it gets rounded up to ithe 128 byte buffer logging chunk > size. > > IOWs, while the inode item has decreased in size, the overall > space consumed by the transaction has gone up and so the CIL ctx > used_space should increase. Hence we can't just look at individual > log items and whether they have decreased in size - we have to look > at all the items in the transaction to understand how the space used > in that transaction has changed. i.e. it's the aggregation of all > items in the transaction that matter here, not so much the > individual items. > Ok, that makes more sense... > > Of course this could just be implementation detail. I haven't dug into > > the details in the remainder of this thread and I don't have specific > > examples off the top of my head, but perhaps based on the ability of > > various structures to change formats and the ability of log vectors to > > shrink in size, shouldn't we expect the possibility of a CIL context to > > shrink in size as well? Just from poking around the CIL it seems like > > the surrounding code supports it (xlog_cil_insert_items() checks len > 0 > > for recalculating split res as well)... > > Yes, there may be situations where it decreases. It may be this is > fine, but the assumption *I've made* in lots of the CIL push code is > that ctx->used_space rarely, if ever, will go backwards. > ... and rarely seems a bit more pragmatic than never. > e.g. we run the first transaction into the CIL, it steals the sapce > needed for the cil checkpoint headers for the transaciton. Then if > the space returned by the item formatting is negative (because it is > in the AIL and being relogged), the CIL checkpoint now doesn't have > the space reserved it needs to run a checkpoint. That transaction is > a sync transaction, so it forces the log, and now we push the CIL > without sufficient reservation to write out the log headers and the > items we just formatted > Hmmm... that seems like an odd scenario because I'd expect the space usage delta to reflect what might or might not have already been added to the CIL context, not necessarily the AIL. IOW, shouldn't a negative delta only occur for items being relogged while still CIL resident (regardless of AIL residency)? >From a code standpoint, the way a particular log item delta comes out negative is from having a shadow lv size smaller than the ->li_lv size. Thus, xlog_cil_insert_format_items() subtracts the currently formatted lv size from the delta, formats the current state of the item, and xfs_cil_prepare_item() adds the new (presumably smalle
Re: [PATCH] xfs: Wake CIL push waiters more reliably
On Sun, Jan 03, 2021 at 05:03:33PM +0100, Donald Buczek wrote: > On 02.01.21 23:44, Dave Chinner wrote: > > On Sat, Jan 02, 2021 at 08:12:56PM +0100, Donald Buczek wrote: > > > On 31.12.20 22:59, Dave Chinner wrote: > > > > On Thu, Dec 31, 2020 at 12:48:56PM +0100, Donald Buczek wrote: > > > > > On 30.12.20 23:16, Dave Chinner wrote: > > > > > > One could argue that, but one should also understand the design > > > > constraints for a particular algorithm are before suggesting > > > > that their solution is "robust". :) > > > > > > Yes, but an understanding to the extend required by the > > > argument should be sufficient :-) > > > > And that's the fundamental conceit described by Dunning-Kruger. > > > > I.e. someone thinks they know enough to understand the argument, > > when in fact they don't understand enough about the subject matter > > to realise they don't understand what the expert at the other end is > > saying at all > > > > > > > # seq 29 > > > > > > > > > > 2020-12-29T20:08:15.652167+01:00 deadbird kernel: [ 1053.860637] XXX > > > > > trigger cil e374c6f1 ctx 4967d650 > > > > > ctx->space_used=33554656 , push_seq=29, ctx->sequence=29 > > > > > > > > So, at 20:08:15 we get a push trigger and the work is queued. But... > > > > > > > > . > > > > > 2020-12-29T20:09:04.961088+01:00 deadbird kernel: [ 1103.168964] XXX > > > > > wakecil e374c6f1 ctx 4967d650 > > > > > ctx->space_used=67109136 >= 67108864, push_seq=29, ctx->sequence=29 > > > > > > > > It takes the best part of *50 seconds* before the push work actually > > > > runs? > > > > > > > > That's well and truly screwed up - the work should run on that > > > > CPU on the very next time it yeilds the CPU. If we're holding the > > > > CPU without yeilding it for that long, hangcheck and RCU warnings > > > > should be going off... > > > > > > No such warnings. > > > > > > But the load is probably I/O bound to the log: > > > > > > - creates `cp -a` copies of a directory tree with small files (linux > > > source repository) > > > - source tree probably completely cached. > > > - two copies in parallel > > > - slow log (on software raid6) > > > > > > Isn't it to be expected that sooner or later you need to wait for > > > log writes when you write as fast as possible with lots of > > > metadata updates and not so much data? > > > > No, incoming transactions waits for transaction reservation space, > > not log writes. Reservation space is freed up by metadata writeback. > > So if you have new transactions blocking in xfs_trans_reserve(), > > then we are blocking on metadata writeback. > > > > The CIL worker thread may be waiting for log IO completion before it > > issues more log IO, and in that case it is waiting on iclog buffer > > space (i.e. only waiting on internal log state, not metadata > > writeback). Can you find that kworker thread stack and paste it? If > > bound on log IO, it will be blocked in xlog_get_iclog_space(). > > > > Please paste the entire stack output, too, not just the bits you > > think are relevant or understand > > That would be a kworker on the "xfs-cil/%s" workqueue, correct? > And you mean before the lockup, when the I/O is still active, correct? > > This is the usual stack output from that thread: > > # # /proc/2080/task/2080: kworker/u82:3+xfs-cil/md0 : > # cat /proc/2080/task/2080/stack > > [<0>] md_flush_request+0x87/0x190 > [<0>] raid5_make_request+0x61b/0xb30 > [<0>] md_handle_request+0x127/0x1a0 > [<0>] md_submit_bio+0xbd/0x100 > [<0>] submit_bio_noacct+0x151/0x410 > [<0>] submit_bio+0x4b/0x1a0 > [<0>] xlog_state_release_iclog+0x87/0xb0 > [<0>] xlog_write+0x452/0x6d0 > [<0>] xlog_cil_push_work+0x2e0/0x4d0 > [<0>] process_one_work+0x1dd/0x3e0 > [<0>] worker_thread+0x23f/0x3b0 > [<0>] kthread+0x118/0x130 > [<0>] ret_from_fork+0x22/0x30 > > sampled three times with a few seconds in between, stack identical. Yeah, so that is MD blocking waiting for a running device cache flush to finish before submitting the iclog IO. IOWs, it's waiting for hardware to flush volatile caches to stable storage. Every log IO has a cache flush preceding it, and if the device doesn't support FUA, then there is a post-IO cache flush as well. These flushes are necessary for correct IO ordering between the log and metadata/data writeback. Essentially, you're waiting for MD to flush all it's dirty stripe cache to the backing devices and then flush the backing device caches as well. Nothing can be done at the filesystem level to make this faster > > Also, cp -a of a linux source tree is just as data intensive as it > > is metadata intensive. There's probably more data IO than metadata > > IO, so that's more likely to be what is slowing the disks down as > > metadata writeback is... > > > > > I'm a bit concerned, though, that there seem to be a rather > > > unlimited (~ 1000) number of kernel worker threads waiting for the > > > cil push and indirectly f
Re: [PATCH] xfs: Wake CIL push waiters more reliably
On Mon, Jan 04, 2021 at 11:23:53AM -0500, Brian Foster wrote: > On Thu, Dec 31, 2020 at 09:16:11AM +1100, Dave Chinner wrote: > > On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote: > > > If the value goes below the limit while some threads are > > > already waiting but before the push worker gets to it, these threads are > > > not woken. > > > > > > Always wake all CIL push waiters. Test with waitqueue_active() as an > > > optimization. This is possible, because we hold the xc_push_lock > > > spinlock, which prevents additions to the waitqueue. > > > > > > Signed-off-by: Donald Buczek > > > --- > > > fs/xfs/xfs_log_cil.c | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c > > > index b0ef071b3cb5..d620de8e217c 100644 > > > --- a/fs/xfs/xfs_log_cil.c > > > +++ b/fs/xfs/xfs_log_cil.c > > > @@ -670,7 +670,7 @@ xlog_cil_push_work( > > > /* > > >* Wake up any background push waiters now this context is being pushed. > > >*/ > > > - if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) > > > + if (waitqueue_active(&cil->xc_push_wait)) > > > wake_up_all(&cil->xc_push_wait); > > > > That just smells wrong to me. It *might* be correct, but this > > condition should pair with the sleep condition, as space used by a > > CIL context should never actually decrease > > > > ... but I'm a little confused by this assertion. The shadow buffer > allocation code refers to the possibility of shadow buffers falling out > that are smaller than currently allocated buffers. Further, the > _insert_format_items() code appears to explicitly optimize for this > possibility by reusing the active buffer, subtracting the old size/count > values from the diff variables and then reformatting the latest > (presumably smaller) item to the lv. Individual items might shrink, but the overall transaction should grow. Think of a extent to btree conversion of an inode fork. THe data in the inode fork decreases from a list of extents to a btree root block pointer, so the inode item shrinks. But then we add a new btree root block that contains all the extents + the btree block header, and it gets rounded up to ithe 128 byte buffer logging chunk size. IOWs, while the inode item has decreased in size, the overall space consumed by the transaction has gone up and so the CIL ctx used_space should increase. Hence we can't just look at individual log items and whether they have decreased in size - we have to look at all the items in the transaction to understand how the space used in that transaction has changed. i.e. it's the aggregation of all items in the transaction that matter here, not so much the individual items. > Of course this could just be implementation detail. I haven't dug into > the details in the remainder of this thread and I don't have specific > examples off the top of my head, but perhaps based on the ability of > various structures to change formats and the ability of log vectors to > shrink in size, shouldn't we expect the possibility of a CIL context to > shrink in size as well? Just from poking around the CIL it seems like > the surrounding code supports it (xlog_cil_insert_items() checks len > 0 > for recalculating split res as well)... Yes, there may be situations where it decreases. It may be this is fine, but the assumption *I've made* in lots of the CIL push code is that ctx->used_space rarely, if ever, will go backwards. e.g. we run the first transaction into the CIL, it steals the sapce needed for the cil checkpoint headers for the transaciton. Then if the space returned by the item formatting is negative (because it is in the AIL and being relogged), the CIL checkpoint now doesn't have the space reserved it needs to run a checkpoint. That transaction is a sync transaction, so it forces the log, and now we push the CIL without sufficient reservation to write out the log headers and the items we just formatted So, yeah, shrinking transaction space usage definitely violates some of the assumptions the code makes about how relogging works. It's entirely possible the assumptions I've made are not entirely correct in some corner cases - those particular cases are what we need to ferret out here, and then decide if they are correct or not and deal with it from there... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH] xfs: Wake CIL push waiters more reliably
On Thu, Dec 31, 2020 at 09:16:11AM +1100, Dave Chinner wrote: > On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote: > > Threads, which committed items to the CIL, wait in the xc_push_wait > > waitqueue when used_space in the push context goes over a limit. These > > threads need to be woken when the CIL is pushed. > > > > The CIL push worker tries to avoid the overhead of calling wake_all() > > when there are no waiters waiting. It does so by checking the same > > condition which caused the waits to happen. This, however, is > > unreliable, because ctx->space_used can actually decrease when items are > > recommitted. > > When does this happen? > > Do you have tracing showing the operation where the relogged item > has actually gotten smaller? By definition, relogging in the CIL > should only grow the size of the object in the CIL because it must > relog all the existing changes on top of the new changed being made > to the object. Hence the CIL reservation should only ever grow. > > IOWs, returning negative lengths from the formatting code is > unexpected and probably a bug and requires further investigation, > not papering over the occurrence with broadcast wakeups... > I agree that this warrants a bit more explanation and analysis before changing the current code... > > If the value goes below the limit while some threads are > > already waiting but before the push worker gets to it, these threads are > > not woken. > > > > Always wake all CIL push waiters. Test with waitqueue_active() as an > > optimization. This is possible, because we hold the xc_push_lock > > spinlock, which prevents additions to the waitqueue. > > > > Signed-off-by: Donald Buczek > > --- > > fs/xfs/xfs_log_cil.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c > > index b0ef071b3cb5..d620de8e217c 100644 > > --- a/fs/xfs/xfs_log_cil.c > > +++ b/fs/xfs/xfs_log_cil.c > > @@ -670,7 +670,7 @@ xlog_cil_push_work( > > /* > > * Wake up any background push waiters now this context is being pushed. > > */ > > - if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) > > + if (waitqueue_active(&cil->xc_push_wait)) > > wake_up_all(&cil->xc_push_wait); > > That just smells wrong to me. It *might* be correct, but this > condition should pair with the sleep condition, as space used by a > CIL context should never actually decrease > ... but I'm a little confused by this assertion. The shadow buffer allocation code refers to the possibility of shadow buffers falling out that are smaller than currently allocated buffers. Further, the _insert_format_items() code appears to explicitly optimize for this possibility by reusing the active buffer, subtracting the old size/count values from the diff variables and then reformatting the latest (presumably smaller) item to the lv. Of course this could just be implementation detail. I haven't dug into the details in the remainder of this thread and I don't have specific examples off the top of my head, but perhaps based on the ability of various structures to change formats and the ability of log vectors to shrink in size, shouldn't we expect the possibility of a CIL context to shrink in size as well? Just from poking around the CIL it seems like the surrounding code supports it (xlog_cil_insert_items() checks len > 0 for recalculating split res as well)... Brian > Cheers, > > Dave. > -- > Dave Chinner > da...@fromorbit.com >
Re: [PATCH] xfs: Wake CIL push waiters more reliably
On 02.01.21 23:44, Dave Chinner wrote: On Sat, Jan 02, 2021 at 08:12:56PM +0100, Donald Buczek wrote: On 31.12.20 22:59, Dave Chinner wrote: On Thu, Dec 31, 2020 at 12:48:56PM +0100, Donald Buczek wrote: On 30.12.20 23:16, Dave Chinner wrote: One could argue that, but one should also understand the design constraints for a particular algorithm are before suggesting that their solution is "robust". :) Yes, but an understanding to the extend required by the argument should be sufficient :-) And that's the fundamental conceit described by Dunning-Kruger. I.e. someone thinks they know enough to understand the argument, when in fact they don't understand enough about the subject matter to realise they don't understand what the expert at the other end is saying at all # seq 29 2020-12-29T20:08:15.652167+01:00 deadbird kernel: [ 1053.860637] XXX trigger cil e374c6f1 ctx 4967d650 ctx->space_used=33554656 , push_seq=29, ctx->sequence=29 So, at 20:08:15 we get a push trigger and the work is queued. But... . 2020-12-29T20:09:04.961088+01:00 deadbird kernel: [ 1103.168964] XXX wakecil e374c6f1 ctx 4967d650 ctx->space_used=67109136 >= 67108864, push_seq=29, ctx->sequence=29 It takes the best part of *50 seconds* before the push work actually runs? That's well and truly screwed up - the work should run on that CPU on the very next time it yeilds the CPU. If we're holding the CPU without yeilding it for that long, hangcheck and RCU warnings should be going off... No such warnings. But the load is probably I/O bound to the log: - creates `cp -a` copies of a directory tree with small files (linux source repository) - source tree probably completely cached. - two copies in parallel - slow log (on software raid6) Isn't it to be expected that sooner or later you need to wait for log writes when you write as fast as possible with lots of metadata updates and not so much data? No, incoming transactions waits for transaction reservation space, not log writes. Reservation space is freed up by metadata writeback. So if you have new transactions blocking in xfs_trans_reserve(), then we are blocking on metadata writeback. The CIL worker thread may be waiting for log IO completion before it issues more log IO, and in that case it is waiting on iclog buffer space (i.e. only waiting on internal log state, not metadata writeback). Can you find that kworker thread stack and paste it? If bound on log IO, it will be blocked in xlog_get_iclog_space(). Please paste the entire stack output, too, not just the bits you think are relevant or understand That would be a kworker on the "xfs-cil/%s" workqueue, correct? And you mean before the lockup, when the I/O is still active, correct? This is the usual stack output from that thread: # # /proc/2080/task/2080: kworker/u82:3+xfs-cil/md0 : # cat /proc/2080/task/2080/stack [<0>] md_flush_request+0x87/0x190 [<0>] raid5_make_request+0x61b/0xb30 [<0>] md_handle_request+0x127/0x1a0 [<0>] md_submit_bio+0xbd/0x100 [<0>] submit_bio_noacct+0x151/0x410 [<0>] submit_bio+0x4b/0x1a0 [<0>] xlog_state_release_iclog+0x87/0xb0 [<0>] xlog_write+0x452/0x6d0 [<0>] xlog_cil_push_work+0x2e0/0x4d0 [<0>] process_one_work+0x1dd/0x3e0 [<0>] worker_thread+0x23f/0x3b0 [<0>] kthread+0x118/0x130 [<0>] ret_from_fork+0x22/0x30 sampled three times with a few seconds in between, stack identical. Also, cp -a of a linux source tree is just as data intensive as it is metadata intensive. There's probably more data IO than metadata IO, so that's more likely to be what is slowing the disks down as metadata writeback is... I'm a bit concerned, though, that there seem to be a rather unlimited (~ 1000) number of kernel worker threads waiting for the cil push and indirectly for log writes. That's normal - XFS is highly asynchronous and defers a lot of work to completion threads. So it dropped by 16 bytes (seems to be common) which is unexpected. I wonder if it filled a hole in a buffer and so needed one less xlog_op_header()? But then the size would have gone up by at least 128 bytes for the hole that was filled, so it still shouldn't go down in size. I think you need to instrument xlog_cil_insert_items() and catch a negative length here: /* account for space used by new iovec headers */ iovhdr_res = diff_iovecs * sizeof(xlog_op_header_t); len += iovhdr_res; ctx->nvecs += diff_iovecs; (diff_iovecs will be negative if the number of xlog_op_header structures goes down) And if this happens, then dump the transaction ticket via xlog_print_trans(tp) so we can see all the log items types and vectors that the transaction has formatted... I tried that, but the output was difficult to understand, because at that point you can only log the complete transaction with the items already updated. And a shrinking item is not switched to the shadow vector, so the formatted content is already overwritten and not avail
Re: [PATCH] xfs: Wake CIL push waiters more reliably
On Sat, Jan 02, 2021 at 08:12:56PM +0100, Donald Buczek wrote: > On 31.12.20 22:59, Dave Chinner wrote: > > On Thu, Dec 31, 2020 at 12:48:56PM +0100, Donald Buczek wrote: > > > On 30.12.20 23:16, Dave Chinner wrote: > > One could argue that, but one should also understand the design > > constraints for a particular algorithm are before suggesting > > that their solution is "robust". :) > > Yes, but an understanding to the extend required by the > argument should be sufficient :-) And that's the fundamental conceit described by Dunning-Kruger. I.e. someone thinks they know enough to understand the argument, when in fact they don't understand enough about the subject matter to realise they don't understand what the expert at the other end is saying at all > > > # seq 29 > > > > > > 2020-12-29T20:08:15.652167+01:00 deadbird kernel: [ 1053.860637] XXX > > > trigger cil e374c6f1 ctx 4967d650 > > > ctx->space_used=33554656 , push_seq=29, ctx->sequence=29 > > > > So, at 20:08:15 we get a push trigger and the work is queued. But... > > > > . > > > 2020-12-29T20:09:04.961088+01:00 deadbird kernel: [ 1103.168964] XXX wake > > >cil e374c6f1 ctx 4967d650 ctx->space_used=67109136 >= > > > 67108864, push_seq=29, ctx->sequence=29 > > > > It takes the best part of *50 seconds* before the push work actually > > runs? > > > > That's well and truly screwed up - the work should run on that > > CPU on the very next time it yeilds the CPU. If we're holding the > > CPU without yeilding it for that long, hangcheck and RCU warnings > > should be going off... > > No such warnings. > > But the load is probably I/O bound to the log: > > - creates `cp -a` copies of a directory tree with small files (linux source > repository) > - source tree probably completely cached. > - two copies in parallel > - slow log (on software raid6) > > Isn't it to be expected that sooner or later you need to wait for > log writes when you write as fast as possible with lots of > metadata updates and not so much data? No, incoming transactions waits for transaction reservation space, not log writes. Reservation space is freed up by metadata writeback. So if you have new transactions blocking in xfs_trans_reserve(), then we are blocking on metadata writeback. The CIL worker thread may be waiting for log IO completion before it issues more log IO, and in that case it is waiting on iclog buffer space (i.e. only waiting on internal log state, not metadata writeback). Can you find that kworker thread stack and paste it? If bound on log IO, it will be blocked in xlog_get_iclog_space(). Please paste the entire stack output, too, not just the bits you think are relevant or understand Also, cp -a of a linux source tree is just as data intensive as it is metadata intensive. There's probably more data IO than metadata IO, so that's more likely to be what is slowing the disks down as metadata writeback is... > I'm a bit concerned, though, that there seem to be a rather > unlimited (~ 1000) number of kernel worker threads waiting for the > cil push and indirectly for log writes. That's normal - XFS is highly asynchronous and defers a lot of work to completion threads. > > So it dropped by 16 bytes (seems to be common) which is > > unexpected. I wonder if it filled a hole in a buffer and so > > needed one less xlog_op_header()? But then the size would have > > gone up by at least 128 bytes for the hole that was filled, so > > it still shouldn't go down in size. > > > > I think you need to instrument xlog_cil_insert_items() and catch > > a negative length here: > > > > /* account for space used by new iovec headers */ > > iovhdr_res = diff_iovecs * sizeof(xlog_op_header_t); len += > > iovhdr_res; ctx->nvecs += diff_iovecs; > > > > (diff_iovecs will be negative if the number of xlog_op_header > > structures goes down) > > > > And if this happens, then dump the transaction ticket via > > xlog_print_trans(tp) so we can see all the log items types and > > vectors that the transaction has formatted... > > I tried that, but the output was difficult to understand, because > at that point you can only log the complete transaction with the > items already updated. And a shrinking item is not switched to the > shadow vector, so the formatted content is already overwritten and > not available for analysis. Yes, that's exactly the information I need to see. But the fact you think this is something I don't need to know about is classic Dunning-Kruger in action. You don't understand why I asked for this information, and found the information "difficult to understand", so you decided I didn't need it either, despite the fact I asked explicitly for it. What I first need to know is what operations are being performed by the transaciton that shrunk and what all the items in it are, not which specific items shrunk and by how much. There can be tens to hundreds of items in a single t
Re: [PATCH] xfs: Wake CIL push waiters more reliably
On 31.12.20 22:59, Dave Chinner wrote: Hey, funny, your email could celebrate New Year a second time :-) On Thu, Dec 31, 2020 at 12:48:56PM +0100, Donald Buczek wrote: On 30.12.20 23:16, Dave Chinner wrote: On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote: Threads, which committed items to the CIL, wait in the xc_push_wait waitqueue when used_space in the push context goes over a limit. These threads need to be woken when the CIL is pushed. The CIL push worker tries to avoid the overhead of calling wake_all() when there are no waiters waiting. It does so by checking the same condition which caused the waits to happen. This, however, is unreliable, because ctx->space_used can actually decrease when items are recommitted. When does this happen? Do you have tracing showing the operation where the relogged item has actually gotten smaller? By definition, relogging in the CIL should only grow the size of the object in the CIL because it must relog all the existing changes on top of the new changed being made to the object. Hence the CIL reservation should only ever grow. I have (very ugly printk based) log (see below), but it only shows, that it happened (space_used decreasing), not what caused it. I only browsed the ( xfs_*_item.c ) code and got the impression that the size of a log item is rather dynamic (e.g. number of extends in an inode, extended attributes in an inode, continuity of chunks in a buffer) and wasn't surprised that a relogged item might need less space from time to time. IOWs, returning negative lengths from the formatting code is unexpected and probably a bug and requires further investigation, not papering over the occurrence with broadcast wakeups... One could argue that the code is more robust after the change, because it wakes up every thread which is waiting on the next push to happen when the next push is happening without making assumption of why these threads are waiting by duplicating code from that waiters side. The proposed waitqueue_active() is inlined to two instructions and avoids the call overhead if there are no waiters as well. One could argue that, but one should also understand the design constraints for a particular algorithm are before suggesting that their solution is "robust". :) Yes, but an understanding to the extend required by the argument should be sufficient :-) # seq 29 2020-12-29T20:08:15.652167+01:00 deadbird kernel: [ 1053.860637] XXX trigger cil e374c6f1 ctx 4967d650 ctx->space_used=33554656 , push_seq=29, ctx->sequence=29 So, at 20:08:15 we get a push trigger and the work is queued. But... . 2020-12-29T20:09:04.961088+01:00 deadbird kernel: [ 1103.168964] XXX wakecil e374c6f1 ctx 4967d650 ctx->space_used=67109136 >= 67108864, push_seq=29, ctx->sequence=29 It takes the best part of *50 seconds* before the push work actually runs? That's well and truly screwed up - the work should run on that CPU on the very next time it yeilds the CPU. If we're holding the CPU without yeilding it for that long, hangcheck and RCU warnings should be going off... No such warnings. But the load is probably I/O bound to the log: - creates `cp -a` copies of a directory tree with small files (linux source repository) - source tree probably completely cached. - two copies in parallel - slow log (on software raid6) Isn't it to be expected that sooner or later you need to wait for log writes when you write as fast as possible with lots of metadata updates and not so much data? I'm a bit concerned, though, that there seem to be a rather unlimited (~ 1000) number of kernel worker threads waiting for the cil push and indirectly for log writes. # seq 30 2020-12-29T20:09:39.305108+01:00 deadbird kernel: [ 1137.514718] XXX trigger cil e374c6f1 ctx c46ab121 ctx->space_used=33554480 , push_seq=30, ctx->sequence=30 20:09:39 for the next trigger, 2020-12-29T20:10:20.389104+01:00 deadbird kernel: [ 1178.597976] XXX pushw cil e374c6f1 ctx c46ab121 ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30 2020-12-29T20:10:20.389117+01:00 deadbird kernel: [ 1178.613792] XXX pushw cil e374c6f1 ctx c46ab121 ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30 2020-12-29T20:10:20.619077+01:00 deadbird kernel: [ 1178.827935] XXX pushw cil e374c6f1 ctx c46ab121 ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30 2020-12-29T20:10:21.129074+01:00 deadbird kernel: [ 1179.337996] XXX pushw cil e374c6f1 ctx c46ab121 ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30 2020-12-29T20:10:21.190101+01:00 deadbird kernel: [ 1179.398869] XXX pushw cil e374c6f1 ctx c46ab121 ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30 2020-12-29T20:10:21.866096+01:00 deadbird kernel: [ 1180.074325] XXX pushw
Re: [PATCH] xfs: Wake CIL push waiters more reliably
On Thu, Dec 31, 2020 at 12:48:56PM +0100, Donald Buczek wrote: > On 30.12.20 23:16, Dave Chinner wrote: > > On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote: > > > Threads, which committed items to the CIL, wait in the > > > xc_push_wait waitqueue when used_space in the push context > > > goes over a limit. These threads need to be woken when the CIL > > > is pushed. > > > > > > The CIL push worker tries to avoid the overhead of calling > > > wake_all() when there are no waiters waiting. It does so by > > > checking the same condition which caused the waits to happen. > > > This, however, is unreliable, because ctx->space_used can > > > actually decrease when items are recommitted. > > > > When does this happen? > > > > Do you have tracing showing the operation where the relogged > > item has actually gotten smaller? By definition, relogging in > > the CIL should only grow the size of the object in the CIL > > because it must relog all the existing changes on top of the new > > changed being made to the object. Hence the CIL reservation > > should only ever grow. > > I have (very ugly printk based) log (see below), but it only > shows, that it happened (space_used decreasing), not what caused > it. > > I only browsed the ( xfs_*_item.c ) code and got the impression > that the size of a log item is rather dynamic (e.g. number of > extends in an inode, extended attributes in an inode, continuity > of chunks in a buffer) and wasn't surprised that a relogged item > might need less space from time to time. > > > IOWs, returning negative lengths from the formatting code is > > unexpected and probably a bug and requires further > > investigation, not papering over the occurrence with broadcast > > wakeups... > > One could argue that the code is more robust after the change, > because it wakes up every thread which is waiting on the next push > to happen when the next push is happening without making > assumption of why these threads are waiting by duplicating code > from that waiters side. The proposed waitqueue_active() is inlined > to two instructions and avoids the call overhead if there are no > waiters as well. One could argue that, but one should also understand the design constraints for a particular algorithm are before suggesting that their solution is "robust". :) > > # seq 29 > > 2020-12-29T20:08:15.652167+01:00 deadbird kernel: [ 1053.860637] XXX trigger > cil e374c6f1 ctx 4967d650 ctx->space_used=33554656 , > push_seq=29, ctx->sequence=29 So, at 20:08:15 we get a push trigger and the work is queued. But... . > 2020-12-29T20:09:04.961088+01:00 deadbird kernel: [ 1103.168964] XXX wake > cil e374c6f1 ctx 4967d650 ctx->space_used=67109136 >= > 67108864, push_seq=29, ctx->sequence=29 It takes the best part of *50 seconds* before the push work actually runs? That's well and truly screwed up - the work should run on that CPU on the very next time it yeilds the CPU. If we're holding the CPU without yeilding it for that long, hangcheck and RCU warnings should be going off... > # seq 30 > > 2020-12-29T20:09:39.305108+01:00 deadbird kernel: [ 1137.514718] XXX trigger > cil e374c6f1 ctx c46ab121 ctx->space_used=33554480 , > push_seq=30, ctx->sequence=30 20:09:39 for the next trigger, > 2020-12-29T20:10:20.389104+01:00 deadbird kernel: [ 1178.597976] XXX pushw > cil e374c6f1 ctx c46ab121 ctx->space_used=67108924 >= > 67108864, push_seq=30, ctx->sequence=30 > 2020-12-29T20:10:20.389117+01:00 deadbird kernel: [ 1178.613792] XXX pushw > cil e374c6f1 ctx c46ab121 ctx->space_used=67108924 >= > 67108864, push_seq=30, ctx->sequence=30 > 2020-12-29T20:10:20.619077+01:00 deadbird kernel: [ 1178.827935] XXX pushw > cil e374c6f1 ctx c46ab121 ctx->space_used=67108924 >= > 67108864, push_seq=30, ctx->sequence=30 > 2020-12-29T20:10:21.129074+01:00 deadbird kernel: [ 1179.337996] XXX pushw > cil e374c6f1 ctx c46ab121 ctx->space_used=67108924 >= > 67108864, push_seq=30, ctx->sequence=30 > 2020-12-29T20:10:21.190101+01:00 deadbird kernel: [ 1179.398869] XXX pushw > cil e374c6f1 ctx c46ab121 ctx->space_used=67108924 >= > 67108864, push_seq=30, ctx->sequence=30 > 2020-12-29T20:10:21.866096+01:00 deadbird kernel: [ 1180.074325] XXX pushw > cil e374c6f1 ctx c46ab121 ctx->space_used=67108924 >= > 67108864, push_seq=30, ctx->sequence=30 > 2020-12-29T20:10:22.076095+01:00 deadbird kernel: [ 1180.283748] XXX pushw > cil e374c6f1 ctx c46ab121 ctx->space_used=67108924 >= > 67108864, push_seq=30, ctx->sequence=30 > 2020-12-29T20:10:22.193070+01:00 deadbird kernel: [ 1180.401590] XXX pushw > cil e374c6f1 ctx c46ab121 ctx->space_used=67108924 >= > 67108864, push_seq=30, ctx->sequence=30 > 2020-12-29T20:10:22.421082+01:00 deadbird kernel: [ 1180.629682] XXX pushw > c
Re: [PATCH] xfs: Wake CIL push waiters more reliably
On 30.12.20 23:16, Dave Chinner wrote: On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote: Threads, which committed items to the CIL, wait in the xc_push_wait waitqueue when used_space in the push context goes over a limit. These threads need to be woken when the CIL is pushed. The CIL push worker tries to avoid the overhead of calling wake_all() when there are no waiters waiting. It does so by checking the same condition which caused the waits to happen. This, however, is unreliable, because ctx->space_used can actually decrease when items are recommitted. When does this happen? Do you have tracing showing the operation where the relogged item has actually gotten smaller? By definition, relogging in the CIL should only grow the size of the object in the CIL because it must relog all the existing changes on top of the new changed being made to the object. Hence the CIL reservation should only ever grow. I have (very ugly printk based) log (see below), but it only shows, that it happened (space_used decreasing), not what caused it. I only browsed the ( xfs_*_item.c ) code and got the impression that the size of a log item is rather dynamic (e.g. number of extends in an inode, extended attributes in an inode, continuity of chunks in a buffer) and wasn't surprised that a relogged item might need less space from time to time. IOWs, returning negative lengths from the formatting code is unexpected and probably a bug and requires further investigation, not papering over the occurrence with broadcast wakeups... One could argue that the code is more robust after the change, because it wakes up every thread which is waiting on the next push to happen when the next push is happening without making assumption of why these threads are waiting by duplicating code from that waiters side. The proposed waitqueue_active() is inlined to two instructions and avoids the call overhead if there are no waiters as well. But, of course, if the size is not expected to decrease, this discrepancy must be clarified anyway. I am sure, I can find out the exact circumstances, this does happen. I will follow up on this. Happy New Year! Donald Log extract following +++ b/fs/xfs/xfs_log_cil.c @@ -17,6 +17,8 @@ #include "xfs_log_priv.h" #include "xfs_trace.h" +#include + struct workqueue_struct *xfs_discard_wq; /* @@ -670,8 +672,25 @@ xlog_cil_push_work( /* * Wake up any background push waiters now this context is being pushed. */ - if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) + if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) { wake_up_all(&cil->xc_push_wait); + pr_warn("XXX wakecil %p ctx %p ctx->space_used=%d >= %d, push_seq=%lld, ctx->sequence=%lld\n", + cil, ctx, + ctx->space_used, + XLOG_CIL_BLOCKING_SPACE_LIMIT(log), + cil->xc_push_seq, + ctx->sequence); + } else { + pr_warn("XXX no wake cil %p ctx %p ctx->space_used=%d < %d, push_seq=%lld, ctx->sequence=%lld\n", + cil, ctx, + ctx->space_used, + XLOG_CIL_BLOCKING_SPACE_LIMIT(log), + cil->xc_push_seq, + ctx->sequence); + if (waitqueue_active(&cil->xc_push_wait)) { + pr_warn("XXX xc_push_wait ACTIVE!\n"); + } + } /* * Check if we've anything to push. If there is nothing, then we don't @@ -918,6 +937,11 @@ xlog_cil_push_background( spin_lock(&cil->xc_push_lock); if (cil->xc_push_seq < cil->xc_current_sequence) { cil->xc_push_seq = cil->xc_current_sequence; + pr_warn("XXX trigger cil %p ctx %p ctx->space_used=%d , push_seq=%lld, ctx->sequence=%lld\n", + cil, cil->xc_ctx, + cil->xc_ctx->space_used, + cil->xc_push_seq, + cil->xc_ctx->sequence); queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work); } @@ -936,6 +960,12 @@ xlog_cil_push_background( if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) { trace_xfs_log_cil_wait(log, cil->xc_ctx->ticket); ASSERT(cil->xc_ctx->space_used < log->l_logsize); + pr_warn("XXX pushw cil %p ctx %p ctx->space_used=%d >= %d, push_seq=%lld, ctx->sequence=%lld\n", + cil, cil->xc_ctx, + cil->xc_ctx->space_used, + XLOG_CIL_BLOCKING_SPACE_LIMIT(log), + cil->xc_push_seq, + cil->xc_ctx->sequence); xlog_wait(&cil->xc_push_wait, &cil->xc_push_lock); return; } = 14606 "XXX" lines logged before deadlo
Re: [PATCH] xfs: Wake CIL push waiters more reliably
On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote: > Threads, which committed items to the CIL, wait in the xc_push_wait > waitqueue when used_space in the push context goes over a limit. These > threads need to be woken when the CIL is pushed. > > The CIL push worker tries to avoid the overhead of calling wake_all() > when there are no waiters waiting. It does so by checking the same > condition which caused the waits to happen. This, however, is > unreliable, because ctx->space_used can actually decrease when items are > recommitted. When does this happen? Do you have tracing showing the operation where the relogged item has actually gotten smaller? By definition, relogging in the CIL should only grow the size of the object in the CIL because it must relog all the existing changes on top of the new changed being made to the object. Hence the CIL reservation should only ever grow. IOWs, returning negative lengths from the formatting code is unexpected and probably a bug and requires further investigation, not papering over the occurrence with broadcast wakeups... > If the value goes below the limit while some threads are > already waiting but before the push worker gets to it, these threads are > not woken. > > Always wake all CIL push waiters. Test with waitqueue_active() as an > optimization. This is possible, because we hold the xc_push_lock > spinlock, which prevents additions to the waitqueue. > > Signed-off-by: Donald Buczek > --- > fs/xfs/xfs_log_cil.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c > index b0ef071b3cb5..d620de8e217c 100644 > --- a/fs/xfs/xfs_log_cil.c > +++ b/fs/xfs/xfs_log_cil.c > @@ -670,7 +670,7 @@ xlog_cil_push_work( > /* >* Wake up any background push waiters now this context is being pushed. >*/ > - if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) > + if (waitqueue_active(&cil->xc_push_wait)) > wake_up_all(&cil->xc_push_wait); That just smells wrong to me. It *might* be correct, but this condition should pair with the sleep condition, as space used by a CIL context should never actually decrease Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH] xfs: Wake CIL push waiters more reliably
On 30.12.20 03:46, Hillf Danton wrote: On Wed, 30 Dec 2020 00:56:27 +0100 Threads, which committed items to the CIL, wait in the xc_push_wait waitqueue when used_space in the push context goes over a limit. These threads need to be woken when the CIL is pushed. The CIL push worker tries to avoid the overhead of calling wake_all() when there are no waiters waiting. It does so by checking the same condition which caused the waits to happen. This, however, is unreliable, because ctx->space_used can actually decrease when items are recommitted. If the value goes below the limit while some threads are already waiting but before the push worker gets to it, these threads are not woken. Looks like you are fixing a typo in c7f87f3984cf ("xfs: fix use-after-free on CIL context on shutdown") in mainline, and it may mean /* * Wake up any background push waiters now this context is being pushed * if we are no longer over the space limit */ given waiters throttled for comsuming more space than limit in xlog_cil_push_background(). I'm not sure, I understand you correctly. Do you suggest to update the comment to "...if we are no longer over the space limit" and change the code to `if (ctx->space_used < XLOG_CIL_BLOCKING_SPACE_LIMIT(log))` ? I don't think, that would be correct. The current push context is most probably still over the limit if it ever was. It is exceptional, that the few bytes, which might be returned to ctx->space_used, bring us back below the limit. The new context, on the other hand, will have space_used=0. We need to resume any thread which is waiting for the push. Best Donald Always wake all CIL push waiters. Test with waitqueue_active() as an optimization. This is possible, because we hold the xc_push_lock spinlock, which prevents additions to the waitqueue. Signed-off-by: Donald Buczek --- fs/xfs/xfs_log_cil.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index b0ef071b3cb5..d620de8e217c 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -670,7 +670,7 @@ xlog_cil_push_work( /* * Wake up any background push waiters now this context is being pushed. */ - if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) + if (waitqueue_active(&cil->xc_push_wait)) wake_up_all(&cil->xc_push_wait); /* -- 2.26.2
[PATCH] xfs: Wake CIL push waiters more reliably
Threads, which committed items to the CIL, wait in the xc_push_wait waitqueue when used_space in the push context goes over a limit. These threads need to be woken when the CIL is pushed. The CIL push worker tries to avoid the overhead of calling wake_all() when there are no waiters waiting. It does so by checking the same condition which caused the waits to happen. This, however, is unreliable, because ctx->space_used can actually decrease when items are recommitted. If the value goes below the limit while some threads are already waiting but before the push worker gets to it, these threads are not woken. Always wake all CIL push waiters. Test with waitqueue_active() as an optimization. This is possible, because we hold the xc_push_lock spinlock, which prevents additions to the waitqueue. Signed-off-by: Donald Buczek --- fs/xfs/xfs_log_cil.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index b0ef071b3cb5..d620de8e217c 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -670,7 +670,7 @@ xlog_cil_push_work( /* * Wake up any background push waiters now this context is being pushed. */ - if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) + if (waitqueue_active(&cil->xc_push_wait)) wake_up_all(&cil->xc_push_wait); /* -- 2.26.2