Re: [PATCH] xfs: Wake CIL push waiters more reliably

2021-02-16 Thread Donald Buczek

On 16.02.21 12:18, Brian Foster wrote:

On Mon, Feb 15, 2021 at 02:36:38PM +0100, Donald Buczek wrote:

On 13.01.21 22:53, Dave Chinner wrote:

[...]
I agree that a throttling fix is needed, but I'm trying to
understand the scope and breadth of the problem first instead of
jumping the gun and making the wrong fix for the wrong reasons that
just papers over the underlying problems that the throttling bug has
made us aware of...


Are you still working on this?

If it takes more time to understand the potential underlying problem, the fix 
for the problem at hand should be applied.

This is a real world problem, accidentally found in the wild. It appears very rarely, but 
it freezes a filesystem or the whole system. It exists in 5.7 , 5.8 , 5.9 , 5.10 and 5.11 
and is caused by c7f87f3984cf ("xfs: fix use-after-free on CIL context on 
shutdown") which silently added a condition to the wakeup. The condition is based on 
a wrong assumption.

Why is this "papering over"? If a reminder was needed, there were better ways 
than randomly hanging the system.

Why is

 if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
 wake_up_all(&cil->xc_push_wait);

, which doesn't work reliably, preferable to

 if (waitqueue_active(&cil->xc_push_wait))
 wake_up_all(&cil->xc_push_wait);

which does?



JFYI, Dave followed up with a patch a couple weeks or so ago:

https://lore.kernel.org/linux-xfs/20210128044154.806715-5-da...@fromorbit.com/


Oh, great. I apologize for the unneeded reminder.

Best

  Donald



Brian


Best
   Donald


Cheers,

Dave






Re: [PATCH] xfs: Wake CIL push waiters more reliably

2021-02-16 Thread Brian Foster
On Mon, Feb 15, 2021 at 02:36:38PM +0100, Donald Buczek wrote:
> On 13.01.21 22:53, Dave Chinner wrote:
> > [...]
> > I agree that a throttling fix is needed, but I'm trying to
> > understand the scope and breadth of the problem first instead of
> > jumping the gun and making the wrong fix for the wrong reasons that
> > just papers over the underlying problems that the throttling bug has
> > made us aware of...
> 
> Are you still working on this?
> 
> If it takes more time to understand the potential underlying problem, the fix 
> for the problem at hand should be applied.
> 
> This is a real world problem, accidentally found in the wild. It appears very 
> rarely, but it freezes a filesystem or the whole system. It exists in 5.7 , 
> 5.8 , 5.9 , 5.10 and 5.11 and is caused by c7f87f3984cf ("xfs: fix 
> use-after-free on CIL context on shutdown") which silently added a condition 
> to the wakeup. The condition is based on a wrong assumption.
> 
> Why is this "papering over"? If a reminder was needed, there were better ways 
> than randomly hanging the system.
> 
> Why is
> 
> if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
> wake_up_all(&cil->xc_push_wait);
> 
> , which doesn't work reliably, preferable to
> 
> if (waitqueue_active(&cil->xc_push_wait))
> wake_up_all(&cil->xc_push_wait);
> 
> which does?
> 

JFYI, Dave followed up with a patch a couple weeks or so ago:

https://lore.kernel.org/linux-xfs/20210128044154.806715-5-da...@fromorbit.com/

Brian

> Best
>   Donald
> 
> > Cheers,
> > 
> > Dave
> 



Re: [PATCH] xfs: Wake CIL push waiters more reliably

2021-02-15 Thread Donald Buczek

On 13.01.21 22:53, Dave Chinner wrote:

[...]
I agree that a throttling fix is needed, but I'm trying to
understand the scope and breadth of the problem first instead of
jumping the gun and making the wrong fix for the wrong reasons that
just papers over the underlying problems that the throttling bug has
made us aware of...


Are you still working on this?

If it takes more time to understand the potential underlying problem, the fix 
for the problem at hand should be applied.

This is a real world problem, accidentally found in the wild. It appears very rarely, but 
it freezes a filesystem or the whole system. It exists in 5.7 , 5.8 , 5.9 , 5.10 and 5.11 
and is caused by c7f87f3984cf ("xfs: fix use-after-free on CIL context on 
shutdown") which silently added a condition to the wakeup. The condition is based on 
a wrong assumption.

Why is this "papering over"? If a reminder was needed, there were better ways 
than randomly hanging the system.

Why is

if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
wake_up_all(&cil->xc_push_wait);

, which doesn't work reliably, preferable to

if (waitqueue_active(&cil->xc_push_wait))
wake_up_all(&cil->xc_push_wait);

which does?

Best
  Donald


Cheers,

Dave


Re: [PATCH] xfs: Wake CIL push waiters more reliably

2021-01-13 Thread Dave Chinner
On Fri, Jan 08, 2021 at 11:56:57AM -0500, Brian Foster wrote:
> On Fri, Jan 08, 2021 at 08:54:44AM +1100, Dave Chinner wrote:
> > e.g. we run the first transaction into the CIL, it steals the sapce
> > needed for the cil checkpoint headers for the transaciton. Then if
> > the space returned by the item formatting is negative (because it is
> > in the AIL and being relogged), the CIL checkpoint now doesn't have
> > the space reserved it needs to run a checkpoint. That transaction is
> > a sync transaction, so it forces the log, and now we push the CIL
> > without sufficient reservation to write out the log headers and the
> > items we just formatted
> > 
> 
> Hmmm... that seems like an odd scenario because I'd expect the space
> usage delta to reflect what might or might not have already been added
> to the CIL context, not necessarily the AIL. IOW, shouldn't a negative
> delta only occur for items being relogged while still CIL resident
> (regardless of AIL residency)?
>
> From a code standpoint, the way a particular log item delta comes out
> negative is from having a shadow lv size smaller than the ->li_lv size.
> Thus, xlog_cil_insert_format_items() subtracts the currently formatted
> lv size from the delta, formats the current state of the item, and
> xfs_cil_prepare_item() adds the new (presumably smaller) size to the
> delta. We reuse ->li_lv in this scenario so both it and the shadow
> buffer remain, but a CIL push pulls ->li_lv from all log items and
> chains them to the CIL context for writing, so I don't see how we could
> have an item return a negative delta on an empty CIL. Hm?

In theory, yes. But like I said, I've made a bunch of assumptions
that this won't happen, and so without actually auditting the code
I'm not actually sure that it won't. i.e. I need to go check what
happens with items being marked stale, how shadow buffer
reallocation interacts with items that end up being smaller than the
existing buffer, etc. I've paged a lot of this detail out of my
brain, so until I spend the time to go over it all again I'm not
going to be able to answer definitively.

> (I was also wondering whether repeated smaller relogs of an item could
> be a vector for this to go wrong, but it looks like
> xlog_cil_insert_format_items() always uses the formatted size of the
> current buffer...).

That's my nagging doubt about all this - that there's an interaction
of this nature that I haven't considered due to the assumptions I've
made that allows underflow to occur. That would be much worse than
the current situation of hanging on a missing wakeup when the CIL is
full and used_space goes backwards

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH] xfs: Wake CIL push waiters more reliably

2021-01-13 Thread Dave Chinner
On Mon, Jan 11, 2021 at 11:38:48AM -0500, Brian Foster wrote:
> On Fri, Jan 08, 2021 at 11:56:57AM -0500, Brian Foster wrote:
> > On Fri, Jan 08, 2021 at 08:54:44AM +1100, Dave Chinner wrote:
> > > On Mon, Jan 04, 2021 at 11:23:53AM -0500, Brian Foster wrote:
> > > > On Thu, Dec 31, 2020 at 09:16:11AM +1100, Dave Chinner wrote:
> > > > > On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote:
> > > > > > If the value goes below the limit while some threads are
> > > > > > already waiting but before the push worker gets to it, these 
> > > > > > threads are
> > > > > > not woken.
> > > > > > 
> > > > > > Always wake all CIL push waiters. Test with waitqueue_active() as an
> > > > > > optimization. This is possible, because we hold the xc_push_lock
> > > > > > spinlock, which prevents additions to the waitqueue.
> > > > > > 
> > > > > > Signed-off-by: Donald Buczek 
> > > > > > ---
> > > > > >  fs/xfs/xfs_log_cil.c | 2 +-
> > > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > > > > > index b0ef071b3cb5..d620de8e217c 100644
> > > > > > --- a/fs/xfs/xfs_log_cil.c
> > > > > > +++ b/fs/xfs/xfs_log_cil.c
> > > > > > @@ -670,7 +670,7 @@ xlog_cil_push_work(
> > > > > > /*
> > > > > >  * Wake up any background push waiters now this context is 
> > > > > > being pushed.
> > > > > >  */
> > > > > > -   if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
> > > > > > +   if (waitqueue_active(&cil->xc_push_wait))
> > > > > > wake_up_all(&cil->xc_push_wait);
> > > > > 
> > > > > That just smells wrong to me. It *might* be correct, but this
> > > > > condition should pair with the sleep condition, as space used by a
> > > > > CIL context should never actually decrease
> > > > > 
> > > > 
> > > > ... but I'm a little confused by this assertion. The shadow buffer
> > > > allocation code refers to the possibility of shadow buffers falling out
> > > > that are smaller than currently allocated buffers. Further, the
> > > > _insert_format_items() code appears to explicitly optimize for this
> > > > possibility by reusing the active buffer, subtracting the old size/count
> > > > values from the diff variables and then reformatting the latest
> > > > (presumably smaller) item to the lv.
> > > 
> > > Individual items might shrink, but the overall transaction should
> > > grow. Think of a extent to btree conversion of an inode fork. THe
> > > data in the inode fork decreases from a list of extents to a btree
> > > root block pointer, so the inode item shrinks. But then we add a new
> > > btree root block that contains all the extents + the btree block
> > > header, and it gets rounded up to ithe 128 byte buffer logging chunk
> > > size.
> > > 
> > > IOWs, while the inode item has decreased in size, the overall
> > > space consumed by the transaction has gone up and so the CIL ctx
> > > used_space should increase. Hence we can't just look at individual
> > > log items and whether they have decreased in size - we have to look
> > > at all the items in the transaction to understand how the space used
> > > in that transaction has changed. i.e. it's the aggregation of all
> > > items in the transaction that matter here, not so much the
> > > individual items.
> > > 
> > 
> > Ok, that makes more sense...
> > 
> > > > Of course this could just be implementation detail. I haven't dug into
> > > > the details in the remainder of this thread and I don't have specific
> > > > examples off the top of my head, but perhaps based on the ability of
> > > > various structures to change formats and the ability of log vectors to
> > > > shrink in size, shouldn't we expect the possibility of a CIL context to
> > > > shrink in size as well? Just from poking around the CIL it seems like
> > > > the surrounding code supports it (xlog_cil_insert_items() checks len > 0
> > > > for recalculating split res as well)...
> > > 
> > > Yes, there may be situations where it decreases. It may be this is
> > > fine, but the assumption *I've made* in lots of the CIL push code is
> > > that ctx->used_space rarely, if ever, will go backwards.
> > > 
> > 
> > ... and rarely seems a bit more pragmatic than never.
> > 
> 
> FWIW, a cursory look at the inode size/format code (motivated by
> Donald's recent log dump that appears to show inode log items changing
> size) suggested that a simple local format size change might be enough
> to cause this condition on an item. A subsequent test to create and
> immediately remove a file from an otherwise empty directory triggers a
> tracepoint I injected in xlog_cil_insert_items() to detect a negative
> transaction delta. As expected, the absolute value of the delta does
> seem to increase with a larger filename. This also produces a negative
> iovec delta, fwiw. E.g.:
> 
> # touch `for i in $(seq 0 63); do echo -n a; done`
> # rm -f `for i in $(seq 0 63); do echo -n a; done`
> #
> 
> rm-9265[001] 

Re: [PATCH] xfs: Wake CIL push waiters more reliably

2021-01-11 Thread Brian Foster
On Fri, Jan 08, 2021 at 11:56:57AM -0500, Brian Foster wrote:
> On Fri, Jan 08, 2021 at 08:54:44AM +1100, Dave Chinner wrote:
> > On Mon, Jan 04, 2021 at 11:23:53AM -0500, Brian Foster wrote:
> > > On Thu, Dec 31, 2020 at 09:16:11AM +1100, Dave Chinner wrote:
> > > > On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote:
> > > > > If the value goes below the limit while some threads are
> > > > > already waiting but before the push worker gets to it, these threads 
> > > > > are
> > > > > not woken.
> > > > > 
> > > > > Always wake all CIL push waiters. Test with waitqueue_active() as an
> > > > > optimization. This is possible, because we hold the xc_push_lock
> > > > > spinlock, which prevents additions to the waitqueue.
> > > > > 
> > > > > Signed-off-by: Donald Buczek 
> > > > > ---
> > > > >  fs/xfs/xfs_log_cil.c | 2 +-
> > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > > > > index b0ef071b3cb5..d620de8e217c 100644
> > > > > --- a/fs/xfs/xfs_log_cil.c
> > > > > +++ b/fs/xfs/xfs_log_cil.c
> > > > > @@ -670,7 +670,7 @@ xlog_cil_push_work(
> > > > >   /*
> > > > >* Wake up any background push waiters now this context is 
> > > > > being pushed.
> > > > >*/
> > > > > - if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
> > > > > + if (waitqueue_active(&cil->xc_push_wait))
> > > > >   wake_up_all(&cil->xc_push_wait);
> > > > 
> > > > That just smells wrong to me. It *might* be correct, but this
> > > > condition should pair with the sleep condition, as space used by a
> > > > CIL context should never actually decrease
> > > > 
> > > 
> > > ... but I'm a little confused by this assertion. The shadow buffer
> > > allocation code refers to the possibility of shadow buffers falling out
> > > that are smaller than currently allocated buffers. Further, the
> > > _insert_format_items() code appears to explicitly optimize for this
> > > possibility by reusing the active buffer, subtracting the old size/count
> > > values from the diff variables and then reformatting the latest
> > > (presumably smaller) item to the lv.
> > 
> > Individual items might shrink, but the overall transaction should
> > grow. Think of a extent to btree conversion of an inode fork. THe
> > data in the inode fork decreases from a list of extents to a btree
> > root block pointer, so the inode item shrinks. But then we add a new
> > btree root block that contains all the extents + the btree block
> > header, and it gets rounded up to ithe 128 byte buffer logging chunk
> > size.
> > 
> > IOWs, while the inode item has decreased in size, the overall
> > space consumed by the transaction has gone up and so the CIL ctx
> > used_space should increase. Hence we can't just look at individual
> > log items and whether they have decreased in size - we have to look
> > at all the items in the transaction to understand how the space used
> > in that transaction has changed. i.e. it's the aggregation of all
> > items in the transaction that matter here, not so much the
> > individual items.
> > 
> 
> Ok, that makes more sense...
> 
> > > Of course this could just be implementation detail. I haven't dug into
> > > the details in the remainder of this thread and I don't have specific
> > > examples off the top of my head, but perhaps based on the ability of
> > > various structures to change formats and the ability of log vectors to
> > > shrink in size, shouldn't we expect the possibility of a CIL context to
> > > shrink in size as well? Just from poking around the CIL it seems like
> > > the surrounding code supports it (xlog_cil_insert_items() checks len > 0
> > > for recalculating split res as well)...
> > 
> > Yes, there may be situations where it decreases. It may be this is
> > fine, but the assumption *I've made* in lots of the CIL push code is
> > that ctx->used_space rarely, if ever, will go backwards.
> > 
> 
> ... and rarely seems a bit more pragmatic than never.
> 

FWIW, a cursory look at the inode size/format code (motivated by
Donald's recent log dump that appears to show inode log items changing
size) suggested that a simple local format size change might be enough
to cause this condition on an item. A subsequent test to create and
immediately remove a file from an otherwise empty directory triggers a
tracepoint I injected in xlog_cil_insert_items() to detect a negative
transaction delta. As expected, the absolute value of the delta does
seem to increase with a larger filename. This also produces a negative
iovec delta, fwiw. E.g.:

# touch `for i in $(seq 0 63); do echo -n a; done`
# rm -f `for i in $(seq 0 63); do echo -n a; done`
#

rm-9265[001]   4660.177806: xfs_log_commit_cil: 409: len -72 
diff_iovecs 0
rm-9265[001] .N.1  4660.177913: xfs_log_commit_cil: 419: len -72 
diff_iovecs 0
rm-9265[001]   4660.178313: xfs_log_commit_cil: 409: len -52 
diff_iovecs -1
rm-9265 

Re: [PATCH] xfs: Wake CIL push waiters more reliably

2021-01-09 Thread Donald Buczek

On 07.01.21 23:19, Dave Chinner wrote:

On Sun, Jan 03, 2021 at 05:03:33PM +0100, Donald Buczek wrote:

On 02.01.21 23:44, Dave Chinner wrote:

On Sat, Jan 02, 2021 at 08:12:56PM +0100, Donald Buczek wrote:

On 31.12.20 22:59, Dave Chinner wrote:

On Thu, Dec 31, 2020 at 12:48:56PM +0100, Donald Buczek wrote:

On 30.12.20 23:16, Dave Chinner wrote:



One could argue that, but one should also understand the design
constraints for a particular algorithm are before suggesting
that their solution is "robust". :)


Yes, but an understanding to the extend required by the
argument should be sufficient :-)


And that's the fundamental conceit described by Dunning-Kruger.

I.e. someone thinks they know enough to understand the argument,
when in fact they don't understand enough about the subject matter
to realise they don't understand what the expert at the other end is
saying at all


# seq 29

2020-12-29T20:08:15.652167+01:00 deadbird kernel: [ 1053.860637] XXX trigger cil 
e374c6f1 ctx 4967d650  ctx->space_used=33554656  , 
push_seq=29, ctx->sequence=29


So, at 20:08:15 we get a push trigger and the work is queued. But...

.

2020-12-29T20:09:04.961088+01:00 deadbird kernel: [ 1103.168964] XXX wakecil 
e374c6f1 ctx 4967d650  ctx->space_used=67109136 >= 67108864, 
push_seq=29, ctx->sequence=29


It takes the best part of *50 seconds* before the push work actually
runs?

That's  well and truly screwed up - the work should run on that
CPU on the very next time it yeilds the CPU. If we're holding the
CPU without yeilding it for that long, hangcheck and RCU warnings
should be going off...


No such warnings.

But the load is probably I/O bound to the log:

- creates `cp -a` copies of a directory tree with small files (linux source 
repository)
- source tree probably completely cached.
- two copies in parallel
- slow log (on software raid6)

Isn't it to be expected that sooner or later you need to wait for
log writes when you write as fast as possible with lots of
metadata updates and not so much data?


No, incoming transactions waits for transaction reservation space,
not log writes. Reservation space is freed up by metadata writeback.
So if you have new transactions blocking in xfs_trans_reserve(),
then we are blocking on metadata writeback.

The CIL worker thread may be waiting for log IO completion before it
issues more log IO, and in that case it is waiting on iclog buffer
space (i.e. only waiting on internal log state, not metadata
writeback).  Can you find that kworker thread stack and paste it? If
bound on log IO, it will be blocked in xlog_get_iclog_space().

Please paste the entire stack output, too, not just the bits you
think are relevant or understand


That would be a kworker on the "xfs-cil/%s" workqueue, correct?
And you mean before the lockup, when the I/O is still active, correct?

This is the usual stack output from that thread:

# # /proc/2080/task/2080: kworker/u82:3+xfs-cil/md0 :
# cat /proc/2080/task/2080/stack

[<0>] md_flush_request+0x87/0x190
[<0>] raid5_make_request+0x61b/0xb30
[<0>] md_handle_request+0x127/0x1a0
[<0>] md_submit_bio+0xbd/0x100
[<0>] submit_bio_noacct+0x151/0x410
[<0>] submit_bio+0x4b/0x1a0
[<0>] xlog_state_release_iclog+0x87/0xb0
[<0>] xlog_write+0x452/0x6d0
[<0>] xlog_cil_push_work+0x2e0/0x4d0
[<0>] process_one_work+0x1dd/0x3e0
[<0>] worker_thread+0x23f/0x3b0
[<0>] kthread+0x118/0x130
[<0>] ret_from_fork+0x22/0x30

sampled three times with a few seconds in between, stack identical.


Yeah, so that is MD blocking waiting for a running device cache
flush to finish before submitting the iclog IO. IOWs, it's waiting
for hardware to flush volatile caches to stable storage. Every log
IO has a cache flush preceding it, and if the device doesn't support
FUA, then there is a post-IO cache flush as well. These flushes are
necessary for correct IO ordering between the log and metadata/data
writeback.

Essentially, you're waiting for MD to flush all it's dirty stripe
cache to the backing devices and then flush the backing device
caches as well. Nothing can be done at the filesystem level to make
this faster


Also, cp -a of a linux source tree is just as data intensive as it
is metadata intensive. There's probably more data IO than metadata
IO, so that's more likely to be what is slowing the disks down as
metadata writeback is...


I'm a bit concerned, though, that there seem to be a rather
unlimited (~ 1000) number of kernel worker threads waiting for the
cil push and indirectly for log writes.


That's normal - XFS is highly asynchronous and defers a lot of work
to completion threads.


So it dropped by 16 bytes (seems to be common) which is
unexpected.  I wonder if it filled a hole in a buffer and so
needed one less xlog_op_header()? But then the size would have
gone up by at least 128 bytes for the hole that was filled, so
it still shouldn't go down in size.

I think you need to instrument xlog_cil_insert_items() and c

Re: [PATCH] xfs: Wake CIL push waiters more reliably

2021-01-08 Thread Brian Foster
On Fri, Jan 08, 2021 at 08:54:44AM +1100, Dave Chinner wrote:
> On Mon, Jan 04, 2021 at 11:23:53AM -0500, Brian Foster wrote:
> > On Thu, Dec 31, 2020 at 09:16:11AM +1100, Dave Chinner wrote:
> > > On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote:
> > > > If the value goes below the limit while some threads are
> > > > already waiting but before the push worker gets to it, these threads are
> > > > not woken.
> > > > 
> > > > Always wake all CIL push waiters. Test with waitqueue_active() as an
> > > > optimization. This is possible, because we hold the xc_push_lock
> > > > spinlock, which prevents additions to the waitqueue.
> > > > 
> > > > Signed-off-by: Donald Buczek 
> > > > ---
> > > >  fs/xfs/xfs_log_cil.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > 
> > > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > > > index b0ef071b3cb5..d620de8e217c 100644
> > > > --- a/fs/xfs/xfs_log_cil.c
> > > > +++ b/fs/xfs/xfs_log_cil.c
> > > > @@ -670,7 +670,7 @@ xlog_cil_push_work(
> > > > /*
> > > >  * Wake up any background push waiters now this context is 
> > > > being pushed.
> > > >  */
> > > > -   if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
> > > > +   if (waitqueue_active(&cil->xc_push_wait))
> > > > wake_up_all(&cil->xc_push_wait);
> > > 
> > > That just smells wrong to me. It *might* be correct, but this
> > > condition should pair with the sleep condition, as space used by a
> > > CIL context should never actually decrease
> > > 
> > 
> > ... but I'm a little confused by this assertion. The shadow buffer
> > allocation code refers to the possibility of shadow buffers falling out
> > that are smaller than currently allocated buffers. Further, the
> > _insert_format_items() code appears to explicitly optimize for this
> > possibility by reusing the active buffer, subtracting the old size/count
> > values from the diff variables and then reformatting the latest
> > (presumably smaller) item to the lv.
> 
> Individual items might shrink, but the overall transaction should
> grow. Think of a extent to btree conversion of an inode fork. THe
> data in the inode fork decreases from a list of extents to a btree
> root block pointer, so the inode item shrinks. But then we add a new
> btree root block that contains all the extents + the btree block
> header, and it gets rounded up to ithe 128 byte buffer logging chunk
> size.
> 
> IOWs, while the inode item has decreased in size, the overall
> space consumed by the transaction has gone up and so the CIL ctx
> used_space should increase. Hence we can't just look at individual
> log items and whether they have decreased in size - we have to look
> at all the items in the transaction to understand how the space used
> in that transaction has changed. i.e. it's the aggregation of all
> items in the transaction that matter here, not so much the
> individual items.
> 

Ok, that makes more sense...

> > Of course this could just be implementation detail. I haven't dug into
> > the details in the remainder of this thread and I don't have specific
> > examples off the top of my head, but perhaps based on the ability of
> > various structures to change formats and the ability of log vectors to
> > shrink in size, shouldn't we expect the possibility of a CIL context to
> > shrink in size as well? Just from poking around the CIL it seems like
> > the surrounding code supports it (xlog_cil_insert_items() checks len > 0
> > for recalculating split res as well)...
> 
> Yes, there may be situations where it decreases. It may be this is
> fine, but the assumption *I've made* in lots of the CIL push code is
> that ctx->used_space rarely, if ever, will go backwards.
> 

... and rarely seems a bit more pragmatic than never.

> e.g. we run the first transaction into the CIL, it steals the sapce
> needed for the cil checkpoint headers for the transaciton. Then if
> the space returned by the item formatting is negative (because it is
> in the AIL and being relogged), the CIL checkpoint now doesn't have
> the space reserved it needs to run a checkpoint. That transaction is
> a sync transaction, so it forces the log, and now we push the CIL
> without sufficient reservation to write out the log headers and the
> items we just formatted
> 

Hmmm... that seems like an odd scenario because I'd expect the space
usage delta to reflect what might or might not have already been added
to the CIL context, not necessarily the AIL. IOW, shouldn't a negative
delta only occur for items being relogged while still CIL resident
(regardless of AIL residency)?

>From a code standpoint, the way a particular log item delta comes out
negative is from having a shadow lv size smaller than the ->li_lv size.
Thus, xlog_cil_insert_format_items() subtracts the currently formatted
lv size from the delta, formats the current state of the item, and
xfs_cil_prepare_item() adds the new (presumably smalle

Re: [PATCH] xfs: Wake CIL push waiters more reliably

2021-01-07 Thread Dave Chinner
On Sun, Jan 03, 2021 at 05:03:33PM +0100, Donald Buczek wrote:
> On 02.01.21 23:44, Dave Chinner wrote:
> > On Sat, Jan 02, 2021 at 08:12:56PM +0100, Donald Buczek wrote:
> > > On 31.12.20 22:59, Dave Chinner wrote:
> > > > On Thu, Dec 31, 2020 at 12:48:56PM +0100, Donald Buczek wrote:
> > > > > On 30.12.20 23:16, Dave Chinner wrote:
> > 
> > > > One could argue that, but one should also understand the design
> > > > constraints for a particular algorithm are before suggesting
> > > > that their solution is "robust". :)
> > > 
> > > Yes, but an understanding to the extend required by the
> > > argument should be sufficient :-)
> > 
> > And that's the fundamental conceit described by Dunning-Kruger.
> > 
> > I.e. someone thinks they know enough to understand the argument,
> > when in fact they don't understand enough about the subject matter
> > to realise they don't understand what the expert at the other end is
> > saying at all
> > 
> > > > > # seq 29
> > > > > 
> > > > > 2020-12-29T20:08:15.652167+01:00 deadbird kernel: [ 1053.860637] XXX 
> > > > > trigger cil e374c6f1 ctx 4967d650  
> > > > > ctx->space_used=33554656  , push_seq=29, ctx->sequence=29
> > > > 
> > > > So, at 20:08:15 we get a push trigger and the work is queued. But...
> > > > 
> > > > .
> > > > > 2020-12-29T20:09:04.961088+01:00 deadbird kernel: [ 1103.168964] XXX 
> > > > > wakecil e374c6f1 ctx 4967d650  
> > > > > ctx->space_used=67109136 >= 67108864, push_seq=29, ctx->sequence=29
> > > > 
> > > > It takes the best part of *50 seconds* before the push work actually
> > > > runs?
> > > > 
> > > > That's  well and truly screwed up - the work should run on that
> > > > CPU on the very next time it yeilds the CPU. If we're holding the
> > > > CPU without yeilding it for that long, hangcheck and RCU warnings
> > > > should be going off...
> > > 
> > > No such warnings.
> > > 
> > > But the load is probably I/O bound to the log:
> > > 
> > > - creates `cp -a` copies of a directory tree with small files (linux 
> > > source repository)
> > > - source tree probably completely cached.
> > > - two copies in parallel
> > > - slow log (on software raid6)
> > > 
> > > Isn't it to be expected that sooner or later you need to wait for
> > > log writes when you write as fast as possible with lots of
> > > metadata updates and not so much data?
> > 
> > No, incoming transactions waits for transaction reservation space,
> > not log writes. Reservation space is freed up by metadata writeback.
> > So if you have new transactions blocking in xfs_trans_reserve(),
> > then we are blocking on metadata writeback.
> > 
> > The CIL worker thread may be waiting for log IO completion before it
> > issues more log IO, and in that case it is waiting on iclog buffer
> > space (i.e. only waiting on internal log state, not metadata
> > writeback).  Can you find that kworker thread stack and paste it? If
> > bound on log IO, it will be blocked in xlog_get_iclog_space().
> > 
> > Please paste the entire stack output, too, not just the bits you
> > think are relevant or understand
> 
> That would be a kworker on the "xfs-cil/%s" workqueue, correct?
> And you mean before the lockup, when the I/O is still active, correct?
> 
> This is the usual stack output from that thread:
> 
> # # /proc/2080/task/2080: kworker/u82:3+xfs-cil/md0 :
> # cat /proc/2080/task/2080/stack
> 
> [<0>] md_flush_request+0x87/0x190
> [<0>] raid5_make_request+0x61b/0xb30
> [<0>] md_handle_request+0x127/0x1a0
> [<0>] md_submit_bio+0xbd/0x100
> [<0>] submit_bio_noacct+0x151/0x410
> [<0>] submit_bio+0x4b/0x1a0
> [<0>] xlog_state_release_iclog+0x87/0xb0
> [<0>] xlog_write+0x452/0x6d0
> [<0>] xlog_cil_push_work+0x2e0/0x4d0
> [<0>] process_one_work+0x1dd/0x3e0
> [<0>] worker_thread+0x23f/0x3b0
> [<0>] kthread+0x118/0x130
> [<0>] ret_from_fork+0x22/0x30
> 
> sampled three times with a few seconds in between, stack identical.

Yeah, so that is MD blocking waiting for a running device cache
flush to finish before submitting the iclog IO. IOWs, it's waiting
for hardware to flush volatile caches to stable storage. Every log
IO has a cache flush preceding it, and if the device doesn't support
FUA, then there is a post-IO cache flush as well. These flushes are
necessary for correct IO ordering between the log and metadata/data
writeback.

Essentially, you're waiting for MD to flush all it's dirty stripe
cache to the backing devices and then flush the backing device
caches as well. Nothing can be done at the filesystem level to make
this faster

> > Also, cp -a of a linux source tree is just as data intensive as it
> > is metadata intensive. There's probably more data IO than metadata
> > IO, so that's more likely to be what is slowing the disks down as
> > metadata writeback is...
> > 
> > > I'm a bit concerned, though, that there seem to be a rather
> > > unlimited (~ 1000) number of kernel worker threads waiting for the
> > > cil push and indirectly f

Re: [PATCH] xfs: Wake CIL push waiters more reliably

2021-01-07 Thread Dave Chinner
On Mon, Jan 04, 2021 at 11:23:53AM -0500, Brian Foster wrote:
> On Thu, Dec 31, 2020 at 09:16:11AM +1100, Dave Chinner wrote:
> > On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote:
> > > If the value goes below the limit while some threads are
> > > already waiting but before the push worker gets to it, these threads are
> > > not woken.
> > > 
> > > Always wake all CIL push waiters. Test with waitqueue_active() as an
> > > optimization. This is possible, because we hold the xc_push_lock
> > > spinlock, which prevents additions to the waitqueue.
> > > 
> > > Signed-off-by: Donald Buczek 
> > > ---
> > >  fs/xfs/xfs_log_cil.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > > index b0ef071b3cb5..d620de8e217c 100644
> > > --- a/fs/xfs/xfs_log_cil.c
> > > +++ b/fs/xfs/xfs_log_cil.c
> > > @@ -670,7 +670,7 @@ xlog_cil_push_work(
> > >   /*
> > >* Wake up any background push waiters now this context is being pushed.
> > >*/
> > > - if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
> > > + if (waitqueue_active(&cil->xc_push_wait))
> > >   wake_up_all(&cil->xc_push_wait);
> > 
> > That just smells wrong to me. It *might* be correct, but this
> > condition should pair with the sleep condition, as space used by a
> > CIL context should never actually decrease
> > 
> 
> ... but I'm a little confused by this assertion. The shadow buffer
> allocation code refers to the possibility of shadow buffers falling out
> that are smaller than currently allocated buffers. Further, the
> _insert_format_items() code appears to explicitly optimize for this
> possibility by reusing the active buffer, subtracting the old size/count
> values from the diff variables and then reformatting the latest
> (presumably smaller) item to the lv.

Individual items might shrink, but the overall transaction should
grow. Think of a extent to btree conversion of an inode fork. THe
data in the inode fork decreases from a list of extents to a btree
root block pointer, so the inode item shrinks. But then we add a new
btree root block that contains all the extents + the btree block
header, and it gets rounded up to ithe 128 byte buffer logging chunk
size.

IOWs, while the inode item has decreased in size, the overall
space consumed by the transaction has gone up and so the CIL ctx
used_space should increase. Hence we can't just look at individual
log items and whether they have decreased in size - we have to look
at all the items in the transaction to understand how the space used
in that transaction has changed. i.e. it's the aggregation of all
items in the transaction that matter here, not so much the
individual items.

> Of course this could just be implementation detail. I haven't dug into
> the details in the remainder of this thread and I don't have specific
> examples off the top of my head, but perhaps based on the ability of
> various structures to change formats and the ability of log vectors to
> shrink in size, shouldn't we expect the possibility of a CIL context to
> shrink in size as well? Just from poking around the CIL it seems like
> the surrounding code supports it (xlog_cil_insert_items() checks len > 0
> for recalculating split res as well)...

Yes, there may be situations where it decreases. It may be this is
fine, but the assumption *I've made* in lots of the CIL push code is
that ctx->used_space rarely, if ever, will go backwards.

e.g. we run the first transaction into the CIL, it steals the sapce
needed for the cil checkpoint headers for the transaciton. Then if
the space returned by the item formatting is negative (because it is
in the AIL and being relogged), the CIL checkpoint now doesn't have
the space reserved it needs to run a checkpoint. That transaction is
a sync transaction, so it forces the log, and now we push the CIL
without sufficient reservation to write out the log headers and the
items we just formatted

So, yeah, shrinking transaction space usage definitely violates some
of the assumptions the code makes about how relogging works. It's
entirely possible the assumptions I've made are not entirely correct
in some corner cases - those particular cases are what we need to
ferret out here, and then decide if they are correct or not and deal
with it from there...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH] xfs: Wake CIL push waiters more reliably

2021-01-04 Thread Brian Foster
On Thu, Dec 31, 2020 at 09:16:11AM +1100, Dave Chinner wrote:
> On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote:
> > Threads, which committed items to the CIL, wait in the xc_push_wait
> > waitqueue when used_space in the push context goes over a limit. These
> > threads need to be woken when the CIL is pushed.
> > 
> > The CIL push worker tries to avoid the overhead of calling wake_all()
> > when there are no waiters waiting. It does so by checking the same
> > condition which caused the waits to happen. This, however, is
> > unreliable, because ctx->space_used can actually decrease when items are
> > recommitted.
> 
> When does this happen?
> 
> Do you have tracing showing the operation where the relogged item
> has actually gotten smaller? By definition, relogging in the CIL
> should only grow the size of the object in the CIL because it must
> relog all the existing changes on top of the new changed being made
> to the object. Hence the CIL reservation should only ever grow.
> 
> IOWs, returning negative lengths from the formatting code is
> unexpected and probably a bug and requires further investigation,
> not papering over the occurrence with broadcast wakeups...
> 

I agree that this warrants a bit more explanation and analysis before
changing the current code...

> > If the value goes below the limit while some threads are
> > already waiting but before the push worker gets to it, these threads are
> > not woken.
> > 
> > Always wake all CIL push waiters. Test with waitqueue_active() as an
> > optimization. This is possible, because we hold the xc_push_lock
> > spinlock, which prevents additions to the waitqueue.
> > 
> > Signed-off-by: Donald Buczek 
> > ---
> >  fs/xfs/xfs_log_cil.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > index b0ef071b3cb5..d620de8e217c 100644
> > --- a/fs/xfs/xfs_log_cil.c
> > +++ b/fs/xfs/xfs_log_cil.c
> > @@ -670,7 +670,7 @@ xlog_cil_push_work(
> > /*
> >  * Wake up any background push waiters now this context is being pushed.
> >  */
> > -   if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
> > +   if (waitqueue_active(&cil->xc_push_wait))
> > wake_up_all(&cil->xc_push_wait);
> 
> That just smells wrong to me. It *might* be correct, but this
> condition should pair with the sleep condition, as space used by a
> CIL context should never actually decrease
> 

... but I'm a little confused by this assertion. The shadow buffer
allocation code refers to the possibility of shadow buffers falling out
that are smaller than currently allocated buffers. Further, the
_insert_format_items() code appears to explicitly optimize for this
possibility by reusing the active buffer, subtracting the old size/count
values from the diff variables and then reformatting the latest
(presumably smaller) item to the lv.

Of course this could just be implementation detail. I haven't dug into
the details in the remainder of this thread and I don't have specific
examples off the top of my head, but perhaps based on the ability of
various structures to change formats and the ability of log vectors to
shrink in size, shouldn't we expect the possibility of a CIL context to
shrink in size as well? Just from poking around the CIL it seems like
the surrounding code supports it (xlog_cil_insert_items() checks len > 0
for recalculating split res as well)...

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com
> 



Re: [PATCH] xfs: Wake CIL push waiters more reliably

2021-01-03 Thread Donald Buczek

On 02.01.21 23:44, Dave Chinner wrote:

On Sat, Jan 02, 2021 at 08:12:56PM +0100, Donald Buczek wrote:

On 31.12.20 22:59, Dave Chinner wrote:

On Thu, Dec 31, 2020 at 12:48:56PM +0100, Donald Buczek wrote:

On 30.12.20 23:16, Dave Chinner wrote:



One could argue that, but one should also understand the design
constraints for a particular algorithm are before suggesting
that their solution is "robust". :)


Yes, but an understanding to the extend required by the
argument should be sufficient :-)


And that's the fundamental conceit described by Dunning-Kruger.

I.e. someone thinks they know enough to understand the argument,
when in fact they don't understand enough about the subject matter
to realise they don't understand what the expert at the other end is
saying at all


# seq 29

2020-12-29T20:08:15.652167+01:00 deadbird kernel: [ 1053.860637] XXX trigger cil 
e374c6f1 ctx 4967d650  ctx->space_used=33554656  , 
push_seq=29, ctx->sequence=29


So, at 20:08:15 we get a push trigger and the work is queued. But...

.

2020-12-29T20:09:04.961088+01:00 deadbird kernel: [ 1103.168964] XXX wakecil 
e374c6f1 ctx 4967d650  ctx->space_used=67109136 >= 67108864, 
push_seq=29, ctx->sequence=29


It takes the best part of *50 seconds* before the push work actually
runs?

That's  well and truly screwed up - the work should run on that
CPU on the very next time it yeilds the CPU. If we're holding the
CPU without yeilding it for that long, hangcheck and RCU warnings
should be going off...


No such warnings.

But the load is probably I/O bound to the log:

- creates `cp -a` copies of a directory tree with small files (linux source 
repository)
- source tree probably completely cached.
- two copies in parallel
- slow log (on software raid6)

Isn't it to be expected that sooner or later you need to wait for
log writes when you write as fast as possible with lots of
metadata updates and not so much data?


No, incoming transactions waits for transaction reservation space,
not log writes. Reservation space is freed up by metadata writeback.
So if you have new transactions blocking in xfs_trans_reserve(),
then we are blocking on metadata writeback.

The CIL worker thread may be waiting for log IO completion before it
issues more log IO, and in that case it is waiting on iclog buffer
space (i.e. only waiting on internal log state, not metadata
writeback).  Can you find that kworker thread stack and paste it? If
bound on log IO, it will be blocked in xlog_get_iclog_space().

Please paste the entire stack output, too, not just the bits you
think are relevant or understand


That would be a kworker on the "xfs-cil/%s" workqueue, correct?
And you mean before the lockup, when the I/O is still active, correct?

This is the usual stack output from that thread:

# # /proc/2080/task/2080: kworker/u82:3+xfs-cil/md0 :
# cat /proc/2080/task/2080/stack

[<0>] md_flush_request+0x87/0x190
[<0>] raid5_make_request+0x61b/0xb30
[<0>] md_handle_request+0x127/0x1a0
[<0>] md_submit_bio+0xbd/0x100
[<0>] submit_bio_noacct+0x151/0x410
[<0>] submit_bio+0x4b/0x1a0
[<0>] xlog_state_release_iclog+0x87/0xb0
[<0>] xlog_write+0x452/0x6d0
[<0>] xlog_cil_push_work+0x2e0/0x4d0
[<0>] process_one_work+0x1dd/0x3e0
[<0>] worker_thread+0x23f/0x3b0
[<0>] kthread+0x118/0x130
[<0>] ret_from_fork+0x22/0x30

sampled three times with a few seconds in between, stack identical.


Also, cp -a of a linux source tree is just as data intensive as it
is metadata intensive. There's probably more data IO than metadata
IO, so that's more likely to be what is slowing the disks down as
metadata writeback is...


I'm a bit concerned, though, that there seem to be a rather
unlimited (~ 1000) number of kernel worker threads waiting for the
cil push and indirectly for log writes.


That's normal - XFS is highly asynchronous and defers a lot of work
to completion threads.


So it dropped by 16 bytes (seems to be common) which is
unexpected.  I wonder if it filled a hole in a buffer and so
needed one less xlog_op_header()? But then the size would have
gone up by at least 128 bytes for the hole that was filled, so
it still shouldn't go down in size.

I think you need to instrument xlog_cil_insert_items() and catch
a negative length here:

/* account for space used by new iovec headers  */
iovhdr_res = diff_iovecs * sizeof(xlog_op_header_t); len +=
iovhdr_res; ctx->nvecs += diff_iovecs;

(diff_iovecs will be negative if the number of xlog_op_header
structures goes down)

And if this happens, then dump the transaction ticket via
xlog_print_trans(tp) so we can see all the log items types and
vectors that the transaction has formatted...


I tried that, but the output was difficult to understand, because
at that point you can only log the complete transaction with the
items already updated.  And a shrinking item is not switched to the
shadow vector, so the formatted content is already overwritten and
not avail

Re: [PATCH] xfs: Wake CIL push waiters more reliably

2021-01-02 Thread Dave Chinner
On Sat, Jan 02, 2021 at 08:12:56PM +0100, Donald Buczek wrote:
> On 31.12.20 22:59, Dave Chinner wrote:
> > On Thu, Dec 31, 2020 at 12:48:56PM +0100, Donald Buczek wrote:
> > > On 30.12.20 23:16, Dave Chinner wrote:

> > One could argue that, but one should also understand the design
> > constraints for a particular algorithm are before suggesting
> > that their solution is "robust". :)
>
> Yes, but an understanding to the extend required by the
> argument should be sufficient :-)

And that's the fundamental conceit described by Dunning-Kruger.

I.e. someone thinks they know enough to understand the argument,
when in fact they don't understand enough about the subject matter
to realise they don't understand what the expert at the other end is
saying at all

> > > # seq 29
> > > 
> > > 2020-12-29T20:08:15.652167+01:00 deadbird kernel: [ 1053.860637] XXX 
> > > trigger cil e374c6f1 ctx 4967d650  
> > > ctx->space_used=33554656  , push_seq=29, ctx->sequence=29
> > 
> > So, at 20:08:15 we get a push trigger and the work is queued. But...
> > 
> > .
> > > 2020-12-29T20:09:04.961088+01:00 deadbird kernel: [ 1103.168964] XXX wake 
> > >cil e374c6f1 ctx 4967d650  ctx->space_used=67109136 >= 
> > > 67108864, push_seq=29, ctx->sequence=29
> > 
> > It takes the best part of *50 seconds* before the push work actually
> > runs?
> > 
> > That's  well and truly screwed up - the work should run on that
> > CPU on the very next time it yeilds the CPU. If we're holding the
> > CPU without yeilding it for that long, hangcheck and RCU warnings
> > should be going off...
> 
> No such warnings.
> 
> But the load is probably I/O bound to the log:
> 
> - creates `cp -a` copies of a directory tree with small files (linux source 
> repository)
> - source tree probably completely cached.
> - two copies in parallel
> - slow log (on software raid6)
> 
> Isn't it to be expected that sooner or later you need to wait for
> log writes when you write as fast as possible with lots of
> metadata updates and not so much data?

No, incoming transactions waits for transaction reservation space,
not log writes. Reservation space is freed up by metadata writeback.
So if you have new transactions blocking in xfs_trans_reserve(),
then we are blocking on metadata writeback.

The CIL worker thread may be waiting for log IO completion before it
issues more log IO, and in that case it is waiting on iclog buffer
space (i.e. only waiting on internal log state, not metadata
writeback).  Can you find that kworker thread stack and paste it? If
bound on log IO, it will be blocked in xlog_get_iclog_space().

Please paste the entire stack output, too, not just the bits you
think are relevant or understand

Also, cp -a of a linux source tree is just as data intensive as it
is metadata intensive. There's probably more data IO than metadata
IO, so that's more likely to be what is slowing the disks down as
metadata writeback is...

> I'm a bit concerned, though, that there seem to be a rather
> unlimited (~ 1000) number of kernel worker threads waiting for the
> cil push and indirectly for log writes.

That's normal - XFS is highly asynchronous and defers a lot of work
to completion threads.

> > So it dropped by 16 bytes (seems to be common) which is
> > unexpected.  I wonder if it filled a hole in a buffer and so
> > needed one less xlog_op_header()? But then the size would have
> > gone up by at least 128 bytes for the hole that was filled, so
> > it still shouldn't go down in size.
> > 
> > I think you need to instrument xlog_cil_insert_items() and catch
> > a negative length here:
> > 
> > /* account for space used by new iovec headers  */
> > iovhdr_res = diff_iovecs * sizeof(xlog_op_header_t); len +=
> > iovhdr_res; ctx->nvecs += diff_iovecs;
> > 
> > (diff_iovecs will be negative if the number of xlog_op_header
> > structures goes down)
> > 
> > And if this happens, then dump the transaction ticket via
> > xlog_print_trans(tp) so we can see all the log items types and
> > vectors that the transaction has formatted...
> 
> I tried that, but the output was difficult to understand, because
> at that point you can only log the complete transaction with the
> items already updated.  And a shrinking item is not switched to the
> shadow vector, so the formatted content is already overwritten and
> not available for analysis.

Yes, that's exactly the information I need to see.

But the fact you think this is something I don't need to know about
is classic Dunning-Kruger in action. You don't understand why I
asked for this information, and found the information "difficult to
understand", so you decided I didn't need it either, despite the
fact I asked explicitly for it.

What I first need to know is what operations are being performed by
the transaciton that shrunk and what all the items in it are, not
which specific items shrunk and by how much. There can be tens to
hundreds of items in a single t

Re: [PATCH] xfs: Wake CIL push waiters more reliably

2021-01-02 Thread Donald Buczek

On 31.12.20 22:59, Dave Chinner wrote:

Hey, funny, your email could celebrate New Year a second time :-)


On Thu, Dec 31, 2020 at 12:48:56PM +0100, Donald Buczek wrote:

On 30.12.20 23:16, Dave Chinner wrote:

On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote:

Threads, which committed items to the CIL, wait in the
xc_push_wait waitqueue when used_space in the push context
goes over a limit. These threads need to be woken when the CIL
is pushed.

The CIL push worker tries to avoid the overhead of calling
wake_all() when there are no waiters waiting. It does so by
checking the same condition which caused the waits to happen.
This, however, is unreliable, because ctx->space_used can
actually decrease when items are recommitted.


When does this happen?

Do you have tracing showing the operation where the relogged
item has actually gotten smaller? By definition, relogging in
the CIL should only grow the size of the object in the CIL
because it must relog all the existing changes on top of the new
changed being made to the object. Hence the CIL reservation
should only ever grow.


I have (very ugly printk based) log (see below), but it only
shows, that it happened (space_used decreasing), not what caused
it.

I only browsed the ( xfs_*_item.c ) code and got the impression
that the size of a log item is rather dynamic (e.g. number of
extends in an inode, extended attributes in an inode, continuity
of chunks in a buffer) and wasn't surprised that a relogged item
might need less space from time to time.


IOWs, returning negative lengths from the formatting code is
unexpected and probably a bug and requires further
investigation, not papering over the occurrence with broadcast
wakeups...


One could argue that the code is more robust after the change,
because it wakes up every thread which is waiting on the next push
to happen when the next push is happening without making
assumption of why these threads are waiting by duplicating code
from that waiters side. The proposed waitqueue_active() is inlined
to two instructions and avoids the call overhead if there are no
waiters as well.


One could argue that, but one should also understand the design
constraints for a particular algorithm are before suggesting that
their solution is "robust". :)


Yes, but an understanding to the extend required by the argument should be 
sufficient :-)


# seq 29

2020-12-29T20:08:15.652167+01:00 deadbird kernel: [ 1053.860637] XXX trigger cil 
e374c6f1 ctx 4967d650  ctx->space_used=33554656  , 
push_seq=29, ctx->sequence=29


So, at 20:08:15 we get a push trigger and the work is queued. But...

.

2020-12-29T20:09:04.961088+01:00 deadbird kernel: [ 1103.168964] XXX wakecil 
e374c6f1 ctx 4967d650  ctx->space_used=67109136 >= 67108864, 
push_seq=29, ctx->sequence=29


It takes the best part of *50 seconds* before the push work actually
runs?

That's  well and truly screwed up - the work should run on that
CPU on the very next time it yeilds the CPU. If we're holding the
CPU without yeilding it for that long, hangcheck and RCU warnings
should be going off...


No such warnings.

But the load is probably I/O bound to the log:

- creates `cp -a` copies of a directory tree with small files (linux source 
repository)
- source tree probably completely cached.
- two copies in parallel
- slow log (on software raid6)

Isn't it to be expected that sooner or later you need to wait for log writes 
when you write as fast as possible with lots of metadata updates and not so 
much data?

I'm a bit concerned, though, that there seem to be a rather unlimited (~ 1000) 
number of kernel worker threads waiting for the cil push and indirectly for log 
writes.


# seq 30

2020-12-29T20:09:39.305108+01:00 deadbird kernel: [ 1137.514718] XXX trigger cil 
e374c6f1 ctx c46ab121  ctx->space_used=33554480  , 
push_seq=30, ctx->sequence=30


20:09:39 for the next trigger,


2020-12-29T20:10:20.389104+01:00 deadbird kernel: [ 1178.597976] XXX pushw   cil 
e374c6f1 ctx c46ab121  ctx->space_used=67108924 >= 67108864, 
push_seq=30, ctx->sequence=30
2020-12-29T20:10:20.389117+01:00 deadbird kernel: [ 1178.613792] XXX pushw   cil 
e374c6f1 ctx c46ab121  ctx->space_used=67108924 >= 67108864, 
push_seq=30, ctx->sequence=30
2020-12-29T20:10:20.619077+01:00 deadbird kernel: [ 1178.827935] XXX pushw   cil 
e374c6f1 ctx c46ab121  ctx->space_used=67108924 >= 67108864, 
push_seq=30, ctx->sequence=30
2020-12-29T20:10:21.129074+01:00 deadbird kernel: [ 1179.337996] XXX pushw   cil 
e374c6f1 ctx c46ab121  ctx->space_used=67108924 >= 67108864, 
push_seq=30, ctx->sequence=30
2020-12-29T20:10:21.190101+01:00 deadbird kernel: [ 1179.398869] XXX pushw   cil 
e374c6f1 ctx c46ab121  ctx->space_used=67108924 >= 67108864, 
push_seq=30, ctx->sequence=30
2020-12-29T20:10:21.866096+01:00 deadbird kernel: [ 1180.074325] XXX pushw  

Re: [PATCH] xfs: Wake CIL push waiters more reliably

2020-12-31 Thread Dave Chinner
On Thu, Dec 31, 2020 at 12:48:56PM +0100, Donald Buczek wrote:
> On 30.12.20 23:16, Dave Chinner wrote:
> > On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote:
> > > Threads, which committed items to the CIL, wait in the
> > > xc_push_wait waitqueue when used_space in the push context
> > > goes over a limit. These threads need to be woken when the CIL
> > > is pushed.
> > > 
> > > The CIL push worker tries to avoid the overhead of calling
> > > wake_all() when there are no waiters waiting. It does so by
> > > checking the same condition which caused the waits to happen.
> > > This, however, is unreliable, because ctx->space_used can
> > > actually decrease when items are recommitted.
> > 
> > When does this happen?
> > 
> > Do you have tracing showing the operation where the relogged
> > item has actually gotten smaller? By definition, relogging in
> > the CIL should only grow the size of the object in the CIL
> > because it must relog all the existing changes on top of the new
> > changed being made to the object. Hence the CIL reservation
> > should only ever grow.
> 
> I have (very ugly printk based) log (see below), but it only
> shows, that it happened (space_used decreasing), not what caused
> it.
> 
> I only browsed the ( xfs_*_item.c ) code and got the impression
> that the size of a log item is rather dynamic (e.g. number of
> extends in an inode, extended attributes in an inode, continuity
> of chunks in a buffer) and wasn't surprised that a relogged item
> might need less space from time to time.
> 
> > IOWs, returning negative lengths from the formatting code is
> > unexpected and probably a bug and requires further
> > investigation, not papering over the occurrence with broadcast
> > wakeups...
> 
> One could argue that the code is more robust after the change,
> because it wakes up every thread which is waiting on the next push
> to happen when the next push is happening without making
> assumption of why these threads are waiting by duplicating code
> from that waiters side. The proposed waitqueue_active() is inlined
> to two instructions and avoids the call overhead if there are no
> waiters as well.

One could argue that, but one should also understand the design
constraints for a particular algorithm are before suggesting that
their solution is "robust". :)

> 
> # seq 29
> 
> 2020-12-29T20:08:15.652167+01:00 deadbird kernel: [ 1053.860637] XXX trigger 
> cil e374c6f1 ctx 4967d650  ctx->space_used=33554656  , 
> push_seq=29, ctx->sequence=29

So, at 20:08:15 we get a push trigger and the work is queued. But...

.
> 2020-12-29T20:09:04.961088+01:00 deadbird kernel: [ 1103.168964] XXX wake
> cil e374c6f1 ctx 4967d650  ctx->space_used=67109136 >= 
> 67108864, push_seq=29, ctx->sequence=29

It takes the best part of *50 seconds* before the push work actually
runs?

That's  well and truly screwed up - the work should run on that
CPU on the very next time it yeilds the CPU. If we're holding the
CPU without yeilding it for that long, hangcheck and RCU warnings
should be going off...

> # seq 30
> 
> 2020-12-29T20:09:39.305108+01:00 deadbird kernel: [ 1137.514718] XXX trigger 
> cil e374c6f1 ctx c46ab121  ctx->space_used=33554480  , 
> push_seq=30, ctx->sequence=30

20:09:39 for the next trigger,

> 2020-12-29T20:10:20.389104+01:00 deadbird kernel: [ 1178.597976] XXX pushw   
> cil e374c6f1 ctx c46ab121  ctx->space_used=67108924 >= 
> 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:20.389117+01:00 deadbird kernel: [ 1178.613792] XXX pushw   
> cil e374c6f1 ctx c46ab121  ctx->space_used=67108924 >= 
> 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:20.619077+01:00 deadbird kernel: [ 1178.827935] XXX pushw   
> cil e374c6f1 ctx c46ab121  ctx->space_used=67108924 >= 
> 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:21.129074+01:00 deadbird kernel: [ 1179.337996] XXX pushw   
> cil e374c6f1 ctx c46ab121  ctx->space_used=67108924 >= 
> 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:21.190101+01:00 deadbird kernel: [ 1179.398869] XXX pushw   
> cil e374c6f1 ctx c46ab121  ctx->space_used=67108924 >= 
> 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:21.866096+01:00 deadbird kernel: [ 1180.074325] XXX pushw   
> cil e374c6f1 ctx c46ab121  ctx->space_used=67108924 >= 
> 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:22.076095+01:00 deadbird kernel: [ 1180.283748] XXX pushw   
> cil e374c6f1 ctx c46ab121  ctx->space_used=67108924 >= 
> 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:22.193070+01:00 deadbird kernel: [ 1180.401590] XXX pushw   
> cil e374c6f1 ctx c46ab121  ctx->space_used=67108924 >= 
> 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:22.421082+01:00 deadbird kernel: [ 1180.629682] XXX pushw   
> c

Re: [PATCH] xfs: Wake CIL push waiters more reliably

2020-12-31 Thread Donald Buczek

On 30.12.20 23:16, Dave Chinner wrote:

On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote:

Threads, which committed items to the CIL, wait in the xc_push_wait
waitqueue when used_space in the push context goes over a limit. These
threads need to be woken when the CIL is pushed.

The CIL push worker tries to avoid the overhead of calling wake_all()
when there are no waiters waiting. It does so by checking the same
condition which caused the waits to happen. This, however, is
unreliable, because ctx->space_used can actually decrease when items are
recommitted.


When does this happen?

Do you have tracing showing the operation where the relogged item
has actually gotten smaller? By definition, relogging in the CIL
should only grow the size of the object in the CIL because it must
relog all the existing changes on top of the new changed being made
to the object. Hence the CIL reservation should only ever grow.


I have (very ugly printk based) log (see below), but it only shows, that it 
happened (space_used decreasing), not what caused it.

I only browsed the ( xfs_*_item.c ) code and got the impression that the size 
of a log item is rather dynamic (e.g. number of extends in an inode, extended 
attributes in an inode, continuity of chunks in a buffer) and wasn't surprised 
that a relogged item might need less space from time to time.


IOWs, returning negative lengths from the formatting code is
unexpected and probably a bug and requires further investigation,
not papering over the occurrence with broadcast wakeups...


One could argue that the code is more robust after the change, because it wakes 
up every thread which is waiting on the next push to happen when the next push 
is happening without making assumption of why these threads are waiting by 
duplicating code from that waiters side. The proposed waitqueue_active() is 
inlined to two instructions and avoids the call overhead if there are no 
waiters as well.

But, of course, if the size is not expected to decrease, this discrepancy must 
be clarified anyway. I am sure, I can find out the exact circumstances, this 
does happen. I will follow up on this.

Happy New Year!

  Donald

Log extract following



+++ b/fs/xfs/xfs_log_cil.c
@@ -17,6 +17,8 @@
 #include "xfs_log_priv.h"
 #include "xfs_trace.h"
 
+#include 

+
 struct workqueue_struct *xfs_discard_wq;
 
 /*

@@ -670,8 +672,25 @@ xlog_cil_push_work(
/*
 * Wake up any background push waiters now this context is being pushed.
 */
-   if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
+   if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
wake_up_all(&cil->xc_push_wait);
+   pr_warn("XXX wakecil %p ctx %p  ctx->space_used=%d >= %d, 
push_seq=%lld, ctx->sequence=%lld\n",
+   cil, ctx,
+   ctx->space_used,
+   XLOG_CIL_BLOCKING_SPACE_LIMIT(log),
+   cil->xc_push_seq,
+   ctx->sequence);
+   } else {
+   pr_warn("XXX no wake cil %p ctx %p ctx->space_used=%d < %d, 
push_seq=%lld, ctx->sequence=%lld\n",
+   cil, ctx,
+   ctx->space_used,
+   XLOG_CIL_BLOCKING_SPACE_LIMIT(log),
+   cil->xc_push_seq,
+   ctx->sequence);
+   if (waitqueue_active(&cil->xc_push_wait)) {
+   pr_warn("XXX xc_push_wait ACTIVE!\n");
+   }
+   }
 
/*

 * Check if we've anything to push. If there is nothing, then we don't
@@ -918,6 +937,11 @@ xlog_cil_push_background(
spin_lock(&cil->xc_push_lock);
if (cil->xc_push_seq < cil->xc_current_sequence) {
cil->xc_push_seq = cil->xc_current_sequence;
+   pr_warn("XXX trigger cil %p ctx %p  ctx->space_used=%d  , 
push_seq=%lld, ctx->sequence=%lld\n",
+   cil, cil->xc_ctx,
+   cil->xc_ctx->space_used,
+   cil->xc_push_seq,
+   cil->xc_ctx->sequence);
queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work);
}
 
@@ -936,6 +960,12 @@ xlog_cil_push_background(

if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
trace_xfs_log_cil_wait(log, cil->xc_ctx->ticket);
ASSERT(cil->xc_ctx->space_used < log->l_logsize);
+   pr_warn("XXX pushw   cil %p ctx %p  ctx->space_used=%d >= %d, 
push_seq=%lld, ctx->sequence=%lld\n",
+   cil, cil->xc_ctx,
+   cil->xc_ctx->space_used,
+   XLOG_CIL_BLOCKING_SPACE_LIMIT(log),
+   cil->xc_push_seq,
+   cil->xc_ctx->sequence);
xlog_wait(&cil->xc_push_wait, &cil->xc_push_lock);
return;
}

=

14606 "XXX" lines logged before deadlo

Re: [PATCH] xfs: Wake CIL push waiters more reliably

2020-12-30 Thread Dave Chinner
On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote:
> Threads, which committed items to the CIL, wait in the xc_push_wait
> waitqueue when used_space in the push context goes over a limit. These
> threads need to be woken when the CIL is pushed.
> 
> The CIL push worker tries to avoid the overhead of calling wake_all()
> when there are no waiters waiting. It does so by checking the same
> condition which caused the waits to happen. This, however, is
> unreliable, because ctx->space_used can actually decrease when items are
> recommitted.

When does this happen?

Do you have tracing showing the operation where the relogged item
has actually gotten smaller? By definition, relogging in the CIL
should only grow the size of the object in the CIL because it must
relog all the existing changes on top of the new changed being made
to the object. Hence the CIL reservation should only ever grow.

IOWs, returning negative lengths from the formatting code is
unexpected and probably a bug and requires further investigation,
not papering over the occurrence with broadcast wakeups...

> If the value goes below the limit while some threads are
> already waiting but before the push worker gets to it, these threads are
> not woken.
> 
> Always wake all CIL push waiters. Test with waitqueue_active() as an
> optimization. This is possible, because we hold the xc_push_lock
> spinlock, which prevents additions to the waitqueue.
> 
> Signed-off-by: Donald Buczek 
> ---
>  fs/xfs/xfs_log_cil.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index b0ef071b3cb5..d620de8e217c 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -670,7 +670,7 @@ xlog_cil_push_work(
>   /*
>* Wake up any background push waiters now this context is being pushed.
>*/
> - if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
> + if (waitqueue_active(&cil->xc_push_wait))
>   wake_up_all(&cil->xc_push_wait);

That just smells wrong to me. It *might* be correct, but this
condition should pair with the sleep condition, as space used by a
CIL context should never actually decrease

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH] xfs: Wake CIL push waiters more reliably

2020-12-30 Thread Donald Buczek

On 30.12.20 03:46, Hillf Danton wrote:

On Wed, 30 Dec 2020 00:56:27 +0100

Threads, which committed items to the CIL, wait in the xc_push_wait
waitqueue when used_space in the push context goes over a limit. These
threads need to be woken when the CIL is pushed.

The CIL push worker tries to avoid the overhead of calling wake_all()
when there are no waiters waiting. It does so by checking the same
condition which caused the waits to happen. This, however, is
unreliable, because ctx->space_used can actually decrease when items are
recommitted. If the value goes below the limit while some threads are
already waiting but before the push worker gets to it, these threads are
not woken.


Looks like you are fixing a typo in c7f87f3984cf ("xfs: fix
use-after-free on CIL context on shutdown") in mainline, and
it may mean

/*
 * Wake up any background push waiters now this context is being pushed
 * if we are no longer over the space limit
 */

given waiters throttled for comsuming more space than limit in
xlog_cil_push_background().


I'm not sure, I understand you correctly. Do you suggest to update the comment to "...if 
we are no longer over the space limit"  and change the code to `if (ctx->space_used 
< XLOG_CIL_BLOCKING_SPACE_LIMIT(log))` ?

I don't think, that would be correct.

The current push context is most probably still over the limit if it ever was. It 
is exceptional, that the few bytes, which might be returned to ctx->space_used, 
bring us back below the limit. The new context, on the other hand, will have 
space_used=0.

We need to resume any thread which is waiting for the push.

Best
  Donald


Always wake all CIL push waiters. Test with waitqueue_active() as an
optimization. This is possible, because we hold the xc_push_lock
spinlock, which prevents additions to the waitqueue.

Signed-off-by: Donald Buczek 
---
  fs/xfs/xfs_log_cil.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index b0ef071b3cb5..d620de8e217c 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -670,7 +670,7 @@ xlog_cil_push_work(
/*
 * Wake up any background push waiters now this context is being pushed.
 */
-   if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
+   if (waitqueue_active(&cil->xc_push_wait))
wake_up_all(&cil->xc_push_wait);
  
  	/*

--
2.26.2