Re: [dm-devel] Fix "dm kcopyd: Fix bug causing workqueue stalls" causes dead lock

2019-09-27 Thread Nikos Tsironis
On 9/27/19 4:19 PM, Guruswamy Basavaiah wrote:
> Hello,
>  We have drbd partition on top of lvm partition. when node having
> secondary drbd partition is coming up, large amount of data will be
> synced between primary to secondary drbd partitions.
> 
> During this time, we see the drbd Sync(Resync) stops at some point.
> After 120 seconds we see hung-task-timeout warnings in the logs.(see
> at the end of this email)
> 
> If i increase the cow_count semaphore value from 2048 to 8192 or
> remove the below patch, drbd sync works seamlessly.
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/patch/?id=721b1d98fb517ae99ab3b757021cf81db41e67be
> 
> I am not familiar with dm code, from hung task back traces what i
> understand is, when thread is trying to queue work to kcopyd, holding
> "&_origins_lock" and blocked on cow_count lock,
> jobs from kcopyd is trying to queue work to same kcopyd and blocked on
> "&_origins_lock" and dead lock.
> 

Hello Guruswamy,

I am Cc-ing the maintainers, so they can be in the loop.

I examined the attached logs and I believe the following happens:

1. DRBD issues a number of writes to the snapshot origin device. These
   writes cause COW, which is performed by kcopyd.

2. At some point DRBD reaches the cow_count semaphore limit (2048) and
   blocks in down(&s->cow_count), holding a read lock on _origins_lock.

3. Someone tries to create a new snapshot. This involves taking a write
   lock on _origins_lock, which blocks because DRBD at step (2) already
   holds a read lock on it. That's the blocked lvcreate at the end of
   the trace.

4. A COW operation, issued by step (1), completes and kcopyd runs
   dm-snapshot's completion callback, which tries to take a read lock on
   _origins_lock, before signaling the cow_count semaphore. This read
   lock blocks, the semaphore is never signaled and we have the deadlock
   you experienced.

At first glance this seemed strange, because DRBD at step (2) holds a
read lock on _origins_lock, so taking another read lock should be
possible.

But, if I am not missing something, the read-write semaphore
implementation gives priority to writers, meaning that as soon as a
writer tries to enter the critical section, the lvcreate in our case, no
readers will be allowed in until all writers have completed their work.

That's what I believe is causing the deadlock you are experiencing.

I will send a patch fixing this and I will let you now.

Thanks,
Nikos

> Below is the hung task back traces.
> Sep 24 12:08:48.974658 err CFPU-1 kernel: [  279.991760] INFO: task
> kworker/1:1:170 blocked for more than 120 seconds.
> Sep 24 12:08:48.974658 err CFPU-1 kernel: [  279.998569]
> Tainted: P   O4.4.184-octeon-distro.git-v2.96-4-rc-wnd #1
> Sep 24 12:08:48.974658 err CFPU-1 kernel: [  280.006593] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Sep 24 12:08:48.974658 info CFPU-1 kernel: [  280.014435] kworker/1:1
>D 80e1db78 0   170  2 0x0010
> Sep 24 12:08:48.974658 info CFPU-1 kernel: [  280.014482] Workqueue:
> kcopyd do_work [dm_mod]
> Sep 24 12:08:48.974658 warn CFPU-1 kernel: [  280.014487] Stack :
>  0001 00030003 8007fde8bac8
> Sep 24 12:08:48.974658 warn CFPU-1 kernel: [  280.014487]
> 8007fe759b00 0002 c0285294 8007f8d1ca00
> Sep 24 12:08:48.974658 warn CFPU-1 kernel: [  280.014487]
> c027eda8 0001 80b3 0100
> Sep 24 12:08:48.974658 warn CFPU-1 kernel: [  280.014487]
> 800784c098c8 80e1db78 8007fe759b00 80e204b8
> Sep 24 12:08:48.974658 warn CFPU-1 kernel: [  280.014487]
> 800788ef79c0 80078505ba70 8007fe759b00 0001852b4620
> Sep 24 12:08:48.974658 warn CFPU-1 kernel: [  280.014487]
> c028 8007852b4620 8007eebf5758 c027edec
> Sep 24 12:08:48.974658 warn CFPU-1 kernel: [  280.014487]
>  8007852b4620 8007835d8e80 c027f38c
> Sep 24 12:08:48.974658 warn CFPU-1 kernel: [  280.014487]
> 800787ac0580 0001 8007f8d1ca60 800785aeb080
> Sep 24 12:08:48.974658 warn CFPU-1 kernel: [  280.014487]
>   0200 c0282488
> Sep 24 12:08:48.974658 warn CFPU-1 kernel: [  280.014487]
> 0200 8007f8d1ca00 c028 c027db90
> Sep 24 12:08:48.974658 warn CFPU-1 kernel: [  280.014487]   ...
> Sep 24 12:08:48.974658 warn CFPU-1 kernel: [  280.014558] Call Trace:
> Sep 24 12:08:48.974658 warn CFPU-1 kernel: [  280.014570]
> [] __schedule+0x3c0/0xa58
> Sep 24 12:08:48.974658 warn CFPU-1 kernel: [  280.014580]
> [] schedule+0x38/0x98
> Sep 24 12:08:48.974658 warn CFPU-1 kernel: [  280.014590]
> [] __down_read+0xa8/0xf0
> Sep 24 12:08:48.974658 warn CFPU-1 kernel: [  280.014609]
> [] do_origin.isra.13+0x44/0x110 [dm_snapshot]
> Sep 24 12:08:48.974658 warn CFPU-1 kernel: [  280.014625]
>

Re: kernel BUG at drivers/md/persistent-data/dm-space-map-disk.c:178 with scsi_mod.use_blk_mq=y

2019-09-27 Thread Eric Wheeler
On Fri, 27 Sep 2019, Joe Thornber wrote:

> Hi Eric,
> 
> On Thu, Sep 26, 2019 at 06:27:09PM +, Eric Wheeler wrote:
> > I pvmoved the tmeta to an SSD logical volume (dm-linear) on a non-bcache 
> > volume and we got the same trace this morning, so while the tdata still 
> > passes through bcache, all meta operations are direct to an SSD. This is 
> > still using multi-queue scsi, but dm_mod.use_blk_mq=N.
> > 
> > Since bcache is no longer involved with metadata operations, and since 
> > this appears to be a metadata issue, are there any other reasons to 
> > suspect bcache?
> 
> Did you recreate the pool, or are you just using the existing pool but with
> a different IO path?  If it's the latter then there could still be something
> wrong with the metadata, introduced while bcache was in the stack.

We did not create the pool after the initial problem, though the pool was 
new just before the problem occurred. 
 
> Would it be possible to send me a copy of the metadata device please so
> I can double check the space maps (I presume you've run thin_check on it)?

~]# /usr/local/bin/thin_check /dev/mapper/data-data--pool_tmeta 
examining superblock
TRANSACTION_ID=2347
METADATA_FREE_BLOCKS=4145151
examining devices tree
examining mapping tree
checking space map counts

~]# echo $?
0

~]# /usr/local/bin/thin_check -V
0.8.5

> [Assuming you're using the existing pool] Another useful experiment would be 
> to 
> thump_dump and then thin_restore the metadata, which will create totally fresh
> metadata and see if you can still reproduce the issue.

It didn't lockup last night, but I'll keep working to reproduce the 
problem and let you know what we find.

Mike said it could be a race:

> The stack shows the call to sm_disk_new_block() is due to
> dm_pool_alloc_data_block().
> 
> sm_disk_new_block()'s BUG_ON(ev != SM_ALLOC) indicates that somehow it is
> getting called without the passed 'ev' being set to SM_ALLOC.  Only
> drivers/md/persistent-dat/dm-space-map-common.c:sm_ll_mutate() sets
> SM_ALLOC. sm_disk_new_block() is indirectly calling sm_ll_mutate()
> 
> sm_ll_mutate() will only return 0 if ll->save_ie() does, the ll_disk *ll
> should be ll_disk, and so disk_ll_save_ie()'s call to dm_btree_insert()
> returns 0 -- which simply means success.  And on success
> sm_disk_new_block() assumes ev was set to SM_ALLOC (by sm_ll_mutate).
> 
> sm_ll_mutate() decided to _not_ set SM_ALLOC because either:
> 1) ref_count wasn't set
> or
> 2) old was identified
> 
> So all said: somehow a new data block was found to already be in use.
> _WHY_ that is the case isn't clear from this stack...
>
> But it does speak to the possibility of data block allocation racing
> with other operations to the same block.  Which implies missing locking.

Would a spinlock on the block solve the issue?

Where might such a spinlock be added?


--
Eric Wheeler


> 
> Thanks,
> 
> - Joe
> 


Re: [dm-devel] [PATCH] multipathd: warn when configuration has been changed.

2019-09-27 Thread Martin Wilck
On Mon, 2019-09-23 at 14:29 -0500, Benjamin Marzinski wrote:
> It would be helpful if multipathd could log a message when
> multipath.conf or files in the config_dir have been written to, both
> so
> that it can be used to send a notification to users, and to help with
> determining after the fact if multipathd was running with an older
> config, when the logs of multipathd's behaviour don't match with the
> current multipath.conf.
> 
> To do this, the multipathd uxlsnr thread now sets up inotify watches
> on
> both /etc/multipath.conf and the config_dir to watch if the files are
> deleted or closed after being opened for writing.  In order to keep
> uxlsnr from polling repeatedly if the multipath.conf or the
> config_dir
> aren't present, it will only set up the watches once per reconfigure.
> However, since multipath.conf is far more likely to be replaced by a
> text editor than modified in place, if it gets removed, multipathd
> will
> immediately try to restart the watch on it (which will succeed if the
> file was simply replaced by a new copy).  This does mean that if
> multipath.conf or the config_dir are actually removed and then later
> re-added, multipathd won't log any more messages for changes until
> the
> next reconfigure. But that seems like a fair trade-off to avoid
> repeatedly polling for files that aren't likely to appear.
> 
> Signed-off-by: Benjamin Marzinski 
> ---
>  libmultipath/config.h |   1 +
>  multipathd/main.c |   1 +
>  multipathd/uxlsnr.c   | 134
> --
>  3 files changed, 130 insertions(+), 6 deletions(-)
> 

So, next, after we get a notification, we wait a few seconds and call
reconfigure() automatically? Well I guess before we do that we should
implement a dry-run with a syntax check...

I found one minor issue, see below. Otherwise, ACK.

Thanks,
Martin

> +void handle_inotify(int fd, int  *wds)
> +{
> + char buff[1024]
> + __attribute__ ((aligned(__alignof__(struct
> inotify_event;
> + const struct inotify_event *event;
> + ssize_t len;
> + char *ptr;
> + int i, got_notify = 0;
> +
> + for (;;) {
> + len = read(fd, buff, sizeof(buff));
> + if (len <= 0) {
> + if (len < 0 && errno != EAGAIN) {
> + condlog(3, "error reading from
> inotify_fd");
> + for (i = 0; i < 2; i++) {
> + if (wds[i] != -1) {
> + inotify_rm_watch(fd,
> wds[0]);

Should this be wds[i] instead?


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: kernel BUG at drivers/md/persistent-data/dm-space-map-disk.c:178 with scsi_mod.use_blk_mq=y

2019-09-27 Thread Joe Thornber
Hi Eric,

On Thu, Sep 26, 2019 at 06:27:09PM +, Eric Wheeler wrote:
> I pvmoved the tmeta to an SSD logical volume (dm-linear) on a non-bcache 
> volume and we got the same trace this morning, so while the tdata still 
> passes through bcache, all meta operations are direct to an SSD. This is 
> still using multi-queue scsi, but dm_mod.use_blk_mq=N.
> 
> Since bcache is no longer involved with metadata operations, and since 
> this appears to be a metadata issue, are there any other reasons to 
> suspect bcache?

Did you recreate the pool, or are you just using the existing pool but with
a different IO path?  If it's the latter then there could still be something
wrong with the metadata, introduced while bcache was in the stack.

Would it be possible to send me a copy of the metadata device please so
I can double check the space maps (I presume you've run thin_check on it)?

[Assuming you're using the existing pool] Another useful experiment would be to 
thump_dump and then thin_restore the metadata, which will create totally fresh
metadata and see if you can still reproduce the issue.

Thanks,

- Joe