Re: [dm-devel] Fix "dm kcopyd: Fix bug causing workqueue stalls" causes dead lock
On 9/27/19 4:19 PM, Guruswamy Basavaiah wrote: > Hello, > We have drbd partition on top of lvm partition. when node having > secondary drbd partition is coming up, large amount of data will be > synced between primary to secondary drbd partitions. > > During this time, we see the drbd Sync(Resync) stops at some point. > After 120 seconds we see hung-task-timeout warnings in the logs.(see > at the end of this email) > > If i increase the cow_count semaphore value from 2048 to 8192 or > remove the below patch, drbd sync works seamlessly. > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/patch/?id=721b1d98fb517ae99ab3b757021cf81db41e67be > > I am not familiar with dm code, from hung task back traces what i > understand is, when thread is trying to queue work to kcopyd, holding > "&_origins_lock" and blocked on cow_count lock, > jobs from kcopyd is trying to queue work to same kcopyd and blocked on > "&_origins_lock" and dead lock. > Hello Guruswamy, I am Cc-ing the maintainers, so they can be in the loop. I examined the attached logs and I believe the following happens: 1. DRBD issues a number of writes to the snapshot origin device. These writes cause COW, which is performed by kcopyd. 2. At some point DRBD reaches the cow_count semaphore limit (2048) and blocks in down(&s->cow_count), holding a read lock on _origins_lock. 3. Someone tries to create a new snapshot. This involves taking a write lock on _origins_lock, which blocks because DRBD at step (2) already holds a read lock on it. That's the blocked lvcreate at the end of the trace. 4. A COW operation, issued by step (1), completes and kcopyd runs dm-snapshot's completion callback, which tries to take a read lock on _origins_lock, before signaling the cow_count semaphore. This read lock blocks, the semaphore is never signaled and we have the deadlock you experienced. At first glance this seemed strange, because DRBD at step (2) holds a read lock on _origins_lock, so taking another read lock should be possible. But, if I am not missing something, the read-write semaphore implementation gives priority to writers, meaning that as soon as a writer tries to enter the critical section, the lvcreate in our case, no readers will be allowed in until all writers have completed their work. That's what I believe is causing the deadlock you are experiencing. I will send a patch fixing this and I will let you now. Thanks, Nikos > Below is the hung task back traces. > Sep 24 12:08:48.974658 err CFPU-1 kernel: [ 279.991760] INFO: task > kworker/1:1:170 blocked for more than 120 seconds. > Sep 24 12:08:48.974658 err CFPU-1 kernel: [ 279.998569] > Tainted: P O4.4.184-octeon-distro.git-v2.96-4-rc-wnd #1 > Sep 24 12:08:48.974658 err CFPU-1 kernel: [ 280.006593] "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Sep 24 12:08:48.974658 info CFPU-1 kernel: [ 280.014435] kworker/1:1 >D 80e1db78 0 170 2 0x0010 > Sep 24 12:08:48.974658 info CFPU-1 kernel: [ 280.014482] Workqueue: > kcopyd do_work [dm_mod] > Sep 24 12:08:48.974658 warn CFPU-1 kernel: [ 280.014487] Stack : > 0001 00030003 8007fde8bac8 > Sep 24 12:08:48.974658 warn CFPU-1 kernel: [ 280.014487] > 8007fe759b00 0002 c0285294 8007f8d1ca00 > Sep 24 12:08:48.974658 warn CFPU-1 kernel: [ 280.014487] > c027eda8 0001 80b3 0100 > Sep 24 12:08:48.974658 warn CFPU-1 kernel: [ 280.014487] > 800784c098c8 80e1db78 8007fe759b00 80e204b8 > Sep 24 12:08:48.974658 warn CFPU-1 kernel: [ 280.014487] > 800788ef79c0 80078505ba70 8007fe759b00 0001852b4620 > Sep 24 12:08:48.974658 warn CFPU-1 kernel: [ 280.014487] > c028 8007852b4620 8007eebf5758 c027edec > Sep 24 12:08:48.974658 warn CFPU-1 kernel: [ 280.014487] > 8007852b4620 8007835d8e80 c027f38c > Sep 24 12:08:48.974658 warn CFPU-1 kernel: [ 280.014487] > 800787ac0580 0001 8007f8d1ca60 800785aeb080 > Sep 24 12:08:48.974658 warn CFPU-1 kernel: [ 280.014487] > 0200 c0282488 > Sep 24 12:08:48.974658 warn CFPU-1 kernel: [ 280.014487] > 0200 8007f8d1ca00 c028 c027db90 > Sep 24 12:08:48.974658 warn CFPU-1 kernel: [ 280.014487] ... > Sep 24 12:08:48.974658 warn CFPU-1 kernel: [ 280.014558] Call Trace: > Sep 24 12:08:48.974658 warn CFPU-1 kernel: [ 280.014570] > [] __schedule+0x3c0/0xa58 > Sep 24 12:08:48.974658 warn CFPU-1 kernel: [ 280.014580] > [] schedule+0x38/0x98 > Sep 24 12:08:48.974658 warn CFPU-1 kernel: [ 280.014590] > [] __down_read+0xa8/0xf0 > Sep 24 12:08:48.974658 warn CFPU-1 kernel: [ 280.014609] > [] do_origin.isra.13+0x44/0x110 [dm_snapshot] > Sep 24 12:08:48.974658 warn CFPU-1 kernel: [ 280.014625] >
Re: kernel BUG at drivers/md/persistent-data/dm-space-map-disk.c:178 with scsi_mod.use_blk_mq=y
On Fri, 27 Sep 2019, Joe Thornber wrote: > Hi Eric, > > On Thu, Sep 26, 2019 at 06:27:09PM +, Eric Wheeler wrote: > > I pvmoved the tmeta to an SSD logical volume (dm-linear) on a non-bcache > > volume and we got the same trace this morning, so while the tdata still > > passes through bcache, all meta operations are direct to an SSD. This is > > still using multi-queue scsi, but dm_mod.use_blk_mq=N. > > > > Since bcache is no longer involved with metadata operations, and since > > this appears to be a metadata issue, are there any other reasons to > > suspect bcache? > > Did you recreate the pool, or are you just using the existing pool but with > a different IO path? If it's the latter then there could still be something > wrong with the metadata, introduced while bcache was in the stack. We did not create the pool after the initial problem, though the pool was new just before the problem occurred. > Would it be possible to send me a copy of the metadata device please so > I can double check the space maps (I presume you've run thin_check on it)? ~]# /usr/local/bin/thin_check /dev/mapper/data-data--pool_tmeta examining superblock TRANSACTION_ID=2347 METADATA_FREE_BLOCKS=4145151 examining devices tree examining mapping tree checking space map counts ~]# echo $? 0 ~]# /usr/local/bin/thin_check -V 0.8.5 > [Assuming you're using the existing pool] Another useful experiment would be > to > thump_dump and then thin_restore the metadata, which will create totally fresh > metadata and see if you can still reproduce the issue. It didn't lockup last night, but I'll keep working to reproduce the problem and let you know what we find. Mike said it could be a race: > The stack shows the call to sm_disk_new_block() is due to > dm_pool_alloc_data_block(). > > sm_disk_new_block()'s BUG_ON(ev != SM_ALLOC) indicates that somehow it is > getting called without the passed 'ev' being set to SM_ALLOC. Only > drivers/md/persistent-dat/dm-space-map-common.c:sm_ll_mutate() sets > SM_ALLOC. sm_disk_new_block() is indirectly calling sm_ll_mutate() > > sm_ll_mutate() will only return 0 if ll->save_ie() does, the ll_disk *ll > should be ll_disk, and so disk_ll_save_ie()'s call to dm_btree_insert() > returns 0 -- which simply means success. And on success > sm_disk_new_block() assumes ev was set to SM_ALLOC (by sm_ll_mutate). > > sm_ll_mutate() decided to _not_ set SM_ALLOC because either: > 1) ref_count wasn't set > or > 2) old was identified > > So all said: somehow a new data block was found to already be in use. > _WHY_ that is the case isn't clear from this stack... > > But it does speak to the possibility of data block allocation racing > with other operations to the same block. Which implies missing locking. Would a spinlock on the block solve the issue? Where might such a spinlock be added? -- Eric Wheeler > > Thanks, > > - Joe >
Re: [dm-devel] [PATCH] multipathd: warn when configuration has been changed.
On Mon, 2019-09-23 at 14:29 -0500, Benjamin Marzinski wrote: > It would be helpful if multipathd could log a message when > multipath.conf or files in the config_dir have been written to, both > so > that it can be used to send a notification to users, and to help with > determining after the fact if multipathd was running with an older > config, when the logs of multipathd's behaviour don't match with the > current multipath.conf. > > To do this, the multipathd uxlsnr thread now sets up inotify watches > on > both /etc/multipath.conf and the config_dir to watch if the files are > deleted or closed after being opened for writing. In order to keep > uxlsnr from polling repeatedly if the multipath.conf or the > config_dir > aren't present, it will only set up the watches once per reconfigure. > However, since multipath.conf is far more likely to be replaced by a > text editor than modified in place, if it gets removed, multipathd > will > immediately try to restart the watch on it (which will succeed if the > file was simply replaced by a new copy). This does mean that if > multipath.conf or the config_dir are actually removed and then later > re-added, multipathd won't log any more messages for changes until > the > next reconfigure. But that seems like a fair trade-off to avoid > repeatedly polling for files that aren't likely to appear. > > Signed-off-by: Benjamin Marzinski > --- > libmultipath/config.h | 1 + > multipathd/main.c | 1 + > multipathd/uxlsnr.c | 134 > -- > 3 files changed, 130 insertions(+), 6 deletions(-) > So, next, after we get a notification, we wait a few seconds and call reconfigure() automatically? Well I guess before we do that we should implement a dry-run with a syntax check... I found one minor issue, see below. Otherwise, ACK. Thanks, Martin > +void handle_inotify(int fd, int *wds) > +{ > + char buff[1024] > + __attribute__ ((aligned(__alignof__(struct > inotify_event; > + const struct inotify_event *event; > + ssize_t len; > + char *ptr; > + int i, got_notify = 0; > + > + for (;;) { > + len = read(fd, buff, sizeof(buff)); > + if (len <= 0) { > + if (len < 0 && errno != EAGAIN) { > + condlog(3, "error reading from > inotify_fd"); > + for (i = 0; i < 2; i++) { > + if (wds[i] != -1) { > + inotify_rm_watch(fd, > wds[0]); Should this be wds[i] instead? -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: kernel BUG at drivers/md/persistent-data/dm-space-map-disk.c:178 with scsi_mod.use_blk_mq=y
Hi Eric, On Thu, Sep 26, 2019 at 06:27:09PM +, Eric Wheeler wrote: > I pvmoved the tmeta to an SSD logical volume (dm-linear) on a non-bcache > volume and we got the same trace this morning, so while the tdata still > passes through bcache, all meta operations are direct to an SSD. This is > still using multi-queue scsi, but dm_mod.use_blk_mq=N. > > Since bcache is no longer involved with metadata operations, and since > this appears to be a metadata issue, are there any other reasons to > suspect bcache? Did you recreate the pool, or are you just using the existing pool but with a different IO path? If it's the latter then there could still be something wrong with the metadata, introduced while bcache was in the stack. Would it be possible to send me a copy of the metadata device please so I can double check the space maps (I presume you've run thin_check on it)? [Assuming you're using the existing pool] Another useful experiment would be to thump_dump and then thin_restore the metadata, which will create totally fresh metadata and see if you can still reproduce the issue. Thanks, - Joe