Re: [dm-devel] kernel BUG at drivers/md/persistent-data/dm-space-map-disk.c:178 with scsi_mod.use_blk_mq=y

2019-12-20 Thread Eric Wheeler
On Wed, 25 Sep 2019, Eric Wheeler wrote:
> We are using the 4.19.75 stable tree with dm-thin and multi-queue scsi.  
> We have been using the 4.19 branch for months without issue; we just 
> switched to MQ and we seem to have hit this BUG_ON.  Whether or not MQ is 
> related to the issue, I don't know, maybe coincidence:
> 
>   static int sm_disk_new_block(struct dm_space_map *sm, dm_block_t *b)
>   {
>   int r;
>   enum allocation_event ev;
>   struct sm_disk *smd = container_of(sm, struct sm_disk, sm);
> 
>   /* FIXME: we should loop round a couple of times */
>   r = sm_ll_find_free_block(&smd->old_ll, smd->begin, 
> smd->old_ll.nr_blocks, b);
>   if (r)
>   return r;
> 
>   smd->begin = *b + 1;
>   r = sm_ll_inc(&smd->ll, *b, &ev);
>   if (!r) {
>   BUG_ON(ev != SM_ALLOC); <
>   smd->nr_allocated_this_transaction++;
>   }

Hello all,

We hit this BUG_ON again, this time with 4.19.86 with 
scsi_mod.use_blk_mq=y, and it is known to be present as of 5.1.2 as 
additionally reported by Markus Schade:

  https://www.redhat.com/archives/dm-devel/2019-June/msg00116.html
 and
  https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1777398

In our case, the most latest trace (below) is from a different system that
has been stable for years on Linux 4.1 with tmeta direct on the SSD.
We updated to 4.19.86 a few weeks ago and just hit this, what Mike
Snitzer explains to be an allocator race:

On Wed, 25 Sep 2019, Mike Snitzer wrote:
> > [199391.691714]  dm_pool_alloc_data_block+0x3f/0x60 [dm_thin_pool]
> > [199391.692411]  alloc_data_block.isra.52+0x6d/0x1e0 [dm_thin_pool]
> > [199391.693142]  process_cell+0x2a3/0x550 [dm_thin_pool]
> > [199391.693852]  ? sort+0x17b/0x270
> > [199391.694527]  ? u32_swap+0x10/0x10
> > [199391.695192]  do_worker+0x268/0x9a0 [dm_thin_pool]
> > [199391.695890]  process_one_work+0x171/0x370
> > [199391.696640]  worker_thread+0x49/0x3f0
> > [199391.697332]  kthread+0xf8/0x130
> > [199391.697988]  ? max_active_store+0x80/0x80
> > [199391.698659]  ? kthread_bind+0x10/0x10
> > [199391.699281]  ret_from_fork+0x1f/0x40
> 
> The stack shows the call to sm_disk_new_block() is due to
> dm_pool_alloc_data_block().
> 
> sm_disk_new_block()'s BUG_ON(ev != SM_ALLOC) indicates that somehow it is
> getting called without the passed 'ev' being set to SM_ALLOC.  Only
> drivers/md/persistent-dat/dm-space-map-common.c:sm_ll_mutate() sets
> SM_ALLOC. sm_disk_new_block() is indirectly calling sm_ll_mutate()
> 
> sm_ll_mutate() will only return 0 if ll->save_ie() does, the ll_disk *ll
> should be ll_disk, and so disk_ll_save_ie()'s call to dm_btree_insert()
> returns 0 -- which simply means success.  And on success
> sm_disk_new_block() assumes ev was set to SM_ALLOC (by sm_ll_mutate).
> 
> sm_ll_mutate() decided to _not_ set SM_ALLOC because either:
> 1) ref_count wasn't set
> or
> 2) old was identified
> 
> So all said: somehow a new data block was found to already be in use.
> _WHY_ that is the case isn't clear from this stack...
> 
> But it does speak to the possibility of data block allocation racing
> with other operations to the same block.  Which implies missing locking.

Where would you look to add locking do you think? 

> But that's all I've got so far... I'll review past dm-thinp changes with
> all this in mind and see what turns up.  But Joe Thornber (ejt) likely
> needs to have a look at this too.
> 
> But could it be that bcache is the source of the data device race (same
> block used concurrently)?  And DM thinp is acting as the canary in the
> coal mine?

As Marcus has shown, this bug triggers without bcache.


Other questions:

1) Can sm_disk_new_block() fail more gracefully than BUG_ON?  For example:

+   spin_lock(&lock); /* protect smd->begin */
smd->begin = *b + 1;
r = sm_ll_inc(&smd->ll, *b, &ev);
if (!r) {
-   BUG_ON(ev != SM_ALLOC); 
smd->nr_allocated_this_transaction++;
}
+   else {
+   r = -ENOSPC;
+   smd->begin = *b - 1;
+   }
+   spin_unlock(&lock);

The lock might protect smd->begin, but I'm not sure how &smd->ll might 
have been modified by sm_ll_inc().  However, since ll->save_ie() failed in 
sm_ll_mutate() then perhaps this is safe.  What do you think?

Putting the pool into PM_OUT_OF_DATA_SPACE isn't ideal since it isn't out 
of space, but I would take it over a BUG_ON.

2) If example #1 above returned -EAGAIN, how might alloc_data_block be 
   taught retry?  This bug shows up weeks or months apart, even on heavily 
   loaded systems with ~100 live thin volumes, so retrying would be fine 
   IMHO.

3) In the thread from June, Marcus says:
"Judging from the call trace, my guess is that there is somewhere 
a race condition, when a new block needs to be allocated which has 
still to be discarded."

Is t

Re: kernel BUG at drivers/md/persistent-data/dm-space-map-disk.c:178 with scsi_mod.use_blk_mq=y

2019-09-27 Thread Eric Wheeler
On Fri, 27 Sep 2019, Joe Thornber wrote:

> Hi Eric,
> 
> On Thu, Sep 26, 2019 at 06:27:09PM +, Eric Wheeler wrote:
> > I pvmoved the tmeta to an SSD logical volume (dm-linear) on a non-bcache 
> > volume and we got the same trace this morning, so while the tdata still 
> > passes through bcache, all meta operations are direct to an SSD. This is 
> > still using multi-queue scsi, but dm_mod.use_blk_mq=N.
> > 
> > Since bcache is no longer involved with metadata operations, and since 
> > this appears to be a metadata issue, are there any other reasons to 
> > suspect bcache?
> 
> Did you recreate the pool, or are you just using the existing pool but with
> a different IO path?  If it's the latter then there could still be something
> wrong with the metadata, introduced while bcache was in the stack.

We did not create the pool after the initial problem, though the pool was 
new just before the problem occurred. 
 
> Would it be possible to send me a copy of the metadata device please so
> I can double check the space maps (I presume you've run thin_check on it)?

~]# /usr/local/bin/thin_check /dev/mapper/data-data--pool_tmeta 
examining superblock
TRANSACTION_ID=2347
METADATA_FREE_BLOCKS=4145151
examining devices tree
examining mapping tree
checking space map counts

~]# echo $?
0

~]# /usr/local/bin/thin_check -V
0.8.5

> [Assuming you're using the existing pool] Another useful experiment would be 
> to 
> thump_dump and then thin_restore the metadata, which will create totally fresh
> metadata and see if you can still reproduce the issue.

It didn't lockup last night, but I'll keep working to reproduce the 
problem and let you know what we find.

Mike said it could be a race:

> The stack shows the call to sm_disk_new_block() is due to
> dm_pool_alloc_data_block().
> 
> sm_disk_new_block()'s BUG_ON(ev != SM_ALLOC) indicates that somehow it is
> getting called without the passed 'ev' being set to SM_ALLOC.  Only
> drivers/md/persistent-dat/dm-space-map-common.c:sm_ll_mutate() sets
> SM_ALLOC. sm_disk_new_block() is indirectly calling sm_ll_mutate()
> 
> sm_ll_mutate() will only return 0 if ll->save_ie() does, the ll_disk *ll
> should be ll_disk, and so disk_ll_save_ie()'s call to dm_btree_insert()
> returns 0 -- which simply means success.  And on success
> sm_disk_new_block() assumes ev was set to SM_ALLOC (by sm_ll_mutate).
> 
> sm_ll_mutate() decided to _not_ set SM_ALLOC because either:
> 1) ref_count wasn't set
> or
> 2) old was identified
> 
> So all said: somehow a new data block was found to already be in use.
> _WHY_ that is the case isn't clear from this stack...
>
> But it does speak to the possibility of data block allocation racing
> with other operations to the same block.  Which implies missing locking.

Would a spinlock on the block solve the issue?

Where might such a spinlock be added?


--
Eric Wheeler


> 
> Thanks,
> 
> - Joe
> 


Re: kernel BUG at drivers/md/persistent-data/dm-space-map-disk.c:178 with scsi_mod.use_blk_mq=y

2019-09-27 Thread Joe Thornber
Hi Eric,

On Thu, Sep 26, 2019 at 06:27:09PM +, Eric Wheeler wrote:
> I pvmoved the tmeta to an SSD logical volume (dm-linear) on a non-bcache 
> volume and we got the same trace this morning, so while the tdata still 
> passes through bcache, all meta operations are direct to an SSD. This is 
> still using multi-queue scsi, but dm_mod.use_blk_mq=N.
> 
> Since bcache is no longer involved with metadata operations, and since 
> this appears to be a metadata issue, are there any other reasons to 
> suspect bcache?

Did you recreate the pool, or are you just using the existing pool but with
a different IO path?  If it's the latter then there could still be something
wrong with the metadata, introduced while bcache was in the stack.

Would it be possible to send me a copy of the metadata device please so
I can double check the space maps (I presume you've run thin_check on it)?

[Assuming you're using the existing pool] Another useful experiment would be to 
thump_dump and then thin_restore the metadata, which will create totally fresh
metadata and see if you can still reproduce the issue.

Thanks,

- Joe


Re: kernel BUG at drivers/md/persistent-data/dm-space-map-disk.c:178 with scsi_mod.use_blk_mq=y

2019-09-26 Thread Eric Wheeler
On Wed, 25 Sep 2019, Mike Snitzer wrote:
> On Wed, Sep 25 2019 at  2:40pm -0400,
> Eric Wheeler  wrote:
> 
> > Hello,
> > 
> > We are using the 4.19.75 stable tree with dm-thin and multi-queue scsi.  
> > We have been using the 4.19 branch for months without issue; we just 
> > switched to MQ and we seem to have hit this BUG_ON.  Whether or not MQ is 
> > related to the issue, I don't know, maybe coincidence:
> > 
> > static int sm_disk_new_block(struct dm_space_map *sm, dm_block_t *b)
> > {
> > int r;
> > enum allocation_event ev;
> > struct sm_disk *smd = container_of(sm, struct sm_disk, sm);
> > 
> > /* FIXME: we should loop round a couple of times */
> > r = sm_ll_find_free_block(&smd->old_ll, smd->begin, 
> > smd->old_ll.nr_blocks, b);
> > if (r)
> > return r;
> > 
> > smd->begin = *b + 1;
> > r = sm_ll_inc(&smd->ll, *b, &ev);
> > if (!r) {
> > BUG_ON(ev != SM_ALLOC); <
> > smd->nr_allocated_this_transaction++;
> > }
> > 
> > return r;
> > }
> > 
> > This is a brand-new thin pool created about 12 hours ago:
> > 
> >   lvcreate -c 64k -L 12t --type thin-pool --thinpool data-pool 
> > --poolmetadatasize 16G data /dev/bcache0
> > 
> > We are using bcache, but I don't see any bcache code in the backtraces.  
> > The metadata is also on the bcache volume.
> 
> So bcache is be used for both data and metadata.

Hi Mike, 

I pvmoved the tmeta to an SSD logical volume (dm-linear) on a non-bcache 
volume and we got the same trace this morning, so while the tdata still 
passes through bcache, all meta operations are direct to an SSD. This is 
still using multi-queue scsi, but dm_mod.use_blk_mq=N.

Since bcache is no longer involved with metadata operations, and since 
this appears to be a metadata issue, are there any other reasons to 
suspect bcache?

Since we seem to hit this every night, I can try any patches that you 
would like for testing. I appreciate your help, hopefully we can solve 
this quickly. 


-Eric
  
> > We were transferring data to the new thin volumes and it ran for about 12 
> > hours and then gave the trace below.  So far it has only happened once 
> > and I don't have a way to reproduce it.
> > 
> > Any idea what this BUG_ON would indicate and how we might contrive a fix?
> >
> > [199391.677689] [ cut here ]
> > [199391.678437] kernel BUG at 
> > drivers/md/persistent-data/dm-space-map-disk.c:178!
> > [199391.679183] invalid opcode:  [#1] SMP NOPTI
> > [199391.679941] CPU: 4 PID: 31359 Comm: kworker/u16:4 Not tainted 4.19.75 #1
> > [199391.680683] Hardware name: Supermicro X9SCL/X9SCM/X9SCL/X9SCM, BIOS 2.2 
> > 02/20/2015
> > [199391.681446] Workqueue: dm-thin do_worker [dm_thin_pool]
> > [199391.682187] RIP: 0010:sm_disk_new_block+0xa0/0xb0 [dm_persistent_data]
> > [199391.682929] Code: 22 00 00 49 8b 34 24 e8 4e f9 ff ff 85 c0 75 11 83 7c 
> > 24 04 01 75 13 48 83 83 28 22 00 00 01 eb af 89 c5 eb ab e8 e0 95 9b e0 
> > <0f> 0b 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55
> > [199391.684432] RSP: 0018:c9000a147c88 EFLAGS: 00010297
> > [199391.685186] RAX:  RBX: 8887ceed8000 RCX: 
> > 
> > [199391.685936] RDX: 8887d093ac80 RSI: 0246 RDI: 
> > 8887faab0a00
> > [199391.686659] RBP:  R08:  R09: 
> > 8887faab0a98
> > [199391.687379] R10: 810f8077 R11:  R12: 
> > c9000a147d58
> > [199391.688120] R13: c9000a147d58 R14: ffc3 R15: 
> > 0014bbc0
> > [199391.688843] FS:  () GS:0fb0() 
> > knlGS:
> > [199391.689571] CS:  0010 DS:  ES:  CR0: 80050033
> > [199391.690253] CR2: 7f5ae49a1000 CR3: 0200a003 CR4: 
> > 001626e0
> > [199391.690984] Call Trace:
> > [199391.691714]  dm_pool_alloc_data_block+0x3f/0x60 [dm_thin_pool]
> > [199391.692411]  alloc_data_block.isra.52+0x6d/0x1e0 [dm_thin_pool]
> > [199391.693142]  process_cell+0x2a3/0x550 [dm_thin_pool]
> > [199391.693852]  ? sort+0x17b/0x270
> > [199391.694527]  ? u32_swap+0x10/0x10
> > [199391.695192]  do_worker+0x268/0x9a0 [dm_thin_pool]
> > [199391.695890]  process_one_work+0x171/0x370
> > [199391.696640]  worker_thread+0x49/0x3f0
> > [199391.697332]  kthread+0xf8/0x130
> > [199391.697988]  ? max_active_store+0x80/0x80
> > [199391.698659]  ? kthread_bind+0x10/0x10
> > [199391.699281]  ret_from_fork+0x1f/0x40
> 
> The stack shows the call to sm_disk_new_block() is due to
> dm_pool_alloc_data_block().
> 
> sm_disk_new_block()'s BUG_ON(ev != SM_ALLOC) indicates that somehow it is
> getting called without the passed 'ev' being set to SM_ALLOC.  Only
> drivers/md/persistent-dat/dm-space-map-common.c:sm_ll_mutate() sets
> SM_ALLOC. sm_disk_new_block() is indirectly calling sm_ll_mutate()
> 
> sm_ll_mutate() will only ret

Re: kernel BUG at drivers/md/persistent-data/dm-space-map-disk.c:178 with scsi_mod.use_blk_mq=y

2019-09-25 Thread Eric Wheeler
On Wed, 25 Sep 2019, Mike Snitzer wrote:
> On Wed, Sep 25 2019 at  2:40pm -0400, Eric Wheeler 
>  wrote:
> 
> > Hello,
> > 
> > We are using the 4.19.75 stable tree with dm-thin and multi-queue scsi.  
> > We have been using the 4.19 branch for months without issue; we just 
> > switched to MQ and we seem to have hit this BUG_ON.  Whether or not MQ is 
> > related to the issue, I don't know, maybe coincidence:
> > 
> > static int sm_disk_new_block(struct dm_space_map *sm, dm_block_t *b)
> > {
> > int r;
> > enum allocation_event ev;
> > struct sm_disk *smd = container_of(sm, struct sm_disk, sm);
> > 
> > /* FIXME: we should loop round a couple of times */
> > r = sm_ll_find_free_block(&smd->old_ll, smd->begin, 
> > smd->old_ll.nr_blocks, b);
> > if (r)
> > return r;
> > 
> > smd->begin = *b + 1;
> > r = sm_ll_inc(&smd->ll, *b, &ev);
> > if (!r) {
> > BUG_ON(ev != SM_ALLOC); <
> > smd->nr_allocated_this_transaction++;
> > }
> > 
> > return r;
> > }
> > 
> > This is a brand-new thin pool created about 12 hours ago:
> > 
> >   lvcreate -c 64k -L 12t --type thin-pool --thinpool data-pool 
> > --poolmetadatasize 16G data /dev/bcache0
> > 
> > We are using bcache, but I don't see any bcache code in the backtraces.  
> > The metadata is also on the bcache volume.
> 
> So bcache is be used for both data and metadata.

Correct.
  
> > We were transferring data to the new thin volumes and it ran for about 12 
> > hours and then gave the trace below.  So far it has only happened once 
> > and I don't have a way to reproduce it.
> > 
> > Any idea what this BUG_ON would indicate and how we might contrive a fix?
> >
> > [199391.677689] [ cut here ]
> > [199391.678437] kernel BUG at 
> > drivers/md/persistent-data/dm-space-map-disk.c:178!
> > [199391.679183] invalid opcode:  [#1] SMP NOPTI
> > [199391.679941] CPU: 4 PID: 31359 Comm: kworker/u16:4 Not tainted 4.19.75 #1
> > [199391.680683] Hardware name: Supermicro X9SCL/X9SCM/X9SCL/X9SCM, BIOS 2.2 
> > 02/20/2015
> > [199391.681446] Workqueue: dm-thin do_worker [dm_thin_pool]
> > [199391.682187] RIP: 0010:sm_disk_new_block+0xa0/0xb0 [dm_persistent_data]
> > [199391.682929] Code: 22 00 00 49 8b 34 24 e8 4e f9 ff ff 85 c0 75 11 83 7c 
> > 24 04 01 75 13 48 83 83 28 22 00 00 01 eb af 89 c5 eb ab e8 e0 95 9b e0 
> > <0f> 0b 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55
> > [199391.684432] RSP: 0018:c9000a147c88 EFLAGS: 00010297
> > [199391.685186] RAX:  RBX: 8887ceed8000 RCX: 
> > 
> > [199391.685936] RDX: 8887d093ac80 RSI: 0246 RDI: 
> > 8887faab0a00
> > [199391.686659] RBP:  R08:  R09: 
> > 8887faab0a98
> > [199391.687379] R10: 810f8077 R11:  R12: 
> > c9000a147d58
> > [199391.688120] R13: c9000a147d58 R14: ffc3 R15: 
> > 0014bbc0
> > [199391.688843] FS:  () GS:0fb0() 
> > knlGS:
> > [199391.689571] CS:  0010 DS:  ES:  CR0: 80050033
> > [199391.690253] CR2: 7f5ae49a1000 CR3: 0200a003 CR4: 
> > 001626e0
> > [199391.690984] Call Trace:
> > [199391.691714]  dm_pool_alloc_data_block+0x3f/0x60 [dm_thin_pool]
> > [199391.692411]  alloc_data_block.isra.52+0x6d/0x1e0 [dm_thin_pool]
> > [199391.693142]  process_cell+0x2a3/0x550 [dm_thin_pool]
> > [199391.693852]  ? sort+0x17b/0x270
> > [199391.694527]  ? u32_swap+0x10/0x10
> > [199391.695192]  do_worker+0x268/0x9a0 [dm_thin_pool]
> > [199391.695890]  process_one_work+0x171/0x370
> > [199391.696640]  worker_thread+0x49/0x3f0
> > [199391.697332]  kthread+0xf8/0x130
> > [199391.697988]  ? max_active_store+0x80/0x80
> > [199391.698659]  ? kthread_bind+0x10/0x10
> > [199391.699281]  ret_from_fork+0x1f/0x40
> 
> The stack shows the call to sm_disk_new_block() is due to
> dm_pool_alloc_data_block().
> 
> sm_disk_new_block()'s BUG_ON(ev != SM_ALLOC) indicates that somehow it is
> getting called without the passed 'ev' being set to SM_ALLOC.  Only
> drivers/md/persistent-dat/dm-space-map-common.c:sm_ll_mutate() sets
> SM_ALLOC. sm_disk_new_block() is indirectly calling sm_ll_mutate()
> 
> sm_ll_mutate() will only return 0 if ll->save_ie() does, the ll_disk *ll
> should be ll_disk, and so disk_ll_save_ie()'s call to dm_btree_insert()
> returns 0 -- which simply means success.  And on success
> sm_disk_new_block() assumes ev was set to SM_ALLOC (by sm_ll_mutate).
> 
> sm_ll_mutate() decided to _not_ set SM_ALLOC because either:
> 1) ref_count wasn't set
> or
> 2) old was identified
> 
> So all said: somehow a new data block was found to already be in use.
> _WHY_ that is the case isn't clear from this stack...
> 
> But it does speak to the possibility of data block allocation racing
> with other operations to

Re: [dm-devel] kernel BUG at drivers/md/persistent-data/dm-space-map-disk.c:178 with scsi_mod.use_blk_mq=y

2019-09-25 Thread Mike Snitzer
On Wed, Sep 25 2019 at  2:40pm -0400,
Eric Wheeler  wrote:

> Hello,
> 
> We are using the 4.19.75 stable tree with dm-thin and multi-queue scsi.  
> We have been using the 4.19 branch for months without issue; we just 
> switched to MQ and we seem to have hit this BUG_ON.  Whether or not MQ is 
> related to the issue, I don't know, maybe coincidence:
> 
>   static int sm_disk_new_block(struct dm_space_map *sm, dm_block_t *b)
>   {
>   int r;
>   enum allocation_event ev;
>   struct sm_disk *smd = container_of(sm, struct sm_disk, sm);
> 
>   /* FIXME: we should loop round a couple of times */
>   r = sm_ll_find_free_block(&smd->old_ll, smd->begin, 
> smd->old_ll.nr_blocks, b);
>   if (r)
>   return r;
> 
>   smd->begin = *b + 1;
>   r = sm_ll_inc(&smd->ll, *b, &ev);
>   if (!r) {
>   BUG_ON(ev != SM_ALLOC); <
>   smd->nr_allocated_this_transaction++;
>   }
> 
>   return r;
>   }
> 
> This is a brand-new thin pool created about 12 hours ago:
> 
>   lvcreate -c 64k -L 12t --type thin-pool --thinpool data-pool 
> --poolmetadatasize 16G data /dev/bcache0
> 
> We are using bcache, but I don't see any bcache code in the backtraces.  
> The metadata is also on the bcache volume.

So bcache is be used for both data and metadata.
 
> We were transferring data to the new thin volumes and it ran for about 12 
> hours and then gave the trace below.  So far it has only happened once 
> and I don't have a way to reproduce it.
> 
> Any idea what this BUG_ON would indicate and how we might contrive a fix?
>
> [199391.677689] [ cut here ]
> [199391.678437] kernel BUG at 
> drivers/md/persistent-data/dm-space-map-disk.c:178!
> [199391.679183] invalid opcode:  [#1] SMP NOPTI
> [199391.679941] CPU: 4 PID: 31359 Comm: kworker/u16:4 Not tainted 4.19.75 #1
> [199391.680683] Hardware name: Supermicro X9SCL/X9SCM/X9SCL/X9SCM, BIOS 2.2 
> 02/20/2015
> [199391.681446] Workqueue: dm-thin do_worker [dm_thin_pool]
> [199391.682187] RIP: 0010:sm_disk_new_block+0xa0/0xb0 [dm_persistent_data]
> [199391.682929] Code: 22 00 00 49 8b 34 24 e8 4e f9 ff ff 85 c0 75 11 83 7c 
> 24 04 01 75 13 48 83 83 28 22 00 00 01 eb af 89 c5 eb ab e8 e0 95 9b e0 <0f> 
> 0b 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55
> [199391.684432] RSP: 0018:c9000a147c88 EFLAGS: 00010297
> [199391.685186] RAX:  RBX: 8887ceed8000 RCX: 
> 
> [199391.685936] RDX: 8887d093ac80 RSI: 0246 RDI: 
> 8887faab0a00
> [199391.686659] RBP:  R08:  R09: 
> 8887faab0a98
> [199391.687379] R10: 810f8077 R11:  R12: 
> c9000a147d58
> [199391.688120] R13: c9000a147d58 R14: ffc3 R15: 
> 0014bbc0
> [199391.688843] FS:  () GS:0fb0() 
> knlGS:
> [199391.689571] CS:  0010 DS:  ES:  CR0: 80050033
> [199391.690253] CR2: 7f5ae49a1000 CR3: 0200a003 CR4: 
> 001626e0
> [199391.690984] Call Trace:
> [199391.691714]  dm_pool_alloc_data_block+0x3f/0x60 [dm_thin_pool]
> [199391.692411]  alloc_data_block.isra.52+0x6d/0x1e0 [dm_thin_pool]
> [199391.693142]  process_cell+0x2a3/0x550 [dm_thin_pool]
> [199391.693852]  ? sort+0x17b/0x270
> [199391.694527]  ? u32_swap+0x10/0x10
> [199391.695192]  do_worker+0x268/0x9a0 [dm_thin_pool]
> [199391.695890]  process_one_work+0x171/0x370
> [199391.696640]  worker_thread+0x49/0x3f0
> [199391.697332]  kthread+0xf8/0x130
> [199391.697988]  ? max_active_store+0x80/0x80
> [199391.698659]  ? kthread_bind+0x10/0x10
> [199391.699281]  ret_from_fork+0x1f/0x40

The stack shows the call to sm_disk_new_block() is due to
dm_pool_alloc_data_block().

sm_disk_new_block()'s BUG_ON(ev != SM_ALLOC) indicates that somehow it is
getting called without the passed 'ev' being set to SM_ALLOC.  Only
drivers/md/persistent-dat/dm-space-map-common.c:sm_ll_mutate() sets
SM_ALLOC. sm_disk_new_block() is indirectly calling sm_ll_mutate()

sm_ll_mutate() will only return 0 if ll->save_ie() does, the ll_disk *ll
should be ll_disk, and so disk_ll_save_ie()'s call to dm_btree_insert()
returns 0 -- which simply means success.  And on success
sm_disk_new_block() assumes ev was set to SM_ALLOC (by sm_ll_mutate).

sm_ll_mutate() decided to _not_ set SM_ALLOC because either:
1) ref_count wasn't set
or
2) old was identified

So all said: somehow a new data block was found to already be in use.
_WHY_ that is the case isn't clear from this stack...

But it does speak to the possibility of data block allocation racing
with other operations to the same block.  Which implies missing locking.

But that's all I've got so far... I'll review past dm-thinp changes with
all this in mind and see what turns up.  But Joe Thornber (ejt) likely
needs to have a look at this too.

But could it b

kernel BUG at drivers/md/persistent-data/dm-space-map-disk.c:178 with scsi_mod.use_blk_mq=y

2019-09-25 Thread Eric Wheeler
Hello,

We are using the 4.19.75 stable tree with dm-thin and multi-queue scsi.  
We have been using the 4.19 branch for months without issue; we just 
switched to MQ and we seem to have hit this BUG_ON.  Whether or not MQ is 
related to the issue, I don't know, maybe coincidence:

static int sm_disk_new_block(struct dm_space_map *sm, dm_block_t *b)
{
int r;
enum allocation_event ev;
struct sm_disk *smd = container_of(sm, struct sm_disk, sm);

/* FIXME: we should loop round a couple of times */
r = sm_ll_find_free_block(&smd->old_ll, smd->begin, 
smd->old_ll.nr_blocks, b);
if (r)
return r;

smd->begin = *b + 1;
r = sm_ll_inc(&smd->ll, *b, &ev);
if (!r) {
BUG_ON(ev != SM_ALLOC); <
smd->nr_allocated_this_transaction++;
}

return r;
}

This is a brand-new thin pool created about 12 hours ago:

  lvcreate -c 64k -L 12t --type thin-pool --thinpool data-pool 
--poolmetadatasize 16G data /dev/bcache0

We are using bcache, but I don't see any bcache code in the backtraces.  
The metadata is also on the bcache volume.

We were transferring data to the new thin volumes and it ran for about 12 
hours and then gave the trace below.  So far it has only happened once 
and I don't have a way to reproduce it.

Any idea what this BUG_ON would indicate and how we might contrive a fix?

-Eric



[199391.677689] [ cut here ]
[199391.678437] kernel BUG at 
drivers/md/persistent-data/dm-space-map-disk.c:178!
[199391.679183] invalid opcode:  [#1] SMP NOPTI
[199391.679941] CPU: 4 PID: 31359 Comm: kworker/u16:4 Not tainted 4.19.75 #1
[199391.680683] Hardware name: Supermicro X9SCL/X9SCM/X9SCL/X9SCM, BIOS 2.2 
02/20/2015
[199391.681446] Workqueue: dm-thin do_worker [dm_thin_pool]
[199391.682187] RIP: 0010:sm_disk_new_block+0xa0/0xb0 [dm_persistent_data]
[199391.682929] Code: 22 00 00 49 8b 34 24 e8 4e f9 ff ff 85 c0 75 11 83 7c 24 
04 01 75 13 48 83 83 28 22 00 00 01 eb af 89 c5 eb ab e8 e0 95 9b e0 <0f> 0b 0f 
1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55
[199391.684432] RSP: 0018:c9000a147c88 EFLAGS: 00010297
[199391.685186] RAX:  RBX: 8887ceed8000 RCX: 

[199391.685936] RDX: 8887d093ac80 RSI: 0246 RDI: 
8887faab0a00
[199391.686659] RBP:  R08:  R09: 
8887faab0a98
[199391.687379] R10: 810f8077 R11:  R12: 
c9000a147d58
[199391.688120] R13: c9000a147d58 R14: ffc3 R15: 
0014bbc0
[199391.688843] FS:  () GS:0fb0() 
knlGS:
[199391.689571] CS:  0010 DS:  ES:  CR0: 80050033
[199391.690253] CR2: 7f5ae49a1000 CR3: 0200a003 CR4: 
001626e0
[199391.690984] Call Trace:
[199391.691714]  dm_pool_alloc_data_block+0x3f/0x60 [dm_thin_pool]
[199391.692411]  alloc_data_block.isra.52+0x6d/0x1e0 [dm_thin_pool]
[199391.693142]  process_cell+0x2a3/0x550 [dm_thin_pool]
[199391.693852]  ? sort+0x17b/0x270
[199391.694527]  ? u32_swap+0x10/0x10
[199391.695192]  do_worker+0x268/0x9a0 [dm_thin_pool]
[199391.695890]  process_one_work+0x171/0x370
[199391.696640]  worker_thread+0x49/0x3f0
[199391.697332]  kthread+0xf8/0x130
[199391.697988]  ? max_active_store+0x80/0x80
[199391.698659]  ? kthread_bind+0x10/0x10
[199391.699281]  ret_from_fork+0x1f/0x40
[199391.699930] Modules linked in: dm_snapshot btrfs xor zstd_decompress 
zstd_compress xxhash raid6_pq xfs dm_thin_pool dm_persistent_data dm_bio_prison 
dm_bufio drbd lru_cache xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat 
nf_nat_ipv4 nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 
libcrc32c ipt_REJECT nf_reject_ipv4 ebtable_filter ebtables binfmt_misc 
ip6table_filter ip6_tables bcache xt_comment crc64 iptable_filter netconsole 
bridge 8021q garp stp mrp llc lz4 lz4_compress zram sunrpc x86_pkg_temp_thermal 
intel_powerclamp coretemp iTCO_wdt iTCO_vendor_support kvm_intel kvm irqbypass 
crct10dif_pclmul crc32_pclmul pcc_cpufreq ghash_clmulni_intel pcspkr sg ipmi_si 
ipmi_devintf lpc_ich ipmi_msghandler video i2c_i801 mfd_core ip_tables ext4 
mbcache jbd2 mgag200 i2c_algo_bit drm_kms_helper
[199391.705631]  syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm 
crc32c_intel i2c_core ahci libahci ixgbe libata e1000e arcmsr mdio dca 
dm_mirror dm_region_hash dm_log dm_mod
[199391.708083] ---[ end trace c31536d98046e8ec ]---
[199391.866776] RIP: 0010:sm_disk_new_block+0xa0/0xb0 [dm_persistent_data]
[199391.867960] Code: 22 00 00 49 8b 34 24 e8 4e f9 ff ff 85 c0 75 11 83 7c 24 
04 01 75 13 48 83 83 28 22 00 00 01 eb af 89 c5 eb ab e8 e0 95 9b e0 <0f> 0b 0f 
1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55
[199391.870379] RSP: 0018:c9000a147c88 EFLAGS: 00010297
[199391.871524] RAX: 0