Re: bfq/ext4 disk IO hangs forever on resume

2017-07-25 Thread Jan Kara
Hello,

On Sun 25-06-17 23:07:56, Alex Xu wrote:
> I get hangs when resuming when using bfq-mq with ext4 on 4.12-rc6+
> (currently a4fd8b3accf43d407472e34403d4b0a4df5c0e71).
> 
> Steps to reproduce:
> 1. boot computer
> 2. systemctl suspend
> 3. wait few seconds
> 4. press power button
> 5. type "ls" into console or SSH or do anything that does disk IO
> 
> Expected results:
> Command is executed.
> 
> Actual results:
> Command hangs.
> 
> lockdep has no comments, but sysrq-d shows that i_mutex_dir_key and
> jbd2_handle are held by multiple processes, leading me to suspect that
> ext4 is at least partially involved. [0]

Can you still reproduce this?

> sysrq-w lists many blocked processes [1]
> 
> This happens consistently, every time I resume the system from
> suspend-to-RAM using this configuration. Switching to noop IO scheduler
> makes it stop happening. I haven't tried switching filesystems yet.

The two stacktraces that you've pasted show that we are waiting for buffer
lock - likely we have submitted the buffer for IO and are waiting for it to
complete. Together with the fact that switching to NOOP fixes the problem I
somewhat suspect that IO scheduler somehow fails to ever complete some IO -
added relevant people to CC.

Anyway if you can still reproduce, it would be good to get full output from
sysrq-w so that we can confirm that everything is blocked on waiting for IO
to complete.

> I can do more debugging (enable KASAN or whatever), but usually when I
> bother doing that I find someone has already sent a patch for the issue.

Honza

> [0]
> 
> 4 locks held by systemd/384:  
>   
>   
>
>  #0:  (sb_writers#3){.+.+.+}, at: [] 
> mnt_want_write+0x1f/0x50
>  #1:  (&type->i_mutex_dir_key/1){+.+.+.}, at: [] 
> do_rmdir+0x15e/0x1e0
>  #2:  (&type->i_mutex_dir_key){++}, at: [] 
> vfs_rmdir+0x50/0x130
>  #3:  (jbd2_handle){..}, at: [] 
> start_this_handle+0xff/0x430
> 4 locks held by syncthing/279: 
>  #0:  (&f->f_pos_lock){+.+.+.}, at: [] __fdget_pos+0x3e/0x50
>  #1:  (sb_writers#3){.+.+.+}, at: [] vfs_write+0x17c/0x1d0
>  #2:  (&sb->s_type->i_mutex_key#9){+.+.+.}, at: [] 
> ext4_file_write_iter+0x57/0x350
>  #3:  (jbd2_handle){..}, at: [] 
> start_this_handle+0xff/0x430
> 2 locks held by zsh/238:
>  #0:  (&tty->ldisc_sem){.+}, at: [] 
> ldsem_down_read+0x1f/0x30
>  #1:  (&ldata->atomic_read_lock){+.+...}, at: [] 
> n_tty_read+0xb0/0x8b0
> 2 locks held by sddm-greeter/267:
>  #0:  (sb_writers#3){.+.+.+}, at: [] 
> mnt_want_write+0x1f/0x50
>  #1:  (&type->i_mutex_dir_key){++}, at: [] 
> path_openat+0x2d8/0xa10
> 2 locks held by kworker/u16:28/330:
>  #0:  ("events_unbound"){.+.+.+}, at: [] 
> process_one_work+0x1c3/0x420
>  #1:  ((&entry->work)){+.+.+.}, at: [] 
> process_one_work+0x1c3/0x420
> 1 lock held by zsh/382:
>  #0:  (&sig->cred_guard_mutex){+.+.+.}, at: [] 
> prepare_bprm_creds+0x30/0x70
> 
> [1]
> 
>   taskPC stack   pid father
> systemd D0   384  0 0x   
> Call Trace:
>  __schedule+0x295/0x7c0
>  ? bit_wait+0x50/0x50
>  ? bit_wait+0x50/0x50
>  schedule+0x31/0x80
>  io_schedule+0x11/0x40
>  bit_wait_io+0xc/0x50
>  __wait_on_bit+0x53/0x80
>  ? bit_wait+0x50/0x50
>  out_of_line_wait_on_bit+0x6e/0x80
>  ? autoremove_wake_function+0x30/0x30
>  do_get_write_access+0x20b/0x420
>  jbd2_journal_get_write_access+0x2c/0x60
>  __ext4_journal_get_write_access+0x55/0xa0
>  ext4_delete_entry+0x8c/0x140
>  ? __ext4_journal_start_sb+0x4e/0xa0
>  ext4_rmdir+0x114/0x250
>  vfs_rmdir+0x6e/0x130
>  do_rmdir+0x1a3/0x1e0
>  SyS_unlinkat+0x1d/0x30
>  entry_SYSCALL_64_fastpath+0x18/0xad
> jbd2/sda1-8 D081  2 0x   
> Call Trace:
>  __schedule+0x295/0x7c0
>  ? bit_wait+0x50/0x50
>  schedule+0x31/0x80
>  io_schedule+0x11/0x40
>  bit_wait_io+0xc/0x50
>  __wait_on_bit+0x53/0x80
>  ? bit_wait+0x50/0x50
>  out_of_line_wait_on_bit+0x6e/0x80
>  ? autoremove_wake_function+0x30/0x30
>  __wait_on_buffer+0x2d/0x30
>  jbd2_journal_commit_transaction+0xe6a/0x1700
>  kjournald2+0xc8/0x270
>  ? kjournald2+0xc8/0x270
>  ? wake_atomic_t_function+0x50/0x50
>  kthread+0xfe/0x130
>  ? commit_timeout+0x10/0x10
>  ? kthread_create_on_node+0x40/0x40
>  ret_from_fork+0x27/0x40
> [ more processes follow, some different tracebacks ]
-- 
Jan Kara 
SUSE Labs, CR


Re: [Nbd] [PATCH 1/3] nbd: allow multiple disconnects to be sent

2017-07-25 Thread Wouter Verhelst
On Mon, Jul 24, 2017 at 07:24:30PM -0400, Josef Bacik wrote:
> On Mon, Jul 24, 2017 at 10:08:21PM +0200, Wouter Verhelst wrote:
> > On Fri, Jul 21, 2017 at 10:48:13AM -0400, jo...@toxicpanda.com wrote:
> > > From: Josef Bacik 
> > > 
> > > There's no reason to limit ourselves to one disconnect message per
> > > socket.  Sometimes networks do strange things, might as well let
> > > sysadmins hit the panic button as much as they want.
> > 
> > The protocol spec is pretty clear that any requests sent after the
> > disconnect request was sent out are not guaranteed to be processed
> > anymore.
> > 
> > Doesn't this allow more requests to be sent out? Or is the
> > NBD_DISCONNECT_REQUESTED flag enough to make that impossible?
> > 
> 
> This just allows users to call the disconnect ioctl/netlink thing multiple 
> times
> and have it send the DISCONNECT command if they want.

Right.

> We've had problems with our in-hosue nbd server missing messages,

That's pretty bad...

> and it's just a pain to have to unstuck it because the server messed up.
> It's just for the rare case the server is being weird, not because we
> expect/guarantee that subsequent disconnect commands will be processed.
> Thanks,

Okay, makes sense. Just thought I'd ask :-)

-- 
Could you people please use IRC like normal people?!?

  -- Amaya Rodrigo Sastre, trying to quiet down the buzz in the DebConf 2008
 Hacklab


Re: bfq/ext4 disk IO hangs forever on resume

2017-07-25 Thread Jens Axboe
On 07/25/2017 02:51 AM, Jan Kara wrote:
> Hello,
> 
> On Sun 25-06-17 23:07:56, Alex Xu wrote:
>> I get hangs when resuming when using bfq-mq with ext4 on 4.12-rc6+
>> (currently a4fd8b3accf43d407472e34403d4b0a4df5c0e71).
>>
>> Steps to reproduce:
>> 1. boot computer
>> 2. systemctl suspend
>> 3. wait few seconds
>> 4. press power button
>> 5. type "ls" into console or SSH or do anything that does disk IO
>>
>> Expected results:
>> Command is executed.
>>
>> Actual results:
>> Command hangs.
>>
>> lockdep has no comments, but sysrq-d shows that i_mutex_dir_key and
>> jbd2_handle are held by multiple processes, leading me to suspect that
>> ext4 is at least partially involved. [0]
> 
> Can you still reproduce this?

Can you try with this patch added:

http://git.kernel.dk/cgit/linux-block/commit/?h=for-linus&id=765e40b675a9566459ddcb8358ad16f3b8344bbe

-- 
Jens Axboe



[PATCH] nbd: clear disconnected on reconnect

2017-07-25 Thread josef
From: Josef Bacik 

If our device loses its connection for longer than the dead timeout we
will set NBD_DISCONNECTED in order to quickly fail any pending IO's that
flood in after the IO's that were waiting during the dead timer.
However if we re-connect at some point in the future we'll still see
this DISCONNECTED flag set if we then lose our connection again after
that, which means we won't get notifications for our newly lost
connections.  Fix this by just clearing the DISCONNECTED flag on
reconnect in order to make sure everything works as it's supposed to.

Reported-by: Dan Melnic 
Signed-off-by: Josef Bacik 
---
 drivers/block/nbd.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 64b19b1..5bdf923 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -923,6 +923,8 @@ static int nbd_reconnect_socket(struct nbd_device *nbd, 
unsigned long arg)
mutex_unlock(&nsock->tx_lock);
sockfd_put(old);
 
+   clear_bit(NBD_DISCONNECTED, &config->runtime_flags);
+
/* We take the tx_mutex in an error path in the recv_work, so we
 * need to queue_work outside of the tx_mutex.
 */
-- 
2.7.4



Re: [PATCH] nbd: clear disconnected on reconnect

2017-07-25 Thread Jens Axboe
On 07/25/2017 11:31 AM, jo...@toxicpanda.com wrote:
> From: Josef Bacik 
> 
> If our device loses its connection for longer than the dead timeout we
> will set NBD_DISCONNECTED in order to quickly fail any pending IO's that
> flood in after the IO's that were waiting during the dead timer.
> However if we re-connect at some point in the future we'll still see
> this DISCONNECTED flag set if we then lose our connection again after
> that, which means we won't get notifications for our newly lost
> connections.  Fix this by just clearing the DISCONNECTED flag on
> reconnect in order to make sure everything works as it's supposed to.

Will add for 4.13. Any chance that you can write blktests regression
tests for some of the bugs that you are fixing for nbd? Would be nice
to have that be part of the regular regression suite.

-- 
Jens Axboe



Re: [Xen-users] File-based domU - Slow storage write since xen 4.8

2017-07-25 Thread Keith Busch
On Fri, Jul 21, 2017 at 07:07:06PM +0200, Benoit Depail wrote:
> On 07/21/17 18:07, Roger Pau Monné wrote:
> > 
> > Hm, I'm not sure I follow either. AFAIK this problem came from
> > changing the Linux version in the Dom0 (where the backend, blkback is
> > running), rather than in the DomU right?
> > 
> > Regarding the queue/sectors stuff, blkfront uses several blk_queue
> > functions to set those parameters, maybe there's something wrong
> > there, but I cannot really spot what it is:
> > 
> > http://elixir.free-electrons.com/linux/latest/source/drivers/block/xen-blkfront.c#L929
> > 
> > In the past the number of pages that could fit in a single ring
> > request was limited to 11, but some time ago indirect descriptors
> > where introduced in order to lift this limit, and now requests can
> > have a much bigger number of pages.
> > 
> > Could you check the max_sectors_kb of the underlying storage you are
> > using in Dom0?
> > 
> > Roger.
> > 
> I checked the value for the loop device as well
> 
> With 4.4.77 (bad write performance)
> $ cat /sys/block/sda/queue/max_sectors_kb
> 1280
> $ cat /sys/block/loop1/queue/max_sectors_kb
> 127
> 
> 
> With 4.1.42 (normal write performance)
> $ cat /sys/block/sda/queue/max_sectors_kb
> 4096
> $ cat /sys/block/loop1/queue/max_sectors_kb
> 127

Thank you for the confirmations so far. Could you confirm performance dom0
running 4.4.77 with domU running 4.1.42, and the other way around? Would
like to verify if this is just isolated to blkfront.


[BUG] nvme driver crash

2017-07-25 Thread Shaohua Li
Disable CONIFG_SMP, kernel crashes at boot time, here is the log.

[9.593590] nvme :00:03.0: can't allocate MSI-X affinity masks for 1 
vectors
[9.595189] 
==
[9.595834] BUG: KASAN: null-ptr-deref in blk_mq_init_queue+0x1c/0x70
[9.596010] Read of size 4 at addr 0020 by task kworker/u2:0/5
[9.596010]
[9.596010] CPU: 0 PID: 5 Comm: kworker/u2:0 Not tainted 4.13.0-rc1+ #1
[9.596010] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.9.1-5.el7_3.3 04/01/2014
[9.596010] Workqueue: nvme-wq nvme_scan_work
[9.596010] Call Trace:
[9.596010]  dump_stack+0x19/0x24
[9.596010]  kasan_report+0xe8/0x350
[9.596010]  __asan_load4+0x78/0x80
[9.596010]  blk_mq_init_queue+0x1c/0x70
[9.596010]  nvme_validate_ns+0x17c/0x620
[9.596010]  ? nvme_revalidate_disk+0xf0/0xf0
[9.596010]  ? nvme_submit_sync_cmd+0x30/0x30


[PATCH] brd: fix brd_rw_page() vs copy_to_brd_setup errors

2017-07-25 Thread Dan Williams
As is done in zram_rw_page, pmem_rw_page, and btt_rw_page, don't
call page_endio in the error case since do_mpage_readpage and
__mpage_writepage will resubmit on error. Calling page_endio in the
error case leads to double completion.

Cc: Jens Axboe 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Signed-off-by: Dan Williams 
---
Noticed this while looking at unrelated brd code...

 drivers/block/brd.c |8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 104b71c0490d..055255ea131d 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -327,7 +327,13 @@ static int brd_rw_page(struct block_device *bdev, sector_t 
sector,
 {
struct brd_device *brd = bdev->bd_disk->private_data;
int err = brd_do_bvec(brd, page, PAGE_SIZE, 0, is_write, sector);
-   page_endio(page, is_write, err);
+
+   /*
+* In the error case we expect the upper layer to retry, so we
+* can't trigger page_endio yet.
+*/
+   if (err == 0)
+   page_endio(page, is_write, 0);
return err;
 }