Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-21 Thread Dave Chinner
On Wed, Dec 21, 2016 at 09:46:37PM -0800, Linus Torvalds wrote:
> On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner  wrote:
> >
> > There may be deeper issues. I just started running scalability tests
> > (e.g. 16-way fsmark create tests) and about a minute in I got a
> > directory corruption reported - something I hadn't seen in the dev
> > cycle at all.
> 
> By "in the dev cycle", do you mean your XFS changes, or have you been
> tracking the merge cycle at least for some testing?

I mean the three months leading up to the 4.10 merge, when all the
XFS changes were being tested against 4.9-rc kernels.

The iscsi problem showed up when I updated the base kernel from
4.9 to 4.10-current last week to test the pullreq I was going to
send you. I've been bust with other stuff until now, so I didn't
upgrade my working trees again until today in the hope the iscsi
problem had already been found and fixed.

> > I unmounted the fs, mkfs'd it again, ran the
> > workload again and about a minute in this fired:
> >
> > [628867.607417] [ cut here ]
> > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 
> > shadow_lru_isolate+0x171/0x220
> 
> Well, part of the changes during the merge window were the shadow
> entry tracking changes that came in through Andrew's tree. Adding
> Johannes Weiner to the participants.
> 
> > Now, this workload does not touch the page cache at all - it's
> > entirely an XFS metadata workload, so it should not really be
> > affecting the working set code.
> 
> Well, I suspect that anything that creates memory pressure will end up
> triggering the working set code, so ..
> 
> That said, obviously memory corruption could be involved and result in
> random issues too, but I wouldn't really expect that in this code.
> 
> It would probably be really useful to get more data points - is the
> problem reliably in this area, or is it going to be random and all
> over the place.

The iscsi problem is 100% reproducable. create a pair of iscsi luns,
mkfs, run xfstests on them. iscsi fails a second after xfstests mounts
the filesystems.

The test machine I'm having all these other problems on? stable and
steady as a rock using PMEM devices. Moment I go to use /dev/vdc
(i.e. run load/perf benchmarks) it starts falling over left, right
and center.

And I just smacked into this in the bulkstat phase of the benchmark
(mkfs, fsmark, xfs_repair, mount, bulkstat, find, grep, rm):

[ 2729.750563] BUG: Bad page state in process bstat  pfn:14945
[ 2729.751863] page:ea525140 count:-1 mapcount:0 mapping:  
(null) index:0x0
[ 2729.753763] flags: 0x4000()
[ 2729.754671] raw: 4000   

[ 2729.756469] raw: dead0100 dead0200  

[ 2729.758276] page dumped because: nonzero _refcount
[ 2729.759393] Modules linked in:
[ 2729.760137] CPU: 7 PID: 25902 Comm: bstat Tainted: GB   
4.9.0-dgc #18
[ 2729.761888] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Debian-1.8.2-1 04/01/2014
[ 2729.763943] Call Trace:
[ 2729.764523]  
[ 2729.765004]  dump_stack+0x63/0x83
[ 2729.765784]  bad_page+0xc4/0x130
[ 2729.766552]  free_pages_check_bad+0x4f/0x70
[ 2729.767531]  free_pcppages_bulk+0x3c5/0x3d0
[ 2729.768513]  ? page_alloc_cpu_dead+0x30/0x30
[ 2729.769510]  drain_pages_zone+0x41/0x60
[ 2729.770417]  drain_pages+0x3e/0x60
[ 2729.771215]  drain_local_pages+0x24/0x30
[ 2729.772138]  flush_smp_call_function_queue+0x88/0x160
[ 2729.773317]  generic_smp_call_function_single_interrupt+0x13/0x30
[ 2729.774742]  smp_call_function_single_interrupt+0x27/0x40
[ 2729.776000]  smp_call_function_interrupt+0xe/0x10
[ 2729.777102]  call_function_interrupt+0x8e/0xa0
[ 2729.778147] RIP: 0010:delay_tsc+0x41/0x90
[ 2729.779085] RSP: 0018:c9000f0cf500 EFLAGS: 0202 ORIG_RAX: 
ff03
[ 2729.780852] RAX: 77541291 RBX: 88008b5efe40 RCX: 002e
[ 2729.782514] RDX: 0577 RSI: 05541291 RDI: 0001
[ 2729.784167] RBP: c9000f0cf500 R08: 0007 R09: c9000f0cf678
[ 2729.785818] R10: 0006 R11: 1000 R12: 0061
[ 2729.787480] R13: 0001 R14: 83214e30 R15: 0080
[ 2729.789124]  
[ 2729.789626]  __delay+0xf/0x20
[ 2729.790333]  do_raw_spin_lock+0x8c/0x160
[ 2729.791255]  _raw_spin_lock+0x15/0x20
[ 2729.792112]  list_lru_add+0x1a/0x70
[ 2729.792932]  xfs_buf_rele+0x3e7/0x410
[ 2729.793792]  xfs_buftarg_shrink_scan+0x6b/0x80
[ 2729.794841]  shrink_slab.part.65.constprop.86+0x1dc/0x410
[ 2729.796099]  shrink_node+0x57/0x90
[ 2729.796905]  do_try_to_free_pages+0xdd/0x230
[ 2729.797914]  try_to_free_pages+0xce/0x1a0
[ 2729.798852]  __alloc_pages_slowpath+0x2df/0x960
[ 2729.799908]  __alloc_pages_nodemask+0x24b/0x290
[ 2729.800963]  new_slab+0x2ac/0x380
[ 2729.801743]  ___slab_alloc.constprop.82+0x336/0x440
[ 

Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-21 Thread Christoph Hellwig
On Thu, Dec 22, 2016 at 05:30:46PM +1100, Dave Chinner wrote:
> > For "normal" bios the for_each_segment loop iterates over bi_vcnt,
> > so it will be ignored anyway.  That being said both I and the lists
> > got CCed halfway through the thread and I haven't seen the original
> > report, so I'm not really sure what's going on here anyway.
> 
> http://www.gossamer-threads.com/lists/linux/kernel/2587485

This doesn't look like the discard changes, but if Chris wants to test
without them f9d03f96b988 reverts cleanly.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-21 Thread Dave Chinner
On Thu, Dec 22, 2016 at 07:18:27AM +0100, Christoph Hellwig wrote:
> On Wed, Dec 21, 2016 at 03:19:15PM -0800, Linus Torvalds wrote:
> > Looking around a bit, the only even halfway suspicious scatterlist
> > initialization thing I see is commit f9d03f96b988 ("block: improve
> > handling of the magic discard payload") which used to have a magic
> > hack wrt !bio->bi_vcnt, and that got removed. See __blk_bios_map_sg(),
> > now it does __blk_bvec_map_sg() instead.
> 
> But that check was only for discard (and discard-like) bios which
> had the maic single page that sometimes was unused attached.
> 
> For "normal" bios the for_each_segment loop iterates over bi_vcnt,
> so it will be ignored anyway.  That being said both I and the lists
> got CCed halfway through the thread and I haven't seen the original
> report, so I'm not really sure what's going on here anyway.

http://www.gossamer-threads.com/lists/linux/kernel/2587485

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-21 Thread Dave Chinner
On Thu, Dec 22, 2016 at 04:13:22PM +1100, Dave Chinner wrote:
> On Wed, Dec 21, 2016 at 04:13:03PM -0800, Chris Leech wrote:
> > On Wed, Dec 21, 2016 at 03:19:15PM -0800, Linus Torvalds wrote:
> > > Hi,
> > > 
> > > On Wed, Dec 21, 2016 at 2:16 PM, Dave Chinner  wrote:
> > > > On Fri, Dec 16, 2016 at 10:59:06AM -0800, Chris Leech wrote:
> > > >> Thanks Dave,
> > > >>
> > > >> I'm hitting a bug at scatterlist.h:140 before I even get any iSCSI
> > > >> modules loaded (virtio block) so there's something else going on in the
> > > >> current merge window.  I'll keep an eye on it and make sure there's
> > > >> nothing iSCSI needs fixing for.
> > > >
> > > > OK, so before this slips through the cracks.
> > > >
> > > > Linus - your tree as of a few minutes ago still panics immediately
> > > > when starting xfstests on iscsi devices. It appears to be a
> > > > scatterlist corruption and not an iscsi problem, so the iscsi guys
> > > > seem to have bounced it and no-one is looking at it.
> > > 
> > > Hmm. There's not much to go by.
> > > 
> > > Can somebody in iscsi-land please try to just bisect it - I'm not
> > > seeing a lot of clues to where this comes from otherwise.
> > 
> > Yeah, my hopes of this being quickly resolved by someone else didn't
> > work out and whatever is going on in that test VM is looking like a
> > different kind of odd.  I'm saving that off for later, and seeing if I
> > can't be a bisect on the iSCSI issue.
> 
> There may be deeper issues. I just started running scalability tests
> (e.g. 16-way fsmark create tests) and about a minute in I got a
> directory corruption reported - something I hadn't seen in the dev
> cycle at all. I unmounted the fs, mkfs'd it again, ran the
> workload again and about a minute in this fired:
> 
> [628867.607417] [ cut here ]
> [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 
> shadow_lru_isolate+0x171/0x220
> [628867.610702] Modules linked in:
> [628867.611375] CPU: 2 PID: 16925 Comm: kworker/2:97 Tainted: GW  
>  4.9.0-dgc #18
> [628867.613382] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> Debian-1.8.2-1 04/01/2014
> [628867.616179] Workqueue: events rht_deferred_worker
> [628867.632422] Call Trace:
> [628867.634691]  dump_stack+0x63/0x83
> [628867.637937]  __warn+0xcb/0xf0
> [628867.641359]  warn_slowpath_null+0x1d/0x20
> [628867.643362]  shadow_lru_isolate+0x171/0x220
> [628867.644627]  __list_lru_walk_one.isra.11+0x79/0x110
> [628867.645780]  ? __list_lru_init+0x70/0x70
> [628867.646628]  list_lru_walk_one+0x17/0x20
> [628867.647488]  scan_shadow_nodes+0x34/0x50
> [628867.648358]  shrink_slab.part.65.constprop.86+0x1dc/0x410
> [628867.649506]  shrink_node+0x57/0x90
> [628867.650233]  do_try_to_free_pages+0xdd/0x230
> [628867.651157]  try_to_free_pages+0xce/0x1a0
> [628867.652342]  __alloc_pages_slowpath+0x2df/0x960
> [628867.653332]  ? __might_sleep+0x4a/0x80
> [628867.654148]  __alloc_pages_nodemask+0x24b/0x290
> [628867.655237]  kmalloc_order+0x21/0x50
> [628867.656016]  kmalloc_order_trace+0x24/0xc0
> [628867.656878]  __kmalloc+0x17d/0x1d0
> [628867.657644]  bucket_table_alloc+0x195/0x1d0
> [628867.658564]  ? __might_sleep+0x4a/0x80
> [628867.659449]  rht_deferred_worker+0x287/0x3c0
> [628867.660366]  ? _raw_spin_unlock_irq+0xe/0x30
> [628867.661294]  process_one_work+0x1de/0x4d0
> [628867.662208]  worker_thread+0x4b/0x4f0
> [628867.662990]  kthread+0x10c/0x140
> [628867.663687]  ? process_one_work+0x4d0/0x4d0
> [628867.664564]  ? kthread_create_on_node+0x40/0x40
> [628867.665523]  ret_from_fork+0x25/0x30
> [628867.666317] ---[ end trace 7c38634006a9955e ]---
> 
> Now, this workload does not touch the page cache at all - it's
> entirely an XFS metadata workload, so it should not really be
> affecting the working set code.

The system back up, and I haven't reproduced this problem yet.
However, benchmark results are way off where they should be, and at
times the performance is utterly abysmal. The XFS for-next tree
based on the 4.9 kernel shows none of these problems, so I don't
think there's an XFS problem here. Workload is the same 16-way
fsmark workload that I've been using for years as a performance
regression test.

The workload normally averages around 230k files/s - i'm seeing
and average of ~175k files/s on you current kernel. And there are
periods where performance just completely tanks:

#  ./fs_mark  -D  1  -S0  -n  10  -s  0  -L  32  -d  /mnt/scratch/0  -d 
 /mnt/scratch/1  -d  /mnt/scratch/2  -d  /mnt/scratch/3  -d  /mnt/scratch/4  -d 
 /mnt/scratch/5  -d  /mnt/scratch/6  -d  /mnt/scratch/7  -d  /mnt/scratch/8  -d 
 /mnt/scratch/9  -d  /mnt/scratch/10  -d  /mnt/scratch/11  -d  /mnt/scratch/12  
-d  /mnt/scratch/13  -d  /mnt/scratch/14  -d  /mnt/scratch/15
#   Version 3.3, 16 thread(s) starting at Thu Dec 22 16:29:20 2016
#   Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
#   Directories:  Time based hash 

Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-21 Thread Christoph Hellwig
On Wed, Dec 21, 2016 at 03:19:15PM -0800, Linus Torvalds wrote:
> Looking around a bit, the only even halfway suspicious scatterlist
> initialization thing I see is commit f9d03f96b988 ("block: improve
> handling of the magic discard payload") which used to have a magic
> hack wrt !bio->bi_vcnt, and that got removed. See __blk_bios_map_sg(),
> now it does __blk_bvec_map_sg() instead.

But that check was only for discard (and discard-like) bios which
had the maic single page that sometimes was unused attached.

For "normal" bios the for_each_segment loop iterates over bi_vcnt,
so it will be ignored anyway.  That being said both I and the lists
got CCed halfway through the thread and I haven't seen the original
report, so I'm not really sure what's going on here anyway.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-21 Thread Linus Torvalds
On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner  wrote:
>
> There may be deeper issues. I just started running scalability tests
> (e.g. 16-way fsmark create tests) and about a minute in I got a
> directory corruption reported - something I hadn't seen in the dev
> cycle at all.

By "in the dev cycle", do you mean your XFS changes, or have you been
tracking the merge cycle at least for some testing?

> I unmounted the fs, mkfs'd it again, ran the
> workload again and about a minute in this fired:
>
> [628867.607417] [ cut here ]
> [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 
> shadow_lru_isolate+0x171/0x220

Well, part of the changes during the merge window were the shadow
entry tracking changes that came in through Andrew's tree. Adding
Johannes Weiner to the participants.

> Now, this workload does not touch the page cache at all - it's
> entirely an XFS metadata workload, so it should not really be
> affecting the working set code.

Well, I suspect that anything that creates memory pressure will end up
triggering the working set code, so ..

That said, obviously memory corruption could be involved and result in
random issues too, but I wouldn't really expect that in this code.

It would probably be really useful to get more data points - is the
problem reliably in this area, or is it going to be random and all
over the place.

That said:

> And worse, on that last error, the /host/ is now going into meltdown
> (running 4.7.5) with 32 CPUs all burning down in ACPI code:

The obvious question here is how much you trust the environment if the
host ends up also showing problems. Maybe you do end up having hw
issues pop up too.

The primary suspect would presumably be the development kernel you're
testing triggering something, but it has to be asked..

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-21 Thread Dave Chinner
On Wed, Dec 21, 2016 at 04:13:03PM -0800, Chris Leech wrote:
> On Wed, Dec 21, 2016 at 03:19:15PM -0800, Linus Torvalds wrote:
> > Hi,
> > 
> > On Wed, Dec 21, 2016 at 2:16 PM, Dave Chinner  wrote:
> > > On Fri, Dec 16, 2016 at 10:59:06AM -0800, Chris Leech wrote:
> > >> Thanks Dave,
> > >>
> > >> I'm hitting a bug at scatterlist.h:140 before I even get any iSCSI
> > >> modules loaded (virtio block) so there's something else going on in the
> > >> current merge window.  I'll keep an eye on it and make sure there's
> > >> nothing iSCSI needs fixing for.
> > >
> > > OK, so before this slips through the cracks.
> > >
> > > Linus - your tree as of a few minutes ago still panics immediately
> > > when starting xfstests on iscsi devices. It appears to be a
> > > scatterlist corruption and not an iscsi problem, so the iscsi guys
> > > seem to have bounced it and no-one is looking at it.
> > 
> > Hmm. There's not much to go by.
> > 
> > Can somebody in iscsi-land please try to just bisect it - I'm not
> > seeing a lot of clues to where this comes from otherwise.
> 
> Yeah, my hopes of this being quickly resolved by someone else didn't
> work out and whatever is going on in that test VM is looking like a
> different kind of odd.  I'm saving that off for later, and seeing if I
> can't be a bisect on the iSCSI issue.

There may be deeper issues. I just started running scalability tests
(e.g. 16-way fsmark create tests) and about a minute in I got a
directory corruption reported - something I hadn't seen in the dev
cycle at all. I unmounted the fs, mkfs'd it again, ran the
workload again and about a minute in this fired:

[628867.607417] [ cut here ]
[628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 
shadow_lru_isolate+0x171/0x220
[628867.610702] Modules linked in:
[628867.611375] CPU: 2 PID: 16925 Comm: kworker/2:97 Tainted: GW   
4.9.0-dgc #18
[628867.613382] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Debian-1.8.2-1 04/01/2014
[628867.616179] Workqueue: events rht_deferred_worker
[628867.632422] Call Trace:
[628867.634691]  dump_stack+0x63/0x83
[628867.637937]  __warn+0xcb/0xf0
[628867.641359]  warn_slowpath_null+0x1d/0x20
[628867.643362]  shadow_lru_isolate+0x171/0x220
[628867.644627]  __list_lru_walk_one.isra.11+0x79/0x110
[628867.645780]  ? __list_lru_init+0x70/0x70
[628867.646628]  list_lru_walk_one+0x17/0x20
[628867.647488]  scan_shadow_nodes+0x34/0x50
[628867.648358]  shrink_slab.part.65.constprop.86+0x1dc/0x410
[628867.649506]  shrink_node+0x57/0x90
[628867.650233]  do_try_to_free_pages+0xdd/0x230
[628867.651157]  try_to_free_pages+0xce/0x1a0
[628867.652342]  __alloc_pages_slowpath+0x2df/0x960
[628867.653332]  ? __might_sleep+0x4a/0x80
[628867.654148]  __alloc_pages_nodemask+0x24b/0x290
[628867.655237]  kmalloc_order+0x21/0x50
[628867.656016]  kmalloc_order_trace+0x24/0xc0
[628867.656878]  __kmalloc+0x17d/0x1d0
[628867.657644]  bucket_table_alloc+0x195/0x1d0
[628867.658564]  ? __might_sleep+0x4a/0x80
[628867.659449]  rht_deferred_worker+0x287/0x3c0
[628867.660366]  ? _raw_spin_unlock_irq+0xe/0x30
[628867.661294]  process_one_work+0x1de/0x4d0
[628867.662208]  worker_thread+0x4b/0x4f0
[628867.662990]  kthread+0x10c/0x140
[628867.663687]  ? process_one_work+0x4d0/0x4d0
[628867.664564]  ? kthread_create_on_node+0x40/0x40
[628867.665523]  ret_from_fork+0x25/0x30
[628867.666317] ---[ end trace 7c38634006a9955e ]---

Now, this workload does not touch the page cache at all - it's
entirely an XFS metadata workload, so it should not really be
affecting the working set code.

And worse, on that last error, the /host/ is now going into meltdown
(running 4.7.5) with 32 CPUs all burning down in ACPI code:

  PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND
35074 root  -2   0   0  0  0 R  99.0  0.0  12:38.92 acpi_pad/12
35079 root  -2   0   0  0  0 R  99.0  0.0  12:39.40 acpi_pad/16
35080 root  -2   0   0  0  0 R  99.0  0.0  12:39.29 acpi_pad/17
35085 root  -2   0   0  0  0 R  99.0  0.0  12:39.35 acpi_pad/22
35087 root  -2   0   0  0  0 R  99.0  0.0  12:39.13 acpi_pad/24
35090 root  -2   0   0  0  0 R  99.0  0.0  12:38.89 acpi_pad/27
35093 root  -2   0   0  0  0 R  99.0  0.0  12:38.88 acpi_pad/30
35063 root  -2   0   0  0  0 R  98.1  0.0  12:40.64 acpi_pad/1
35065 root  -2   0   0  0  0 R  98.1  0.0  12:40.38 acpi_pad/3
35066 root  -2   0   0  0  0 R  98.1  0.0  12:40.30 acpi_pad/4
35067 root  -2   0   0  0  0 R  98.1  0.0  12:40.82 acpi_pad/5
35077 root  -2   0   0  0  0 R  98.1  0.0  12:39.65 acpi_pad/14
35078 root  -2   0   0  0  0 R  98.1  0.0  12:39.58 acpi_pad/15
35081 root  -2   0   0  0  0 R  98.1  0.0  12:39.32 acpi_pad/18
35072 root  -2   0   0  0  0 R  96.2  

Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-21 Thread Chris Leech
On Wed, Dec 21, 2016 at 03:19:15PM -0800, Linus Torvalds wrote:
> Hi,
> 
> On Wed, Dec 21, 2016 at 2:16 PM, Dave Chinner  wrote:
> > On Fri, Dec 16, 2016 at 10:59:06AM -0800, Chris Leech wrote:
> >> Thanks Dave,
> >>
> >> I'm hitting a bug at scatterlist.h:140 before I even get any iSCSI
> >> modules loaded (virtio block) so there's something else going on in the
> >> current merge window.  I'll keep an eye on it and make sure there's
> >> nothing iSCSI needs fixing for.
> >
> > OK, so before this slips through the cracks.
> >
> > Linus - your tree as of a few minutes ago still panics immediately
> > when starting xfstests on iscsi devices. It appears to be a
> > scatterlist corruption and not an iscsi problem, so the iscsi guys
> > seem to have bounced it and no-one is looking at it.
> 
> Hmm. There's not much to go by.
> 
> Can somebody in iscsi-land please try to just bisect it - I'm not
> seeing a lot of clues to where this comes from otherwise.

Yeah, my hopes of this being quickly resolved by someone else didn't
work out and whatever is going on in that test VM is looking like a
different kind of odd.  I'm saving that off for later, and seeing if I
can't be a bisect on the iSCSI issue.

Chris
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi: do not requeue requests unaligned with device sector size

2016-12-21 Thread Mauricio Faria de Oliveira

On 12/21/2016 05:50 AM, Christoph Hellwig wrote:
> How do you even get an unaligned residual count?  Except for SES
> processor devices (which will only issue BLOCK_PC commands) this is
> not allowed by SPC:
>
> "The residual count shall be reported in bytes if the peripheral device
>  type in the destination target descriptor is 03h (i.e., processor 
device),

>  and in destination device blocks for all other device type codes.

On 12/21/2016 06:09 AM, Hannes Reinecke wrote:
> Which actually would be pretty much my objection, too.
>
> This would only be applicable for 512e drives, where we _might_ end up
> with a residual smaller than the physical sector size.
> But that should be handled by firmware; after all, that's what the 'e'
> implies, right?

On 12/21/2016 12:01 PM, Martin K. Petersen wrote:

I agree with Christoph and Hannes. Some of this falls into the gray area
that's outside of the T10 spec (HBA programming interface guarantees)
but it seems like a deficiency in the HBA to report a byte count that's
not a multiple of the logical block size. A block can't be partially
written. Either it made it or it didn't. Regardless of how the I/O is
being broken up into frames at the transport level and at which offset
the transfer was interrupted.


Christoph, Hannes, Martin,

Thank you all for your comments and pointers to the documentation/spec.
I'll carry it on with the HBA and storage folks.

cheers,

--
Mauricio Faria de Oliveira
IBM Linux Technology Center

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-21 Thread Dave Chinner
On Fri, Dec 16, 2016 at 10:59:06AM -0800, Chris Leech wrote:
> Thanks Dave,
> 
> I'm hitting a bug at scatterlist.h:140 before I even get any iSCSI
> modules loaded (virtio block) so there's something else going on in the
> current merge window.  I'll keep an eye on it and make sure there's
> nothing iSCSI needs fixing for.

OK, so before this slips through the cracks.

Linus - your tree as of a few minutes ago still panics immediately
when starting xfstests on iscsi devices. It appears to be a
scatterlist corruption and not an iscsi problem, so the iscsi guys
seem to have bounced it and no-one is looking at it.

I'm disappearing for several months at the end of tomorrow, so I
thought I better make sure you know about it.  I've also added
linux-scsi, linux-block to the cc list

Cheers,

Dave.

> On Thu, Dec 15, 2016 at 09:29:53AM +1100, Dave Chinner wrote:
> > On Thu, Dec 15, 2016 at 09:24:11AM +1100, Dave Chinner wrote:
> > > Hi folks,
> > > 
> > > Just updated my test boxes from 4.9 to a current Linus 4.10 merge
> > > window kernel to test the XFS merge I am preparing for Linus.
> > > Unfortunately, all my test VMs using iscsi failed pretty much
> > > instantly on the first mount of an iscsi device:
> > > 
> > > [  159.372704] XFS (sdb): EXPERIMENTAL reverse mapping btree feature 
> > > enabled. Use at your own risk!
> > > [  159.374612] XFS (sdb): Mounting V5 Filesystem
> > > [  159.425710] XFS (sdb): Ending clean mount
> > > [  160.274438] BUG: unable to handle kernel NULL pointer dereference at 
> > > 000c
> > > [  160.275851] IP: iscsi_tcp_segment_done+0x20d/0x2e0
> > 
> > FYI, crash is here:
> > 
> > (gdb) l *(iscsi_tcp_segment_done+0x20d)
> > 0x81b950bd is in iscsi_tcp_segment_done 
> > (drivers/scsi/libiscsi_tcp.c:102).
> > 97  iscsi_tcp_segment_init_sg(struct iscsi_segment *segment,
> > 98struct scatterlist *sg, unsigned int offset)
> > 99  {
> > 100 segment->sg = sg;
> > 101 segment->sg_offset = offset;
> > 102 segment->size = min(sg->length - offset,
> > 103 segment->total_size - 
> > segment->total_copied);
> > 104 segment->data = NULL;
> > 105 }
> > 106 
> > 
> > So it looks to be sg = NULL, which means there's probably an issue
> > with the scatterlist...
> > 
> > -Dave.
> > 
> > > [  160.276565] PGD 336ed067 [  160.276885] PUD 31b0d067
> > > PMD 0 [  160.277309]
> > > [  160.277523] Oops:  [#1] PREEMPT SMP
> > > [  160.278004] Modules linked in:
> > > [  160.278407] CPU: 0 PID: 16 Comm: kworker/u2:1 Not tainted 4.9.0-dgc #18
> > > [  160.279224] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
> > > BIOS Debian-1.8.2-1 04/01/2014
> > > [  160.280314] Workqueue: iscsi_q_2 iscsi_xmitworker
> > > [  160.280919] task: 88003e28 task.stack: c908
> > > [  160.281647] RIP: 0010:iscsi_tcp_segment_done+0x20d/0x2e0
> > > [  160.282312] RSP: 0018:c9083c38 EFLAGS: 00010206
> > > [  160.282980] RAX:  RBX: 880039061730 RCX: 
> > > 
> > > [  160.283854] RDX: 1e00 RSI:  RDI: 
> > > 880039061730
> > > [  160.284738] RBP: c9083c90 R08: 0200 R09: 
> > > 05a8
> > > [  160.285627] R10: 9835607d R11:  R12: 
> > > 0200
> > > [  160.286495] R13:  R14: 8800390615a0 R15: 
> > > 880039061730
> > > [  160.287362] FS:  () GS:88003fc0() 
> > > knlGS:
> > > [  160.288340] CS:  0010 DS:  ES:  CR0: 80050033
> > > [  160.289113] CR2: 000c CR3: 31a8d000 CR4: 
> > > 06f0
> > > [  160.290084] Call Trace:
> > > [  160.290429]  ? inet_sendpage+0x4d/0x140
> > > [  160.290957]  iscsi_sw_tcp_xmit_segment+0x89/0x110
> > > [  160.291597]  iscsi_sw_tcp_pdu_xmit+0x56/0x180
> > > [  160.292190]  iscsi_tcp_task_xmit+0xb8/0x280
> > > [  160.292771]  iscsi_xmit_task+0x53/0xc0
> > > [  160.293282]  iscsi_xmitworker+0x274/0x310
> > > [  160.293835]  process_one_work+0x1de/0x4d0
> > > [  160.294388]  worker_thread+0x4b/0x4f0
> > > [  160.294889]  kthread+0x10c/0x140
> > > [  160.295333]  ? process_one_work+0x4d0/0x4d0
> > > [  160.295898]  ? kthread_create_on_node+0x40/0x40
> > > [  160.296525]  ret_from_fork+0x25/0x30
> > > [  160.297015] Code: 43 18 00 00 00 00 e9 ad fe ff ff 48 8b 7b 30 e8 da 
> > > e7 ca ff 8b 53 10 44 89 ee 48 89 df 2b 53 14 48 89 43 30 c7 43 40 00 00 
> > > 00 00 <8b
> > > [  160.300674] RIP: iscsi_tcp_segment_done+0x20d/0x2e0 RSP: 
> > > c9083c38
> > > [  160.301584] CR2: 000c
> > > 
> > > 
> > > Known problem, or something new?
> > > 
> > > Cheers,
> > > 
> > > Dave.
> > > -- 
> > > Dave Chinner
> > > da...@fromorbit.com
> > > 
> > 
> > -- 
> > Dave Chinner
> > da...@fromorbit.com
> 

-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-block" 

Re: [PATCH v2] RFD: switch MMC/SD to use blk-mq multiqueueing

2016-12-21 Thread Ritesh Harjani

Hi,

I may have some silly queries here. Please bear with my little 
understanding on blk-mq.


On 12/20/2016 7:31 PM, Linus Walleij wrote:

HACK ALERT: DO NOT MERGE THIS! IT IS A FYI PATCH FOR DISCUSSION
ONLY.

This hack switches the MMC/SD subsystem from using the legacy blk
layer to using blk-mq. It does this by registering one single
hardware queue, since MMC/SD has only one command pipe. I kill
Could you please confirm on this- does even the HW/SW CMDQ in emmc would 
use only 1 hardware queue with (say ~31) as queue depth, of that HW 
queue? Is this understanding correct?


Or will it be possible to have more than 1 HW Queue with lesser queue 
depth per HW queue?




off the worker thread altogether and let the MQ core logic fire
sleepable requests directly into the MMC core.

We emulate the 2 elements deep pipeline by specifying queue depth
2, which is an elaborate lie that makes the block layer issue
another request while a previous request is in transit. It't not
neat but it works.

As the pipeline needs to be flushed by pushing in a NULL request
after the last block layer request I added a delayed work with a
timeout of zero. This will fire as soon as the block layer stops
pushing in requests: as long as there are new requests the MQ
block layer will just repeatedly cancel this pipeline flush work
and push new requests into the pipeline, but once the requests
stop coming the NULL request will be flushed into the pipeline.

It's not pretty but it works... Look at the following performance
statistics:

I understand that the block drivers are moving to blk-mq framework.
But keeping that reason apart, do we also anticipate any theoretical 
performance gains in moving mmc driver to blk-mq framework  - for both 
in case of legacy emmc, and SW/HW CMDQ in emmc ? And by how much?


It would be even better to know if adding of scheduler to blk-mq will 
make any difference in perf gains or not in this case?


Do we any rough estimate or study on that?
This is only out of curiosity and for information purpose.

Regards
Ritesh



BEFORE this patch:

time dd if=/dev/mmcblk0 of=/dev/null bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.0GB) copied, 45.145874 seconds, 22.7MB/s
real0m 45.15s
user0m 0.02s
sys 0m 7.51s

mount /dev/mmcblk0p1 /mnt/
cd /mnt/
time find . > /dev/null
real0m 3.70s
user0m 0.29s
sys 0m 1.63s

AFTER this patch:

time dd if=/dev/mmcblk0 of=/dev/null bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.0GB) copied, 45.285431 seconds, 22.6MB/s
real0m 45.29s
user0m 0.02s
sys 0m 6.58s

mount /dev/mmcblk0p1 /mnt/
cd /mnt/
time find . > /dev/null
real0m 4.37s
user0m 0.27s
sys 0m 1.65s

The results are consistent.

As you can see, for a straight dd-like task, we get more or less the
same nice parallelism as for the old framework. I have confirmed
through debugprints that indeed this is because the two-stage pipeline
is full at all times.

However, for spurious reads in the find command, we already see a big
performance regression.

This is because there are many small operations requireing a flush of
the pipeline, which used to happen immediately with the old block
layer interface code that used to pull a few NULL requests off the
queue and feed them into the pipeline immediately after the last
request, but happens after the delayed work is executed in this
new framework. The delayed work is never quick enough to terminate
all these small operations even if we schedule it immediately after
the last request.

AFAICT the only way forward to provide proper performance with MQ
for MMC/SD is to get the requests to complete out-of-sync, i.e. when
the driver calls back to MMC/SD core to notify that a request is
complete, it should not notify any main thread with a completion
as is done right now, but instead directly call blk_end_request_all()
and only schedule some extra communication with the card if necessary
for example to handle an error condition.

This rework needs a bigger rewrite so we can get rid of the paradigm
of the block layer "driving" the requests throgh the pipeline.

Signed-off-by: Linus Walleij 

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler

2016-12-21 Thread Jens Axboe
On 12/21/2016 04:59 AM, Bart Van Assche wrote:
> Since this patch is the first patch that introduces a call to
> blk_queue_exit() from a module other than the block layer core,
> shouldn't this patch export the blk_queue_exit() function? An attempt
> to build mq-deadline as a module resulted in the following:
> 
> ERROR: "blk_queue_exit" [block/mq-deadline.ko] undefined!
> make[1]: *** [scripts/Makefile.modpost:91: __modpost] Error 1
> make: *** [Makefile:1198: modules] Error 2
> Execution failed: make all

Yes, it should. I'll make the export for now, I want to move that check
and free/drop into the generic code so that the schedulers don't have to
worry about it.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi: do not requeue requests unaligned with device sector size

2016-12-21 Thread Martin K. Petersen
> "Mauricio" == Mauricio Faria de Oliveira  
> writes:

Mauricio,

Mauricio> When a SCSI command (e.g., read operation) is partially
Mauricio> completed with good status and residual bytes (i.e., not all
Mauricio> the bytes from the specified transfer length were transferred)
Mauricio> the SCSI midlayer will update the request/bios with the
Mauricio> completed bytes and requeue the request in order to complete
Mauricio> the remainder/pending bytes.

I agree with Christoph and Hannes. Some of this falls into the gray area
that's outside of the T10 spec (HBA programming interface guarantees)
but it seems like a deficiency in the HBA to report a byte count that's
not a multiple of the logical block size. A block can't be partially
written. Either it made it or it didn't. Regardless of how the I/O is
being broken up into frames at the transport level and at which offset
the transfer was interrupted.

I am also not a fan of the delayed retry stuff which seems somewhat
orthogonal to the problem you're describing.

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi: do not requeue requests unaligned with device sector size

2016-12-21 Thread Hannes Reinecke
On 12/21/2016 08:50 AM, Christoph Hellwig wrote:
> On Tue, Dec 20, 2016 at 12:02:27AM -0200, Mauricio Faria de Oliveira wrote:
>> When a SCSI command (e.g., read operation) is partially completed
>> with good status and residual bytes (i.e., not all the bytes from
>> the specified transfer length were transferred) the SCSI midlayer
>> will update the request/bios with the completed bytes and requeue
>> the request in order to complete the remainder/pending bytes.
>>
>> However, when the device sector size is greater than the 512-byte
>> default/kernel sector size, alignment restrictions and validation
>> apply (both to the starting logical block address, and the number
>> of logical blocks to transfer) -- values must be multiples of the
>> device sector size, otherwise the kernel fails the request in the
>> preparation stage (e.g., sd_setup_read_write_cmnd() at sd.c file):
> 
> How do you even get an unaligned residual count?  Except for SES
> processor devices (which will only issue BLOCK_PC commands) this is
> not allowed by SPC:
> 
> "The residual count shall be reported in bytes if the peripheral device
>  type in the destination target descriptor is 03h (i.e., processor device),
>  and in destination device blocks for all other device type codes."

Which actually would be pretty much my objection, too.

This would only be applicable for 512e drives, where we _might_ end up
with a residual smaller than the physical sector size.
But that should be handled by firmware; after all, that's what the 'e'
implies, right?

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html