Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
On Wed, Dec 21, 2016 at 09:46:37PM -0800, Linus Torvalds wrote: > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinnerwrote: > > > > There may be deeper issues. I just started running scalability tests > > (e.g. 16-way fsmark create tests) and about a minute in I got a > > directory corruption reported - something I hadn't seen in the dev > > cycle at all. > > By "in the dev cycle", do you mean your XFS changes, or have you been > tracking the merge cycle at least for some testing? I mean the three months leading up to the 4.10 merge, when all the XFS changes were being tested against 4.9-rc kernels. The iscsi problem showed up when I updated the base kernel from 4.9 to 4.10-current last week to test the pullreq I was going to send you. I've been bust with other stuff until now, so I didn't upgrade my working trees again until today in the hope the iscsi problem had already been found and fixed. > > I unmounted the fs, mkfs'd it again, ran the > > workload again and about a minute in this fired: > > > > [628867.607417] [ cut here ] > > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 > > shadow_lru_isolate+0x171/0x220 > > Well, part of the changes during the merge window were the shadow > entry tracking changes that came in through Andrew's tree. Adding > Johannes Weiner to the participants. > > > Now, this workload does not touch the page cache at all - it's > > entirely an XFS metadata workload, so it should not really be > > affecting the working set code. > > Well, I suspect that anything that creates memory pressure will end up > triggering the working set code, so .. > > That said, obviously memory corruption could be involved and result in > random issues too, but I wouldn't really expect that in this code. > > It would probably be really useful to get more data points - is the > problem reliably in this area, or is it going to be random and all > over the place. The iscsi problem is 100% reproducable. create a pair of iscsi luns, mkfs, run xfstests on them. iscsi fails a second after xfstests mounts the filesystems. The test machine I'm having all these other problems on? stable and steady as a rock using PMEM devices. Moment I go to use /dev/vdc (i.e. run load/perf benchmarks) it starts falling over left, right and center. And I just smacked into this in the bulkstat phase of the benchmark (mkfs, fsmark, xfs_repair, mount, bulkstat, find, grep, rm): [ 2729.750563] BUG: Bad page state in process bstat pfn:14945 [ 2729.751863] page:ea525140 count:-1 mapcount:0 mapping: (null) index:0x0 [ 2729.753763] flags: 0x4000() [ 2729.754671] raw: 4000 [ 2729.756469] raw: dead0100 dead0200 [ 2729.758276] page dumped because: nonzero _refcount [ 2729.759393] Modules linked in: [ 2729.760137] CPU: 7 PID: 25902 Comm: bstat Tainted: GB 4.9.0-dgc #18 [ 2729.761888] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014 [ 2729.763943] Call Trace: [ 2729.764523] [ 2729.765004] dump_stack+0x63/0x83 [ 2729.765784] bad_page+0xc4/0x130 [ 2729.766552] free_pages_check_bad+0x4f/0x70 [ 2729.767531] free_pcppages_bulk+0x3c5/0x3d0 [ 2729.768513] ? page_alloc_cpu_dead+0x30/0x30 [ 2729.769510] drain_pages_zone+0x41/0x60 [ 2729.770417] drain_pages+0x3e/0x60 [ 2729.771215] drain_local_pages+0x24/0x30 [ 2729.772138] flush_smp_call_function_queue+0x88/0x160 [ 2729.773317] generic_smp_call_function_single_interrupt+0x13/0x30 [ 2729.774742] smp_call_function_single_interrupt+0x27/0x40 [ 2729.776000] smp_call_function_interrupt+0xe/0x10 [ 2729.777102] call_function_interrupt+0x8e/0xa0 [ 2729.778147] RIP: 0010:delay_tsc+0x41/0x90 [ 2729.779085] RSP: 0018:c9000f0cf500 EFLAGS: 0202 ORIG_RAX: ff03 [ 2729.780852] RAX: 77541291 RBX: 88008b5efe40 RCX: 002e [ 2729.782514] RDX: 0577 RSI: 05541291 RDI: 0001 [ 2729.784167] RBP: c9000f0cf500 R08: 0007 R09: c9000f0cf678 [ 2729.785818] R10: 0006 R11: 1000 R12: 0061 [ 2729.787480] R13: 0001 R14: 83214e30 R15: 0080 [ 2729.789124] [ 2729.789626] __delay+0xf/0x20 [ 2729.790333] do_raw_spin_lock+0x8c/0x160 [ 2729.791255] _raw_spin_lock+0x15/0x20 [ 2729.792112] list_lru_add+0x1a/0x70 [ 2729.792932] xfs_buf_rele+0x3e7/0x410 [ 2729.793792] xfs_buftarg_shrink_scan+0x6b/0x80 [ 2729.794841] shrink_slab.part.65.constprop.86+0x1dc/0x410 [ 2729.796099] shrink_node+0x57/0x90 [ 2729.796905] do_try_to_free_pages+0xdd/0x230 [ 2729.797914] try_to_free_pages+0xce/0x1a0 [ 2729.798852] __alloc_pages_slowpath+0x2df/0x960 [ 2729.799908] __alloc_pages_nodemask+0x24b/0x290 [ 2729.800963] new_slab+0x2ac/0x380 [ 2729.801743] ___slab_alloc.constprop.82+0x336/0x440 [
Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
On Thu, Dec 22, 2016 at 05:30:46PM +1100, Dave Chinner wrote: > > For "normal" bios the for_each_segment loop iterates over bi_vcnt, > > so it will be ignored anyway. That being said both I and the lists > > got CCed halfway through the thread and I haven't seen the original > > report, so I'm not really sure what's going on here anyway. > > http://www.gossamer-threads.com/lists/linux/kernel/2587485 This doesn't look like the discard changes, but if Chris wants to test without them f9d03f96b988 reverts cleanly. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
On Thu, Dec 22, 2016 at 07:18:27AM +0100, Christoph Hellwig wrote: > On Wed, Dec 21, 2016 at 03:19:15PM -0800, Linus Torvalds wrote: > > Looking around a bit, the only even halfway suspicious scatterlist > > initialization thing I see is commit f9d03f96b988 ("block: improve > > handling of the magic discard payload") which used to have a magic > > hack wrt !bio->bi_vcnt, and that got removed. See __blk_bios_map_sg(), > > now it does __blk_bvec_map_sg() instead. > > But that check was only for discard (and discard-like) bios which > had the maic single page that sometimes was unused attached. > > For "normal" bios the for_each_segment loop iterates over bi_vcnt, > so it will be ignored anyway. That being said both I and the lists > got CCed halfway through the thread and I haven't seen the original > report, so I'm not really sure what's going on here anyway. http://www.gossamer-threads.com/lists/linux/kernel/2587485 Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
On Thu, Dec 22, 2016 at 04:13:22PM +1100, Dave Chinner wrote: > On Wed, Dec 21, 2016 at 04:13:03PM -0800, Chris Leech wrote: > > On Wed, Dec 21, 2016 at 03:19:15PM -0800, Linus Torvalds wrote: > > > Hi, > > > > > > On Wed, Dec 21, 2016 at 2:16 PM, Dave Chinnerwrote: > > > > On Fri, Dec 16, 2016 at 10:59:06AM -0800, Chris Leech wrote: > > > >> Thanks Dave, > > > >> > > > >> I'm hitting a bug at scatterlist.h:140 before I even get any iSCSI > > > >> modules loaded (virtio block) so there's something else going on in the > > > >> current merge window. I'll keep an eye on it and make sure there's > > > >> nothing iSCSI needs fixing for. > > > > > > > > OK, so before this slips through the cracks. > > > > > > > > Linus - your tree as of a few minutes ago still panics immediately > > > > when starting xfstests on iscsi devices. It appears to be a > > > > scatterlist corruption and not an iscsi problem, so the iscsi guys > > > > seem to have bounced it and no-one is looking at it. > > > > > > Hmm. There's not much to go by. > > > > > > Can somebody in iscsi-land please try to just bisect it - I'm not > > > seeing a lot of clues to where this comes from otherwise. > > > > Yeah, my hopes of this being quickly resolved by someone else didn't > > work out and whatever is going on in that test VM is looking like a > > different kind of odd. I'm saving that off for later, and seeing if I > > can't be a bisect on the iSCSI issue. > > There may be deeper issues. I just started running scalability tests > (e.g. 16-way fsmark create tests) and about a minute in I got a > directory corruption reported - something I hadn't seen in the dev > cycle at all. I unmounted the fs, mkfs'd it again, ran the > workload again and about a minute in this fired: > > [628867.607417] [ cut here ] > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 > shadow_lru_isolate+0x171/0x220 > [628867.610702] Modules linked in: > [628867.611375] CPU: 2 PID: 16925 Comm: kworker/2:97 Tainted: GW > 4.9.0-dgc #18 > [628867.613382] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > Debian-1.8.2-1 04/01/2014 > [628867.616179] Workqueue: events rht_deferred_worker > [628867.632422] Call Trace: > [628867.634691] dump_stack+0x63/0x83 > [628867.637937] __warn+0xcb/0xf0 > [628867.641359] warn_slowpath_null+0x1d/0x20 > [628867.643362] shadow_lru_isolate+0x171/0x220 > [628867.644627] __list_lru_walk_one.isra.11+0x79/0x110 > [628867.645780] ? __list_lru_init+0x70/0x70 > [628867.646628] list_lru_walk_one+0x17/0x20 > [628867.647488] scan_shadow_nodes+0x34/0x50 > [628867.648358] shrink_slab.part.65.constprop.86+0x1dc/0x410 > [628867.649506] shrink_node+0x57/0x90 > [628867.650233] do_try_to_free_pages+0xdd/0x230 > [628867.651157] try_to_free_pages+0xce/0x1a0 > [628867.652342] __alloc_pages_slowpath+0x2df/0x960 > [628867.653332] ? __might_sleep+0x4a/0x80 > [628867.654148] __alloc_pages_nodemask+0x24b/0x290 > [628867.655237] kmalloc_order+0x21/0x50 > [628867.656016] kmalloc_order_trace+0x24/0xc0 > [628867.656878] __kmalloc+0x17d/0x1d0 > [628867.657644] bucket_table_alloc+0x195/0x1d0 > [628867.658564] ? __might_sleep+0x4a/0x80 > [628867.659449] rht_deferred_worker+0x287/0x3c0 > [628867.660366] ? _raw_spin_unlock_irq+0xe/0x30 > [628867.661294] process_one_work+0x1de/0x4d0 > [628867.662208] worker_thread+0x4b/0x4f0 > [628867.662990] kthread+0x10c/0x140 > [628867.663687] ? process_one_work+0x4d0/0x4d0 > [628867.664564] ? kthread_create_on_node+0x40/0x40 > [628867.665523] ret_from_fork+0x25/0x30 > [628867.666317] ---[ end trace 7c38634006a9955e ]--- > > Now, this workload does not touch the page cache at all - it's > entirely an XFS metadata workload, so it should not really be > affecting the working set code. The system back up, and I haven't reproduced this problem yet. However, benchmark results are way off where they should be, and at times the performance is utterly abysmal. The XFS for-next tree based on the 4.9 kernel shows none of these problems, so I don't think there's an XFS problem here. Workload is the same 16-way fsmark workload that I've been using for years as a performance regression test. The workload normally averages around 230k files/s - i'm seeing and average of ~175k files/s on you current kernel. And there are periods where performance just completely tanks: # ./fs_mark -D 1 -S0 -n 10 -s 0 -L 32 -d /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d /mnt/scratch/6 -d /mnt/scratch/7 -d /mnt/scratch/8 -d /mnt/scratch/9 -d /mnt/scratch/10 -d /mnt/scratch/11 -d /mnt/scratch/12 -d /mnt/scratch/13 -d /mnt/scratch/14 -d /mnt/scratch/15 # Version 3.3, 16 thread(s) starting at Thu Dec 22 16:29:20 2016 # Sync method: NO SYNC: Test does not issue sync() or fsync() calls. # Directories: Time based hash
Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
On Wed, Dec 21, 2016 at 03:19:15PM -0800, Linus Torvalds wrote: > Looking around a bit, the only even halfway suspicious scatterlist > initialization thing I see is commit f9d03f96b988 ("block: improve > handling of the magic discard payload") which used to have a magic > hack wrt !bio->bi_vcnt, and that got removed. See __blk_bios_map_sg(), > now it does __blk_bvec_map_sg() instead. But that check was only for discard (and discard-like) bios which had the maic single page that sometimes was unused attached. For "normal" bios the for_each_segment loop iterates over bi_vcnt, so it will be ignored anyway. That being said both I and the lists got CCed halfway through the thread and I haven't seen the original report, so I'm not really sure what's going on here anyway. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinnerwrote: > > There may be deeper issues. I just started running scalability tests > (e.g. 16-way fsmark create tests) and about a minute in I got a > directory corruption reported - something I hadn't seen in the dev > cycle at all. By "in the dev cycle", do you mean your XFS changes, or have you been tracking the merge cycle at least for some testing? > I unmounted the fs, mkfs'd it again, ran the > workload again and about a minute in this fired: > > [628867.607417] [ cut here ] > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 > shadow_lru_isolate+0x171/0x220 Well, part of the changes during the merge window were the shadow entry tracking changes that came in through Andrew's tree. Adding Johannes Weiner to the participants. > Now, this workload does not touch the page cache at all - it's > entirely an XFS metadata workload, so it should not really be > affecting the working set code. Well, I suspect that anything that creates memory pressure will end up triggering the working set code, so .. That said, obviously memory corruption could be involved and result in random issues too, but I wouldn't really expect that in this code. It would probably be really useful to get more data points - is the problem reliably in this area, or is it going to be random and all over the place. That said: > And worse, on that last error, the /host/ is now going into meltdown > (running 4.7.5) with 32 CPUs all burning down in ACPI code: The obvious question here is how much you trust the environment if the host ends up also showing problems. Maybe you do end up having hw issues pop up too. The primary suspect would presumably be the development kernel you're testing triggering something, but it has to be asked.. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
On Wed, Dec 21, 2016 at 04:13:03PM -0800, Chris Leech wrote: > On Wed, Dec 21, 2016 at 03:19:15PM -0800, Linus Torvalds wrote: > > Hi, > > > > On Wed, Dec 21, 2016 at 2:16 PM, Dave Chinnerwrote: > > > On Fri, Dec 16, 2016 at 10:59:06AM -0800, Chris Leech wrote: > > >> Thanks Dave, > > >> > > >> I'm hitting a bug at scatterlist.h:140 before I even get any iSCSI > > >> modules loaded (virtio block) so there's something else going on in the > > >> current merge window. I'll keep an eye on it and make sure there's > > >> nothing iSCSI needs fixing for. > > > > > > OK, so before this slips through the cracks. > > > > > > Linus - your tree as of a few minutes ago still panics immediately > > > when starting xfstests on iscsi devices. It appears to be a > > > scatterlist corruption and not an iscsi problem, so the iscsi guys > > > seem to have bounced it and no-one is looking at it. > > > > Hmm. There's not much to go by. > > > > Can somebody in iscsi-land please try to just bisect it - I'm not > > seeing a lot of clues to where this comes from otherwise. > > Yeah, my hopes of this being quickly resolved by someone else didn't > work out and whatever is going on in that test VM is looking like a > different kind of odd. I'm saving that off for later, and seeing if I > can't be a bisect on the iSCSI issue. There may be deeper issues. I just started running scalability tests (e.g. 16-way fsmark create tests) and about a minute in I got a directory corruption reported - something I hadn't seen in the dev cycle at all. I unmounted the fs, mkfs'd it again, ran the workload again and about a minute in this fired: [628867.607417] [ cut here ] [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 shadow_lru_isolate+0x171/0x220 [628867.610702] Modules linked in: [628867.611375] CPU: 2 PID: 16925 Comm: kworker/2:97 Tainted: GW 4.9.0-dgc #18 [628867.613382] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014 [628867.616179] Workqueue: events rht_deferred_worker [628867.632422] Call Trace: [628867.634691] dump_stack+0x63/0x83 [628867.637937] __warn+0xcb/0xf0 [628867.641359] warn_slowpath_null+0x1d/0x20 [628867.643362] shadow_lru_isolate+0x171/0x220 [628867.644627] __list_lru_walk_one.isra.11+0x79/0x110 [628867.645780] ? __list_lru_init+0x70/0x70 [628867.646628] list_lru_walk_one+0x17/0x20 [628867.647488] scan_shadow_nodes+0x34/0x50 [628867.648358] shrink_slab.part.65.constprop.86+0x1dc/0x410 [628867.649506] shrink_node+0x57/0x90 [628867.650233] do_try_to_free_pages+0xdd/0x230 [628867.651157] try_to_free_pages+0xce/0x1a0 [628867.652342] __alloc_pages_slowpath+0x2df/0x960 [628867.653332] ? __might_sleep+0x4a/0x80 [628867.654148] __alloc_pages_nodemask+0x24b/0x290 [628867.655237] kmalloc_order+0x21/0x50 [628867.656016] kmalloc_order_trace+0x24/0xc0 [628867.656878] __kmalloc+0x17d/0x1d0 [628867.657644] bucket_table_alloc+0x195/0x1d0 [628867.658564] ? __might_sleep+0x4a/0x80 [628867.659449] rht_deferred_worker+0x287/0x3c0 [628867.660366] ? _raw_spin_unlock_irq+0xe/0x30 [628867.661294] process_one_work+0x1de/0x4d0 [628867.662208] worker_thread+0x4b/0x4f0 [628867.662990] kthread+0x10c/0x140 [628867.663687] ? process_one_work+0x4d0/0x4d0 [628867.664564] ? kthread_create_on_node+0x40/0x40 [628867.665523] ret_from_fork+0x25/0x30 [628867.666317] ---[ end trace 7c38634006a9955e ]--- Now, this workload does not touch the page cache at all - it's entirely an XFS metadata workload, so it should not really be affecting the working set code. And worse, on that last error, the /host/ is now going into meltdown (running 4.7.5) with 32 CPUs all burning down in ACPI code: PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND 35074 root -2 0 0 0 0 R 99.0 0.0 12:38.92 acpi_pad/12 35079 root -2 0 0 0 0 R 99.0 0.0 12:39.40 acpi_pad/16 35080 root -2 0 0 0 0 R 99.0 0.0 12:39.29 acpi_pad/17 35085 root -2 0 0 0 0 R 99.0 0.0 12:39.35 acpi_pad/22 35087 root -2 0 0 0 0 R 99.0 0.0 12:39.13 acpi_pad/24 35090 root -2 0 0 0 0 R 99.0 0.0 12:38.89 acpi_pad/27 35093 root -2 0 0 0 0 R 99.0 0.0 12:38.88 acpi_pad/30 35063 root -2 0 0 0 0 R 98.1 0.0 12:40.64 acpi_pad/1 35065 root -2 0 0 0 0 R 98.1 0.0 12:40.38 acpi_pad/3 35066 root -2 0 0 0 0 R 98.1 0.0 12:40.30 acpi_pad/4 35067 root -2 0 0 0 0 R 98.1 0.0 12:40.82 acpi_pad/5 35077 root -2 0 0 0 0 R 98.1 0.0 12:39.65 acpi_pad/14 35078 root -2 0 0 0 0 R 98.1 0.0 12:39.58 acpi_pad/15 35081 root -2 0 0 0 0 R 98.1 0.0 12:39.32 acpi_pad/18 35072 root -2 0 0 0 0 R 96.2
Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
On Wed, Dec 21, 2016 at 03:19:15PM -0800, Linus Torvalds wrote: > Hi, > > On Wed, Dec 21, 2016 at 2:16 PM, Dave Chinnerwrote: > > On Fri, Dec 16, 2016 at 10:59:06AM -0800, Chris Leech wrote: > >> Thanks Dave, > >> > >> I'm hitting a bug at scatterlist.h:140 before I even get any iSCSI > >> modules loaded (virtio block) so there's something else going on in the > >> current merge window. I'll keep an eye on it and make sure there's > >> nothing iSCSI needs fixing for. > > > > OK, so before this slips through the cracks. > > > > Linus - your tree as of a few minutes ago still panics immediately > > when starting xfstests on iscsi devices. It appears to be a > > scatterlist corruption and not an iscsi problem, so the iscsi guys > > seem to have bounced it and no-one is looking at it. > > Hmm. There's not much to go by. > > Can somebody in iscsi-land please try to just bisect it - I'm not > seeing a lot of clues to where this comes from otherwise. Yeah, my hopes of this being quickly resolved by someone else didn't work out and whatever is going on in that test VM is looking like a different kind of odd. I'm saving that off for later, and seeing if I can't be a bisect on the iSCSI issue. Chris -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi: do not requeue requests unaligned with device sector size
On 12/21/2016 05:50 AM, Christoph Hellwig wrote: > How do you even get an unaligned residual count? Except for SES > processor devices (which will only issue BLOCK_PC commands) this is > not allowed by SPC: > > "The residual count shall be reported in bytes if the peripheral device > type in the destination target descriptor is 03h (i.e., processor device), > and in destination device blocks for all other device type codes. On 12/21/2016 06:09 AM, Hannes Reinecke wrote: > Which actually would be pretty much my objection, too. > > This would only be applicable for 512e drives, where we _might_ end up > with a residual smaller than the physical sector size. > But that should be handled by firmware; after all, that's what the 'e' > implies, right? On 12/21/2016 12:01 PM, Martin K. Petersen wrote: I agree with Christoph and Hannes. Some of this falls into the gray area that's outside of the T10 spec (HBA programming interface guarantees) but it seems like a deficiency in the HBA to report a byte count that's not a multiple of the logical block size. A block can't be partially written. Either it made it or it didn't. Regardless of how the I/O is being broken up into frames at the transport level and at which offset the transfer was interrupted. Christoph, Hannes, Martin, Thank you all for your comments and pointers to the documentation/spec. I'll carry it on with the HBA and storage folks. cheers, -- Mauricio Faria de Oliveira IBM Linux Technology Center -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
On Fri, Dec 16, 2016 at 10:59:06AM -0800, Chris Leech wrote: > Thanks Dave, > > I'm hitting a bug at scatterlist.h:140 before I even get any iSCSI > modules loaded (virtio block) so there's something else going on in the > current merge window. I'll keep an eye on it and make sure there's > nothing iSCSI needs fixing for. OK, so before this slips through the cracks. Linus - your tree as of a few minutes ago still panics immediately when starting xfstests on iscsi devices. It appears to be a scatterlist corruption and not an iscsi problem, so the iscsi guys seem to have bounced it and no-one is looking at it. I'm disappearing for several months at the end of tomorrow, so I thought I better make sure you know about it. I've also added linux-scsi, linux-block to the cc list Cheers, Dave. > On Thu, Dec 15, 2016 at 09:29:53AM +1100, Dave Chinner wrote: > > On Thu, Dec 15, 2016 at 09:24:11AM +1100, Dave Chinner wrote: > > > Hi folks, > > > > > > Just updated my test boxes from 4.9 to a current Linus 4.10 merge > > > window kernel to test the XFS merge I am preparing for Linus. > > > Unfortunately, all my test VMs using iscsi failed pretty much > > > instantly on the first mount of an iscsi device: > > > > > > [ 159.372704] XFS (sdb): EXPERIMENTAL reverse mapping btree feature > > > enabled. Use at your own risk! > > > [ 159.374612] XFS (sdb): Mounting V5 Filesystem > > > [ 159.425710] XFS (sdb): Ending clean mount > > > [ 160.274438] BUG: unable to handle kernel NULL pointer dereference at > > > 000c > > > [ 160.275851] IP: iscsi_tcp_segment_done+0x20d/0x2e0 > > > > FYI, crash is here: > > > > (gdb) l *(iscsi_tcp_segment_done+0x20d) > > 0x81b950bd is in iscsi_tcp_segment_done > > (drivers/scsi/libiscsi_tcp.c:102). > > 97 iscsi_tcp_segment_init_sg(struct iscsi_segment *segment, > > 98struct scatterlist *sg, unsigned int offset) > > 99 { > > 100 segment->sg = sg; > > 101 segment->sg_offset = offset; > > 102 segment->size = min(sg->length - offset, > > 103 segment->total_size - > > segment->total_copied); > > 104 segment->data = NULL; > > 105 } > > 106 > > > > So it looks to be sg = NULL, which means there's probably an issue > > with the scatterlist... > > > > -Dave. > > > > > [ 160.276565] PGD 336ed067 [ 160.276885] PUD 31b0d067 > > > PMD 0 [ 160.277309] > > > [ 160.277523] Oops: [#1] PREEMPT SMP > > > [ 160.278004] Modules linked in: > > > [ 160.278407] CPU: 0 PID: 16 Comm: kworker/u2:1 Not tainted 4.9.0-dgc #18 > > > [ 160.279224] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > > > BIOS Debian-1.8.2-1 04/01/2014 > > > [ 160.280314] Workqueue: iscsi_q_2 iscsi_xmitworker > > > [ 160.280919] task: 88003e28 task.stack: c908 > > > [ 160.281647] RIP: 0010:iscsi_tcp_segment_done+0x20d/0x2e0 > > > [ 160.282312] RSP: 0018:c9083c38 EFLAGS: 00010206 > > > [ 160.282980] RAX: RBX: 880039061730 RCX: > > > > > > [ 160.283854] RDX: 1e00 RSI: RDI: > > > 880039061730 > > > [ 160.284738] RBP: c9083c90 R08: 0200 R09: > > > 05a8 > > > [ 160.285627] R10: 9835607d R11: R12: > > > 0200 > > > [ 160.286495] R13: R14: 8800390615a0 R15: > > > 880039061730 > > > [ 160.287362] FS: () GS:88003fc0() > > > knlGS: > > > [ 160.288340] CS: 0010 DS: ES: CR0: 80050033 > > > [ 160.289113] CR2: 000c CR3: 31a8d000 CR4: > > > 06f0 > > > [ 160.290084] Call Trace: > > > [ 160.290429] ? inet_sendpage+0x4d/0x140 > > > [ 160.290957] iscsi_sw_tcp_xmit_segment+0x89/0x110 > > > [ 160.291597] iscsi_sw_tcp_pdu_xmit+0x56/0x180 > > > [ 160.292190] iscsi_tcp_task_xmit+0xb8/0x280 > > > [ 160.292771] iscsi_xmit_task+0x53/0xc0 > > > [ 160.293282] iscsi_xmitworker+0x274/0x310 > > > [ 160.293835] process_one_work+0x1de/0x4d0 > > > [ 160.294388] worker_thread+0x4b/0x4f0 > > > [ 160.294889] kthread+0x10c/0x140 > > > [ 160.295333] ? process_one_work+0x4d0/0x4d0 > > > [ 160.295898] ? kthread_create_on_node+0x40/0x40 > > > [ 160.296525] ret_from_fork+0x25/0x30 > > > [ 160.297015] Code: 43 18 00 00 00 00 e9 ad fe ff ff 48 8b 7b 30 e8 da > > > e7 ca ff 8b 53 10 44 89 ee 48 89 df 2b 53 14 48 89 43 30 c7 43 40 00 00 > > > 00 00 <8b > > > [ 160.300674] RIP: iscsi_tcp_segment_done+0x20d/0x2e0 RSP: > > > c9083c38 > > > [ 160.301584] CR2: 000c > > > > > > > > > Known problem, or something new? > > > > > > Cheers, > > > > > > Dave. > > > -- > > > Dave Chinner > > > da...@fromorbit.com > > > > > > > -- > > Dave Chinner > > da...@fromorbit.com > -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-block"
Re: [PATCH v2] RFD: switch MMC/SD to use blk-mq multiqueueing
Hi, I may have some silly queries here. Please bear with my little understanding on blk-mq. On 12/20/2016 7:31 PM, Linus Walleij wrote: HACK ALERT: DO NOT MERGE THIS! IT IS A FYI PATCH FOR DISCUSSION ONLY. This hack switches the MMC/SD subsystem from using the legacy blk layer to using blk-mq. It does this by registering one single hardware queue, since MMC/SD has only one command pipe. I kill Could you please confirm on this- does even the HW/SW CMDQ in emmc would use only 1 hardware queue with (say ~31) as queue depth, of that HW queue? Is this understanding correct? Or will it be possible to have more than 1 HW Queue with lesser queue depth per HW queue? off the worker thread altogether and let the MQ core logic fire sleepable requests directly into the MMC core. We emulate the 2 elements deep pipeline by specifying queue depth 2, which is an elaborate lie that makes the block layer issue another request while a previous request is in transit. It't not neat but it works. As the pipeline needs to be flushed by pushing in a NULL request after the last block layer request I added a delayed work with a timeout of zero. This will fire as soon as the block layer stops pushing in requests: as long as there are new requests the MQ block layer will just repeatedly cancel this pipeline flush work and push new requests into the pipeline, but once the requests stop coming the NULL request will be flushed into the pipeline. It's not pretty but it works... Look at the following performance statistics: I understand that the block drivers are moving to blk-mq framework. But keeping that reason apart, do we also anticipate any theoretical performance gains in moving mmc driver to blk-mq framework - for both in case of legacy emmc, and SW/HW CMDQ in emmc ? And by how much? It would be even better to know if adding of scheduler to blk-mq will make any difference in perf gains or not in this case? Do we any rough estimate or study on that? This is only out of curiosity and for information purpose. Regards Ritesh BEFORE this patch: time dd if=/dev/mmcblk0 of=/dev/null bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.0GB) copied, 45.145874 seconds, 22.7MB/s real0m 45.15s user0m 0.02s sys 0m 7.51s mount /dev/mmcblk0p1 /mnt/ cd /mnt/ time find . > /dev/null real0m 3.70s user0m 0.29s sys 0m 1.63s AFTER this patch: time dd if=/dev/mmcblk0 of=/dev/null bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.0GB) copied, 45.285431 seconds, 22.6MB/s real0m 45.29s user0m 0.02s sys 0m 6.58s mount /dev/mmcblk0p1 /mnt/ cd /mnt/ time find . > /dev/null real0m 4.37s user0m 0.27s sys 0m 1.65s The results are consistent. As you can see, for a straight dd-like task, we get more or less the same nice parallelism as for the old framework. I have confirmed through debugprints that indeed this is because the two-stage pipeline is full at all times. However, for spurious reads in the find command, we already see a big performance regression. This is because there are many small operations requireing a flush of the pipeline, which used to happen immediately with the old block layer interface code that used to pull a few NULL requests off the queue and feed them into the pipeline immediately after the last request, but happens after the delayed work is executed in this new framework. The delayed work is never quick enough to terminate all these small operations even if we schedule it immediately after the last request. AFAICT the only way forward to provide proper performance with MQ for MMC/SD is to get the requests to complete out-of-sync, i.e. when the driver calls back to MMC/SD core to notify that a request is complete, it should not notify any main thread with a completion as is done right now, but instead directly call blk_end_request_all() and only schedule some extra communication with the card if necessary for example to handle an error condition. This rework needs a bigger rewrite so we can get rid of the paradigm of the block layer "driving" the requests throgh the pipeline. Signed-off-by: Linus Walleij-- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler
On 12/21/2016 04:59 AM, Bart Van Assche wrote: > Since this patch is the first patch that introduces a call to > blk_queue_exit() from a module other than the block layer core, > shouldn't this patch export the blk_queue_exit() function? An attempt > to build mq-deadline as a module resulted in the following: > > ERROR: "blk_queue_exit" [block/mq-deadline.ko] undefined! > make[1]: *** [scripts/Makefile.modpost:91: __modpost] Error 1 > make: *** [Makefile:1198: modules] Error 2 > Execution failed: make all Yes, it should. I'll make the export for now, I want to move that check and free/drop into the generic code so that the schedulers don't have to worry about it. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi: do not requeue requests unaligned with device sector size
> "Mauricio" == Mauricio Faria de Oliveira> writes: Mauricio, Mauricio> When a SCSI command (e.g., read operation) is partially Mauricio> completed with good status and residual bytes (i.e., not all Mauricio> the bytes from the specified transfer length were transferred) Mauricio> the SCSI midlayer will update the request/bios with the Mauricio> completed bytes and requeue the request in order to complete Mauricio> the remainder/pending bytes. I agree with Christoph and Hannes. Some of this falls into the gray area that's outside of the T10 spec (HBA programming interface guarantees) but it seems like a deficiency in the HBA to report a byte count that's not a multiple of the logical block size. A block can't be partially written. Either it made it or it didn't. Regardless of how the I/O is being broken up into frames at the transport level and at which offset the transfer was interrupted. I am also not a fan of the delayed retry stuff which seems somewhat orthogonal to the problem you're describing. -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi: do not requeue requests unaligned with device sector size
On 12/21/2016 08:50 AM, Christoph Hellwig wrote: > On Tue, Dec 20, 2016 at 12:02:27AM -0200, Mauricio Faria de Oliveira wrote: >> When a SCSI command (e.g., read operation) is partially completed >> with good status and residual bytes (i.e., not all the bytes from >> the specified transfer length were transferred) the SCSI midlayer >> will update the request/bios with the completed bytes and requeue >> the request in order to complete the remainder/pending bytes. >> >> However, when the device sector size is greater than the 512-byte >> default/kernel sector size, alignment restrictions and validation >> apply (both to the starting logical block address, and the number >> of logical blocks to transfer) -- values must be multiples of the >> device sector size, otherwise the kernel fails the request in the >> preparation stage (e.g., sd_setup_read_write_cmnd() at sd.c file): > > How do you even get an unaligned residual count? Except for SES > processor devices (which will only issue BLOCK_PC commands) this is > not allowed by SPC: > > "The residual count shall be reported in bytes if the peripheral device > type in the destination target descriptor is 03h (i.e., processor device), > and in destination device blocks for all other device type codes." Which actually would be pretty much my objection, too. This would only be applicable for 512e drives, where we _might_ end up with a residual smaller than the physical sector size. But that should be handled by firmware; after all, that's what the 'e' implies, right? Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html