problem with block: Move bdi_unregister() to del_gendisk() commit 165a5e22fafb127ecb5914e12e8c32a1f0d3f820
On one of my pc's I have 2 PATA disks (one, WDC below is used for booting, the other SAMSUNG is not mounted), plus an IBM SCSI disk using a DPT 2044W controller with eata driver and sometimes a Verbatim Storengo USB stick. On recent 4.10.0+ kernel builds (i386), the resulting kernel would pause during the start-up when the USB stick was inserted but boot normally otherwise. A git-bisect lead to: commit 165a5e22fafb127ecb5914e12e8c32a1f0d3f820 Author: Jan KaraDate: Wed Feb 8 08:05:56 2017 +0100 block: Move bdi_unregister() to del_gendisk() Commit 6cd18e711dd8 "block: destroy bdi before blockdev is unregistered." moved bdi unregistration (at that time through bdi_destroy()) from blk_release_queue() to blk_cleanup_queue() because it needs to happen before blk_unregister_region() call in del_gendisk() for MD. SCSI though will free up the device number from sd_remove() called through a maze of callbacks from device_del() in __scsi_remove_device() before blk_cleanup_queue() and thus similar races as described in 6cd18e711dd8 can happen for SCSI as well as reported by Omar [1]. Moving bdi_unregister() to del_gendisk() works for MD and fixes the problem for SCSI since del_gendisk() gets called from sd_remove() before freeing the device number. This also makes device_add_disk() (calling bdi_register_owner()) more symmetric with del_gendisk(). [1] http://marc.info/?l=linux-block=148554717109098=2 When booting the bad kernel, I would eventually get a prompt to press the enter key to boot and it eventually started, but the SCSI disk partitions were not found by blkid nor could they be mounted. lsscsi reports: [0:0:6:0]diskIBM DCAS-34330W S65A /dev/sda [1:0:0:0]diskATA WDC WD3200AAJB-0 2C01 /dev/sdc [2:0:0:0]cd/dvd HL-DT-ST DVDRAM GSA-4163B A103 /dev/sr0 [2:0:1:0]diskATA SAMSUNG SP4002H 0-57 /dev/sdd [3:0:0:0]diskVerbatim STORE N GO 5.00 /dev/sdb blkid reports: /dev/sdb1: LABEL="STORENGO" UUID="B08B-79DA" TYPE="vfat" PARTUUID="961d9655-01" /dev/sdc1: UUID="bfdeb6d6-0b77-4beb-a63d-bdc3e455b8ea" TYPE="ext3" PTTYPE="dos" PARTUUID="000750bf-01" /dev/sdc5: UUID="26b7280a-f40c-49dd-a086-dbbb9b7e3def" TYPE="swap" PARTUUID="000750bf-05" /dev/sdc6: UUID="7417-5AFF" TYPE="vfat" PARTUUID="000750bf-06" /dev/sdc7: UUID="96c96a61-8615-4715-86d0-09cb8c62638c" TYPE="ext3" PARTUUID="000750bf-07" /dev/sdd1: LABEL="W-98 SE" UUID="3571-16DE" TYPE="vfat" PARTUUID="43598af3-01" /dev/sdd3: UUID="fd6a052e-c062-4c47-801d-087595635c5d" SEC_TYPE="ext2" TYPE="ext3" PARTUUID="43598af3-03" /dev/sdd5: UUID="026a3f5c-0064-4ae7-869e-519d2cee05e7" SEC_TYPE="ext2" TYPE="ext3" PTTYPE="dos" PARTUUID="43598af3-05" /dev/sdd6: UUID="9a0970fa-74ba-4426-98ac-1e8b81933e0e" TYPE="swap" PARTUUID="43598af3-06" /dev/sdd7: UUID="4912-06CA" TYPE="vfat" PARTUUID="43598af3-07" The boot screen is at: http://www.users.on.net/~arthur.marsh/20170308_18.jpg and the dmesg output from booting the bad kernel is attached. I'm happy to supply any other configuration details needed and run further tests. Regards, Arthur. [0.00] Linux version 4.10.0+ (root@victoria) (gcc version 6.3.0 20170221 (Debian 6.3.0-8) ) #652 SMP PREEMPT Wed Mar 8 04:45:06 ACDT 2017 [0.00] x86/fpu: x87 FPU will use FXSAVE [0.00] e820: BIOS-provided physical RAM map: [0.00] BIOS-e820: [mem 0x-0x0009fbff] usable [0.00] BIOS-e820: [mem 0x0009fc00-0x0009] reserved [0.00] BIOS-e820: [mem 0x000e-0x000f] reserved [0.00] BIOS-e820: [mem 0x0010-0x3ffa] usable [0.00] BIOS-e820: [mem 0x3ffb-0x3ffbdfff] ACPI data [0.00] BIOS-e820: [mem 0x3ffbe000-0x3ffd] ACPI NVS [0.00] BIOS-e820: [mem 0x3ffe-0x3fff] reserved [0.00] BIOS-e820: [mem 0xe000-0xefff] reserved [0.00] BIOS-e820: [mem 0xfec0-0xfec00fff] reserved [0.00] BIOS-e820: [mem 0xff78-0x] reserved [0.00] NX (Execute Disable) protection: active [0.00] SMBIOS 2.3 present. [0.00] DMI: System manufacturer System Product Name/A8V-MX, BIOS 0503 12/06/2005 [0.00] e820: update [mem 0x-0x0fff] usable ==> reserved [0.00] e820: remove [mem 0x000a-0x000f] usable [0.00] e820: last_pfn = 0x3ffb0 max_arch_pfn = 0x100 [0.00] MTRR default type: uncachable [0.00] MTRR fixed ranges enabled: [0.00] 0-9 write-back [0.00] A-E uncachable [0.00] F-F write-protect [0.00] MTRR variable ranges enabled: [0.00] 0 base 00 mask FFC000 write-back [0.00] 1 base 00D000 mask FFF000
problem with block: Move bdi_unregister() to del_gendisk() commit 165a5e22fafb127ecb5914e12e8c32a1f0d3f820
On one of my pc's I have 2 PATA disks (one, WDC below is used for booting, the other SAMSUNG is not mounted), plus an IBM SCSI disk using a DPT 2044W controller with eata driver and sometimes a Verbatim Storengo USB stick. On recent 4.10.0+ kernel builds (i386), the resulting kernel would pause during the start-up when the USB stick was inserted but boot normally otherwise. A git-bisect lead to: commit 165a5e22fafb127ecb5914e12e8c32a1f0d3f820 Author: Jan Kara Date: Wed Feb 8 08:05:56 2017 +0100 block: Move bdi_unregister() to del_gendisk() Commit 6cd18e711dd8 "block: destroy bdi before blockdev is unregistered." moved bdi unregistration (at that time through bdi_destroy()) from blk_release_queue() to blk_cleanup_queue() because it needs to happen before blk_unregister_region() call in del_gendisk() for MD. SCSI though will free up the device number from sd_remove() called through a maze of callbacks from device_del() in __scsi_remove_device() before blk_cleanup_queue() and thus similar races as described in 6cd18e711dd8 can happen for SCSI as well as reported by Omar [1]. Moving bdi_unregister() to del_gendisk() works for MD and fixes the problem for SCSI since del_gendisk() gets called from sd_remove() before freeing the device number. This also makes device_add_disk() (calling bdi_register_owner()) more symmetric with del_gendisk(). [1] http://marc.info/?l=linux-block=148554717109098=2 When booting the bad kernel, I would eventually get a prompt to press the enter key to boot and it eventually started, but the SCSI disk partitions were not found by blkid nor could they be mounted. lsscsi reports: [0:0:6:0]diskIBM DCAS-34330W S65A /dev/sda [1:0:0:0]diskATA WDC WD3200AAJB-0 2C01 /dev/sdc [2:0:0:0]cd/dvd HL-DT-ST DVDRAM GSA-4163B A103 /dev/sr0 [2:0:1:0]diskATA SAMSUNG SP4002H 0-57 /dev/sdd [3:0:0:0]diskVerbatim STORE N GO 5.00 /dev/sdb blkid reports: /dev/sdb1: LABEL="STORENGO" UUID="B08B-79DA" TYPE="vfat" PARTUUID="961d9655-01" /dev/sdc1: UUID="bfdeb6d6-0b77-4beb-a63d-bdc3e455b8ea" TYPE="ext3" PTTYPE="dos" PARTUUID="000750bf-01" /dev/sdc5: UUID="26b7280a-f40c-49dd-a086-dbbb9b7e3def" TYPE="swap" PARTUUID="000750bf-05" /dev/sdc6: UUID="7417-5AFF" TYPE="vfat" PARTUUID="000750bf-06" /dev/sdc7: UUID="96c96a61-8615-4715-86d0-09cb8c62638c" TYPE="ext3" PARTUUID="000750bf-07" /dev/sdd1: LABEL="W-98 SE" UUID="3571-16DE" TYPE="vfat" PARTUUID="43598af3-01" /dev/sdd3: UUID="fd6a052e-c062-4c47-801d-087595635c5d" SEC_TYPE="ext2" TYPE="ext3" PARTUUID="43598af3-03" /dev/sdd5: UUID="026a3f5c-0064-4ae7-869e-519d2cee05e7" SEC_TYPE="ext2" TYPE="ext3" PTTYPE="dos" PARTUUID="43598af3-05" /dev/sdd6: UUID="9a0970fa-74ba-4426-98ac-1e8b81933e0e" TYPE="swap" PARTUUID="43598af3-06" /dev/sdd7: UUID="4912-06CA" TYPE="vfat" PARTUUID="43598af3-07" The boot screen is at: http://www.users.on.net/~arthur.marsh/20170308_18.jpg and the dmesg output from booting the bad kernel is attached. I'm happy to supply any other configuration details needed and run further tests. Regards, Arthur. [0.00] Linux version 4.10.0+ (root@victoria) (gcc version 6.3.0 20170221 (Debian 6.3.0-8) ) #652 SMP PREEMPT Wed Mar 8 04:45:06 ACDT 2017 [0.00] x86/fpu: x87 FPU will use FXSAVE [0.00] e820: BIOS-provided physical RAM map: [0.00] BIOS-e820: [mem 0x-0x0009fbff] usable [0.00] BIOS-e820: [mem 0x0009fc00-0x0009] reserved [0.00] BIOS-e820: [mem 0x000e-0x000f] reserved [0.00] BIOS-e820: [mem 0x0010-0x3ffa] usable [0.00] BIOS-e820: [mem 0x3ffb-0x3ffbdfff] ACPI data [0.00] BIOS-e820: [mem 0x3ffbe000-0x3ffd] ACPI NVS [0.00] BIOS-e820: [mem 0x3ffe-0x3fff] reserved [0.00] BIOS-e820: [mem 0xe000-0xefff] reserved [0.00] BIOS-e820: [mem 0xfec0-0xfec00fff] reserved [0.00] BIOS-e820: [mem 0xff78-0x] reserved [0.00] NX (Execute Disable) protection: active [0.00] SMBIOS 2.3 present. [0.00] DMI: System manufacturer System Product Name/A8V-MX, BIOS 0503 12/06/2005 [0.00] e820: update [mem 0x-0x0fff] usable ==> reserved [0.00] e820: remove [mem 0x000a-0x000f] usable [0.00] e820: last_pfn = 0x3ffb0 max_arch_pfn = 0x100 [0.00] MTRR default type: uncachable [0.00] MTRR fixed ranges enabled: [0.00] 0-9 write-back [0.00] A-E uncachable [0.00] F-F write-protect [0.00] MTRR variable ranges enabled: [0.00] 0 base 00 mask FFC000 write-back [0.00] 1 base 00D000 mask FFF000 write-combining
Re: problem with block: Move bdi_unregister() to del_gendisk() commit 165a5e22fafb127ecb5914e12e8c32a1f0d3f820
On Wed 08-03-17 20:37:54, Arthur Marsh wrote: > > On one of my pc's I have 2 PATA disks (one, WDC below is used for booting, > the other SAMSUNG is not mounted), plus an IBM SCSI disk using a DPT 2044W > controller with eata driver and sometimes a Verbatim Storengo USB stick. > > On recent 4.10.0+ kernel builds (i386), the resulting kernel would pause > during the start-up when the USB stick was inserted but boot normally > otherwise. > > A git-bisect lead to: > > commit 165a5e22fafb127ecb5914e12e8c32a1f0d3f820 > Author: Jan Kara> Date: Wed Feb 8 08:05:56 2017 +0100 > > block: Move bdi_unregister() to del_gendisk() > > Commit 6cd18e711dd8 "block: destroy bdi before blockdev is > unregistered." moved bdi unregistration (at that time through > bdi_destroy()) from blk_release_queue() to blk_cleanup_queue() because > it needs to happen before blk_unregister_region() call in del_gendisk() > for MD. SCSI though will free up the device number from sd_remove() > called through a maze of callbacks from device_del() in > __scsi_remove_device() before blk_cleanup_queue() and thus similar races > as described in 6cd18e711dd8 can happen for SCSI as well as reported by > Omar [1]. > > Moving bdi_unregister() to del_gendisk() works for MD and fixes the > problem for SCSI since del_gendisk() gets called from sd_remove() before > freeing the device number. > > This also makes device_add_disk() (calling bdi_register_owner()) more > symmetric with del_gendisk(). > > [1] http://marc.info/?l=linux-block=148554717109098=2 Thanks for report and the analysis! The real culprit seems to be: [4.614530] BUG: unable to handle kernel NULL pointer dereference at 0258 [4.614606] IP: sd_revalidate_disk+0x12/0x1460 [sd_mod] [4.614668] *pdpt = 370de001 *pde = [4.614788] Oops: [#1] PREEMPT SMP [4.614847] Modules linked in: uas sd_mod usb_storage psmouse pata_via ehci_pci uhci_hcd ehci_hcd via_rhine usbcore eata mii i2c_viapro usb_common libata scsi_mod [4.614970] CPU: 0 PID: 127 Comm: kworker/u4:5 Not tainted 4.10.0+ #652 [4.615035] Hardware name: System manufacturer System Product Name/A8V-MX, BIOS 050312/06/2005 [4.615117] Workqueue: events_unbound async_run_entry_fn [4.615180] task: f71b9080 task.stack: f729c000 [4.615243] EIP: sd_revalidate_disk+0x12/0x1460 [sd_mod] [4.615305] EFLAGS: 00010296 CPU: 0 [4.615363] EAX: f7261000 EBX: ECX: EDX: f7196000 [4.615426] ESI: f7196000 EDI: f7261000 EBP: f729deac ESP: f729de44 [4.615489] DS: 007b ES: 007b FS: 00d8 GS: SS: 0068 [4.615552] CR0: 80050033 CR2: 0258 CR3: 370ce6c0 CR4: 06f0 [4.615615] Call Trace: [4.615675] ? kobject_uevent+0xa/0x10 [4.615736] ? blk_integrity_add+0x41/0x50 [4.615797] ? device_add_disk+0x275/0x450 [4.615861] sd_probe_async+0x103/0x1b0 [sd_mod] [4.615923] ? preempt_count_sub+0x26/0x70 [4.615984] async_run_entry_fn+0x30/0x170 [4.616045] ? preempt_count_sub+0x26/0x70 [4.616107] process_one_work+0x235/0x670 [4.616167] ? process_one_work+0x1a3/0x670 [4.616229] worker_thread+0x2f/0x4c0 [4.616288] ? preempt_count_add+0x82/0xc0 [4.616349] ? process_one_work+0x670/0x670 [4.616410] kthread+0xd0/0x100 [4.616467] ? process_one_work+0x670/0x670 [4.616528] ? kthread_create_on_node+0x30/0x30 [4.616591] ret_from_fork+0x21/0x2c [4.616652] Code: 43 02 31 c0 8b 5d f4 c9 c3 8d 75 e4 89 34 24 e8 95 0a fb ff eb d4 8d 76 00 55 89 e5 57 56 53 83 ec 5c 8b 98 0c 03 00 00 89 45 c8 <8b> 93 58 02 00 00 8b 43 04 89 45 d8 8b 82 08 03 00 00 89 45 d0 [4.616840] EIP: sd_revalidate_disk+0x12/0x1460 [sd_mod] SS:ESP: 0068:f729de44 [4.616914] CR2: 0258 [4.616973] ---[ end trace b1bff84fe33f3677 ]--- in your log. And this apparently happened because disk->private_data was NULL. Can you try whether attached patches fix the problem for you? Honza > > When booting the bad kernel, I would eventually get a prompt to press the > enter key to boot and it eventually started, but the SCSI disk partitions > were not found by blkid nor could they be mounted. > > lsscsi reports: > > [0:0:6:0]diskIBM DCAS-34330W S65A /dev/sda > [1:0:0:0]diskATA WDC WD3200AAJB-0 2C01 /dev/sdc > [2:0:0:0]cd/dvd HL-DT-ST DVDRAM GSA-4163B A103 /dev/sr0 > [2:0:1:0]diskATA SAMSUNG SP4002H 0-57 /dev/sdd > [3:0:0:0]diskVerbatim STORE N GO 5.00 /dev/sdb > > blkid reports: > > /dev/sdb1: LABEL="STORENGO" UUID="B08B-79DA" TYPE="vfat" > PARTUUID="961d9655-01" > /dev/sdc1: UUID="bfdeb6d6-0b77-4beb-a63d-bdc3e455b8ea" TYPE="ext3" > PTTYPE="dos" PARTUUID="000750bf-01" > /dev/sdc5: UUID="26b7280a-f40c-49dd-a086-dbbb9b7e3def" TYPE="swap" >
Re: problem with block: Move bdi_unregister() to del_gendisk() commit 165a5e22fafb127ecb5914e12e8c32a1f0d3f820
On Wed 08-03-17 20:37:54, Arthur Marsh wrote: > > On one of my pc's I have 2 PATA disks (one, WDC below is used for booting, > the other SAMSUNG is not mounted), plus an IBM SCSI disk using a DPT 2044W > controller with eata driver and sometimes a Verbatim Storengo USB stick. > > On recent 4.10.0+ kernel builds (i386), the resulting kernel would pause > during the start-up when the USB stick was inserted but boot normally > otherwise. > > A git-bisect lead to: > > commit 165a5e22fafb127ecb5914e12e8c32a1f0d3f820 > Author: Jan Kara > Date: Wed Feb 8 08:05:56 2017 +0100 > > block: Move bdi_unregister() to del_gendisk() > > Commit 6cd18e711dd8 "block: destroy bdi before blockdev is > unregistered." moved bdi unregistration (at that time through > bdi_destroy()) from blk_release_queue() to blk_cleanup_queue() because > it needs to happen before blk_unregister_region() call in del_gendisk() > for MD. SCSI though will free up the device number from sd_remove() > called through a maze of callbacks from device_del() in > __scsi_remove_device() before blk_cleanup_queue() and thus similar races > as described in 6cd18e711dd8 can happen for SCSI as well as reported by > Omar [1]. > > Moving bdi_unregister() to del_gendisk() works for MD and fixes the > problem for SCSI since del_gendisk() gets called from sd_remove() before > freeing the device number. > > This also makes device_add_disk() (calling bdi_register_owner()) more > symmetric with del_gendisk(). > > [1] http://marc.info/?l=linux-block=148554717109098=2 Thanks for report and the analysis! The real culprit seems to be: [4.614530] BUG: unable to handle kernel NULL pointer dereference at 0258 [4.614606] IP: sd_revalidate_disk+0x12/0x1460 [sd_mod] [4.614668] *pdpt = 370de001 *pde = [4.614788] Oops: [#1] PREEMPT SMP [4.614847] Modules linked in: uas sd_mod usb_storage psmouse pata_via ehci_pci uhci_hcd ehci_hcd via_rhine usbcore eata mii i2c_viapro usb_common libata scsi_mod [4.614970] CPU: 0 PID: 127 Comm: kworker/u4:5 Not tainted 4.10.0+ #652 [4.615035] Hardware name: System manufacturer System Product Name/A8V-MX, BIOS 050312/06/2005 [4.615117] Workqueue: events_unbound async_run_entry_fn [4.615180] task: f71b9080 task.stack: f729c000 [4.615243] EIP: sd_revalidate_disk+0x12/0x1460 [sd_mod] [4.615305] EFLAGS: 00010296 CPU: 0 [4.615363] EAX: f7261000 EBX: ECX: EDX: f7196000 [4.615426] ESI: f7196000 EDI: f7261000 EBP: f729deac ESP: f729de44 [4.615489] DS: 007b ES: 007b FS: 00d8 GS: SS: 0068 [4.615552] CR0: 80050033 CR2: 0258 CR3: 370ce6c0 CR4: 06f0 [4.615615] Call Trace: [4.615675] ? kobject_uevent+0xa/0x10 [4.615736] ? blk_integrity_add+0x41/0x50 [4.615797] ? device_add_disk+0x275/0x450 [4.615861] sd_probe_async+0x103/0x1b0 [sd_mod] [4.615923] ? preempt_count_sub+0x26/0x70 [4.615984] async_run_entry_fn+0x30/0x170 [4.616045] ? preempt_count_sub+0x26/0x70 [4.616107] process_one_work+0x235/0x670 [4.616167] ? process_one_work+0x1a3/0x670 [4.616229] worker_thread+0x2f/0x4c0 [4.616288] ? preempt_count_add+0x82/0xc0 [4.616349] ? process_one_work+0x670/0x670 [4.616410] kthread+0xd0/0x100 [4.616467] ? process_one_work+0x670/0x670 [4.616528] ? kthread_create_on_node+0x30/0x30 [4.616591] ret_from_fork+0x21/0x2c [4.616652] Code: 43 02 31 c0 8b 5d f4 c9 c3 8d 75 e4 89 34 24 e8 95 0a fb ff eb d4 8d 76 00 55 89 e5 57 56 53 83 ec 5c 8b 98 0c 03 00 00 89 45 c8 <8b> 93 58 02 00 00 8b 43 04 89 45 d8 8b 82 08 03 00 00 89 45 d0 [4.616840] EIP: sd_revalidate_disk+0x12/0x1460 [sd_mod] SS:ESP: 0068:f729de44 [4.616914] CR2: 0258 [4.616973] ---[ end trace b1bff84fe33f3677 ]--- in your log. And this apparently happened because disk->private_data was NULL. Can you try whether attached patches fix the problem for you? Honza > > When booting the bad kernel, I would eventually get a prompt to press the > enter key to boot and it eventually started, but the SCSI disk partitions > were not found by blkid nor could they be mounted. > > lsscsi reports: > > [0:0:6:0]diskIBM DCAS-34330W S65A /dev/sda > [1:0:0:0]diskATA WDC WD3200AAJB-0 2C01 /dev/sdc > [2:0:0:0]cd/dvd HL-DT-ST DVDRAM GSA-4163B A103 /dev/sr0 > [2:0:1:0]diskATA SAMSUNG SP4002H 0-57 /dev/sdd > [3:0:0:0]diskVerbatim STORE N GO 5.00 /dev/sdb > > blkid reports: > > /dev/sdb1: LABEL="STORENGO" UUID="B08B-79DA" TYPE="vfat" > PARTUUID="961d9655-01" > /dev/sdc1: UUID="bfdeb6d6-0b77-4beb-a63d-bdc3e455b8ea" TYPE="ext3" > PTTYPE="dos" PARTUUID="000750bf-01" > /dev/sdc5: UUID="26b7280a-f40c-49dd-a086-dbbb9b7e3def" TYPE="swap" >