problem with block: Move bdi_unregister() to del_gendisk() commit 165a5e22fafb127ecb5914e12e8c32a1f0d3f820

2017-03-08 Thread Arthur Marsh


On one of my pc's I have 2 PATA disks (one, WDC below is used for 
booting, the other SAMSUNG is not mounted), plus an IBM SCSI disk using 
a DPT 2044W controller with eata driver and sometimes a Verbatim 
Storengo USB stick.


On recent 4.10.0+ kernel builds (i386), the resulting kernel would pause 
during the start-up when the USB stick was inserted but boot normally 
otherwise.


A git-bisect lead to:

commit 165a5e22fafb127ecb5914e12e8c32a1f0d3f820
Author: Jan Kara 
Date:   Wed Feb 8 08:05:56 2017 +0100

block: Move bdi_unregister() to del_gendisk()

Commit 6cd18e711dd8 "block: destroy bdi before blockdev is
unregistered." moved bdi unregistration (at that time through
bdi_destroy()) from blk_release_queue() to blk_cleanup_queue() because
it needs to happen before blk_unregister_region() call in del_gendisk()
for MD. SCSI though will free up the device number from sd_remove()
called through a maze of callbacks from device_del() in
__scsi_remove_device() before blk_cleanup_queue() and thus similar 
races

as described in 6cd18e711dd8 can happen for SCSI as well as reported by
Omar [1].

Moving bdi_unregister() to del_gendisk() works for MD and fixes the
problem for SCSI since del_gendisk() gets called from sd_remove() 
before

freeing the device number.

This also makes device_add_disk() (calling bdi_register_owner()) more
symmetric with del_gendisk().

[1] http://marc.info/?l=linux-block=148554717109098=2

When booting the bad kernel, I would eventually get a prompt to press 
the enter key to boot and it eventually started, but the SCSI disk 
partitions were not found by blkid nor could they be mounted.


lsscsi reports:

[0:0:6:0]diskIBM  DCAS-34330W  S65A  /dev/sda
[1:0:0:0]diskATA  WDC WD3200AAJB-0 2C01  /dev/sdc
[2:0:0:0]cd/dvd  HL-DT-ST DVDRAM GSA-4163B A103  /dev/sr0
[2:0:1:0]diskATA  SAMSUNG SP4002H  0-57  /dev/sdd
[3:0:0:0]diskVerbatim STORE N GO   5.00  /dev/sdb

blkid reports:

/dev/sdb1: LABEL="STORENGO" UUID="B08B-79DA" TYPE="vfat" 
PARTUUID="961d9655-01"
/dev/sdc1: UUID="bfdeb6d6-0b77-4beb-a63d-bdc3e455b8ea" TYPE="ext3" 
PTTYPE="dos" PARTUUID="000750bf-01"
/dev/sdc5: UUID="26b7280a-f40c-49dd-a086-dbbb9b7e3def" TYPE="swap" 
PARTUUID="000750bf-05"

/dev/sdc6: UUID="7417-5AFF" TYPE="vfat" PARTUUID="000750bf-06"
/dev/sdc7: UUID="96c96a61-8615-4715-86d0-09cb8c62638c" TYPE="ext3" 
PARTUUID="000750bf-07"
/dev/sdd1: LABEL="W-98 SE" UUID="3571-16DE" TYPE="vfat" 
PARTUUID="43598af3-01"
/dev/sdd3: UUID="fd6a052e-c062-4c47-801d-087595635c5d" SEC_TYPE="ext2" 
TYPE="ext3" PARTUUID="43598af3-03"
/dev/sdd5: UUID="026a3f5c-0064-4ae7-869e-519d2cee05e7" SEC_TYPE="ext2" 
TYPE="ext3" PTTYPE="dos" PARTUUID="43598af3-05"
/dev/sdd6: UUID="9a0970fa-74ba-4426-98ac-1e8b81933e0e" TYPE="swap" 
PARTUUID="43598af3-06"

/dev/sdd7: UUID="4912-06CA" TYPE="vfat" PARTUUID="43598af3-07"

The boot screen is at:

http://www.users.on.net/~arthur.marsh/20170308_18.jpg

and the dmesg output from booting the bad kernel is attached.

I'm happy to supply any other configuration details needed and run 
further tests.


Regards,

Arthur.



[0.00] Linux version 4.10.0+ (root@victoria) (gcc version 6.3.0 
20170221 (Debian 6.3.0-8) ) #652 SMP PREEMPT Wed Mar 8 04:45:06 ACDT 2017
[0.00] x86/fpu: x87 FPU will use FXSAVE
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0009fbff] usable
[0.00] BIOS-e820: [mem 0x0009fc00-0x0009] reserved
[0.00] BIOS-e820: [mem 0x000e-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0x3ffa] usable
[0.00] BIOS-e820: [mem 0x3ffb-0x3ffbdfff] ACPI data
[0.00] BIOS-e820: [mem 0x3ffbe000-0x3ffd] ACPI NVS
[0.00] BIOS-e820: [mem 0x3ffe-0x3fff] reserved
[0.00] BIOS-e820: [mem 0xe000-0xefff] reserved
[0.00] BIOS-e820: [mem 0xfec0-0xfec00fff] reserved
[0.00] BIOS-e820: [mem 0xff78-0x] reserved
[0.00] NX (Execute Disable) protection: active
[0.00] SMBIOS 2.3 present.
[0.00] DMI: System manufacturer System Product Name/A8V-MX, BIOS 0503   
 12/06/2005
[0.00] e820: update [mem 0x-0x0fff] usable ==> reserved
[0.00] e820: remove [mem 0x000a-0x000f] usable
[0.00] e820: last_pfn = 0x3ffb0 max_arch_pfn = 0x100
[0.00] MTRR default type: uncachable
[0.00] MTRR fixed ranges enabled:
[0.00]   0-9 write-back
[0.00]   A-E uncachable
[0.00]   F-F write-protect
[0.00] MTRR variable ranges enabled:
[0.00]   0 base 00 mask FFC000 write-back
[0.00]   1 base 00D000 mask FFF000 

problem with block: Move bdi_unregister() to del_gendisk() commit 165a5e22fafb127ecb5914e12e8c32a1f0d3f820

2017-03-08 Thread Arthur Marsh


On one of my pc's I have 2 PATA disks (one, WDC below is used for 
booting, the other SAMSUNG is not mounted), plus an IBM SCSI disk using 
a DPT 2044W controller with eata driver and sometimes a Verbatim 
Storengo USB stick.


On recent 4.10.0+ kernel builds (i386), the resulting kernel would pause 
during the start-up when the USB stick was inserted but boot normally 
otherwise.


A git-bisect lead to:

commit 165a5e22fafb127ecb5914e12e8c32a1f0d3f820
Author: Jan Kara 
Date:   Wed Feb 8 08:05:56 2017 +0100

block: Move bdi_unregister() to del_gendisk()

Commit 6cd18e711dd8 "block: destroy bdi before blockdev is
unregistered." moved bdi unregistration (at that time through
bdi_destroy()) from blk_release_queue() to blk_cleanup_queue() because
it needs to happen before blk_unregister_region() call in del_gendisk()
for MD. SCSI though will free up the device number from sd_remove()
called through a maze of callbacks from device_del() in
__scsi_remove_device() before blk_cleanup_queue() and thus similar 
races

as described in 6cd18e711dd8 can happen for SCSI as well as reported by
Omar [1].

Moving bdi_unregister() to del_gendisk() works for MD and fixes the
problem for SCSI since del_gendisk() gets called from sd_remove() 
before

freeing the device number.

This also makes device_add_disk() (calling bdi_register_owner()) more
symmetric with del_gendisk().

[1] http://marc.info/?l=linux-block=148554717109098=2

When booting the bad kernel, I would eventually get a prompt to press 
the enter key to boot and it eventually started, but the SCSI disk 
partitions were not found by blkid nor could they be mounted.


lsscsi reports:

[0:0:6:0]diskIBM  DCAS-34330W  S65A  /dev/sda
[1:0:0:0]diskATA  WDC WD3200AAJB-0 2C01  /dev/sdc
[2:0:0:0]cd/dvd  HL-DT-ST DVDRAM GSA-4163B A103  /dev/sr0
[2:0:1:0]diskATA  SAMSUNG SP4002H  0-57  /dev/sdd
[3:0:0:0]diskVerbatim STORE N GO   5.00  /dev/sdb

blkid reports:

/dev/sdb1: LABEL="STORENGO" UUID="B08B-79DA" TYPE="vfat" 
PARTUUID="961d9655-01"
/dev/sdc1: UUID="bfdeb6d6-0b77-4beb-a63d-bdc3e455b8ea" TYPE="ext3" 
PTTYPE="dos" PARTUUID="000750bf-01"
/dev/sdc5: UUID="26b7280a-f40c-49dd-a086-dbbb9b7e3def" TYPE="swap" 
PARTUUID="000750bf-05"

/dev/sdc6: UUID="7417-5AFF" TYPE="vfat" PARTUUID="000750bf-06"
/dev/sdc7: UUID="96c96a61-8615-4715-86d0-09cb8c62638c" TYPE="ext3" 
PARTUUID="000750bf-07"
/dev/sdd1: LABEL="W-98 SE" UUID="3571-16DE" TYPE="vfat" 
PARTUUID="43598af3-01"
/dev/sdd3: UUID="fd6a052e-c062-4c47-801d-087595635c5d" SEC_TYPE="ext2" 
TYPE="ext3" PARTUUID="43598af3-03"
/dev/sdd5: UUID="026a3f5c-0064-4ae7-869e-519d2cee05e7" SEC_TYPE="ext2" 
TYPE="ext3" PTTYPE="dos" PARTUUID="43598af3-05"
/dev/sdd6: UUID="9a0970fa-74ba-4426-98ac-1e8b81933e0e" TYPE="swap" 
PARTUUID="43598af3-06"

/dev/sdd7: UUID="4912-06CA" TYPE="vfat" PARTUUID="43598af3-07"

The boot screen is at:

http://www.users.on.net/~arthur.marsh/20170308_18.jpg

and the dmesg output from booting the bad kernel is attached.

I'm happy to supply any other configuration details needed and run 
further tests.


Regards,

Arthur.



[0.00] Linux version 4.10.0+ (root@victoria) (gcc version 6.3.0 
20170221 (Debian 6.3.0-8) ) #652 SMP PREEMPT Wed Mar 8 04:45:06 ACDT 2017
[0.00] x86/fpu: x87 FPU will use FXSAVE
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0009fbff] usable
[0.00] BIOS-e820: [mem 0x0009fc00-0x0009] reserved
[0.00] BIOS-e820: [mem 0x000e-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0x3ffa] usable
[0.00] BIOS-e820: [mem 0x3ffb-0x3ffbdfff] ACPI data
[0.00] BIOS-e820: [mem 0x3ffbe000-0x3ffd] ACPI NVS
[0.00] BIOS-e820: [mem 0x3ffe-0x3fff] reserved
[0.00] BIOS-e820: [mem 0xe000-0xefff] reserved
[0.00] BIOS-e820: [mem 0xfec0-0xfec00fff] reserved
[0.00] BIOS-e820: [mem 0xff78-0x] reserved
[0.00] NX (Execute Disable) protection: active
[0.00] SMBIOS 2.3 present.
[0.00] DMI: System manufacturer System Product Name/A8V-MX, BIOS 0503   
 12/06/2005
[0.00] e820: update [mem 0x-0x0fff] usable ==> reserved
[0.00] e820: remove [mem 0x000a-0x000f] usable
[0.00] e820: last_pfn = 0x3ffb0 max_arch_pfn = 0x100
[0.00] MTRR default type: uncachable
[0.00] MTRR fixed ranges enabled:
[0.00]   0-9 write-back
[0.00]   A-E uncachable
[0.00]   F-F write-protect
[0.00] MTRR variable ranges enabled:
[0.00]   0 base 00 mask FFC000 write-back
[0.00]   1 base 00D000 mask FFF000 write-combining

Re: problem with block: Move bdi_unregister() to del_gendisk() commit 165a5e22fafb127ecb5914e12e8c32a1f0d3f820

2017-03-08 Thread Jan Kara
On Wed 08-03-17 20:37:54, Arthur Marsh wrote:
> 
> On one of my pc's I have 2 PATA disks (one, WDC below is used for booting,
> the other SAMSUNG is not mounted), plus an IBM SCSI disk using a DPT 2044W
> controller with eata driver and sometimes a Verbatim Storengo USB stick.
> 
> On recent 4.10.0+ kernel builds (i386), the resulting kernel would pause
> during the start-up when the USB stick was inserted but boot normally
> otherwise.
> 
> A git-bisect lead to:
> 
> commit 165a5e22fafb127ecb5914e12e8c32a1f0d3f820
> Author: Jan Kara 
> Date:   Wed Feb 8 08:05:56 2017 +0100
> 
> block: Move bdi_unregister() to del_gendisk()
> 
> Commit 6cd18e711dd8 "block: destroy bdi before blockdev is
> unregistered." moved bdi unregistration (at that time through
> bdi_destroy()) from blk_release_queue() to blk_cleanup_queue() because
> it needs to happen before blk_unregister_region() call in del_gendisk()
> for MD. SCSI though will free up the device number from sd_remove()
> called through a maze of callbacks from device_del() in
> __scsi_remove_device() before blk_cleanup_queue() and thus similar races
> as described in 6cd18e711dd8 can happen for SCSI as well as reported by
> Omar [1].
> 
> Moving bdi_unregister() to del_gendisk() works for MD and fixes the
> problem for SCSI since del_gendisk() gets called from sd_remove() before
> freeing the device number.
> 
> This also makes device_add_disk() (calling bdi_register_owner()) more
> symmetric with del_gendisk().
> 
> [1] http://marc.info/?l=linux-block=148554717109098=2

Thanks for report and the analysis! The real culprit seems to be:

 [4.614530] BUG: unable to handle kernel NULL pointer dereference at 
0258
 [4.614606] IP: sd_revalidate_disk+0x12/0x1460 [sd_mod]
 [4.614668] *pdpt = 370de001 *pde =  
 [4.614788] Oops:  [#1] PREEMPT SMP
 [4.614847] Modules linked in: uas sd_mod usb_storage psmouse pata_via 
ehci_pci uhci_hcd ehci_hcd via_rhine usbcore eata mii i2c_viapro usb_common 
libata scsi_mod
 [4.614970] CPU: 0 PID: 127 Comm: kworker/u4:5 Not tainted 4.10.0+ #652
 [4.615035] Hardware name: System manufacturer System Product Name/A8V-MX, 
BIOS 050312/06/2005
 [4.615117] Workqueue: events_unbound async_run_entry_fn
 [4.615180] task: f71b9080 task.stack: f729c000
 [4.615243] EIP: sd_revalidate_disk+0x12/0x1460 [sd_mod]
 [4.615305] EFLAGS: 00010296 CPU: 0
 [4.615363] EAX: f7261000 EBX:  ECX:  EDX: f7196000
 [4.615426] ESI: f7196000 EDI: f7261000 EBP: f729deac ESP: f729de44
 [4.615489]  DS: 007b ES: 007b FS: 00d8 GS:  SS: 0068
 [4.615552] CR0: 80050033 CR2: 0258 CR3: 370ce6c0 CR4: 06f0
 [4.615615] Call Trace:
 [4.615675]  ? kobject_uevent+0xa/0x10
 [4.615736]  ? blk_integrity_add+0x41/0x50
 [4.615797]  ? device_add_disk+0x275/0x450
 [4.615861]  sd_probe_async+0x103/0x1b0 [sd_mod]
 [4.615923]  ? preempt_count_sub+0x26/0x70
 [4.615984]  async_run_entry_fn+0x30/0x170
 [4.616045]  ? preempt_count_sub+0x26/0x70
 [4.616107]  process_one_work+0x235/0x670
 [4.616167]  ? process_one_work+0x1a3/0x670
 [4.616229]  worker_thread+0x2f/0x4c0
 [4.616288]  ? preempt_count_add+0x82/0xc0
 [4.616349]  ? process_one_work+0x670/0x670
 [4.616410]  kthread+0xd0/0x100
 [4.616467]  ? process_one_work+0x670/0x670
 [4.616528]  ? kthread_create_on_node+0x30/0x30
 [4.616591]  ret_from_fork+0x21/0x2c
 [4.616652] Code: 43 02 31 c0 8b 5d f4 c9 c3 8d 75 e4 89 34 24 e8 95 0a fb 
ff eb d4 8d 76 00 55 89 e5 57 56 53 83 ec 5c 8b 98 0c 03 00 00 89 45 c8 <8b> 93 
58 02 00 00 8b 43 04 89 45 d8 8b 82 08 03 00 00 89 45 d0
 [4.616840] EIP: sd_revalidate_disk+0x12/0x1460 [sd_mod] SS:ESP: 
0068:f729de44
 [4.616914] CR2: 0258
 [4.616973] ---[ end trace b1bff84fe33f3677 ]---

in your log. And this apparently happened because disk->private_data was
NULL. Can you try whether attached patches fix the problem for you?

Honza


> 
> When booting the bad kernel, I would eventually get a prompt to press the
> enter key to boot and it eventually started, but the SCSI disk partitions
> were not found by blkid nor could they be mounted.
> 
> lsscsi reports:
> 
> [0:0:6:0]diskIBM  DCAS-34330W  S65A  /dev/sda
> [1:0:0:0]diskATA  WDC WD3200AAJB-0 2C01  /dev/sdc
> [2:0:0:0]cd/dvd  HL-DT-ST DVDRAM GSA-4163B A103  /dev/sr0
> [2:0:1:0]diskATA  SAMSUNG SP4002H  0-57  /dev/sdd
> [3:0:0:0]diskVerbatim STORE N GO   5.00  /dev/sdb
> 
> blkid reports:
> 
> /dev/sdb1: LABEL="STORENGO" UUID="B08B-79DA" TYPE="vfat"
> PARTUUID="961d9655-01"
> /dev/sdc1: UUID="bfdeb6d6-0b77-4beb-a63d-bdc3e455b8ea" TYPE="ext3"
> PTTYPE="dos" PARTUUID="000750bf-01"
> /dev/sdc5: UUID="26b7280a-f40c-49dd-a086-dbbb9b7e3def" TYPE="swap"
> 

Re: problem with block: Move bdi_unregister() to del_gendisk() commit 165a5e22fafb127ecb5914e12e8c32a1f0d3f820

2017-03-08 Thread Jan Kara
On Wed 08-03-17 20:37:54, Arthur Marsh wrote:
> 
> On one of my pc's I have 2 PATA disks (one, WDC below is used for booting,
> the other SAMSUNG is not mounted), plus an IBM SCSI disk using a DPT 2044W
> controller with eata driver and sometimes a Verbatim Storengo USB stick.
> 
> On recent 4.10.0+ kernel builds (i386), the resulting kernel would pause
> during the start-up when the USB stick was inserted but boot normally
> otherwise.
> 
> A git-bisect lead to:
> 
> commit 165a5e22fafb127ecb5914e12e8c32a1f0d3f820
> Author: Jan Kara 
> Date:   Wed Feb 8 08:05:56 2017 +0100
> 
> block: Move bdi_unregister() to del_gendisk()
> 
> Commit 6cd18e711dd8 "block: destroy bdi before blockdev is
> unregistered." moved bdi unregistration (at that time through
> bdi_destroy()) from blk_release_queue() to blk_cleanup_queue() because
> it needs to happen before blk_unregister_region() call in del_gendisk()
> for MD. SCSI though will free up the device number from sd_remove()
> called through a maze of callbacks from device_del() in
> __scsi_remove_device() before blk_cleanup_queue() and thus similar races
> as described in 6cd18e711dd8 can happen for SCSI as well as reported by
> Omar [1].
> 
> Moving bdi_unregister() to del_gendisk() works for MD and fixes the
> problem for SCSI since del_gendisk() gets called from sd_remove() before
> freeing the device number.
> 
> This also makes device_add_disk() (calling bdi_register_owner()) more
> symmetric with del_gendisk().
> 
> [1] http://marc.info/?l=linux-block=148554717109098=2

Thanks for report and the analysis! The real culprit seems to be:

 [4.614530] BUG: unable to handle kernel NULL pointer dereference at 
0258
 [4.614606] IP: sd_revalidate_disk+0x12/0x1460 [sd_mod]
 [4.614668] *pdpt = 370de001 *pde =  
 [4.614788] Oops:  [#1] PREEMPT SMP
 [4.614847] Modules linked in: uas sd_mod usb_storage psmouse pata_via 
ehci_pci uhci_hcd ehci_hcd via_rhine usbcore eata mii i2c_viapro usb_common 
libata scsi_mod
 [4.614970] CPU: 0 PID: 127 Comm: kworker/u4:5 Not tainted 4.10.0+ #652
 [4.615035] Hardware name: System manufacturer System Product Name/A8V-MX, 
BIOS 050312/06/2005
 [4.615117] Workqueue: events_unbound async_run_entry_fn
 [4.615180] task: f71b9080 task.stack: f729c000
 [4.615243] EIP: sd_revalidate_disk+0x12/0x1460 [sd_mod]
 [4.615305] EFLAGS: 00010296 CPU: 0
 [4.615363] EAX: f7261000 EBX:  ECX:  EDX: f7196000
 [4.615426] ESI: f7196000 EDI: f7261000 EBP: f729deac ESP: f729de44
 [4.615489]  DS: 007b ES: 007b FS: 00d8 GS:  SS: 0068
 [4.615552] CR0: 80050033 CR2: 0258 CR3: 370ce6c0 CR4: 06f0
 [4.615615] Call Trace:
 [4.615675]  ? kobject_uevent+0xa/0x10
 [4.615736]  ? blk_integrity_add+0x41/0x50
 [4.615797]  ? device_add_disk+0x275/0x450
 [4.615861]  sd_probe_async+0x103/0x1b0 [sd_mod]
 [4.615923]  ? preempt_count_sub+0x26/0x70
 [4.615984]  async_run_entry_fn+0x30/0x170
 [4.616045]  ? preempt_count_sub+0x26/0x70
 [4.616107]  process_one_work+0x235/0x670
 [4.616167]  ? process_one_work+0x1a3/0x670
 [4.616229]  worker_thread+0x2f/0x4c0
 [4.616288]  ? preempt_count_add+0x82/0xc0
 [4.616349]  ? process_one_work+0x670/0x670
 [4.616410]  kthread+0xd0/0x100
 [4.616467]  ? process_one_work+0x670/0x670
 [4.616528]  ? kthread_create_on_node+0x30/0x30
 [4.616591]  ret_from_fork+0x21/0x2c
 [4.616652] Code: 43 02 31 c0 8b 5d f4 c9 c3 8d 75 e4 89 34 24 e8 95 0a fb 
ff eb d4 8d 76 00 55 89 e5 57 56 53 83 ec 5c 8b 98 0c 03 00 00 89 45 c8 <8b> 93 
58 02 00 00 8b 43 04 89 45 d8 8b 82 08 03 00 00 89 45 d0
 [4.616840] EIP: sd_revalidate_disk+0x12/0x1460 [sd_mod] SS:ESP: 
0068:f729de44
 [4.616914] CR2: 0258
 [4.616973] ---[ end trace b1bff84fe33f3677 ]---

in your log. And this apparently happened because disk->private_data was
NULL. Can you try whether attached patches fix the problem for you?

Honza


> 
> When booting the bad kernel, I would eventually get a prompt to press the
> enter key to boot and it eventually started, but the SCSI disk partitions
> were not found by blkid nor could they be mounted.
> 
> lsscsi reports:
> 
> [0:0:6:0]diskIBM  DCAS-34330W  S65A  /dev/sda
> [1:0:0:0]diskATA  WDC WD3200AAJB-0 2C01  /dev/sdc
> [2:0:0:0]cd/dvd  HL-DT-ST DVDRAM GSA-4163B A103  /dev/sr0
> [2:0:1:0]diskATA  SAMSUNG SP4002H  0-57  /dev/sdd
> [3:0:0:0]diskVerbatim STORE N GO   5.00  /dev/sdb
> 
> blkid reports:
> 
> /dev/sdb1: LABEL="STORENGO" UUID="B08B-79DA" TYPE="vfat"
> PARTUUID="961d9655-01"
> /dev/sdc1: UUID="bfdeb6d6-0b77-4beb-a63d-bdc3e455b8ea" TYPE="ext3"
> PTTYPE="dos" PARTUUID="000750bf-01"
> /dev/sdc5: UUID="26b7280a-f40c-49dd-a086-dbbb9b7e3def" TYPE="swap"
>