Re: Oops: NULL pointer dereference - RIP: isci_task_abort_task+0x30/0x3e0 [isci]

2018-01-08 Thread Hannes Reinecke
On 01/08/2018 11:11 AM, Christoph Hellwig wrote:
> Hannes said he was going to look into this, which makes sense
> given that he designed the async abort code.
> 
> On Fri, Jan 05, 2018 at 01:13:48PM +0100, Yves-Alexis Perez wrote:
>> Hi,
>>
>> since kernel 4.11 (sorry it took so long to report) I have a box failing to
>> boot with a NULL pointer dereference (the box is stuck there afterwards).
>>
>> The bug has also been reported to the Debian BTS 
>> (https://bugs.debian.org/cgi-
>> bin/bugreport.cgi?bug=882414) and a suggestion to revert 90965761 has been
>> made. I can confirm it fix the boot issue.
>>
>> I don't have the complete stack trace at hand but there's an example in the
>> Debian bug. The machine is a Dell Precision T5600 with the following SATA
>> controllers:
>>
>> 00:1f.2 SATA controller: Intel Corporation C600/X79 series chipset 6-Port 
>> SATA
>> AHCI Controller (rev 05)
>> 05:00.0 Serial Attached SCSI controller: Intel Corporation C602 chipset 
>> 4-Port 
>> SATA Storage Control Unit (rev 05)
>>
>> If you need more information or need me to test something, please ask.
>>
>> Regards,
>> -- 
>> Yves-Alexis
> 
> ---end quoted text---
> 
Looks like we're calling lldd_abort_task() with a NULL argument.
Will be sending a patch.

Cheers,

Hannes
-- 
Dr. Hannes ReineckezSeries & Storage
h...@suse.com  +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: Oops: NULL pointer dereference - RIP: isci_task_abort_task+0x30/0x3e0 [isci]

2018-01-08 Thread Christoph Hellwig
Hannes said he was going to look into this, which makes sense
given that he designed the async abort code.

On Fri, Jan 05, 2018 at 01:13:48PM +0100, Yves-Alexis Perez wrote:
> Hi,
> 
> since kernel 4.11 (sorry it took so long to report) I have a box failing to
> boot with a NULL pointer dereference (the box is stuck there afterwards).
> 
> The bug has also been reported to the Debian BTS (https://bugs.debian.org/cgi-
> bin/bugreport.cgi?bug=882414) and a suggestion to revert 90965761 has been
> made. I can confirm it fix the boot issue.
> 
> I don't have the complete stack trace at hand but there's an example in the
> Debian bug. The machine is a Dell Precision T5600 with the following SATA
> controllers:
> 
> 00:1f.2 SATA controller: Intel Corporation C600/X79 series chipset 6-Port SATA
> AHCI Controller (rev 05)
> 05:00.0 Serial Attached SCSI controller: Intel Corporation C602 chipset 
> 4-Port 
> SATA Storage Control Unit (rev 05)
> 
> If you need more information or need me to test something, please ask.
> 
> Regards,
> -- 
> Yves-Alexis

---end quoted text---


Re: Oops: NULL pointer dereference - RIP: isci_task_abort_task+0x30/0x3e0 [isci]

2018-01-06 Thread Stefan Priebe - Profihost AG

Am 06.01.2018 um 12:40 schrieb Simon Leinen:
> Yves-Alexis Perez wrote:
>> since kernel 4.11 (sorry it took so long to report) I have a box
>> failing to boot with a NULL pointer dereference (the box is stuck
>> there afterwards).
> 
> I get the same result on a Quanta server with several 4.13 and 4.14
> kernels (from the Ubuntu "mainline" and Xenial hwe-edge PPAs).
> 
> This (I guess) problem had been reported by Stefan Priebe under
> "isci regression in 4.11.0-rc2 by scsi: libsas: allow async aborts"
> on 8 November, 2017[1].  That report didn't elicit any response here.

Yes - also Cristoph Hellwig hasn't responded yet. So i reverted that
commit on my own as well.

Stefan

> 
>> The bug has also been reported to the Debian BTS ([2]) and a
>> suggestion to revert 90965761 has been made. I can confirm it fix the
>> boot issue.
> 
> The Debian people have implemented the suggestion to revert 90965761 as
> of their 4.14.12-1 kernel package[2].
> 
>> I don't have the complete stack trace at hand but there's an example
>> in the Debian bug.
> 
> Here's a stack trace from my server.  It was copied and pasted from a
> serial console (IPMI SOL), I hope it's complete.
> 
>   [9.184043] BUG: unable to handle kernel NULL pointer dereference at 
>   (null)
>   [9.184055] IP: isci_task_abort_task+0x43/0x400 [isci]
>   [9.184056] PGD 0
>   [9.184056] P4D 0
>   [9.184057]
>   [9.184058] Oops:  [#1] SMP
>   [9.184060] Modules linked in: aesni_intel(+) aes_x86_64 crypto_simd 
> glue_helper cryptd mei_me intel_cstate intel_rapl_perf mei shpchp lpc_ich 
> ipmi_si(+) mac_hid kvm_intel kvm irqbypass ib_iser rdma_cm iw_cm ib_cm 
> ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ipmi_devintf 
> ipmi_msghandler autofs4 btrfs xor raid6_pq ast ttm drm_kms_helper ixgbe igb 
> syscopyarea isci sysfillrect i2c_algo_bit dca sysimgblt libsas fb_sys_fops 
> ptp mdio drm scsi_transport_sas pps_core wmi
>   [9.184084] CPU: 18 PID: 434 Comm: kworker/u48:1 Not tainted 
> 4.13.0-21-generic #24~16.04.1-Ubuntu
>   [9.184084] Hardware name: Quanta S210-X12RS V2/S210-X12RS V2, BIOS 
> S2RQ4A08 08/12/2013
>   [9.184090] Workqueue: scsi_tmf_0 scmd_eh_abort_handler
>   [9.184091] task: 96507bb05d00 task.stack: a2de87bb4000
>   [9.184095] RIP: 0010:isci_task_abort_task+0x43/0x400 [isci]
>   [9.184095] RSP: 0018:a2de87bb7c88 EFLAGS: 00010246
>   [9.184096] RAX:  RBX: 9650782f11a8 RCX: 
> 
>   [9.184097] RDX:  RSI: 9650782f11a8 RDI: 
> 
>   [9.184097] RBP: a2de87bb7e28 R08:  R09: 
> 0001
>   [9.184098] R10: b8cb R11: 02f3 R12: 
> 9650782f1148
>   [9.184098] R13: 9650758cb800 R14: 0008 R15: 
> 
>   [9.184099] FS:  () GS:9660bf38() 
> knlGS:
>   [9.184100] CS:  0010 DS:  ES:  CR0: 80050033
>   [9.184100] CR2:  CR3: 4b009000 CR4: 
> 001406e0
>   [9.184101] Call Trace:
>   [9.184107]  ? cpumask_next_and+0x31/0x50
>   [9.184110]  ? load_balance+0x1b5/0x9c0
>   [9.184114]  ? sched_clock+0x9/0x10
>   [9.184116]  ? sched_clock+0x9/0x10
>   [9.184117]  ? sched_clock+0x9/0x10
>   [9.184120]  ? sched_clock_cpu+0x11/0xb0
>   [9.184121]  ? pick_next_task_fair+0x3c7/0x560
>   [9.184123]  ? __switch_to+0x211/0x510
>   [9.184125]  ? put_prev_entity+0x27/0x100
>   [9.184129]  sas_eh_abort_handler+0x30/0x50 [libsas]
>   [9.184131]  scmd_eh_abort_handler+0x74/0x230
>   [9.184135]  process_one_work+0x156/0x410
>   [9.184136]  worker_thread+0x4b/0x460
>   [9.184138]  kthread+0x109/0x140
>   [9.184139]  ? process_one_work+0x410/0x410
>   [9.184140]  ? kthread_create_on_node+0x70/0x70
>   [9.184143]  ret_from_fork+0x25/0x30
>   [9.184144] Code: 08 48 81 ec 78 01 00 00 c7 85 78 fe ff ff 00 00 00 00 
> c7 85 80 fe ff ff 00 00 00 00 65 48 8b 04 25 28 00 00 00 48 89 45 d0 31 c0 
> <48> 8b 07 48 8b 40 30 48 8b 80 90 02 00 00 4c 8b a0 28 01 00 00
>   [9.184160] RIP: isci_task_abort_task+0x43/0x400 [isci] RSP: 
> a2de87bb7c88
>   [9.184161] CR2: 
>   [9.184162] ---[ end trace bf9920b58fca631f ]---
> 
>> The machine is a Dell Precision T5600 with the following SATA
>> controllers:
> 
>> 00:1f.2 SATA controller: Intel Corporation C600/X79 series chipset 6-Port 
>> SATA
>> AHCI Controller (rev 05)
>> 05:00.0 Serial Attached SCSI controller: Intel Corporation C602 chipset 
>> 4-Port 
>> SATA Storage Control Unit (rev 05)
> 
> Mine is a Quanta S210-X12RS server with only one SATA controller:
> 
> 08:00.0 Serial Attached SCSI controller: Intel Corporation C602 chipset 
> 4-Port SATA Storage Control Unit (rev 05)
> 
> Connected to that SATA controller are two Samsung 850 EVO 250GB SSDs and
> one 3TB WD 

Re: Oops: NULL pointer dereference - RIP: isci_task_abort_task+0x30/0x3e0 [isci]

2018-01-06 Thread Simon Leinen
Yves-Alexis Perez wrote:
> since kernel 4.11 (sorry it took so long to report) I have a box
> failing to boot with a NULL pointer dereference (the box is stuck
> there afterwards).

I get the same result on a Quanta server with several 4.13 and 4.14
kernels (from the Ubuntu "mainline" and Xenial hwe-edge PPAs).

This (I guess) problem had been reported by Stefan Priebe under
"isci regression in 4.11.0-rc2 by scsi: libsas: allow async aborts"
on 8 November, 2017[1].  That report didn't elicit any response here.

> The bug has also been reported to the Debian BTS ([2]) and a
> suggestion to revert 90965761 has been made. I can confirm it fix the
> boot issue.

The Debian people have implemented the suggestion to revert 90965761 as
of their 4.14.12-1 kernel package[2].

> I don't have the complete stack trace at hand but there's an example
> in the Debian bug.

Here's a stack trace from my server.  It was copied and pasted from a
serial console (IPMI SOL), I hope it's complete.

  [9.184043] BUG: unable to handle kernel NULL pointer dereference at   
(null)
  [9.184055] IP: isci_task_abort_task+0x43/0x400 [isci]
  [9.184056] PGD 0
  [9.184056] P4D 0
  [9.184057]
  [9.184058] Oops:  [#1] SMP
  [9.184060] Modules linked in: aesni_intel(+) aes_x86_64 crypto_simd 
glue_helper cryptd mei_me intel_cstate intel_rapl_perf mei shpchp lpc_ich 
ipmi_si(+) mac_hid kvm_intel kvm irqbypass ib_iser rdma_cm iw_cm ib_cm ib_core 
iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ipmi_devintf 
ipmi_msghandler autofs4 btrfs xor raid6_pq ast ttm drm_kms_helper ixgbe igb 
syscopyarea isci sysfillrect i2c_algo_bit dca sysimgblt libsas fb_sys_fops ptp 
mdio drm scsi_transport_sas pps_core wmi
  [9.184084] CPU: 18 PID: 434 Comm: kworker/u48:1 Not tainted 
4.13.0-21-generic #24~16.04.1-Ubuntu
  [9.184084] Hardware name: Quanta S210-X12RS V2/S210-X12RS V2, BIOS 
S2RQ4A08 08/12/2013
  [9.184090] Workqueue: scsi_tmf_0 scmd_eh_abort_handler
  [9.184091] task: 96507bb05d00 task.stack: a2de87bb4000
  [9.184095] RIP: 0010:isci_task_abort_task+0x43/0x400 [isci]
  [9.184095] RSP: 0018:a2de87bb7c88 EFLAGS: 00010246
  [9.184096] RAX:  RBX: 9650782f11a8 RCX: 

  [9.184097] RDX:  RSI: 9650782f11a8 RDI: 

  [9.184097] RBP: a2de87bb7e28 R08:  R09: 
0001
  [9.184098] R10: b8cb R11: 02f3 R12: 
9650782f1148
  [9.184098] R13: 9650758cb800 R14: 0008 R15: 

  [9.184099] FS:  () GS:9660bf38() 
knlGS:
  [9.184100] CS:  0010 DS:  ES:  CR0: 80050033
  [9.184100] CR2:  CR3: 4b009000 CR4: 
001406e0
  [9.184101] Call Trace:
  [9.184107]  ? cpumask_next_and+0x31/0x50
  [9.184110]  ? load_balance+0x1b5/0x9c0
  [9.184114]  ? sched_clock+0x9/0x10
  [9.184116]  ? sched_clock+0x9/0x10
  [9.184117]  ? sched_clock+0x9/0x10
  [9.184120]  ? sched_clock_cpu+0x11/0xb0
  [9.184121]  ? pick_next_task_fair+0x3c7/0x560
  [9.184123]  ? __switch_to+0x211/0x510
  [9.184125]  ? put_prev_entity+0x27/0x100
  [9.184129]  sas_eh_abort_handler+0x30/0x50 [libsas]
  [9.184131]  scmd_eh_abort_handler+0x74/0x230
  [9.184135]  process_one_work+0x156/0x410
  [9.184136]  worker_thread+0x4b/0x460
  [9.184138]  kthread+0x109/0x140
  [9.184139]  ? process_one_work+0x410/0x410
  [9.184140]  ? kthread_create_on_node+0x70/0x70
  [9.184143]  ret_from_fork+0x25/0x30
  [9.184144] Code: 08 48 81 ec 78 01 00 00 c7 85 78 fe ff ff 00 00 00 00 c7 
85 80 fe ff ff 00 00 00 00 65 48 8b 04 25 28 00 00 00 48 89 45 d0 31 c0 <48> 8b 
07 48 8b 40 30 48 8b 80 90 02 00 00 4c 8b a0 28 01 00 00
  [9.184160] RIP: isci_task_abort_task+0x43/0x400 [isci] RSP: 
a2de87bb7c88
  [9.184161] CR2: 
  [9.184162] ---[ end trace bf9920b58fca631f ]---

> The machine is a Dell Precision T5600 with the following SATA
> controllers:

> 00:1f.2 SATA controller: Intel Corporation C600/X79 series chipset 6-Port SATA
> AHCI Controller (rev 05)
> 05:00.0 Serial Attached SCSI controller: Intel Corporation C602 chipset 
> 4-Port 
> SATA Storage Control Unit (rev 05)

Mine is a Quanta S210-X12RS server with only one SATA controller:

08:00.0 Serial Attached SCSI controller: Intel Corporation C602 chipset 4-Port 
SATA Storage Control Unit (rev 05)

Connected to that SATA controller are two Samsung 850 EVO 250GB SSDs and
one 3TB WD Red disk.

> If you need more information or need me to test something, please ask.

Likewise.

Best regards,
-- 
Simon.

[1] https://marc.info/?l=linux-scsi=151013394701914
[2] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=882414