------- Comment From mauri...@br.ibm.com 2016-12-12 11:19 EDT------- Canonical,
The patch has been accepted into mainline/4.9 [1]. Submitting to kernel-team mailing list a while ago, but not in the archives yet. Updated netboot files (lpfc module) required. Thanks! [1] scsi: lpfc: fix oops/BUG in lpfc_sli_ringtxcmpl_put() https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/scsi/lpfc?id=2319f847a8910cff1d46c9b66aa1dd7cc3e836a9 [2] subject: "[SRU][Xenial HWE 4.8][Yakkety][PATCH] scsi: lpfc: fix oops/BUG in lpfc_sli_ringtxcmpl_put()" -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1648873 Title: Ubuntu 16.10 netboot install fails with "Oops: Exception in kernel mode, sig: 5 [#1] " (lpfc) Status in debian-installer package in Ubuntu: New Status in linux package in Ubuntu: Triaged Status in debian-installer source package in Xenial: New Status in linux source package in Xenial: Invalid Status in debian-installer source package in Yakkety: New Status in linux source package in Yakkety: New Bug description: == Comment: #33 - Mauricio Faria De Oliveira - 2016-12-09 06:49:57 == Hi Canonical, Can you please apply this patch [1] to 16.10 and 16.04.x HWE (4.8) ? It's fixes a regression introduced in 4.8. As you can see, it's in the SCSI maintainer (James Bottomley)'s 'fixes' branch, but didn't make 4.9-rc8 (maybe he considered it late for this one). We have installer, boot, and post-boot problems due to this one. It'd be good if the netboot images for 16.04.x HWE kernel can get it too. Thank you, [1] scsi: lpfc: fix oops/BUG in lpfc_sli_ringtxcmpl_put() http://git.kernel.org/cgit/linux/kernel/git/jejb/scsi.git/commit/?h=fixes&id=2319f847a8910cff1d46c9b66aa1dd7cc3e836a9 Historical context: == Comment: #0 - HARSHA THYAGARAJA - 2016-11-21 02:39:35 == ---Problem Description--- Ubuntu 16.10 netboot install fails with "Oops: Exception in kernel mode. " (kernel: 4.8.0-27) Machine Type = Power8 baremetal ---boot type--- QEMU direct boot kernel/initrd ---Kernel cmdline used to launch install--- On a Power8 server, Using kernel and initrd images, netcfg/disable_dhcp=true netcfg/confirm_static=true netcfg/choose_interface=98:BE:94:00:4C:68 netcfg/get_ipaddress=9.47.67.159/20 netcfg/get_gateway=9.47.79.254 netcfg/get_nameservers= ---Install repository type--- Internet repository ---Install repository Location--- http://ports.ubuntu.com/ubuntu-ports/dists/yakkety/main/installer-ppc64el/current/images/netboot/ubuntu-installer/ppc64el/ ---Point of failure--- Other failure during installation (stage 1) == Comment: #1 - HARSHA THYAGARAJA - 2016-11-21 02:41:54 == The netboot install fails and Call traces are seen at the Disk detection step. == Comment: #8 - Mauricio Faria De Oliveira - 2016-11-21 15:58:25 == Finally got it. The assembly offset/code + the trap signal is due to this BUG_ON(), and the second condition triggered the trap. Checking why piocb is not NULL but piocb->vport is NULL. This might have happened in the lpfc_linkdown_port() path, in the stack trace. Would need a more readable console log (ie, dmesg, as requested in comments 5, 3) to help understanding it. -- static int lpfc_sli_ringtxcmpl_put(struct lpfc_hba *phba, struct lpfc_sli_ring *pring, struct lpfc_iocbq *piocb) { lockdep_assert_held(&phba->hbalock); BUG_ON(!piocb || !piocb->vport); <...> } [ 226.147886] NIP [d00000000b7324c0] lpfc_sli_ringtxcmpl_put+0x48/0x120 [lpfc] 0x2478 + 0x48 = 0x24c0 (tdnei; trap doubleword not equal immediate) 0000000000002478 <lpfc_sli_ringtxcmpl_put>: <...> 2498: 78 2b bf 7c mr r31,r5 // r31 is *piocb (r5 is the 3rd function parameter) 249c: 78 1b 7d 7c mr r29,r3 24a0: 78 23 9e 7c mr r30,r4 24a4: 01 00 00 48 bl 24a4 <lpfc_sli_ringtxcmpl_put+0x2c> // probably converted at module load time to the call lockdep_assert_held() 24a8: 00 00 00 60 nop 24ac: 00 00 bf 2f cmpdi cr7,r31,0 // compare piocb with 0. checking for NULL. 24b0: 70 00 9e 41 beq cr7,2520 <lpfc_sli_ringtxcmpl_put+0xa8> // if equal to zero, branch out. done w/ the former part of the OR check. 24b4: e8 00 3f e9 ld r9,232(r31) // an offset of piocb. probably piocb->vport in the bug_on 24b8: 74 00 29 7d cntlzd r9,r9 // count leading zeroes. if r59 is null (0), leading zeroes is 64 (binary: 0100_0000, bit 6 is 1, and 6 LSbs [bits 5-0] are 0) 24bc: 82 d1 29 79 rldicl r9,r9,58,6 // rotate left 58 (ie, those 6 LSbs are now MSbs, and that bit 6 from 64 was rotated in the register and is now bit 0, the LSb), now AND the 6 MSbs w/ 0-bits, and the all lower bits with 1-bits (ie, save the LSb). 24c0: 00 00 09 0b tdnei r9,0 // trap if not equal to zero. (ie, the whole r9 was zero, with 64 leading/consecutive zeroes, then bit 6 is 1, it becomes bit 0.. and since bit 0 is now 1, r9 is thus non-zero, and the trap triggers.) this checked the latter part of the OR. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/debian-installer/+bug/1648873/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp