Request for a fix that showed up in the 4.10 kernel to be backported to
16.04 and 16.10.  Please have the Kernel team review/respond.  Thanks.


On 02/23/2017 12:20 AM, Launchpad Bug Tracker wrote:
> bugproxy (bugproxy) has assigned this bug to you for Ubuntu:
> == Comment: #1 - Application Cdeadmin <> - 2016-06-02 
> 15:28:27 ==
> ==== State: Open by: anitrap on 01 June 2016 17:36:39 ====
> Contact: Anitra Powell  ( )
> Backup: Dion Bell (
> Primary BMC (1603G):
> =====================================================
> # cat /proc/ractrends/Helper/FwInfo
> FW_VERSION=2.13.91819
> FW_DATE=Mar 10 2016
> FW_DESC=8335 SRC BUILD RR9 03102016
> #
> PNOR (1603G):
> ========================
> # ipmitool -H -I lanplus -U ADMIN -P admin fru list 47
> Product Name          : OpenPOWER Firmware
> Product Version       : IBM-firestone-ibm-OP8_v1.7_1.62
> Product Extra         : hostboot-bc98d0b-1a29dff
> Product Extra         : occ-0362706-16fdfa7
> Product Extra         : skiboot-5.1.13
> Product Extra         : hostboot-binaries-43d5a59
> Product Extra         : firestone-xml-e7b4fa2-c302f0e
> Product Extra         : capp-ucode-105cb8f
> Partition Info:
> =================
>        ver - OS, HTX, Firmware and Machine details
>                            OS: GNU/Linux
>                    OS Version: Ubuntu 16.04 LTS \n \l
>                Kernel Version: 4.4.8c0ffee0+
>                   HTX Version: htxubuntu-396
>                     Host Name: fsbmc30p1
>             Machine Serial No: 210995A
>            Machine Type/Model: 8335-GCA
> root@fsbmc30p1:~# uname -a
> Linux fsbmc30p1 4.4.8c0ffee0+ #2 SMP Tue May 24 10:50:26 CDT 2016 ppc64le 
> ppc64le ppc64le GNU/Linux
> FlashGT NVMe setup:
> ===================
> 1 FlashGT card in slot 1 running in superpipe mode with 128 LUNs per port 
> (total of 256 LUNs).
> lsscsi
> [0:0:0:0]    disk    ATA      ST1000NX0313     BE33  /dev/sda
> [1:0:0:0]    disk    ATA      ST1000NX0313     BE33  /dev/sdb
> [4:0:0:0]    disk    NVMe     SAMSUNG MZ1LV960 3011  /dev/sdc
> [4:1:0:0]    disk    NVMe     SAMSUNG MZ1LV960 3011  /dev/sdd
> [5:0:0:0]    cd/dvd  AMI      Virtual CDROM0   1.00  /dev/sr0
> [5:0:0:1]    cd/dvd  AMI      Virtual CDROM1   1.00  /dev/sr1
> [5:0:0:2]    cd/dvd  AMI      Virtual CDROM2   1.00  /dev/sr2
> [5:0:0:3]    cd/dvd  AMI      Virtual CDROM3   1.00  /dev/sr3
> [6:0:0:0]    disk    AMI      Virtual Floppy0  1.00  /dev/sde
> [6:0:0:1]    disk    AMI      Virtual Floppy1  1.00  /dev/sdf
> [6:0:0:2]    disk    AMI      Virtual Floppy2  1.00  /dev/sdg
> [6:0:0:3]    disk    AMI      Virtual Floppy3  1.00  /dev/sdh
> [7:0:0:0]    disk    AMI      Virtual HDisk0   1.00  /dev/sdi
> [7:0:0:1]    disk    AMI      Virtual HDisk1   1.00  /dev/sdj
> [7:0:0:2]    disk    AMI      Virtual HDisk2   1.00  /dev/sdk
> [7:0:0:3]    disk    AMI      Virtual HDisk3   1.00  /dev/sdl
> [7:0:0:4]    disk    AMI      Virtual HDisk4   1.00  /dev/sdm
> lspci | grep -i acc
> 0004:01:00.0 Processing accelerators: IBM Device 0601 (rev 01)
> ls -l /sys/class/cxl
> total 0
> lrwxrwxrwx 1 root root 0 May 31 13:27 afu0.0 -> 
> ../../devices/pci0004:00/0004:00:00.0/0004:01:00.0/cxl/card0/afu0.0
> lrwxrwxrwx 1 root root 0 May 31 13:27 afu0.0m -> 
> ../../devices/pci0004:00/0004:00:00.0/0004:01:00.0/cxl/card0/afu0.0/afu0.0m
> lrwxrwxrwx 1 root root 0 May 31 13:27 afu0.0s -> 
> ../../devices/pci0004:00/0004:00:00.0/0004:01:00.0/cxl/card0/afu0.0/afu0.0s
> lrwxrwxrwx 1 root root 0 May 31 13:27 card0 -> 
> ../../devices/pci0004:00/0004:00:00.0/0004:01:00.0/cxl/card0
> lscfg | grep afu
> + afu0.0           Slot1/card0/afu0.0
> + afu0.0m          Slot1/card0/afu0.0/afu0.0m
> + afu0.0s          Slot1/card0/afu0.0/afu0.0s
> /opt/ibm/capikv/bin/cxlfstatus
> CXL Flash Device Status
> Found 0601 0004:01:00.0 Slot1
>     Device:       SCSI  Block       Mode                          LUN WWID
>        sg2:    4:0:0:0,   sdc, superpipe, 60025380025382463300046000000000
>        sg3:    4:1:0:0,   sdd, superpipe, 60025380025382463300052000000000
> dpkg -l | grep capi
> 4el      no description given               3.0-1970-3042652                
> ppc6
> 4el      no description given               3.0-1970-3042652                
> ppc6
> root@fsbmc30p1:/tmp# dpkg -l | grep afu
> ii  afuimage                                3.0-1970-3042652                
> all          no description given
> cat /opt/ibm/capikv/version.txt
> 1970-3042652
> /opt/ibm/capikv/afu/cxl_afu_dump /dev/cxl/afu0.0m -v
> AFU Version     = 160525N1
>  NVMe0 Version = BTV73011
>  NVMe0 NEXT    = BTV73011
>  NVMe0 STATUS  = 0x702
>  NVMe1 Version = BTV73011
>  NVMe1 NEXT    = BTV73011
>  NVMe1 STATUS  = 0x702
> cat /tmp/test_lun_mode
> 128
> Problem:
> ===========
> While running soft bootme (shutdown -r from OS every hour, I noticed htx 
> errors after the 9th & 17th reboot of partition.  At this point they seem 
> like different issues so I am opening up 2 different defects.  I've already 
> opened up defect SW354759 for the first set of htx errors and assigned to 
> htx_screen.
> This defect is for issue that happened after 17th reboot (Jun 1 @ 6am).
> On the 18th reboot (Jun 1 @ 7am), the shutdown -r command failed... I
> had to manually power down system.
> I guess I will open to surelock_screen first since it seems similar to
> the one Dion opened up while running 128 virtual LUNs per port (defect
> .  For
> this fail, other exercisers eventually failed also.
> Test Info:
> ============
> - running Soft bootme (shutdown -r every hour)
> - mdt.bu + hxecom  (GPUs were running). I copied a modified mdt.bu to another 
> mdt file so I would not see any errors in htx after reboot.
> Sample of HTX errors (for this defect)
> ==============================
> /dev/sg2.53       Jun  1 06:26:53 2016 err=00000010 sev=4 hxesurelock   
> READCMP5  numopers=     20000  loop=      4956  blk=0x4eee 
> len=      4096   offset=0   Seed Values= 37882, 44181, 50758 
> Data Pattern Seed Values = 37882, 44182, 50758    LBA Fencepost = 0xb94a
> cblk_read error - Device or resource busy
> /dev/sg2.18       Jun  1 06:26:53 2016 err=00000010 sev=4 hxesurelock   
> READCMP9  numopers=     20000  loop=      1501  blk=0x93f1 
> len=      4096   offset=0   Seed Values= 37847, 44740, 50780 
> Data Pattern Seed Values = 37847, 44741, 50780    LBA Fencepost = 0xb275
> cblk_read error - Device or resource busy
> /dev/sg2.98       Jun  1 06:26:53 2016 err=00000010 sev=4 hxesurelock   
> READCMP5  numopers=     20000  loop=     10365  blk=0x86d5 
> len=      4096   offset=0   Seed Values= 37927, 41320, 50710 
> Data Pattern Seed Values = 37927, 41321, 50710    LBA Fencepost = 0xbc7c
> cblk_read error - Device or resource busy
> /dev/sg2.116      Jun  1 06:30:45 2016 err=00000005 sev=4 hxesurelock   
> RDCMP10  numopers=     20000  loop=      6383  blk=0xc33d 
> len=      4096   offset=0   Seed Values= 37945, 49039, 50726 
> Data Pattern Seed Values = 37945, 49040, 50726    LBA Fencepost = 0xd0b0
> cblk_read error - Input/output error
> /dev/fpu17        Jun  1 06:30:51 2016 err=0000000b sev=1 hxefpu64      
> pthread_create call failed with rc: 11, errno: 11, Resource temporarily 
> unavailable
> /dev/fpu17        Jun  1 06:30:51 2016 err=0000000b sev=1 hxefpu64      
> Hardware Exerciser stopped on an error
> /dev/sctu43       Jun  1 06:30:51 2016 err=0000000b sev=1 hxesctu       
> pthread_create call failed with rc: 11, errno: 11, Resource temporarily 
> unavailable
> /dev/sctu43       Jun  1 06:30:51 2016 err=0000000b sev=1 hxesctu       
> Hardware Exerciser stopped on an error
> Logs:
> ======
> /gsa/ausgsa/home/a/n/anitrap/web/public/fsbmc30/softbootme_fail_1
> /gsa/ausgsa/home/a/n/anitrap/web/public/fsbmc30/softbootme_fail_1/htxerr
> /gsa/ausgsa/home/a/n/anitrap/web/public/fsbmc30/softbootme_fail_1/syslog
> /gsa/ausgsa/home/a/n/anitrap/web/public/fsbmc30/softbootme_fail_1/kern.log
> /gsa/ausgsa/home/a/n/anitrap/web/public/fsbmc30/softbootme_fail_1/bootme.log
> sample of syslog during first htx error:
> ================================================
> Jun  1 06:19:20 fsbmc30p1 systemd[1]: Started Cleanup of Temporary 
> Directories.
> Jun  1 06:25:01 fsbmc30p1 rsyslogd-2007: action 'action 10' suspended, next 
> retry is Wed Jun  1 06:25:31 2016 [v8.16.0 try ]
> Jun  1 06:25:01 fsbmc30p1 CRON[99327]: (root) CMD (test -x /usr/sbin/anacron 
> || ( cd / && run-parts --report /etc/cron.daily ))
> Jun  1 06:26:53 fsbmc30p1 CXLBLK[37882]: 
> cflash_block_kern_mc.c,cblk_notify_mc_err,5504,LOG_EVENT reason 7 error_num = 
> 0x607,for chunk->dev_name = /dev/sg2, chunk index = 0
> Jun  1 06:26:53 fsbmc30p1 rsyslogd-2007: action 'action 10' suspended, next 
> retry is Wed Jun  1 06:27:23 2016 [v8.16.0 try ]
> Jun  1 06:26:53 fsbmc30p1 CXLBLK[37847]: 
> cflash_block_kern_mc.c,cblk_notify_mc_err,5504,LOG_EVENT reason 7 error_num = 
> 0x607,for chunk->dev_name = /dev/sg2, chunk index = 0
> Jun  1 06:26:53 fsbmc30p1 CXLBLK[37927]: 
> cflash_block_kern_mc.c,cblk_notify_mc_err,5504,LOG_EVENT reason 7 error_num = 
> 0x607,for chunk->dev_name = /dev/sg2, chunk index = 0
> Jun  1 06:26:59 fsbmc30p1 CXLBLK[37961]: 
> cflash_block_kern_mc.c,cblk_notify_mc_err,5504,LOG_EVENT reason 7 error_num = 
> 0x607,for chunk->dev_name = /dev/sg3, chunk index = 0
> Jun  1 06:26:59 fsbmc30p1 CXLBLK[37954]: 
> cflash_block_kern_mc.c,cblk_notify_mc_err,5504,LOG_EVENT reason 7 error_num = 
> 0x607,for chunk->dev_name = /dev/sg2, chunk index = 0
> Jun  1 06:26:59 fsbmc30p1 CXLBLK[37887]: 
> cflash_block_kern_mc.c,cblk_notify_mc_err,5504,LOG_EVENT reason 7 error_num = 
> 0x607,for chunk->dev_name = /dev/sg2, chunk index = 0
> Jun  1 06:26:59 fsbmc30p1 kernel: [ 1378.248405] hrtimer: interrupt took 
> 200250 ns
> sample from kern.log during fail:
> =================================
> Jun  1 06:08:11 fsbmc30p1 kernel: [  250.251041] nvidia-uvm: Loaded the UVM 
> driver in lite mode, major device number 241
> Jun  1 06:26:59 fsbmc30p1 kernel: [ 1378.248405] hrtimer: interrupt took 
> 200250 ns
> Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.764382] hxesurelock[40392]: 
> unhandled signal 11 at 0000000000000024 nip 00003fff84602978 lr 
> 00003fff84602974 code 30001
> Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.868242] Unable to handle kernel 
> paging request for data at address 0x0000000c
> Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.868599] Faulting instruction 
> address: 0xc00000000035e2b0
> Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.868865] Oops: Kernel access of bad 
> area, sig: 11 [#1]
> Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.868928] SMP NR_CPUS=2048 NUMA PowerNV
> Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.868992] Modules linked in: 
> nvidia_uvm(POE) iptable_filter ip_tables x_tables nvidia(POE) ipmi_devintf 
> joydev input_leds mac_hid opal_prd ofpart cmdlinepart powernv_flash mtd at24 
> ipmi_powernv ipmi_msghandler uio_pdrv_genirq uio ibmpowernv powernv_rng 
> binfmt_misc nfsd ib_iser auth_rpcgss rdma_cm iw_cm ib_cm nfs_acl ib_sa ib_mad 
> lockd ib_core grace ib_addr sunrpc iscsi_tcp libiscsi_tcp libiscsi 
> scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov 
> async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 multipath 
> linear mlx4_en hid_generic usbhid hid uas usb_storage cxlflash ast bnx2x 
> i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops 
> drm cxl vxlan mlx4_core ahci ip6_udp_tunnel udp_tunnel libahci mdio libcrc32c
> Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.870299] CPU: 80 PID: 40392 Comm: 
> hxesurelock Tainted: P           OE   4.4.8c0ffee0+ #2
> Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.870379] task: c000007935fe23a0 ti: 
> c000007910810000 task.ti: c000007910810000
> Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.870476] NIP: c00000000035e2b0 LR: 
> c00000000035e280 CTR: 0000000000000000
> Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.870552] REGS: c0000079108135e0 TRAP: 
> 0300   Tainted: P           OE    (4.4.8c0ffee0+)
> Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.870642] MSR: 9000000100009033 
> <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 28053988  XER: 00000000
> Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.870852] CFAR: c000000000008468 DAR: 
> 000000000000000c DSISR: 40000000 SOFTE: 1 
> Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.870852] GPR00: c00000000035e280 
> c000007910813860 c000000001594600 0000000000000000 
> Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.870852] GPR04: c000007823192400 
> 000000000002574f 0000000000000001 0000000000000000 
> Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.870852] GPR08: c0000079241b8a00 
> 0000000000000000 00000000000044fb 65776f702f62696c 
> Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.870852] GPR12: 2d656c3436637072 
> c00000000fb6f800 00000000464c457f 0000000000010c78 
> Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.870852] GPR16: 0000000000000000 
> 0000000000000039 d000000034fa04c5 0000000000010000 
> Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.870852] GPR20: 00000000000000cd 
> 0000000000000550 0000000000010000 00000000039e0000 
> Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.870852] GPR24: 00003fffffffffff 
> c000007910813af8 c000007823192600 c00000793f57b980 
> Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.870852] GPR28: c00000793f573e80 
> 00003fffffffffff 000000000000001f c000007926f29790 
> Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.872149] NIP [c00000000035e2b0] 
> elf_core_dump+0xd60/0x1300
> Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.872277] LR [c00000000035e280] 
> elf_core_dump+0xd30/0x1300
> Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.872351] Call Trace:
> Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.872407] [c000007910813860] 
> [c00000000035e280] elf_core_dump+0xd30/0x1300 (unreliable)
> Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.872527] [c000007910813a60] 
> [c00000000036898c] do_coredump+0xcec/0x11e0
> Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.872625] [c000007910813c20] 
> [c0000000000ce7a0] get_signal+0x540/0x7b0
> Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.872705] [c000007910813d10] 
> [c000000000017344] do_signal+0x54/0x2b0
> Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.872785] [c000007910813e00] 
> [c00000000001776c] do_notify_resume+0xbc/0xd0
> Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.872877] [c000007910813e30] 
> [c000000000009838] ret_from_except_lite+0x64/0x68
> Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.872963] Instruction dump:
> Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.873004] 60000000 2fa30000 409effa8 
> e95f0050 39200000 794737e3 4082ffa4 e91f00a0 
> Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.873148] 2fa80000 419e002c e92800f8 
> e9290000 <8129000c> 79279fe3 41820018 7948efe3 
> Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.884655] ---[ end trace 
> f8abb6e0d0322daa ]---
> gsave info: 
> ==============
> GSA Location: 
> /gsa/ausgsa/projects/s/sift/hst/trial_data/Surelock/Ubuntu/flashgt/fsbmc30p1_ubuntu1604_FlashGT_bootme_test5/FAIL201606011024
> <===== This is from RTC side description =====>
> See the Discussion field for the initial comments from CQ.
> </===== This is from RTC side description =====>
> ==== State: Open by: mpvageli on 02 June 2016 14:20:06 ====
>  Oops: Kernel access of bad area, sig: 11 [#1]
> # ipmitool -H -I lanplus -U ADMIN -P admin fru list 47
> Product Name          : OpenPOWER Firmware
> Product Version       : IBM-firestone-ibm-OP8_v1.7_1.62
> Product Extra         : hostboot-bc98d0b-1a29dff
> Product Extra         : occ-0362706-16fdfa7
> Product Extra         : skiboot-5.1.13
> Product Extra         : hostboot-binaries-43d5a59
> Product Extra         : firestone-xml-e7b4fa2-c302f0e
> Product Extra         : capp-ucode-105cb8f
> == Comment: #9 - VIPIN K. PARASHAR <> - 2016-06-07 
> 12:04:49 ==
> root@fsbmc30p1:~# lsb_release -a
> No LSB modules are available.
> Distributor ID:       Ubuntu
> Description:  Ubuntu 16.04 LTS
> Release:      16.04
> Codename:     xenial
> root@fsbmc30p1:~# cat /etc/*release
> NAME="Ubuntu"
> VERSION="16.04 LTS (Xenial Xerus)"
> ID=ubuntu
> ID_LIKE=debian
> PRETTY_NAME="Ubuntu 16.04 LTS"
> VERSION_ID="16.04"
> HOME_URL="";
> root@fsbmc30p1:~# uname -a
> Linux fsbmc30p1 4.4.8c0ffee0+ #2 SMP Tue May 24 10:50:26 CDT 2016 ppc64le 
> ppc64le ppc64le GNU/Linux
> root@fsbmc30p1:~#
> == Comment: #24 - VIPIN K. PARASHAR <> - 2016-07-07 
> 07:14:05 ==
> >From kernel logs
> ===========
> [ 7087.918089] device enP3p5s0f2 left promiscuous mode
> [ 8801.190528] cxlflash 0007:00:00.0: send_tmf: TMF timed out!
> [ 8806.190383] cxlflash 0007:00:00.0: send_tmf: TMF timed out!
> [ 8816.507485] hxesurelock[14180]: unhandled signal 11 at 0000000000000024 
> nip 00003fff852c2ee8 lr 00003fff852c2938 code 30001
> [ 8816.511368] hxesurelock[13501]: unhandled signal 11 at 0000000000000024 
> nip 00003fff890b2ee8 lr 00003fff890b2938 code 30001
> [ 8816.526807] Unable to handle kernel paging request for data at address 
> 0x0000000c
> [ 8816.526928] Faulting instruction address: 0xc00000000035e2b0
> [ 8816.530233] Unable to handle kernel paging request for data at address 
> 0x0000000c
> [ 8816.530596] Faulting instruction address: 0xc00000000035e2b0
> 3f:mon> t
> [c000000686a13a60] c00000000036898c do_coredump+0xcec/0x11e0
> [c000000686a13c20] c0000000000ce7a0 get_signal+0x540/0x7b0
> [c000000686a13d10] c000000000017344 do_signal+0x54/0x2b0
> [c000000686a13e00] c00000000001776c do_notify_resume+0xbc/0xd0
> [c000000686a13e30] c000000000009838 ret_from_except_lite+0x64/0x68
> --- Exception: 300 (Data Access) at 00003fff890b2ee8
> SP (3fff83c2c490) is in userspace
> 3f:mon> r
> R00 = c00000000035e280   R16 = 0000000000000000
> R01 = c000000686a13860   R17 = 0000000000000042
> R02 = c000000001594600   R18 = d000000021b104fa
> R03 = 0000000000000000   R19 = 0000000000010000
> R04 = c000002fb7463400   R20 = 00000000000000cd
> R05 = 00000000000001bf   R21 = 0000000000000628
> R06 = 0000000000000001   R22 = 0000000000010000
> R07 = 0000000000000000   R23 = 0000000000250000
> R08 = c00000281af21500   R24 = 00003fffffffffff
> R09 = 0000000000000000   R25 = c000000686a13af8
> R10 = 00000000000044fb   R26 = c000002fb7463800
> R11 = 6c2d656c34366370   R27 = c000002ff0e05cc0
> R12 = 756e672d78756e69   R28 = c000002ff0e05c40
> R13 = c00000000fb65680   R29 = 00003fffffffffff
> R14 = 00000000464c457f   R30 = 0000000000000016
> R15 = 0000000000010e70   R31 = c000002fb94bd3b8
> pc  = c00000000035e2b0 elf_core_dump+0xd60/0x1300
> cfar= c000000000008468 slb_miss_realmode+0x50/0x78
> lr  = c00000000035e280 elf_core_dump+0xd30/0x1300
> msr = 9000000100009033   cr  = 28053828
> ctr = 0000000000000000   xer = 0000000000000000   trap =  300
> dar = 000000000000000c   dsisr = 40000000
> 3f:mon> 
> hxesurelock process has segfaulted and kernel has crashed while
> dumping core.
> == Comment: #87 - Frederic Barrat <> - 2017-02-21 
> 11:50:40 ==
> Fix is in kernel v4.10:
> bdecf76e319a29735d828575f4a9269f0e17c547
> "cxl: Fix coredump generation when cxl_get_fd() is used"
> We'd like to have it backported to 16.10 and 16.04 LTS.
> ** Affects: ubuntu
>      Importance: Undecided
>      Assignee: Taco Screen team (taco-screen-team)
>          Status: New
> ** Tags: architecture-ppc64le bugnameltc-142129 severity-critical 
> targetmilestone-inin16042

Michael Hohnbaum
OIL Program Manager
Power (ppc64el) Development Project Manager
Canonical, Ltd.

You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.

  FlashGT Integration and Setup: fsbmc30: After 17th reboot of soft
  bootme, HTX & Linux errors seen with 256 virtual LUNs

Status in linux package in Ubuntu:

Bug description:
  == Comment: #1 - Application Cdeadmin <> - 2016-06-02 
15:28:27 ==
  ==== State: Open by: anitrap on 01 June 2016 17:36:39 ====

  Contact: Anitra Powell  ( )
  Backup: Dion Bell (

  Primary BMC (1603G):
  # cat /proc/ractrends/Helper/FwInfo
  FW_DATE=Mar 10 2016
  FW_DESC=8335 SRC BUILD RR9 03102016

  PNOR (1603G):
  # ipmitool -H -I lanplus -U ADMIN -P admin fru list 47
  Product Name          : OpenPOWER Firmware
  Product Version       : IBM-firestone-ibm-OP8_v1.7_1.62
  Product Extra         : hostboot-bc98d0b-1a29dff
  Product Extra         : occ-0362706-16fdfa7
  Product Extra         : skiboot-5.1.13
  Product Extra         : hostboot-binaries-43d5a59
  Product Extra         : firestone-xml-e7b4fa2-c302f0e
  Product Extra         : capp-ucode-105cb8f

  Partition Info:
         ver - OS, HTX, Firmware and Machine details

                             OS: GNU/Linux
                     OS Version: Ubuntu 16.04 LTS \n \l
                 Kernel Version: 4.4.8c0ffee0+
                    HTX Version: htxubuntu-396
                      Host Name: fsbmc30p1
              Machine Serial No: 210995A
             Machine Type/Model: 8335-GCA

  root@fsbmc30p1:~# uname -a
  Linux fsbmc30p1 4.4.8c0ffee0+ #2 SMP Tue May 24 10:50:26 CDT 2016 ppc64le 
ppc64le ppc64le GNU/Linux

  FlashGT NVMe setup:
  1 FlashGT card in slot 1 running in superpipe mode with 128 LUNs per port 
(total of 256 LUNs).

  [0:0:0:0]    disk    ATA      ST1000NX0313     BE33  /dev/sda
  [1:0:0:0]    disk    ATA      ST1000NX0313     BE33  /dev/sdb
  [4:0:0:0]    disk    NVMe     SAMSUNG MZ1LV960 3011  /dev/sdc
  [4:1:0:0]    disk    NVMe     SAMSUNG MZ1LV960 3011  /dev/sdd
  [5:0:0:0]    cd/dvd  AMI      Virtual CDROM0   1.00  /dev/sr0
  [5:0:0:1]    cd/dvd  AMI      Virtual CDROM1   1.00  /dev/sr1
  [5:0:0:2]    cd/dvd  AMI      Virtual CDROM2   1.00  /dev/sr2
  [5:0:0:3]    cd/dvd  AMI      Virtual CDROM3   1.00  /dev/sr3
  [6:0:0:0]    disk    AMI      Virtual Floppy0  1.00  /dev/sde
  [6:0:0:1]    disk    AMI      Virtual Floppy1  1.00  /dev/sdf
  [6:0:0:2]    disk    AMI      Virtual Floppy2  1.00  /dev/sdg
  [6:0:0:3]    disk    AMI      Virtual Floppy3  1.00  /dev/sdh
  [7:0:0:0]    disk    AMI      Virtual HDisk0   1.00  /dev/sdi
  [7:0:0:1]    disk    AMI      Virtual HDisk1   1.00  /dev/sdj
  [7:0:0:2]    disk    AMI      Virtual HDisk2   1.00  /dev/sdk
  [7:0:0:3]    disk    AMI      Virtual HDisk3   1.00  /dev/sdl
  [7:0:0:4]    disk    AMI      Virtual HDisk4   1.00  /dev/sdm

  lspci | grep -i acc
  0004:01:00.0 Processing accelerators: IBM Device 0601 (rev 01)

  ls -l /sys/class/cxl
  total 0
  lrwxrwxrwx 1 root root 0 May 31 13:27 afu0.0 -> 
  lrwxrwxrwx 1 root root 0 May 31 13:27 afu0.0m -> 
  lrwxrwxrwx 1 root root 0 May 31 13:27 afu0.0s -> 
  lrwxrwxrwx 1 root root 0 May 31 13:27 card0 -> 

  lscfg | grep afu
  + afu0.0           Slot1/card0/afu0.0
  + afu0.0m          Slot1/card0/afu0.0/afu0.0m
  + afu0.0s          Slot1/card0/afu0.0/afu0.0s

  CXL Flash Device Status

  Found 0601 0004:01:00.0 Slot1
      Device:       SCSI  Block       Mode                          LUN WWID
         sg2:    4:0:0:0,   sdc, superpipe, 60025380025382463300046000000000
         sg3:    4:1:0:0,   sdd, superpipe, 60025380025382463300052000000000

  dpkg -l | grep capi
  4el      no description given               3.0-1970-3042652                
  4el      no description given               3.0-1970-3042652                

  root@fsbmc30p1:/tmp# dpkg -l | grep afu
  ii  afuimage                                3.0-1970-3042652                
all          no description given

  cat /opt/ibm/capikv/version.txt

  /opt/ibm/capikv/afu/cxl_afu_dump /dev/cxl/afu0.0m -v
  AFU Version     = 160525N1

   NVMe0 Version = BTV73011
   NVMe0 NEXT    = BTV73011
   NVMe0 STATUS  = 0x702

   NVMe1 Version = BTV73011
   NVMe1 NEXT    = BTV73011
   NVMe1 STATUS  = 0x702

  cat /tmp/test_lun_mode

  While running soft bootme (shutdown -r from OS every hour, I noticed htx 
errors after the 9th & 17th reboot of partition.  At this point they seem like 
different issues so I am opening up 2 different defects.  I've already opened 
up defect SW354759 for the first set of htx errors and assigned to htx_screen.

  This defect is for issue that happened after 17th reboot (Jun 1 @
  6am). On the 18th reboot (Jun 1 @ 7am), the shutdown -r command
  failed... I had to manually power down system.

  I guess I will open to surelock_screen first since it seems similar to
  the one Dion opened up while running 128 virtual LUNs per port (defect .  For
  this fail, other exercisers eventually failed also.

  Test Info:
  - running Soft bootme (shutdown -r every hour)
  - mdt.bu + hxecom  (GPUs were running). I copied a modified mdt.bu to another 
mdt file so I would not see any errors in htx after reboot.

  Sample of HTX errors (for this defect)
  /dev/sg2.53       Jun  1 06:26:53 2016 err=00000010 sev=4 hxesurelock   
  READCMP5  numopers=     20000  loop=      4956  blk=0x4eee 
  len=      4096   offset=0   Seed Values= 37882, 44181, 50758 
  Data Pattern Seed Values = 37882, 44182, 50758    LBA Fencepost = 0xb94a
  cblk_read error - Device or resource busy

  /dev/sg2.18       Jun  1 06:26:53 2016 err=00000010 sev=4 hxesurelock   
  READCMP9  numopers=     20000  loop=      1501  blk=0x93f1 
  len=      4096   offset=0   Seed Values= 37847, 44740, 50780 
  Data Pattern Seed Values = 37847, 44741, 50780    LBA Fencepost = 0xb275
  cblk_read error - Device or resource busy

  /dev/sg2.98       Jun  1 06:26:53 2016 err=00000010 sev=4 hxesurelock   
  READCMP5  numopers=     20000  loop=     10365  blk=0x86d5 
  len=      4096   offset=0   Seed Values= 37927, 41320, 50710 
  Data Pattern Seed Values = 37927, 41321, 50710    LBA Fencepost = 0xbc7c
  cblk_read error - Device or resource busy

  /dev/sg2.116      Jun  1 06:30:45 2016 err=00000005 sev=4 hxesurelock   
  RDCMP10  numopers=     20000  loop=      6383  blk=0xc33d 
  len=      4096   offset=0   Seed Values= 37945, 49039, 50726 
  Data Pattern Seed Values = 37945, 49040, 50726    LBA Fencepost = 0xd0b0
  cblk_read error - Input/output error

  /dev/fpu17        Jun  1 06:30:51 2016 err=0000000b sev=1 hxefpu64      
  pthread_create call failed with rc: 11, errno: 11, Resource temporarily 

  /dev/fpu17        Jun  1 06:30:51 2016 err=0000000b sev=1 hxefpu64      
  Hardware Exerciser stopped on an error

  /dev/sctu43       Jun  1 06:30:51 2016 err=0000000b sev=1 hxesctu       
  pthread_create call failed with rc: 11, errno: 11, Resource temporarily 

  /dev/sctu43       Jun  1 06:30:51 2016 err=0000000b sev=1 hxesctu       
  Hardware Exerciser stopped on an error



  sample of syslog during first htx error:
  Jun  1 06:19:20 fsbmc30p1 systemd[1]: Started Cleanup of Temporary 
  Jun  1 06:25:01 fsbmc30p1 rsyslogd-2007: action 'action 10' suspended, next 
retry is Wed Jun  1 06:25:31 2016 [v8.16.0 try ]
  Jun  1 06:25:01 fsbmc30p1 CRON[99327]: (root) CMD (test -x /usr/sbin/anacron 
|| ( cd / && run-parts --report /etc/cron.daily ))
  Jun  1 06:26:53 fsbmc30p1 CXLBLK[37882]: 
cflash_block_kern_mc.c,cblk_notify_mc_err,5504,LOG_EVENT reason 7 error_num = 
0x607,for chunk->dev_name = /dev/sg2, chunk index = 0
  Jun  1 06:26:53 fsbmc30p1 rsyslogd-2007: action 'action 10' suspended, next 
retry is Wed Jun  1 06:27:23 2016 [v8.16.0 try ]
  Jun  1 06:26:53 fsbmc30p1 CXLBLK[37847]: 
cflash_block_kern_mc.c,cblk_notify_mc_err,5504,LOG_EVENT reason 7 error_num = 
0x607,for chunk->dev_name = /dev/sg2, chunk index = 0
  Jun  1 06:26:53 fsbmc30p1 CXLBLK[37927]: 
cflash_block_kern_mc.c,cblk_notify_mc_err,5504,LOG_EVENT reason 7 error_num = 
0x607,for chunk->dev_name = /dev/sg2, chunk index = 0

  Jun  1 06:26:59 fsbmc30p1 CXLBLK[37961]: 
cflash_block_kern_mc.c,cblk_notify_mc_err,5504,LOG_EVENT reason 7 error_num = 
0x607,for chunk->dev_name = /dev/sg3, chunk index = 0
  Jun  1 06:26:59 fsbmc30p1 CXLBLK[37954]: 
cflash_block_kern_mc.c,cblk_notify_mc_err,5504,LOG_EVENT reason 7 error_num = 
0x607,for chunk->dev_name = /dev/sg2, chunk index = 0
  Jun  1 06:26:59 fsbmc30p1 CXLBLK[37887]: 
cflash_block_kern_mc.c,cblk_notify_mc_err,5504,LOG_EVENT reason 7 error_num = 
0x607,for chunk->dev_name = /dev/sg2, chunk index = 0
  Jun  1 06:26:59 fsbmc30p1 kernel: [ 1378.248405] hrtimer: interrupt took 
200250 ns

  sample from kern.log during fail:
  Jun  1 06:08:11 fsbmc30p1 kernel: [  250.251041] nvidia-uvm: Loaded the UVM 
driver in lite mode, major device number 241
  Jun  1 06:26:59 fsbmc30p1 kernel: [ 1378.248405] hrtimer: interrupt took 
200250 ns
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.764382] hxesurelock[40392]: 
unhandled signal 11 at 0000000000000024 nip 00003fff84602978 lr 
00003fff84602974 code 30001
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.868242] Unable to handle kernel 
paging request for data at address 0x0000000c
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.868599] Faulting instruction 
address: 0xc00000000035e2b0
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.868865] Oops: Kernel access of bad 
area, sig: 11 [#1]
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.868928] SMP NR_CPUS=2048 NUMA PowerNV
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.868992] Modules linked in: 
nvidia_uvm(POE) iptable_filter ip_tables x_tables nvidia(POE) ipmi_devintf 
joydev input_leds mac_hid opal_prd ofpart cmdlinepart powernv_flash mtd at24 
ipmi_powernv ipmi_msghandler uio_pdrv_genirq uio ibmpowernv powernv_rng 
binfmt_misc nfsd ib_iser auth_rpcgss rdma_cm iw_cm ib_cm nfs_acl ib_sa ib_mad 
lockd ib_core grace ib_addr sunrpc iscsi_tcp libiscsi_tcp libiscsi 
scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov 
async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 multipath 
linear mlx4_en hid_generic usbhid hid uas usb_storage cxlflash ast bnx2x 
i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops 
drm cxl vxlan mlx4_core ahci ip6_udp_tunnel udp_tunnel libahci mdio libcrc32c
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.870299] CPU: 80 PID: 40392 Comm: 
hxesurelock Tainted: P           OE   4.4.8c0ffee0+ #2
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.870379] task: c000007935fe23a0 ti: 
c000007910810000 task.ti: c000007910810000
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.870476] NIP: c00000000035e2b0 LR: 
c00000000035e280 CTR: 0000000000000000
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.870552] REGS: c0000079108135e0 TRAP: 
0300   Tainted: P           OE    (4.4.8c0ffee0+)
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.870642] MSR: 9000000100009033 
<SF,HV,EE,ME,IR,DR,RI,LE>  CR: 28053988  XER: 00000000
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.870852] CFAR: c000000000008468 DAR: 
000000000000000c DSISR: 40000000 SOFTE: 1 
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.870852] GPR00: c00000000035e280 
c000007910813860 c000000001594600 0000000000000000 
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.870852] GPR04: c000007823192400 
000000000002574f 0000000000000001 0000000000000000 
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.870852] GPR08: c0000079241b8a00 
0000000000000000 00000000000044fb 65776f702f62696c 
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.870852] GPR12: 2d656c3436637072 
c00000000fb6f800 00000000464c457f 0000000000010c78 
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.870852] GPR16: 0000000000000000 
0000000000000039 d000000034fa04c5 0000000000010000 
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.870852] GPR20: 00000000000000cd 
0000000000000550 0000000000010000 00000000039e0000 
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.870852] GPR24: 00003fffffffffff 
c000007910813af8 c000007823192600 c00000793f57b980 
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.870852] GPR28: c00000793f573e80 
00003fffffffffff 000000000000001f c000007926f29790 
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.872149] NIP [c00000000035e2b0] 
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.872277] LR [c00000000035e280] 
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.872351] Call Trace:
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.872407] [c000007910813860] 
[c00000000035e280] elf_core_dump+0xd30/0x1300 (unreliable)
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.872527] [c000007910813a60] 
[c00000000036898c] do_coredump+0xcec/0x11e0
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.872625] [c000007910813c20] 
[c0000000000ce7a0] get_signal+0x540/0x7b0
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.872705] [c000007910813d10] 
[c000000000017344] do_signal+0x54/0x2b0
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.872785] [c000007910813e00] 
[c00000000001776c] do_notify_resume+0xbc/0xd0
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.872877] [c000007910813e30] 
[c000000000009838] ret_from_except_lite+0x64/0x68
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.872963] Instruction dump:
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.873004] 60000000 2fa30000 409effa8 
e95f0050 39200000 794737e3 4082ffa4 e91f00a0 
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.873148] 2fa80000 419e002c e92800f8 
e9290000 <8129000c> 79279fe3 41820018 7948efe3 
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.884655] ---[ end trace 
f8abb6e0d0322daa ]---

  gsave info: 
  GSA Location: 

  <===== This is from RTC side description =====>
  See the Discussion field for the initial comments from CQ.
  </===== This is from RTC side description =====>
  ==== State: Open by: mpvageli on 02 June 2016 14:20:06 ====

   Oops: Kernel access of bad area, sig: 11 [#1]

  # ipmitool -H -I lanplus -U ADMIN -P admin fru list 47
  Product Name          : OpenPOWER Firmware
  Product Version       : IBM-firestone-ibm-OP8_v1.7_1.62
  Product Extra         : hostboot-bc98d0b-1a29dff
  Product Extra         : occ-0362706-16fdfa7
  Product Extra         : skiboot-5.1.13
  Product Extra         : hostboot-binaries-43d5a59
  Product Extra         : firestone-xml-e7b4fa2-c302f0e
  Product Extra         : capp-ucode-105cb8f

  == Comment: #9 - VIPIN K. PARASHAR <> - 2016-06-07 
12:04:49 ==
  root@fsbmc30p1:~# lsb_release -a
  No LSB modules are available.
  Distributor ID:       Ubuntu
  Description:  Ubuntu 16.04 LTS
  Release:      16.04
  Codename:     xenial
  root@fsbmc30p1:~# cat /etc/*release
  VERSION="16.04 LTS (Xenial Xerus)"
  PRETTY_NAME="Ubuntu 16.04 LTS"
  root@fsbmc30p1:~# uname -a
  Linux fsbmc30p1 4.4.8c0ffee0+ #2 SMP Tue May 24 10:50:26 CDT 2016 ppc64le 
ppc64le ppc64le GNU/Linux

  == Comment: #24 - VIPIN K. PARASHAR <> - 2016-07-07 
07:14:05 ==
  From kernel logs

  [ 7087.918089] device enP3p5s0f2 left promiscuous mode
  [ 8801.190528] cxlflash 0007:00:00.0: send_tmf: TMF timed out!
  [ 8806.190383] cxlflash 0007:00:00.0: send_tmf: TMF timed out!
  [ 8816.507485] hxesurelock[14180]: unhandled signal 11 at 0000000000000024 
nip 00003fff852c2ee8 lr 00003fff852c2938 code 30001
  [ 8816.511368] hxesurelock[13501]: unhandled signal 11 at 0000000000000024 
nip 00003fff890b2ee8 lr 00003fff890b2938 code 30001
  [ 8816.526807] Unable to handle kernel paging request for data at address 
  [ 8816.526928] Faulting instruction address: 0xc00000000035e2b0
  [ 8816.530233] Unable to handle kernel paging request for data at address 
  [ 8816.530596] Faulting instruction address: 0xc00000000035e2b0
  3f:mon> t
  [c000000686a13a60] c00000000036898c do_coredump+0xcec/0x11e0
  [c000000686a13c20] c0000000000ce7a0 get_signal+0x540/0x7b0
  [c000000686a13d10] c000000000017344 do_signal+0x54/0x2b0
  [c000000686a13e00] c00000000001776c do_notify_resume+0xbc/0xd0
  [c000000686a13e30] c000000000009838 ret_from_except_lite+0x64/0x68
  --- Exception: 300 (Data Access) at 00003fff890b2ee8
  SP (3fff83c2c490) is in userspace
  3f:mon> r
  R00 = c00000000035e280   R16 = 0000000000000000
  R01 = c000000686a13860   R17 = 0000000000000042
  R02 = c000000001594600   R18 = d000000021b104fa
  R03 = 0000000000000000   R19 = 0000000000010000
  R04 = c000002fb7463400   R20 = 00000000000000cd
  R05 = 00000000000001bf   R21 = 0000000000000628
  R06 = 0000000000000001   R22 = 0000000000010000
  R07 = 0000000000000000   R23 = 0000000000250000
  R08 = c00000281af21500   R24 = 00003fffffffffff
  R09 = 0000000000000000   R25 = c000000686a13af8
  R10 = 00000000000044fb   R26 = c000002fb7463800
  R11 = 6c2d656c34366370   R27 = c000002ff0e05cc0
  R12 = 756e672d78756e69   R28 = c000002ff0e05c40
  R13 = c00000000fb65680   R29 = 00003fffffffffff
  R14 = 00000000464c457f   R30 = 0000000000000016
  R15 = 0000000000010e70   R31 = c000002fb94bd3b8
  pc  = c00000000035e2b0 elf_core_dump+0xd60/0x1300
  cfar= c000000000008468 slb_miss_realmode+0x50/0x78
  lr  = c00000000035e280 elf_core_dump+0xd30/0x1300
  msr = 9000000100009033   cr  = 28053828
  ctr = 0000000000000000   xer = 0000000000000000   trap =  300
  dar = 000000000000000c   dsisr = 40000000

  hxesurelock process has segfaulted and kernel has crashed while
  dumping core.

  == Comment: #87 - Frederic Barrat <> - 2017-02-21 
11:50:40 ==
  Fix is in kernel v4.10:
  "cxl: Fix coredump generation when cxl_get_fd() is used"

  We'd like to have it backported to 16.10 and 16.04 LTS.

To manage notifications about this bug go to:

Mailing list:
Post to     :
Unsubscribe :
More help   :

Reply via email to