[Kernel-packages] [Bug 1667239] Comment bridged from LTC Bugzilla

bugproxy Tue, 04 Jul 2017 06:16:41 -0700

------- Comment From [email protected] 2017-07-04 09:03 EDT-------
This CMVC defect is being cancelled by the CDE Bridge because the corresponding 
CQ Defect [SW354783] was transferred out of the bridge domain.
Here are the additional details:
New Subsystem = ppc_triage
New Release = unspecified
New Component = ubuntu_linux
New OwnerInfo = Chavez, Luciano ([email protected])
To continue tracking this issue, please follow CQ defect [SW354783].


Opened defect SW355478 on new fail to see if it is the same issue.  I
made sev 1 since system in XMON right now and is preventing further
testing.

Like I mentioned earlier, the fail could be related to this defect.

For this defect...

The "Oops: Kernel access of bad area, sig: 11 [#1]" in the logs happens
during HTX run.

On the reboot (that happened ~30 minutes after first error), I saw partition 
hang/crash.  I had to use ipmitool to power down system.
Current xmon crash in SW355478 / 142348 is different than
one being tracked in this bug. Will wait for recreate of original issue.

The FlashGT HST team still needs to recreate this issue.

SW357236 "HTX fail during superpipe 128 per LUN testing...during Guardband 
Testing" is now marked as a duplicate of this SW354783.
Per comment from JVP (SW357236 submitter), he is attempting a recreate again 
with the latest Firmware for his Tuleta-L.
We will monitor that attempt at recreate, and reopen this SW354783 if a new 
recreate is achieved.

This original recreate attempt on Firestone, fsbmc30, may be delayed, as
it is currently tied up with debugging a link training issue.

<Automated Update> The severity of defect SW354783 was increased from 2
to 1 because defect SW358210 was rejected as the duplicate of defect
SW354783 and the severity of defect SW358210 was higher than 2

Defect submitter, Dion is out on vacation until 7/11.  So we can make progress 
on this most recent recreate, SW358210 dup'd to this SW354783,
I request the defect Owner, Luciano/ScreenTeam, to please reopen this SW354783 
and continue live debug on the held system from SW358210:

#=#=# 2016-07-05 17:12:28 (CDT) #=#=#
Action = [reopen]

I'm not quite sure how to handle this (I'll ping Mark Smith) defect.

Dion's defect
SW358210 : FlashGT STC GA3: capiredp01: TMF timed out and Unable to handle 
kernel paging request before system drops into xmon debugger, was running HTX 
for superpipe with 1600 virtual luns across 4 FlashGT NVME cards

was just dup'd to this one.

That system is currently in XMON debugger now and can be debugged to 1) verify 
it is same issue and 2) maybe try to find root cause (his defect can be 
re-opened if not the same issue).
#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#
Not able to look SW358210.
Looking into machine capiredp01 box.
Machine details:

FSP: capiredfsp.aus.stglabs.ibm.com (dev/FipSdev)
Partition: capiredp01.aus.stglabs.ibm.com
IPMI console: ipmitool -I lanplus -H capiredfsp.aus.stglabs.ibm.com -P abc123 
sol activate

Fail on "capiredfsp" seems same as reported in this bug.
hxesurelock process has segfaulted and kernel has crashed
while generating core dump.

cde00 ([email protected]) added native attachment
/tmp/AIXOS05866176/dmesg_backtrace_capiredfsp on 2016-07-07 06:19:39

Hi Dominic,
Can you please have some one from kernel team look at this ?
HTX (hxesurelock) process has segfaulted and kernel has crashed while
generating core. Attached kernel logs with bug . Machine is sitting in
xmon and available for debug.
(In reply to comment #25)
> Hi Dominic,
>            Can you please have some one from kernel team look at this ?
> HTX (hxesurelock) process has segfaulted and kernel has crashed while
> generating core. Attached kernel logs with bug . Machine is sitting in
> xmon and available for debug.

Vipin,

I cannot ssh to capiredfsp.aus.stglabs.ibm.com (dev/FipSdev). Is the
machine still in xmon?

(In reply to comment #26)
> Vipin,
> I cannot ssh to capiredfsp.aus.stglabs.ibm.com (dev/FipSdev). Is the machine
> still in xmon?

Yes its still sitting in xmon. You can open console via IPMI.
Please see comment 22 for machine access details.

Just wanted to point out the send_tmf timeout (at the end of the kernel
log) before the crash even though I am not sure it is the cause. The
system is in xmon. Please advise if additional debug data need to be
collected. Thanks.

Snippet at the end of the kernel log:

[ 8801.190528] cxlflash 0007:00:00.0: send_tmf: TMF timed out!
[ 8806.190383] cxlflash 0007:00:00.0: send_tmf: TMF timed out!
[ 8816.507485] hxesurelock[14180]: unhandled signal 11 at 0000000000000024 nip 
00003fff852c2ee8 lr 00003fff852c2938 code 30001
[ 8816.511368] hxesurelock[13501]: unhandled signal 11 at 0000000000000024 nip 
00003fff890b2ee8 lr 00003fff890b2938 code 30001
[ 8816.526807] Unable to handle kernel paging request for data at address 
0x0000000c
[ 8816.526928] Faulting instruction address: 0xc00000000035e2b0
[ 8816.530233] Unable to handle kernel paging request for data at address 
0x0000000c
[ 8816.530596] Faulting instruction address: 0xc00000000035e2b0

Snippet of the send_tmf() code:
453                 cmd_checkin(cmd);
454                 spin_lock_irqsave(&cfg->tmf_slock, lock_flags);
455                 cfg->tmf_active = false;
456                 spin_unlock_irqrestore(&cfg->tmf_slock, lock_flags);
457                 goto out;
458         }
459
460         spin_lock_irqsave(&cfg->tmf_slock, lock_flags);
461         to = msecs_to_jiffies(5000);
462         to = wait_event_interruptible_lock_irq_timeout(cfg->tmf_waitq,
463                                                        !cfg->tmf_active,
464                                                        cfg->tmf_slock,
465                                                        to);
466         if (!to) {
467                 cfg->tmf_active = false;
468                 dev_err(dev, "%s: TMF timed out!\n", __func__);
469                 rc = -1;
470         }
471         spin_unlock_irqrestore(&cfg->tmf_slock, lock_flags);

Boqun,

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1667239

Title:
  FlashGT Integration and Setup: fsbmc30: After 17th reboot of soft
  bootme, HTX & Linux errors seen with 256 virtual LUNs

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Xenial:
  Fix Released
Status in linux source package in Yakkety:
  Fix Released

Bug description:
  == Comment: #1 - Application Cdeadmin <[email protected]> - 2016-06-02 
15:28:27 ==
  ==== State: Open by: anitrap on 01 June 2016 17:36:39 ====

  Contact: Anitra Powell  ([email protected] )
  Backup: Dion Bell ([email protected])

  
  Primary BMC (1603G):
  =====================================================
  # cat /proc/ractrends/Helper/FwInfo
  FW_VERSION=2.13.91819
  FW_DATE=Mar 10 2016
  FW_BUILDTIME=10:59:31 CDT
  FW_DESC=8335 SRC BUILD RR9 03102016
  FW_PRODUCTID=1
  FW_RELEASEID=RR9
  FW_CODEBASEVERSION=2.X
  #

  PNOR (1603G):
  ========================
  # ipmitool -H 127.0.0.1 -I lanplus -U ADMIN -P admin fru list 47
  Product Name          : OpenPOWER Firmware
  Product Version       : IBM-firestone-ibm-OP8_v1.7_1.62
  Product Extra         : hostboot-bc98d0b-1a29dff
  Product Extra         : occ-0362706-16fdfa7
  Product Extra         : skiboot-5.1.13
  Product Extra         : hostboot-binaries-43d5a59
  Product Extra         : firestone-xml-e7b4fa2-c302f0e
  Product Extra         : capp-ucode-105cb8f

  Partition Info:
  =================
         ver 1.5.4.3 - OS, HTX, Firmware and Machine details

                             OS: GNU/Linux
                     OS Version: Ubuntu 16.04 LTS \n \l
                 Kernel Version: 4.4.8c0ffee0+
                    HTX Version: htxubuntu-396
                      Host Name: fsbmc30p1
              Machine Serial No: 210995A
             Machine Type/Model: 8335-GCA

  root@fsbmc30p1:~# uname -a
  Linux fsbmc30p1 4.4.8c0ffee0+ #2 SMP Tue May 24 10:50:26 CDT 2016 ppc64le 
ppc64le ppc64le GNU/Linux

  FlashGT NVMe setup:
  ===================
  1 FlashGT card in slot 1 running in superpipe mode with 128 LUNs per port 
(total of 256 LUNs).

  lsscsi
  [0:0:0:0]    disk    ATA      ST1000NX0313     BE33  /dev/sda
  [1:0:0:0]    disk    ATA      ST1000NX0313     BE33  /dev/sdb
  [4:0:0:0]    disk    NVMe     SAMSUNG MZ1LV960 3011  /dev/sdc
  [4:1:0:0]    disk    NVMe     SAMSUNG MZ1LV960 3011  /dev/sdd
  [5:0:0:0]    cd/dvd  AMI      Virtual CDROM0   1.00  /dev/sr0
  [5:0:0:1]    cd/dvd  AMI      Virtual CDROM1   1.00  /dev/sr1
  [5:0:0:2]    cd/dvd  AMI      Virtual CDROM2   1.00  /dev/sr2
  [5:0:0:3]    cd/dvd  AMI      Virtual CDROM3   1.00  /dev/sr3
  [6:0:0:0]    disk    AMI      Virtual Floppy0  1.00  /dev/sde
  [6:0:0:1]    disk    AMI      Virtual Floppy1  1.00  /dev/sdf
  [6:0:0:2]    disk    AMI      Virtual Floppy2  1.00  /dev/sdg
  [6:0:0:3]    disk    AMI      Virtual Floppy3  1.00  /dev/sdh
  [7:0:0:0]    disk    AMI      Virtual HDisk0   1.00  /dev/sdi
  [7:0:0:1]    disk    AMI      Virtual HDisk1   1.00  /dev/sdj
  [7:0:0:2]    disk    AMI      Virtual HDisk2   1.00  /dev/sdk
  [7:0:0:3]    disk    AMI      Virtual HDisk3   1.00  /dev/sdl
  [7:0:0:4]    disk    AMI      Virtual HDisk4   1.00  /dev/sdm

  lspci | grep -i acc
  0004:01:00.0 Processing accelerators: IBM Device 0601 (rev 01)

  ls -l /sys/class/cxl
  total 0
  lrwxrwxrwx 1 root root 0 May 31 13:27 afu0.0 -> 
../../devices/pci0004:00/0004:00:00.0/0004:01:00.0/cxl/card0/afu0.0
  lrwxrwxrwx 1 root root 0 May 31 13:27 afu0.0m -> 
../../devices/pci0004:00/0004:00:00.0/0004:01:00.0/cxl/card0/afu0.0/afu0.0m
  lrwxrwxrwx 1 root root 0 May 31 13:27 afu0.0s -> 
../../devices/pci0004:00/0004:00:00.0/0004:01:00.0/cxl/card0/afu0.0/afu0.0s
  lrwxrwxrwx 1 root root 0 May 31 13:27 card0 -> 
../../devices/pci0004:00/0004:00:00.0/0004:01:00.0/cxl/card0

  lscfg | grep afu
  + afu0.0           Slot1/card0/afu0.0
  + afu0.0m          Slot1/card0/afu0.0/afu0.0m
  + afu0.0s          Slot1/card0/afu0.0/afu0.0s

  /opt/ibm/capikv/bin/cxlfstatus
  CXL Flash Device Status

  Found 0601 0004:01:00.0 Slot1
      Device:       SCSI  Block       Mode                          LUN WWID
         sg2:    4:0:0:0,   sdc, superpipe, 60025380025382463300046000000000
         sg3:    4:1:0:0,   sdd, superpipe, 60025380025382463300052000000000

  dpkg -l | grep capi
  4el      no description given               3.0-1970-3042652                
ppc6
  4el      no description given               3.0-1970-3042652                
ppc6

  root@fsbmc30p1:/tmp# dpkg -l | grep afu
  ii  afuimage                                3.0-1970-3042652                
all          no description given

  cat /opt/ibm/capikv/version.txt
  1970-3042652

  /opt/ibm/capikv/afu/cxl_afu_dump /dev/cxl/afu0.0m -v
  AFU Version     = 160525N1

   NVMe0 Version = BTV73011
   NVMe0 NEXT    = BTV73011
   NVMe0 STATUS  = 0x702

   NVMe1 Version = BTV73011
   NVMe1 NEXT    = BTV73011
   NVMe1 STATUS  = 0x702

  cat /tmp/test_lun_mode
  128

  Problem:
  ===========
  While running soft bootme (shutdown -r from OS every hour, I noticed htx 
errors after the 9th & 17th reboot of partition.  At this point they seem like 
different issues so I am opening up 2 different defects.  I've already opened 
up defect SW354759 for the first set of htx errors and assigned to htx_screen.

  This defect is for issue that happened after 17th reboot (Jun 1 @
  6am). On the 18th reboot (Jun 1 @ 7am), the shutdown -r command
  failed... I had to manually power down system.

  I guess I will open to surelock_screen first since it seems similar to
  the one Dion opened up while running 128 virtual LUNs per port (defect
  http://w3.rchland.ibm.com/projects/bestquest/?defect=SW353881) .  For
  this fail, other exercisers eventually failed also.

  Test Info:
  ============
  - running Soft bootme (shutdown -r every hour)
  - mdt.bu + hxecom  (GPUs were running). I copied a modified mdt.bu to another 
mdt file so I would not see any errors in htx after reboot.

  Sample of HTX errors (for this defect)
  ==============================
  /dev/sg2.53       Jun  1 06:26:53 2016 err=00000010 sev=4 hxesurelock   
  READCMP5  numopers=     20000  loop=      4956  blk=0x4eee 
  len=      4096   offset=0   Seed Values= 37882, 44181, 50758 
  Data Pattern Seed Values = 37882, 44182, 50758    LBA Fencepost = 0xb94a
  cblk_read error - Device or resource busy

  /dev/sg2.18       Jun  1 06:26:53 2016 err=00000010 sev=4 hxesurelock   
  READCMP9  numopers=     20000  loop=      1501  blk=0x93f1 
  len=      4096   offset=0   Seed Values= 37847, 44740, 50780 
  Data Pattern Seed Values = 37847, 44741, 50780    LBA Fencepost = 0xb275
  cblk_read error - Device or resource busy

  /dev/sg2.98       Jun  1 06:26:53 2016 err=00000010 sev=4 hxesurelock   
  READCMP5  numopers=     20000  loop=     10365  blk=0x86d5 
  len=      4096   offset=0   Seed Values= 37927, 41320, 50710 
  Data Pattern Seed Values = 37927, 41321, 50710    LBA Fencepost = 0xbc7c
  cblk_read error - Device or resource busy

  /dev/sg2.116      Jun  1 06:30:45 2016 err=00000005 sev=4 hxesurelock   
  RDCMP10  numopers=     20000  loop=      6383  blk=0xc33d 
  len=      4096   offset=0   Seed Values= 37945, 49039, 50726 
  Data Pattern Seed Values = 37945, 49040, 50726    LBA Fencepost = 0xd0b0
  cblk_read error - Input/output error

  /dev/fpu17        Jun  1 06:30:51 2016 err=0000000b sev=1 hxefpu64      
  pthread_create call failed with rc: 11, errno: 11, Resource temporarily 
unavailable

  /dev/fpu17        Jun  1 06:30:51 2016 err=0000000b sev=1 hxefpu64      
  Hardware Exerciser stopped on an error

  /dev/sctu43       Jun  1 06:30:51 2016 err=0000000b sev=1 hxesctu       
  pthread_create call failed with rc: 11, errno: 11, Resource temporarily 
unavailable

  /dev/sctu43       Jun  1 06:30:51 2016 err=0000000b sev=1 hxesctu       
  Hardware Exerciser stopped on an error

  Logs:
  ======
  /gsa/ausgsa/home/a/n/anitrap/web/public/fsbmc30/softbootme_fail_1

  /gsa/ausgsa/home/a/n/anitrap/web/public/fsbmc30/softbootme_fail_1/htxerr
  /gsa/ausgsa/home/a/n/anitrap/web/public/fsbmc30/softbootme_fail_1/syslog
  /gsa/ausgsa/home/a/n/anitrap/web/public/fsbmc30/softbootme_fail_1/kern.log
  /gsa/ausgsa/home/a/n/anitrap/web/public/fsbmc30/softbootme_fail_1/bootme.log

  sample of syslog during first htx error:
  ================================================
  Jun  1 06:19:20 fsbmc30p1 systemd[1]: Started Cleanup of Temporary 
Directories.
  Jun  1 06:25:01 fsbmc30p1 rsyslogd-2007: action 'action 10' suspended, next 
retry is Wed Jun  1 06:25:31 2016 [v8.16.0 try http://www.rsyslog.com/e/2007 ]
  Jun  1 06:25:01 fsbmc30p1 CRON[99327]: (root) CMD (test -x /usr/sbin/anacron 
|| ( cd / && run-parts --report /etc/cron.daily ))
  Jun  1 06:26:53 fsbmc30p1 CXLBLK[37882]: 
cflash_block_kern_mc.c,cblk_notify_mc_err,5504,LOG_EVENT reason 7 error_num = 
0x607,for chunk->dev_name = /dev/sg2, chunk index = 0
  Jun  1 06:26:53 fsbmc30p1 rsyslogd-2007: action 'action 10' suspended, next 
retry is Wed Jun  1 06:27:23 2016 [v8.16.0 try http://www.rsyslog.com/e/2007 ]
  Jun  1 06:26:53 fsbmc30p1 CXLBLK[37847]: 
cflash_block_kern_mc.c,cblk_notify_mc_err,5504,LOG_EVENT reason 7 error_num = 
0x607,for chunk->dev_name = /dev/sg2, chunk index = 0
  Jun  1 06:26:53 fsbmc30p1 CXLBLK[37927]: 
cflash_block_kern_mc.c,cblk_notify_mc_err,5504,LOG_EVENT reason 7 error_num = 
0x607,for chunk->dev_name = /dev/sg2, chunk index = 0

  Jun  1 06:26:59 fsbmc30p1 CXLBLK[37961]: 
cflash_block_kern_mc.c,cblk_notify_mc_err,5504,LOG_EVENT reason 7 error_num = 
0x607,for chunk->dev_name = /dev/sg3, chunk index = 0
  Jun  1 06:26:59 fsbmc30p1 CXLBLK[37954]: 
cflash_block_kern_mc.c,cblk_notify_mc_err,5504,LOG_EVENT reason 7 error_num = 
0x607,for chunk->dev_name = /dev/sg2, chunk index = 0
  Jun  1 06:26:59 fsbmc30p1 CXLBLK[37887]: 
cflash_block_kern_mc.c,cblk_notify_mc_err,5504,LOG_EVENT reason 7 error_num = 
0x607,for chunk->dev_name = /dev/sg2, chunk index = 0
  Jun  1 06:26:59 fsbmc30p1 kernel: [ 1378.248405] hrtimer: interrupt took 
200250 ns

  sample from kern.log during fail:
  =================================
  Jun  1 06:08:11 fsbmc30p1 kernel: [  250.251041] nvidia-uvm: Loaded the UVM 
driver in lite mode, major device number 241
  Jun  1 06:26:59 fsbmc30p1 kernel: [ 1378.248405] hrtimer: interrupt took 
200250 ns
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.764382] hxesurelock[40392]: 
unhandled signal 11 at 0000000000000024 nip 00003fff84602978 lr 
00003fff84602974 code 30001
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.868242] Unable to handle kernel 
paging request for data at address 0x0000000c
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.868599] Faulting instruction 
address: 0xc00000000035e2b0
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.868865] Oops: Kernel access of bad 
area, sig: 11 [#1]
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.868928] SMP NR_CPUS=2048 NUMA PowerNV
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.868992] Modules linked in: 
nvidia_uvm(POE) iptable_filter ip_tables x_tables nvidia(POE) ipmi_devintf 
joydev input_leds mac_hid opal_prd ofpart cmdlinepart powernv_flash mtd at24 
ipmi_powernv ipmi_msghandler uio_pdrv_genirq uio ibmpowernv powernv_rng 
binfmt_misc nfsd ib_iser auth_rpcgss rdma_cm iw_cm ib_cm nfs_acl ib_sa ib_mad 
lockd ib_core grace ib_addr sunrpc iscsi_tcp libiscsi_tcp libiscsi 
scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov 
async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 multipath 
linear mlx4_en hid_generic usbhid hid uas usb_storage cxlflash ast bnx2x 
i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops 
drm cxl vxlan mlx4_core ahci ip6_udp_tunnel udp_tunnel libahci mdio libcrc32c
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.870299] CPU: 80 PID: 40392 Comm: 
hxesurelock Tainted: P           OE   4.4.8c0ffee0+ #2
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.870379] task: c000007935fe23a0 ti: 
c000007910810000 task.ti: c000007910810000
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.870476] NIP: c00000000035e2b0 LR: 
c00000000035e280 CTR: 0000000000000000
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.870552] REGS: c0000079108135e0 TRAP: 
0300   Tainted: P           OE    (4.4.8c0ffee0+)
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.870642] MSR: 9000000100009033 
<SF,HV,EE,ME,IR,DR,RI,LE>  CR: 28053988  XER: 00000000
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.870852] CFAR: c000000000008468 DAR: 
000000000000000c DSISR: 40000000 SOFTE: 1 
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.870852] GPR00: c00000000035e280 
c000007910813860 c000000001594600 0000000000000000 
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.870852] GPR04: c000007823192400 
000000000002574f 0000000000000001 0000000000000000 
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.870852] GPR08: c0000079241b8a00 
0000000000000000 00000000000044fb 65776f702f62696c 
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.870852] GPR12: 2d656c3436637072 
c00000000fb6f800 00000000464c457f 0000000000010c78 
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.870852] GPR16: 0000000000000000 
0000000000000039 d000000034fa04c5 0000000000010000 
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.870852] GPR20: 00000000000000cd 
0000000000000550 0000000000010000 00000000039e0000 
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.870852] GPR24: 00003fffffffffff 
c000007910813af8 c000007823192600 c00000793f57b980 
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.870852] GPR28: c00000793f573e80 
00003fffffffffff 000000000000001f c000007926f29790 
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.872149] NIP [c00000000035e2b0] 
elf_core_dump+0xd60/0x1300
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.872277] LR [c00000000035e280] 
elf_core_dump+0xd30/0x1300
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.872351] Call Trace:
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.872407] [c000007910813860] 
[c00000000035e280] elf_core_dump+0xd30/0x1300 (unreliable)
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.872527] [c000007910813a60] 
[c00000000036898c] do_coredump+0xcec/0x11e0
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.872625] [c000007910813c20] 
[c0000000000ce7a0] get_signal+0x540/0x7b0
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.872705] [c000007910813d10] 
[c000000000017344] do_signal+0x54/0x2b0
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.872785] [c000007910813e00] 
[c00000000001776c] do_notify_resume+0xbc/0xd0
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.872877] [c000007910813e30] 
[c000000000009838] ret_from_except_lite+0x64/0x68
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.872963] Instruction dump:
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.873004] 60000000 2fa30000 409effa8 
e95f0050 39200000 794737e3 4082ffa4 e91f00a0 
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.873148] 2fa80000 419e002c e92800f8 
e9290000 <8129000c> 79279fe3 41820018 7948efe3 
  Jun  1 06:28:16 fsbmc30p1 kernel: [ 1454.884655] ---[ end trace 
f8abb6e0d0322daa ]---

  gsave info: 
  ==============
  GSA Location: 
/gsa/ausgsa/projects/s/sift/hst/trial_data/Surelock/Ubuntu/flashgt/fsbmc30p1_ubuntu1604_FlashGT_bootme_test5/FAIL201606011024

  <===== This is from RTC side description =====>
  See the Discussion field for the initial comments from CQ.
  </===== This is from RTC side description =====>
  ==== State: Open by: mpvageli on 02 June 2016 14:20:06 ====

   Oops: Kernel access of bad area, sig: 11 [#1]

  # ipmitool -H 127.0.0.1 -I lanplus -U ADMIN -P admin fru list 47
  Product Name          : OpenPOWER Firmware
  Product Version       : IBM-firestone-ibm-OP8_v1.7_1.62
  Product Extra         : hostboot-bc98d0b-1a29dff
  Product Extra         : occ-0362706-16fdfa7
  Product Extra         : skiboot-5.1.13
  Product Extra         : hostboot-binaries-43d5a59
  Product Extra         : firestone-xml-e7b4fa2-c302f0e
  Product Extra         : capp-ucode-105cb8f

  == Comment: #9 - VIPIN K. PARASHAR <[email protected]> - 2016-06-07 
12:04:49 ==
  root@fsbmc30p1:~# lsb_release -a
  No LSB modules are available.
  Distributor ID:       Ubuntu
  Description:  Ubuntu 16.04 LTS
  Release:      16.04
  Codename:     xenial
  root@fsbmc30p1:~# cat /etc/*release
  DISTRIB_ID=Ubuntu
  DISTRIB_RELEASE=16.04
  DISTRIB_CODENAME=xenial
  DISTRIB_DESCRIPTION="Ubuntu 16.04 LTS"
  NAME="Ubuntu"
  VERSION="16.04 LTS (Xenial Xerus)"
  ID=ubuntu
  ID_LIKE=debian
  PRETTY_NAME="Ubuntu 16.04 LTS"
  VERSION_ID="16.04"
  HOME_URL="http://www.ubuntu.com/";
  SUPPORT_URL="http://help.ubuntu.com/";
  BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/";
  UBUNTU_CODENAME=xenial
  root@fsbmc30p1:~# uname -a
  Linux fsbmc30p1 4.4.8c0ffee0+ #2 SMP Tue May 24 10:50:26 CDT 2016 ppc64le 
ppc64le ppc64le GNU/Linux
  root@fsbmc30p1:~#

  == Comment: #24 - VIPIN K. PARASHAR <[email protected]> - 2016-07-07 
07:14:05 ==
  From kernel logs
  ===========

  [ 7087.918089] device enP3p5s0f2 left promiscuous mode
  [ 8801.190528] cxlflash 0007:00:00.0: send_tmf: TMF timed out!
  [ 8806.190383] cxlflash 0007:00:00.0: send_tmf: TMF timed out!
  [ 8816.507485] hxesurelock[14180]: unhandled signal 11 at 0000000000000024 
nip 00003fff852c2ee8 lr 00003fff852c2938 code 30001
  [ 8816.511368] hxesurelock[13501]: unhandled signal 11 at 0000000000000024 
nip 00003fff890b2ee8 lr 00003fff890b2938 code 30001
  [ 8816.526807] Unable to handle kernel paging request for data at address 
0x0000000c
  [ 8816.526928] Faulting instruction address: 0xc00000000035e2b0
  [ 8816.530233] Unable to handle kernel paging request for data at address 
0x0000000c
  [ 8816.530596] Faulting instruction address: 0xc00000000035e2b0
  3f:mon> t
  [c000000686a13a60] c00000000036898c do_coredump+0xcec/0x11e0
  [c000000686a13c20] c0000000000ce7a0 get_signal+0x540/0x7b0
  [c000000686a13d10] c000000000017344 do_signal+0x54/0x2b0
  [c000000686a13e00] c00000000001776c do_notify_resume+0xbc/0xd0
  [c000000686a13e30] c000000000009838 ret_from_except_lite+0x64/0x68
  --- Exception: 300 (Data Access) at 00003fff890b2ee8
  SP (3fff83c2c490) is in userspace
  3f:mon> r
  R00 = c00000000035e280   R16 = 0000000000000000
  R01 = c000000686a13860   R17 = 0000000000000042
  R02 = c000000001594600   R18 = d000000021b104fa
  R03 = 0000000000000000   R19 = 0000000000010000
  R04 = c000002fb7463400   R20 = 00000000000000cd
  R05 = 00000000000001bf   R21 = 0000000000000628
  R06 = 0000000000000001   R22 = 0000000000010000
  R07 = 0000000000000000   R23 = 0000000000250000
  R08 = c00000281af21500   R24 = 00003fffffffffff
  R09 = 0000000000000000   R25 = c000000686a13af8
  R10 = 00000000000044fb   R26 = c000002fb7463800
  R11 = 6c2d656c34366370   R27 = c000002ff0e05cc0
  R12 = 756e672d78756e69   R28 = c000002ff0e05c40
  R13 = c00000000fb65680   R29 = 00003fffffffffff
  R14 = 00000000464c457f   R30 = 0000000000000016
  R15 = 0000000000010e70   R31 = c000002fb94bd3b8
  pc  = c00000000035e2b0 elf_core_dump+0xd60/0x1300
  cfar= c000000000008468 slb_miss_realmode+0x50/0x78
  lr  = c00000000035e280 elf_core_dump+0xd30/0x1300
  msr = 9000000100009033   cr  = 28053828
  ctr = 0000000000000000   xer = 0000000000000000   trap =  300
  dar = 000000000000000c   dsisr = 40000000
  3f:mon> 

  
  hxesurelock process has segfaulted and kernel has crashed while
  dumping core.

  == Comment: #87 - Frederic Barrat <[email protected]> - 2017-02-21 
11:50:40 ==
  Fix is in kernel v4.10:
  bdecf76e319a29735d828575f4a9269f0e17c547
  "cxl: Fix coredump generation when cxl_get_fd() is used"

  We'd like to have it backported to 16.10 and 16.04 LTS.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1667239/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1667239] Comment bridged from LTC Bugzilla

Reply via email to