[Kernel-packages] [Bug 1762844] Comment bridged from LTC Bugzilla

bugproxy Sat, 21 Apr 2018 20:51:09 -0700

------- Comment From [email protected] 2018-04-21 23:33 EDT-------
Continuing from the description at #84...


This is going to be long post. I debated putting it as as
attachment but placing it in the main body will probably
help in searching in the future.

===========================================================

So the pool_mayday_timeout() routine basically walked into the
weird/corrupted work item on the pool workqueue corresponding to
cpu 0x68 (104). So the critical question is who may have put
that item there with the strange work->data value
data = R10: 0000000000002040

BTW, pool_mayday_timeout got the pool thus:

static void pool_mayday_timeout(struct timer_list *t)
{
struct worker_pool *pool = from_timer(pool, t, mayday_timer);
and then ...
list_for_each_entry(work, &pool->worklist, entry)
send_mayday(work);
}

This is the timer:

struct timer_list {
entry = {
next = 0x5deadbeef0000200,
pprev = 0x0
},
expires = 0x1018d5103,
function = 0xc00000000012e790 <pool_mayday_timeout>,
flags = 0x1000068
}

crash> rd jiffies
c000000001713b00:  00000001018d5380                    .S......

The corresponding worker pool:
-----------------------------
struct worker_pool {
lock = {
{
rlock = {
raw_lock = {
slock = 0x80000068
}
}
}
},
cpu = 0x68,
node = 0x8,
id = 0xd0, <<<<---- Note the id!
flags = 0x1,
watchdog_ts = 0x1018d50b0,
worklist = {
next = 0xc000000fe2a0b020,
prev = 0xc000000fe2a075b0
},
nr_workers = 0x2,
nr_idle = 0x0,
idle_list = {
next = 0xc000200e60eb7db8,
prev = 0xc000200e60eb7db8
},
idle_timer = {
entry = {
next = 0x5deadbeef0000200,
pprev = 0x0
},
expires = 0x1018c6fa3,
function = 0xc00000000012e980 <idle_worker_timeout>,
flags = 0x41c80068
},
mayday_timer = {
entry = {
next = 0x5deadbeef0000200,
pprev = 0x0
},
expires = 0x1018d5103,
function = 0xc00000000012e790 <pool_mayday_timeout>,
flags = 0x1000068
},
...
workers = {
next = 0xc0002000cb9fa138,
prev = 0xc0002000cb9f1708
},
detach_completion = 0x0,
worker_ida = {
ida_rt = {
gfp_mask = 0x7000000,
rnode = 0xc000200e51923bd8
}
},
attrs = 0xc000000ff92cddf8,
hash_node = {
next = 0x0,
pprev = 0x0
},
refcnt = 0x1,
nr_running = {
counter = 0x0
},
rcu = {
next = 0x0,
func = 0x0
}
}

This pool->id will be used to compute a marker for a
deleted/executed item in the future.

Now let's walk through the workers on the work list:
----------------------------------------------------
The TWO entries there are:
c000000fe2a075b0
c000000fe2a0b020

Work struct #1
===============
crash> work_struct c000000fe2a075a8
struct work_struct {
data = {
counter = 0xc000200e60eba305 <<<---- DATA GOOD!!! (a pwq)
},
entry = {
next = 0xc000200e60eb7da0,
prev = 0xc000000fe2a0b020
},
func = 0xc00800000b1af5a8
}

Work struct #2
===============
crash> work_struct 0xc000000fe2a0b018
struct work_struct {
data = {
counter = 0x2040 <<<------  DATA BAD!!!
},
entry = {
next = 0xc000000fe2a0b020,
prev = 0xc000000fe2a0b020
},
func = 0xc00800000b1af5a8
}

Note that Work struct #2 is the PROBLEM work item!!

One important thing to note: BOTH THE WORK ENTRIES HAVE THE SAME
WORK FUNCTION - i.e the same entity likely created this work.

crash> dis 0xc00800000b1af5a8 1
0xc00800000b1af5a8 <qlt_free_session_done>:     addis   r2,r12,4

So these work entries were created by the QLogic driver!

<continues below>

------- Comment From [email protected] 2018-04-21 23:36 EDT-------
<from the previous segment>

The call
=======
void qlt_unreg_sess(struct fc_port *sess)
...
INIT_WORK(&sess->free_work, qlt_free_session_done);
schedule_work(&sess->free_work);

Now we can look at the embedding structure, which is the following:

crash> fc_port c000000fe2a0af58
struct fc_port {
list = {
next = 0xc000000fe2a074e8,
prev = 0xc000000fe2a09f68
},
vha = 0xc000200e458b69a0,
node_name = "P\005\ah\001\000\241\245",
port_name = "P\005\ah\001\020\241\245",
d_id = {
b24 = 0x8cfdc0,
b = {
al_pa = 0xc0,
area = 0xfd,
domain = 0x8c,
rsvd_1 = 0x0
}
},
loop_id = 0x1000,
old_loop_id = 0x0,
conf_compl_supported = 0x0,
deleted = 0x2,
local = 0x0,
logout_on_delete = 0x1,
logo_ack_needed = 0x0,
keep_nport_handle = 0x0,
send_els_logo = 0x0,
login_pause = 0x0,
login_succ = 0x0,
query = 0x0,
nvme_del_work = {
data = {
counter = 0x0
},
entry = {
next = 0x0,
prev = 0x0
},
func = 0x0
},
nvme_del_done = {
done = 0x0,
wait = {
lock = {
{
rlock = {
raw_lock = {
slock = 0x0
}
}
}
},
head = {
next = 0x0,
prev = 0x0
}
}
},
nvme_prli_service_param = 0x0,
nvme_flag = 0x0,
conflict = 0x0,
logout_completed = 0x0,
generation = 0x0,
se_sess = 0x0,
sess_kref = {
refcount = {
refs = {
counter = 0x0
}
}
},
tgt = 0x0,
expires = 0x0,
del_list_entry = {
next = 0x0,
prev = 0x0
},
free_work = {  <<<<-------  THIS IS OUR (CORRUPTED) WORK!
data = {
counter = 0x2040
},
entry = {
next = 0xc000000fe2a0b020,
prev = 0xc000000fe2a0b020
},
func = 0xc00800000b1af5a8 <qlt_free_session_done>
},
plogi_link = {0x0, 0x0},
tgt_id = 0x0,
old_tgt_id = 0x0,
fcp_prio = 0x0,
fabric_port_name = " \b\000'\370\037N\261",
fp_speed = 0xffff,
port_type = FCT_TARGET,
state = {
counter = 0x3
},
flags = 0xb,
login_retry = 0x1d,
rport = 0xc000000fd91399c8,
drport = 0x0,
supported_classes = 0x8,
fc4_type = 0x8,
fc4f_nvme = 0x0,
scan_state = 0x2,
n2n_flag = 0x0,
last_queue_full = 0x0,
last_ramp_up = 0x0,
port_id = 0x0,
nvme_remote_port = 0x0,
retry_delay_timestamp = 0x0,
tgt_session = 0x0,
ct_desc = {
ct_sns = 0xc000200e4fe30000,
ct_sns_dma = 0x800200e4fe30000
},
disc_state = DSC_GNL,
fw_login_state = DSC_LS_PORT_UNAVAIL,
plogi_nack_done_deadline = 0x0,
login_gen = 0x8,
last_login_gen = 0x8,
rscn_gen = 0x0,
last_rscn_gen = 0x0,
chip_reset = 0x0,
gnl_entry = {
next = 0xc000200e458b7078,
prev = 0xc000200e458b7078
},
del_work = {
data = {
counter = 0x2040
},
entry = {
next = 0xc000000fe2a0b100,
prev = 0xc000000fe2a0b100
},
func = 0xc00800000b1abb18 <qla24xx_delete_sess_fn>
},
iocb = 
"\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000",
current_login_state = 0x0,
last_login_state = 0x0,
n2n_done = {
done = 0x0,
wait = {
lock = {
{
rlock = {
raw_lock = {
slock = 0x0
}
}
}
},
head = {
next = 0x0,
prev = 0x0
}
}
}
}

Looking at the above, we see that del_work has a similar value!!!
(although it properly shows an empty list)

del_work = {
data = {
counter = 0x2040
},
entry = {
next = 0xc000000fe2a0b100,
prev = 0xc000000fe2a0b100
},
func = 0xc00800000b1abb18 <qla24xx_delete_sess_fn>
},

So, it appears that 0x2040 is some kind of a SPECIAL MARKER!!
=============================================================

We display the port/node names ...

crash> rd c000000fe2a0af70 16
c000000fe2a0af70:  a5a1000168070550 a5a1100168070550   P..h....P..h....
^^^node_name^^^^ ^^^port_name^^^^
c000000fe2a0af80:  00001000008cfdc0 0000000000000014   ................

Also:
crash> rd c000000fe2a0b04d 16
c000000fe2a0b04d:  b14e1ff827000820 0300000005ffff00    ..'..N.........
^^fabric_port_name

The Relevant States
===================
crash> fc_port_t.deleted c000000fe2a0af58
deleted = 0x2
QLA_SESS_DELETED
crash> fc_port_t.disc_state c000000fe2a0af58
disc_state = DSC_GNL
crash> fc_port_t.fw_login_state c000000fe2a0af58
fw_login_state = DSC_LS_PORT_UNAVAIL
crash> fc_port_t.flags c000000fe2a0af58
flags = 0xb
FCF_FABRIC_DEVICE | FCF_LOGIN_NEEDED | FCF_ASYNC_SENT

DSC_GNL is set in qla24xx_async_gnl/qla24xx_n2n_handle_login
------------------------------------------------------------

There are two active timers here:
---------------------------------
timer = {
entry = {
next = 0xc000200e45889560,
pprev = 0xc000200e608a13e0
},
expires = 0x1018d52ea,
function = 0xc00800000b12c148 <qla2x00_timer>,
flags = 0x17800050
}

struct timer_list {
entry = {
next = 0x0,
pprev = 0xc000200e458b6c80
},
expires = 0x1018d52ea,
function = 0xc00800000b12c148 <qla2x00_timer>,
flags = 0x17800050
}

Our scsi_qla_host is:

crash> rd c000200e458b6a08 16
c000200e458b6a08:  5f78787832616c71 0000000000000033   qla2xxx_3.......
^^^^^^^^^^

The host seems to be managing several fc_ports, and I am going to
catalog their port names...

List of fc_ports
================
c000000fe2a074e8
c000000fe2a09f68
c000000fe2a0af58

crash> rd 0xc000000fe2a09f68 8 <FCP#1>
c000000fe2a09f68:  c000000fe2a0af58 c000200e458b69b0   X........i.E. ..
c000000fe2a09f78:  c000200e458b69a0 63a1000168070550   .i.E. ..P..h...c
^^^node_name^^^^
c000000fe2a09f88:  63a1100168070550 00000002008cf9c0   P..h...c........
^^^port_name^^^^
c000000fe2a09f98:  0000000000000212 0000000000000000   ................

crash> rd c000000fe2a0af58 8 <FCP#2>
c000000fe2a0af58:  c000000fe2a074e8 c000000fe2a09f68   .t......h.......
c000000fe2a0af68:  c000200e458b69a0 a5a1000168070550   .i.E. ..P..h....
^^^node_name^^^^
c000000fe2a0af78:  a5a1100168070550 00001000008cfdc0   P..h............
^^^port_name^^^^
c000000fe2a0af88:  0000000000000014 0000000000000000   ................

crash> rd c000000fe2a074e8 8 <FCP#3>
c000000fe2a074e8:  c000200e458b69b0 c000000fe2a0af58   .i.E. ..X.......
c000000fe2a074f8:  c000200e458b69a0 7842cdfa90000020   .i.E. .. .....Bx
^^^node_name^^^^
c000000fe2a07508:  7842cdfa90000010 0000000000261800   ......Bx..&.....
^^^port_name^^^^
c000000fe2a07518:  0000000000000213 0000000000000000   ................

So there are 3 fc_ports!!

And the problem occurred on FCP#2!

PROBLEM FCP
===========
FCP#2 = c000000fe2a0af58
port_name = a5a1100168070550
FCPort 50:05:07:68:01:10:a1:a5 (in the logs below)

#########################################################################

<continues below>

------- Comment From [email protected] 2018-04-21 23:39 EDT-------
As mentioned before,the relevant usage of fc_port->free_work is:

void qlt_unreg_sess(struct fc_port *sess)
...
INIT_WORK(&sess->free_work, qlt_free_session_done);
schedule_work(&sess->free_work);

Since we have a problem with free_work here, let's determine how it is
supposed to be actually manipulated in this case.

worker insertion into the worklist
==================================
qla24xx_delete_sess_fn   (and also tcm_qla2xxx_release_session )
-> qlt_unreg_sess(fcport)
-> INIT_WORK(&sess->free_work, qlt_free_session_done);
-> __init_work((_work), _onstack);
(_work)->data = (atomic_long_t) WORK_DATA_INIT(); = 0xfffffffe0 -- DATA INIT!

-> schedule_work(&sess->free_work)
-> queue_work(system_wq, work)
-> queue_work_on(WORK_CPU_UNBOUND, wq, work)
__queue_work(cpu, wq, work);
cpu = wq_select_unbound_cpu(raw_smp_processor_id());
pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);
spin_lock(&pwq->pool->lock);
insert_work(pwq, work, worklist, work_flags);
struct worker_pool *pool = pwq->pool;
set_work_pwq(work, pwq, extra_flags);
set_work_data(work, (unsigned long)pwq, WORK_STRUCT_PENDING | WORK_STRUCT_PWQ | 
extra_flags)
work->data = pwq | flags | work_static(work) --DATA PWQ
list_add_tail(&work->entry, head); <<<<-----NOTE

When a work is taken out of the worklist
=======================================
static int worker_thread(void *__worker)
-> struct worker_pool *pool = __worker->pool;
-> work = list_first_entry(&pool->worklist, struct work_struct, entry)
-> process_one_work(worker, work)
-> list_del_init(&work->entry); <<<<<---------NOTE
-> set_work_pool_and_clear_pending(work, pool->id);
-> set_work_data(work, (unsigned long)pool_id << WORK_OFFQ_POOL_SHIFT, 0)
work->data = pool->id << 5 -- DATA CLEAR
-> worker->current_func(work)
==
qlt_free_session_done

I did some calculations and found that for:
-- cpu:0x68 (104) with id=0xd0, DATA CLEAR = 0x1a00 = SPECIAL MARKER
-- cpu:0x81 (129) with id=0x102, DATA CLEAR = 0x2040 = SPECIAL MARKER

=======================================================================================
Now, for our crash, it sure appears that the DATA CLEAR was set on cpu:0x81 but 
it is
currently inserted in the work queue for cpu:0x68.THAT IS THE PROBLEM!
=======================================================================================
Let's now try to reconstruct the possible scenario for our case:

CPU #0x81                           CPU #0x68
=========                           ==========
worker_thread(void *__worker)
-> process_one_work(worker, work)
-> list_del_init(&work->entry);

1. <W/FCP#2 deleted from pwq list[0x81];previous schecule_work on 0x81>

qla24xx_delete_sess_fn
-> qlt_unreg_sess(fcport)
-> INIT_WORK

2.(_work)->data = 0xfffffffe0 -- DATA INIT!

-> schedule_work(&sess->free_work)
-> queue_work(system_wq, work)
-> queue_work_on(WORK_CPU_UNBOUND, wq, work)
->__queue_work(cpu, wq, work);
-> insert_work(pwq, work, worklist, work_flags);
-> set_work_pwq(work, pwq, extra_flags);

3. work->data = pwq | WORK_STRUCT_PENDING | WORK_STRUCT_PWQ...

-> list_add_tail(&work->entry, head);

4. <W/FCP#2 inserted into pwq list[0x68]>

-> set_work_pool_and_clear_pending(work, pool->id);
-> set_work_data(work, (unsigned long)pool_id << WORK_OFFQ_POOL_SHIFT, 0)

5. work->data = pool->id (==0x102 for cpu 0x81) << 5 == 0x2040 -- DATA
CLEAR

6. Now mayday timer walks across the [0x68] list, encounters this work
and crashes!!

If mayday timer had not walked into it, on cpu 0x68 the next time the worker
thread executed:

worker_thread(void *__worker)
-> process_one_work(worker, work)

It will crash when it tries to access work->data!! This tallies with some of the
other crashes we have seen:

8a:mon> t
[c000200e2a9bbd20] c000000000132f78 worker_thread+0x98/0x630
[c000200e2a9bbdc0] c00000000013bba8 kthread+0x1a8/0x1b0
[c000200e2a9bbe30] c00000000000b528 ret_from_kernel_thread+0x5c/0xb4
8a:mon> e
cpu 0x8a: Vector: 300 (Data Access) at [c000200e2a9bba10]
pc: c00000000013297c: process_one_work+0x3c/0x5a0
lr: c000000000132f78: worker_thread+0x98/0x630
sp: c000200e2a9bbc90
msr: 9000000000009033
dar: 8
dsisr: 40000000
current = 0xc000200e2a0c2e00
paca    = 0xc000000007a7ee00   softe: 0        irq_happened: 0x01
pid   = 111847, comm = kworker/138:1
Linux version 4.15.0-15-generic (buildd@bos02-ppc64el-002) (gcc version 7.3.0 
(Ubuntu 7.3.0-14ubuntu1)) #16-Ubuntu SMP Wed Apr 4 13:57:51 UTC 2018 (Ubuntu 
4.15.0-15.16-generic 4.15.15)

Note, the prior
worker->current_func(work)
==
qlt_free_session_done
would proceed since it is not expected to access work->data.

The key here is that the pool->lock is not going to protect us in this
case since these are of different/pools cpu. In this case,the
application HAS TO BE CAREFUL in its reuse of the work structure.

To test if it is possible for different instances of the work->func
running at the same I looked for multiple instances in the dump.
And voila!

PID: 78610  TASK: c00020003769db80  CPU: 104  COMMAND: "kworker/104:3"
#0 [c0002006c825b940] __schedule at c000000000d05d24
#1 [c0002006c825ba10] schedule at c000000000d065b0
#2 [c0002006c825ba30] schedule_timeout at c000000000d0b3d0
#3 [c0002006c825bb30] msleep at c0000000001b5e2c
#4 [c0002006c825bb60] qlt_free_session_done at c00800000b1afaf0 [qla2xxx]
#5 [c0002006c825bc90] process_one_work at c000000000132bd8
#6 [c0002006c825bd20] worker_thread at c000000000132f78
#7 [c0002006c825bdc0] kthread at c00000000013bba8
#8 [c0002006c825be30] ret_from_kernel_thread at c00000000000b528

PID: 101342  TASK: c00020066bd84b80  CPU: 129  COMMAND: "kworker/129:22"
#0 [c0000008838c7940] __schedule at c000000000d05d24
#1 [c0000008838c7a10] schedule at c000000000d065b0
#2 [c0000008838c7a30] schedule_timeout at c000000000d0b3d0
#3 [c0000008838c7b30] msleep at c0000000001b5e2c
#4 [c0000008838c7b60] qlt_free_session_done at c00800000b1afaf0 [qla2xxx]
#5 [c0000008838c7c90] process_one_work at c000000000132bd8
#6 [c0000008838c7d20] worker_thread at c000000000132f78
#7 [c0000008838c7dc0] kthread at c00000000013bba8
#8 [c0000008838c7e30] ret_from_kernel_thread at c00000000000b528

Both the threads here stuck in qlt_free_session_done() are for FCP#1
c000000fe2a09f68 with port 63a1000168070550 (not the problem
fc_port but the point is show that the same worker threads
could execute simultaneously)

for PID 101342
--------------
crash> work_struct c000000fe2a0a028
struct work_struct {
data = {
counter = 0x2040
},
entry = {
next = 0xc000000fe2a0a030,
prev = 0xc000000fe2a0a030
},
func = 0xc00800000b1af5a8 <qlt_free_session_done>
}

And also for PID 78610
----------------------
crash> work_struct c000000fe2a0a028
struct work_struct {
data = {
counter = 0x2040
},
entry = {
next = 0xc000000fe2a0a030,
prev = 0xc000000fe2a0a030
},
func = 0xc00800000b1af5a8 <qlt_free_session_done>
}

As can be seen these two threads are operating on the same work object.
Basically the two threads are stuck in qlt_free_session_done - the
call back @

while (!READ_ONCE(sess->logout_completed)) {
...
msleep(100)

Now having these two works existing at the same time is not an error
per se but it requires care in the scheduling (and general sync
requirements) to ensure that two parallel threads don't trample
each other.

In this case unfortunately it appears that threads on CPU 104 and
CPU 129 interfered in the work object scheduling/execution. CPU 129
was executing the worker function when qla2xxx driver scheduled
another instance of the same work object. As a result, after
the object was deleted from 129's queue it was inserted into
104's queue and the work->data was first written with the
correct pwq info (from the scheduling/insertion part) but then
overwritten with the marker  value from execution/deletion path.
As a result we have a work object in the list with the special
value instead of the proper (pwq) queue value.

##########################################################################

>From the logs, the last action on the problem fc_port:

One callback/session unreg done
===============================
[104457.977481] qla2xxx [0030:01:00.1]-f887:3: qlt_free_session_done: sess 
00000000c6388dff logout completed

Two work items have been queued
===============================
[104458.678494] qla2xxx [0030:01:00.1]-290a:3: qlt_unreg_sess sess 
00000000c6388dff for deletion 50:05:07:68:01:10:a1:a5
[104458.937849] qla2xxx [0030:01:00.1]-290a:3: qlt_unreg_sess sess 
00000000c6388dff for deletion 50:05:07:68:01:10:a1:a5

For our problem fc_port, we see these last two work scheduling actions
VERY CLOSE to each other. We don't have any record of the execution
of the work function (but the particular target management entry
point was also not being traced).

On boslcp3,the logs have shown hung tasks, hard/soft lockups and all
kinds of other mayhem in previous runs. These have often correlated
with guest crashes etc.I suspect some of these are related to the
issue we are dealing with (when a worker thread is sent into la la
land holding a lock say, who knows what the repurcussions will be).

It is,of course  possible that there might be additional issues lurking
underneath.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

Status in The Ubuntu-power-systems project:
  Triaged
Status in linux package in Ubuntu:
  Triaged
Status in linux source package in Bionic:
  Triaged

Bug description:
  Problem Description:
  ===================
  Host crashed & enters into xmon after updating to  4.15.0-15.16 kernel kernel.

  Steps to re-create:
  ==================

  1. boslcp3 is up with BMC:118 & PNOR: 20180330 levels
  2. Installed boslcp3 with latest kernel 
      4.15.0-13-generic 
  3. Enabled "-proposed" kernel in /etc/apt/sources.list file
  4. Ran sudo apt-get update & apt-get upgrade

  5. root@boslcp3:~# ls /boot
  abi-4.15.0-13-generic         retpoline-4.15.0-13-generic
  abi-4.15.0-15-generic         retpoline-4.15.0-15-generic
  config-4.15.0-13-generic      System.map-4.15.0-13-generic
  config-4.15.0-15-generic      System.map-4.15.0-15-generic
  grub                          vmlinux
  initrd.img                    vmlinux-4.15.0-13-generic
  initrd.img-4.15.0-13-generic  vmlinux-4.15.0-15-generic
  initrd.img-4.15.0-15-generic  vmlinux.old
  initrd.img.old

  6. Rebooted & booted with 4.15.0-15 kernel
  7. Enabled xmon by editing file "vi /etc/default/grub" and ran update-grub
  8. Rebooted host.
  9. Booted with 4.15.0-15  & provided root/password credentials in login 
prompt 

  10. Host crashed & enters into XMON state with 'Unable to handle
  kernel paging request'

  root@boslcp3:~# [   66.295233] Unable to handle kernel paging request for 
data at address 0x8882f6ed90e9151a
  [   66.295297] Faulting instruction address: 0xc00000000038a110
  cpu 0x50: Vector: 380 (Data Access Out of Range) at [c00000000692f650]
      pc: c00000000038a110: kmem_cache_alloc_node+0x2f0/0x350
      lr: c00000000038a0fc: kmem_cache_alloc_node+0x2dc/0x350
      sp: c00000000692f8d0
     msr: 9000000000009033
     dar: 8882f6ed90e9151a
    current = 0xc00000000698fd00
    paca    = 0xc00000000fab7000   softe: 0        irq_happened: 0x01
      pid   = 1762, comm = systemd-journal
  Linux version 4.15.0-15-generic (buildd@bos02-ppc64el-002) (gcc version 7.3.0 
(Ubuntu 7.3.0-14ubuntu1)) #16-Ubuntu SMP Wed Apr 4 13:57:51 UTC 2018 (Ubuntu 
4.15.0-15.16-generic 4.15.15)
  enter ? for help
  [c00000000692f8d0] c000000000389fd4 kmem_cache_alloc_node+0x1b4/0x350 
(unreliable)
  [c00000000692f940] c000000000b2ec6c __alloc_skb+0x6c/0x220
  [c00000000692f9a0] c000000000b30b6c alloc_skb_with_frags+0x7c/0x2e0
  [c00000000692fa30] c000000000b247cc sock_alloc_send_pskb+0x29c/0x2c0
  [c00000000692fae0] c000000000c5705c unix_dgram_sendmsg+0x15c/0x8f0
  [c00000000692fbc0] c000000000b1ec64 sock_sendmsg+0x64/0x90
  [c00000000692fbf0] c000000000b20abc ___sys_sendmsg+0x31c/0x390
  [c00000000692fd90] c000000000b221ec __sys_sendmsg+0x5c/0xc0
  [c00000000692fe30] c00000000000b184 system_call+0x58/0x6c
  --- Exception: c00 (System Call) at 000074826f6fa9c4
  SP (7ffff5dc5510) is in userspace
  50:mon>
  50:mon>

  10. Attached Host console logs

  I rebooted the host just to see if it would hit the issue again and
  this time I didn't even get to the login prompt but it crashed in the
  same location:

  50:mon> r
  R00 = c000000000389fd4   R16 = c000200e0b20fdc0
  R01 = c000200e0b20f8d0   R17 = 0000000000000048
  R02 = c0000000016eb400   R18 = 000000000001fe80
  R03 = 0000000000000001   R19 = 0000000000000000
  R04 = 0048ca1cff37803d   R20 = 0000000000000000
  R05 = 0000000000000688   R21 = 0000000000000000
  R06 = 0000000000000001   R22 = 0000000000000048
  R07 = 0000000000000687   R23 = 4882d6e3c8b7ab55
  R08 = 48ca1cff37802b68   R24 = c000200e5851df01
  R09 = 0000000000000000   R25 = 8882f6ed90e67454
  R10 = 0000000000000000   R26 = c000000000b2ec6c
  R11 = c000000000d10f78   R27 = c000000ff901ee00
  R12 = 0000000000002000   R28 = ffffffffffffffff
  R13 = c00000000fab7000   R29 = 00000000015004c0
  R14 = c000200e4c973fc8   R30 = c000200e5851df01
  R15 = c000200e4c974238   R31 = c000000ff901ee00
  pc  = c00000000038a110 kmem_cache_alloc_node+0x2f0/0x350
  cfar= c000000000016e1c arch_local_irq_restore+0x1c/0x90
  lr  = c00000000038a0fc kmem_cache_alloc_node+0x2dc/0x350
  msr = 9000000000009033   cr  = 28002844
  ctr = c00000000061e1b0   xer = 0000000000000000   trap =  380
  dar = 8882f6ed90e67454   dsisr = c000200e40bd8400
  50:mon> t
  [c000200e0b20f8d0] c000000000389fd4 kmem_cache_alloc_node+0x1b4/0x350 
(unreliable)
  [c000200e0b20f940] c000000000b2ec6c __alloc_skb+0x6c/0x220
  [c000200e0b20f9a0] c000000000b30b6c alloc_skb_with_frags+0x7c/0x2e0
  [c000200e0b20fa30] c000000000b247cc sock_alloc_send_pskb+0x29c/0x2c0
  [c000200e0b20fae0] c000000000c56ae4 unix_stream_sendmsg+0x264/0x5c0
  [c000200e0b20fbc0] c000000000b1ec64 sock_sendmsg+0x64/0x90
  [c000200e0b20fbf0] c000000000b20abc ___sys_sendmsg+0x31c/0x390
  [c000200e0b20fd90] c000000000b221ec __sys_sendmsg+0x5c/0xc0
  [c000200e0b20fe30] c00000000000b184 system_call+0x58/0x6c
  --- Exception: c01 (System Call) at 00007d16a993a940
  SP (7ffffbee2270) is in userspace

  Mirroring to Canonical to advise them that this might be possible
  regression. Didn't see any obvious changes in this area in the
  changelog published at
  https://launchpad.net/ubuntu/+source/linux/4.15.0-15.16 but it would
  be good to have Canonical help reviewing the deltas as we try to
  isolate this further.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1762844] Comment bridged from LTC Bugzilla

Reply via email to