------- Comment From jac...@de.ibm.com 2017-03-13 09:57 EDT-------
Next iteration: I was able to make OpenAFS for the proposed kernel fixed and so 
I was able to start the next pdebuild job .... and again I run into a hang 
scenario:

Mar 13 14:30:32 mclint multipathd[881]: mpatha: sda - tur checker timed out
Mar 13 14:30:32 mclint multipathd[881]: 8:0: reinstated
Mar 13 14:30:32 mclint multipathd[881]: mpatha: sdc - tur checker timed out
Mar 13 14:30:32 mclint multipathd[881]: 8:32: reinstated
Mar 13 14:30:33 mclint multipathd[881]: mpatha: sdb - tur checker timed out
Mar 13 14:30:33 mclint multipathd[881]: 8:16: reinstated
Mar 13 14:30:36 mclint multipathd[881]: mpatha: sdd - tur checker timed out
Mar 13 14:30:36 mclint rsyslogd-2007: action 'action 10' suspended, next retry 
is Mon Mar 13 14:32:06 2017 [v8.16.0 try http://www.rsyslog.com/e/2007 ]
Mar 13 14:30:36 mclint multipathd[881]: 8:48: reinstated

I just pulled a dump and compress it, eventually I'll upload it to Box.
I'll go on and try to reproduce this without OpenAFS and I'm very
confident that this hang is not related to AFS at all ...

General dump info:
Dump format........: s390mv
Version............: 5
Dump created.......: Mon, 13 Mar 2017 14:42:39 +0100
Dump ended.........: Mon, 13 Mar 2017 14:43:17 +0100
Dump CPU ID........: 9efc729648000
UTS node name......: mclint
UTS kernel release.: 4.4.0-67-generic
UTS kernel version.: #88 SMP Wed Mar 8 14:48:51 UTC 2017
Build arch.........: s390x (64 bit)
System arch........: s390x (64 bit)
CPU count (online).: 2
CPU count (real)...: 4
Dump memory range..: 8192 MB
Real memory range..: 8192 MB

Memory map:
0000000000000000 - 00000001b831afff (7043 MB)
00000001b831b000 - 00000001ffffffff (1149 MB)

Dump device info:
Volume 0: 0.0.8409 (online/active)
Volume 1: 0.0.840a (online/active)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1670634

Title:
  blk-mq: possible deadlock on CPU hot(un)plug

Status in Ubuntu on IBM z Systems:
  New
Status in linux package in Ubuntu:
  New

Bug description:
  == Comment: #0 - Carsten Jacobi <jac...@de.ibm.com> - 2017-03-07 03:35:31 ==
  I'm evaluating Ubuntu-Xenial on z for development purposes, the test system 
is installed in an LPAR with one FCP-LUN which is accessable by 4 pathes (all 
pathes are configured).
  The system hangs regularly when I make packages with "pdebuild" using the 
pbuilder packaging suit.
  The local Linux development team helped me out with a pre-analysis that I can 
post here (thanks a lot for that):

  With the default settings and under a certain workload,
  blk_mq seems to get into a presumed "deadlock".
  Possibly this happens on CPU hot(un)plug.

  After the I/O stalled, a dump was pulled manually.
  The following information is from the crash dump pre-analysis.

  $ zgetdump -i dump.0
  General dump info:
    Dump format........: elf
    Version............: 1
    UTS node name......: mclint
    UTS kernel release.: 4.4.0-65-generic
    UTS kernel version.: #86-Ubuntu SMP Thu Feb 23 17:54:37 UTC 2017
    System arch........: s390x (64 bit)
    CPU count (online).: 2
    Dump memory range..: 8192 MB
  Memory map:
    0000000000000000 - 00000001b831afff (7043 MB)
    00000001b831b000 - 00000001ffffffff (1149 MB)

  Things look similarly with HWE kernel ubuntu16.04-4.8.0-34.36~16.04.1.

        KERNEL: vmlinux.full
      DUMPFILE: dump.0
          CPUS: 2
          DATE: Fri Mar  3 14:31:07 2017
        UPTIME: 02:11:20
  LOAD AVERAGE: 13.00, 12.92, 11.37
         TASKS: 411
      NODENAME: mclint
       RELEASE: 4.4.0-65-generic
       VERSION: #86-Ubuntu SMP Thu Feb 23 17:54:37 UTC 2017
       MACHINE: s390x  (unknown Mhz)
        MEMORY: 7.8 GB
         PANIC: ""
           PID: 0
       COMMAND: "swapper/0"
          TASK: bad528  (1 of 2)  [THREAD_INFO: b78000]
           CPU: 0
         STATE: TASK_RUNNING (ACTIVE)
          INFO: no panic task found

  crash> dev -d
  MAJOR GENDISK            NAME       REQUEST_QUEUE      TOTAL ASYNC  SYNC   DRV
  ...
      8 1e1d6d800          sda        1e1d51210              0 23151 4294944145 
N/A(MQ)
      8 1e4e06800          sdc        2081b18                0 23148 4294944148 
N/A(MQ)
      8 1f07800            sdb        20c7568                0 23195 4294944101 
N/A(MQ)
      8 1e4e06000          sdd        1e4e31210              0 23099 4294944197 
N/A(MQ)
    252 1e1d6c800          dm-0       1e1d51b18              9     1     8 
N/A(MQ)
  ...

  So both dm-mpath and sd have requests pending in their block multiqueue.
  The large numbers of sd look strange and seem to be the unsigned formatting 
of the values shown for async multiplied by -1.

  [    0.798256] Linux version 4.4.0-65-generic (buildd@z13-011) (gcc version 
5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4) ) #86-Ubuntu SMP Thu Feb 23 
17:54:37 UTC 2017 (Ubuntu 4.4.0-65.86-generic 4.4.49)
  [    0.798262] setup: Linux is running natively in 64-bit mode
  [    0.798290] setup: Max memory size: 8192MB
  [    0.798298] setup: Reserving 196MB of memory at 7996MB for crashkernel 
(System RAM: 7996MB)

  [    0.836923] Kernel command line: root=/dev/mapper/mclint_vg-root
  rootflags=subvol=@ crashkernel=196M BOOT_IMAGE=0

  [ 5281.179428] INFO: task xfsaild/dm-11:1604 blocked for more than 120 
seconds.
  [ 5281.179437]       Not tainted 4.4.0-65-generic #86-Ubuntu
  [ 5281.179438] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
  [ 5281.179440] xfsaild/dm-11   D 00000000007bcf52     0  1604      2 
0x00000000
  [ 5281.179444]        00000001e931c230 00000000001a6964 00000001e6f9b958 
00000001e6f9b9d8
                        00000001e15795f0 00000001e6f9b988 0000000000ce8c00 
00000001ea805c70
                        00000001ea805c00 0000000000ba5ed0 00000001e931c1d0 
00000001e1579b20
                        00000001ea805c00 00000001e15795f0 00000001ea805c00 
0000000000000000
                        00000000007d3978 00000000007bc9f8 00000001e6f9b9d8 
00000001e6f9ba40
  [ 5281.179454] Call Trace:
  [ 5281.179461] ([<00000000007bc9f8>] __schedule+0x300/0x810)
  [ 5281.179462]  [<00000000007bcf52>] schedule+0x4a/0xb0
  [ 5281.179465]  [<00000000007c02aa>] schedule_timeout+0x232/0x2a8
  [ 5281.179466]  [<00000000007bde50>] wait_for_common+0x110/0x1c8
  [ 5281.179472]  [<000000000017b602>] flush_work+0x42/0x58
  [ 5281.179564]  [<000003ff805e14ba>] xlog_cil_force_lsn+0x7a/0x238 [xfs]
  [ 5281.179589]  [<000003ff805dee82>] _xfs_log_force+0x9a/0x2e8 [xfs]
  [ 5281.179615]  [<000003ff805df114>] xfs_log_force+0x44/0x100 [xfs]
  [ 5281.179640]  [<000003ff805ec668>] xfsaild+0x170/0x798 [xfs]
  [ 5281.179643]  [<000000000018335a>] kthread+0x10a/0x110
  [ 5281.179645]  [<00000000007c0ff6>] kernel_thread_starter+0x6/0xc
  [ 5281.179646]  [<00000000007c0ff0>] kernel_thread_starter+0x0/0xc

  see below

  [ 5281.179664] INFO: task cpuplugd:2260 blocked for more than 120 seconds.
  [ 5281.179665]       Not tainted 4.4.0-65-generic #86-Ubuntu
  [ 5281.179666] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
  [ 5281.179668] cpuplugd        D 00000000007bcf52     0  2260      1 
0x00000000
  [ 5281.179670]        00000001e782e318 00000000001a6964 000000007bc4ba58 
000000007bc4bad8
                        00000001d4076db0 000000007bc4ba88 0000000000ce8c00 
00000001ea805c70
                        00000001ea805c00 0000000000ba5ed0 00000001e782e2b8 
00000001d40772e0
                        00000001ea805c00 00000001d4076db0 00000001ea805c00 
0000000000000000
                        00000000007d3978 00000000007bc9f8 000000007bc4bad8 
000000007bc4bb40
  [ 5281.179678] Call Trace:
  [ 5281.179680] ([<00000000007bc9f8>] __schedule+0x300/0x810)
  [ 5281.179681]  [<00000000007bcf52>] schedule+0x4a/0xb0
  [ 5281.179685]  [<0000000000516cc2>] blk_mq_freeze_queue_wait+0x62/0xc8
  [ 5281.179687]  [<0000000000519412>] blk_mq_queue_reinit_notify+0x11a/0x240
  [ 5281.179690]  [<00000000001844c6>] notifier_call_chain+0x56/0x98
  [ 5281.179692]  [<000000000018466a>] __raw_notifier_call_chain+0x2a/0x38
  [ 5281.179696]  [<00000000001605ac>] _cpu_up+0x10c/0x1b0
  [ 5281.179698]  [<0000000000160738>] cpu_up+0xe8/0x108
  [ 5281.179700]  [<00000000005d08be>] cpu_subsys_online+0x56/0xb0
  [ 5281.179703]  [<00000000005ca1c2>] device_online+0x82/0xc0
  [ 5281.179704]  [<00000000005ca28a>] online_store+0x8a/0x98
  [ 5281.179710]  [<00000000003a4d12>] kernfs_fop_write+0x13a/0x190
  [ 5281.179712]  [<000000000031218c>] vfs_write+0x94/0x1a0
  [ 5281.179714]  [<0000000000312e9e>] SyS_write+0x66/0xd8
  [ 5281.179715]  [<00000000007c0e3e>] system_call+0xd6/0x264
  [ 5281.179718]  [<000003ff803df478>] zlib_tr_flush_block+0x650/0x830 
[zlib_deflate]

  Cpuplugd performs CPU hot(un)plug based on configurable rules.
  
https://www.ibm.com/support/knowledgecenter/linuxonibm/com.ibm.linux.z.ludd/ludd_r_cpuplugdcmd.html
  
https://www.ibm.com/support/knowledgecenter/linuxonibm/com.ibm.linux.z.ludd/ludd_t_cpu_act.html
  
https://www.ibm.com/support/knowledgecenter/linuxonibm/com.ibm.linux.z.ludd/ludd_r_numa_know_cpu.html

  [ 5281.179769] INFO: task kworker/0:2:23669 blocked for more than 120 seconds.
  [ 5281.179770]       Not tainted 4.4.0-65-generic #86-Ubuntu
  [ 5281.179771] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
  [ 5281.179773] kworker/0:2     D 00000000007bcf52     0 23669      2 
0x00000000
  [ 5281.179801] Workqueue: xfs-cil/dm-11 xlog_cil_push_work [xfs]
  [ 5281.179802]        00000001c925e318 00000000001a6964 0000000068c5f9f8 
0000000068c5fa78
                        00000001da99b6d8 0000000068c5fa10 0000000000ce8c00 
00000001ea805c70
                        00000001ea805c00 00000001e782e94c 00000001c925e2b8 
00000001da99bc08
                        00000001ea805c00 00000001da99b6d8 00000001ea805c00 
0000000000000000
                        00000000007d3978 00000000007bc9f8 0000000068c5fa78 
0000000068c5fae0
  [ 5281.179810] Call Trace:
  [ 5281.179812] ([<00000000007bc9f8>] __schedule+0x300/0x810)
  [ 5281.179813]  [<00000000007bcf52>] schedule+0x4a/0xb0
  [ 5281.179839]  [<000003ff805de144>] xlog_state_get_iclog_space+0x124/0x338 
[xfs]
  [ 5281.179864]  [<000003ff805de702>] xlog_write+0x1ea/0x800 [xfs]
  [ 5281.179890]  [<000003ff805e09a6>] xlog_cil_push+0x286/0x508 [xfs]
  [ 5281.179891]  [<000000000017c400>] process_one_work+0x1a0/0x4f8
  [ 5281.179893]  [<000000000017c7a2>] worker_thread+0x4a/0x530
  [ 5281.179894]  [<000000000018335a>] kthread+0x10a/0x110
  [ 5281.179896]  [<00000000007c0ff6>] kernel_thread_starter+0x6/0xc
  [ 5281.179898]  [<00000000007c0ff0>] kernel_thread_starter+0x0/0xc

  While above kworker executes a work item for a long duration, other
  processes block in turn on flush_work for a long duration.

  [ 5281.179728] INFO: task kworker/0:1:4454 blocked for more than 120 seconds.
  [ 5281.179730]       Not tainted 4.4.0-65-generic #86-Ubuntu
  [ 5281.179731] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
  [ 5281.179732] kworker/0:1     D 00000000007bcf52     0  4454      2 
0x00000000
  [ 5281.179738] Workqueue: events vmstat_shepherd
  [ 5281.179739]        00000001c925ac40 00000000001a6964 000000007eb8bb38 
000000007eb8bbb8
                        00000001e782e2b8 000000007eb8bb50 0000000000ce8c00 
00000001ea805c70
                        00000001c925abe0 00000001c925b274 00000001c925abe0 
00000001e782e7e8
                        00000001ea805c00 00000001e782e2b8 00000001ea805c00 
0000000000000000
                        00000000007d3978 00000000007bc9f8 000000007eb8bbb8 
000000007eb8bc20
  [ 5281.179747] Call Trace:
  [ 5281.179749] ([<00000000007bc9f8>] __schedule+0x300/0x810)
  [ 5281.179750]  [<00000000007bcf52>] schedule+0x4a/0xb0
  [ 5281.179752]  [<00000000007bd39a>] schedule_preempt_disabled+0x2a/0x38
  [ 5281.179753]  [<00000000007becc4>] __mutex_lock_slowpath+0xcc/0x170
  [ 5281.179755]  [<00000000007bedc6>] mutex_lock+0x5e/0x78
  [ 5281.179756]  [<000000000015fba0>] get_online_cpus+0x40/0x68
  [ 5281.179757]  [<00000000002a3ccc>] vmstat_shepherd+0x44/0x168
  [ 5281.179759]  [<000000000017c400>] process_one_work+0x1a0/0x4f8
  [ 5281.179761]  [<000000000017c7a2>] worker_thread+0x4a/0x530
  [ 5281.179762]  [<000000000018335a>] kthread+0x10a/0x110
  [ 5281.179764]  [<00000000007c0ff6>] kernel_thread_starter+0x6/0xc
  [ 5281.179765]  [<00000000007c0ff0>] kernel_thread_starter+0x0/0xc

  This work item cannot progress maybe because cpuplugd:2260 above
  "hangs" in the cpu hotplug notifier chain.

  The low level device driver (here zfcp) is completely idle without any
  pending I/O after the lockup happened and all its paths are in good
  state and could service I/O but it simply does not get any new I/O
  requests from the upper layers (scsi / block). Zfcp does not implement
  blk_mq so dm or scsi translate which works in general but fails with
  above workload. There were no other undesired events, i.e. no path
  interruptions nor any recovery in zfcp.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/1670634/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to