Hi we have very strange issue with one volume, we are using lvm and drbd with diskless replica.
After we start workload on diskless node it it is working fine, but after some time all I/O is hung and we have the next messages in dmesg: from m8c25 (diskless node): [Sun Mar 10 05:42:28 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 m12c4: [drbd_s_csi-9400/6151] sending time expired, ko = 6 [Sun Mar 10 05:42:28 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 m7c14: [drbd_s_csi-9400/6154] sending time expired, ko = 6 [Sun Mar 10 05:42:34 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 m12c4: [drbd_s_csi-9400/6151] sending time expired, ko = 5 [Sun Mar 10 05:42:34 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 m7c14: [drbd_s_csi-9400/6154] sending time expired, ko = 6 [Sun Mar 10 05:42:40 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 m12c4: [drbd_s_csi-9400/6151] sending time expired, ko = 4 [Sun Mar 10 05:42:40 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 m7c14: [drbd_s_csi-9400/6154] sending time expired, ko = 5 [Sun Mar 10 05:42:46 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 m12c4: [drbd_s_csi-9400/6151] sending time expired, ko = 3 [Sun Mar 10 05:42:46 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 m7c14: [drbd_s_csi-9400/6154] sending time expired, ko = 4 [Sun Mar 10 05:42:52 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 m12c4: [drbd_s_csi-9400/6151] sending time expired, ko = 2 [Sun Mar 10 05:42:52 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 m7c14: [drbd_s_csi-9400/6154] sending time expired, ko = 3 [Sun Mar 10 05:42:59 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 m12c4: [drbd_s_csi-9400/6151] sending time expired, ko = 1 [Sun Mar 10 05:42:59 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 m7c14: [drbd_s_csi-9400/6154] sending time expired, ko = 2 [Sun Mar 10 05:43:05 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 m12c4: conn( Connected -> NetworkFailure ) peer( Secondary -> Unknown ) [Sun Mar 10 05:43:05 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696/0 drbd1104 m12c4: pdsk( UpToDate -> DUnknown ) repl( Established -> Off ) [Sun Mar 10 05:43:05 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 m7c14: [drbd_s_csi-9400/6154] sending time expired, ko = 1 [Sun Mar 10 05:43:05 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 m12c4: ack_receiver terminated [Sun Mar 10 05:43:05 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 m12c4: Terminating ack_recv thread [Sun Mar 10 05:43:05 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696/0 drbd1104: sending new current UUID: C07BEF398E0AD984 [Sun Mar 10 05:43:05 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 m7c14: Preparing remote state change 1414712872 [Sun Mar 10 05:43:05 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 m7c14: Committing remote state change 1414712872 (primary_nodes=8) [Sun Mar 10 05:43:05 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696/0 drbd1104 m12c4: pdsk( DUnknown -> Outdated ) [Sun Mar 10 05:43:11 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 m7c14: conn( Connected -> NetworkFailure ) peer( Secondary -> Unknown ) [Sun Mar 10 05:43:11 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696/0 drbd1104 m7c14: pdsk( UpToDate -> DUnknown ) repl( Established -> Off ) [Sun Mar 10 05:43:11 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 m12c4: Connection closed [Sun Mar 10 05:43:11 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 m12c4: conn( NetworkFailure -> Unconnected ) [Sun Mar 10 05:43:11 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 m12c4: Restarting receiver thread [Sun Mar 10 05:43:11 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696/0 drbd1104: IO ERROR: neither local nor remote data, sector 25237760+8 [Sun Mar 10 05:43:11 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 m12c4: conn( Unconnected -> Connecting ) [Sun Mar 10 05:43:11 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696/0 drbd1104: IO ERROR: neither local nor remote data, sector 1001578856+8 [Sun Mar 10 05:43:11 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696/0 drbd1104: IO ERROR: neither local nor remote data, sector 713186240+8 >From m7c14 (first data node): [Sun Mar 10 05:45:34 2019] INFO: task kworker/u16:3:3676 blocked for more than 120 seconds. [Sun Mar 10 05:45:34 2019] Tainted: P O 4.15.18-7-pve #1 [Sun Mar 10 05:45:34 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Sun Mar 10 05:45:34 2019] kworker/u16:3 D 0 3676 2 0x80000000 [Sun Mar 10 05:45:34 2019] Workqueue: drbd1104_submit do_submit [drbd] [Sun Mar 10 05:45:34 2019] Call Trace: [Sun Mar 10 05:45:34 2019] __schedule+0x3e0/0x870 [Sun Mar 10 05:45:34 2019] schedule+0x36/0x80 [Sun Mar 10 05:45:34 2019] do_submit+0x325/0x5f0 [drbd] [Sun Mar 10 05:45:34 2019] ? __switch_to_asm+0x34/0x70 [Sun Mar 10 05:45:34 2019] ? __switch_to_asm+0x40/0x70 [Sun Mar 10 05:45:34 2019] ? __switch_to_asm+0x34/0x70 [Sun Mar 10 05:45:34 2019] ? __switch_to_asm+0x40/0x70 [Sun Mar 10 05:45:34 2019] ? wait_woken+0x80/0x80 [Sun Mar 10 05:45:34 2019] ? __switch_to_asm+0x34/0x70 [Sun Mar 10 05:45:34 2019] process_one_work+0x1e0/0x400 [Sun Mar 10 05:45:34 2019] ? __drbd_make_request+0x4e0/0x4e0 [drbd] [Sun Mar 10 05:45:34 2019] ? process_one_work+0x1e0/0x400 [Sun Mar 10 05:45:34 2019] worker_thread+0x4b/0x420 [Sun Mar 10 05:45:34 2019] kthread+0x105/0x140 [Sun Mar 10 05:45:34 2019] ? process_one_work+0x400/0x400 [Sun Mar 10 05:45:34 2019] ? kthread_create_worker_on_cpu+0x70/0x70 [Sun Mar 10 05:45:34 2019] ? do_syscall_64+0x73/0x130 [Sun Mar 10 05:45:34 2019] ? SyS_exit_group+0x14/0x20 [Sun Mar 10 05:45:34 2019] ret_from_fork+0x35/0x40 [Sun Mar 10 05:45:34 2019] INFO: task drbd_r_csi-9400:18969 blocked for more than 120 seconds. [Sun Mar 10 05:45:34 2019] Tainted: P O 4.15.18-7-pve #1 [Sun Mar 10 05:45:34 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Sun Mar 10 05:45:34 2019] drbd_r_csi-9400 D 0 18969 2 0x80000004 [Sun Mar 10 05:45:34 2019] Call Trace: [Sun Mar 10 05:45:34 2019] __schedule+0x3e0/0x870 [Sun Mar 10 05:45:34 2019] schedule+0x36/0x80 [Sun Mar 10 05:45:34 2019] conn_wait_ee_empty+0x72/0x9f [drbd] [Sun Mar 10 05:45:34 2019] ? wait_woken+0x80/0x80 [Sun Mar 10 05:45:34 2019] conn_disconnect+0x1c0/0x780 [drbd] [Sun Mar 10 05:45:34 2019] ? w_flush+0x50/0x50 [drbd] [Sun Mar 10 05:45:34 2019] drbd_receiver+0x2a2/0x670 [drbd] [Sun Mar 10 05:45:34 2019] drbd_thread_setup+0x76/0x180 [drbd] [Sun Mar 10 05:45:34 2019] kthread+0x105/0x140 [Sun Mar 10 05:45:34 2019] ? __drbd_next_peer_device_ref+0x150/0x150 [drbd] [Sun Mar 10 05:45:34 2019] ? kthread_create_worker_on_cpu+0x70/0x70 [Sun Mar 10 05:45:34 2019] ? do_syscall_64+0x73/0x130 [Sun Mar 10 05:45:34 2019] ? SyS_exit_group+0x14/0x20 [Sun Mar 10 05:45:34 2019] ret_from_fork+0x35/0x40 >From m7c14 (second data node): [Sun Mar 10 05:46:55 2019] INFO: task drbd_r_csi-9400:21586 blocked for more than 120 seconds. [Sun Mar 10 05:46:55 2019] Tainted: P O 4.15.18-7-pve #1 [Sun Mar 10 05:46:55 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Sun Mar 10 05:46:55 2019] drbd_r_csi-9400 D 0 21586 2 0x80000000 [Sun Mar 10 05:46:55 2019] Call Trace: [Sun Mar 10 05:46:55 2019] __schedule+0x3e0/0x870 [Sun Mar 10 05:46:55 2019] schedule+0x36/0x80 [Sun Mar 10 05:46:55 2019] receive_peer_dagtag+0x162/0x2f0 [drbd] [Sun Mar 10 05:46:55 2019] ? wait_woken+0x80/0x80 [Sun Mar 10 05:46:55 2019] ? got_twopc_reply+0x1f0/0x1f0 [drbd] [Sun Mar 10 05:46:55 2019] drbd_receiver+0x4ed/0x670 [drbd] [Sun Mar 10 05:46:55 2019] drbd_thread_setup+0x76/0x180 [drbd] [Sun Mar 10 05:46:55 2019] kthread+0x105/0x140 [Sun Mar 10 05:46:55 2019] ? __drbd_next_peer_device_ref+0x150/0x150 [drbd] [Sun Mar 10 05:46:55 2019] ? kthread_create_worker_on_cpu+0x70/0x70 [Sun Mar 10 05:46:55 2019] ? do_syscall_64+0x73/0x130 [Sun Mar 10 05:46:55 2019] ? SyS_exit_group+0x14/0x20 [Sun Mar 10 05:46:55 2019] ret_from_fork+0x35/0x40 [Sun Mar 10 05:46:55 2019] INFO: task drbd_r_csi-9400:18504 blocked for more than 120 seconds. [Sun Mar 10 05:46:55 2019] Tainted: P O 4.15.18-7-pve #1 [Sun Mar 10 05:46:55 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Sun Mar 10 05:46:55 2019] drbd_r_csi-9400 D 0 18504 2 0x80000004 [Sun Mar 10 05:46:55 2019] Call Trace: [Sun Mar 10 05:46:55 2019] __schedule+0x3e0/0x870 [Sun Mar 10 05:46:55 2019] schedule+0x36/0x80 [Sun Mar 10 05:46:55 2019] conn_wait_ee_empty+0x72/0x9f [drbd] [Sun Mar 10 05:46:55 2019] ? wait_woken+0x80/0x80 [Sun Mar 10 05:46:55 2019] conn_disconnect+0x1c0/0x780 [drbd] [Sun Mar 10 05:46:55 2019] ? w_flush+0x50/0x50 [drbd] [Sun Mar 10 05:46:55 2019] drbd_receiver+0x2a2/0x670 [drbd] [Sun Mar 10 05:46:55 2019] drbd_thread_setup+0x76/0x180 [drbd] [Sun Mar 10 05:46:55 2019] kthread+0x105/0x140 [Sun Mar 10 05:46:55 2019] ? __drbd_next_peer_device_ref+0x150/0x150 [drbd] [Sun Mar 10 05:46:55 2019] ? kthread_create_worker_on_cpu+0x70/0x70 [Sun Mar 10 05:46:55 2019] ret_from_fork+0x35/0x40 [Sun Mar 10 05:46:55 2019] INFO: task kworker/u16:0:18946 blocked for more than 120 seconds. [Sun Mar 10 05:46:55 2019] Tainted: P O 4.15.18-7-pve #1 [Sun Mar 10 05:46:55 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Sun Mar 10 05:46:55 2019] kworker/u16:0 D 0 18946 2 0x80000000 [Sun Mar 10 05:46:55 2019] Workqueue: drbd1104_submit do_submit [drbd] [Sun Mar 10 05:46:55 2019] Call Trace: [Sun Mar 10 05:46:55 2019] __schedule+0x3e0/0x870 [Sun Mar 10 05:46:55 2019] schedule+0x36/0x80 [Sun Mar 10 05:46:55 2019] do_submit+0x325/0x5f0 [drbd] [Sun Mar 10 05:46:55 2019] ? __switch_to_asm+0x34/0x70 [Sun Mar 10 05:46:55 2019] ? __switch_to_asm+0x40/0x70 [Sun Mar 10 05:46:55 2019] ? __switch_to_asm+0x34/0x70 [Sun Mar 10 05:46:55 2019] ? __switch_to_asm+0x40/0x70 [Sun Mar 10 05:46:55 2019] ? wait_woken+0x80/0x80 [Sun Mar 10 05:46:55 2019] ? __switch_to_asm+0x34/0x70 [Sun Mar 10 05:46:55 2019] process_one_work+0x1e0/0x400 [Sun Mar 10 05:46:55 2019] ? __drbd_make_request+0x4e0/0x4e0 [drbd] [Sun Mar 10 05:46:55 2019] ? process_one_work+0x1e0/0x400 [Sun Mar 10 05:46:55 2019] worker_thread+0x4b/0x420 [Sun Mar 10 05:46:55 2019] kthread+0x105/0x140 [Sun Mar 10 05:46:55 2019] ? process_one_work+0x400/0x400 [Sun Mar 10 05:46:55 2019] ? kthread_create_worker_on_cpu+0x70/0x70 [Sun Mar 10 05:46:55 2019] ? do_syscall_64+0x73/0x130 [Sun Mar 10 05:46:55 2019] ? SyS_exit_group+0x14/0x20 [Sun Mar 10 05:46:55 2019] ret_from_fork+0x35/0x40 `drbdadm status` shows `NetworkFailure` on the diskless node: csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 role:Secondary disk:Consistent m7c14 role:Secondary peer-disk:Consistent m8c25 connection:NetworkFailure m8c29 role:Secondary peer-disk:Diskless `linstor r l` shows `Consistent` for the diskless node. | csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 | m12c4 | 7158 | Unused | Consistent | | csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 | m7c14 | 7158 | Unused | UpToDate | | csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 | m8c25 | 7158 | Unused | Diskless | | csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 | m8c29 | 7158 | Unused | Diskless | Any `drbdadm down`, `drbdadm disconnect` operations are stuck and not working on the data nodes. Only reboot is helping. We have this issue with one volume, then we created another one and copied all the files from old one to new one and run workload on new. This problem was occurred with new volume too, although the different nodes with different hardware was used. Environement: backend: lvm filesystem: ext4 linstor version: v0.7.5 kernel version: 4.15.18-7-pve drbd version: version: 9.0.14-1 (api:2/proto:86-113) GIT-hash: 62f906cf44ef02a30ce0c148fec223b40c51c533 build by @gitlab-runner-docker1-0, 2018-10-09 16:33:22 Transports (api:16): tcp (9.0.14-1) Could you explain please where exactly can be problem and where to dig further? Thanks in advance! - kvaps _______________________________________________ drbd-user mailing list drbd-user@lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user