We are missing the following fix for 3.16 series: commit a88215312c5ed74697973f6c9f0fce718bcf18ad Author: Philipp Reisner <philipp.reis...@linbit.com> Date: Mon Nov 10 17:21:11 2014 +0100
drbd: fix race between role change and handshake Symptoms: If DRBD was "cleanly shut down" (all in sync, both Secondary before disconnect, identical data generation uuids), and then one side was promoted *during* the next connection handshake, the role change could confuse the handshake. The Primary would get stuck in WFBitmapS, the Secondary would log unexpected cstate (Connected) in receive_bitmap and get stuck in WFBitmapT. Fix: The test in is_valid_soft_transition wrong. It works because the not allowed actions (promote/attach) do not touch the cstate. The previous condition failed to demand a cstate change in one clause. In order to avoid deadlocks give up the state_mutex while waiting for the transient state to go away. Conflicts: drbd/drbd_state.c drbd/drbd_state.h drbd/drbd_wrappers.h Signed-off-by: Philipp Reisner <philipp.reis...@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenb...@linbit.com> Signed-off-by: Jens Axboe <ax...@fb.com> That probably fixes this issue. Will provide a PPA with a kernel to be tested. ** Description changed: - tinoco@freenode said: - you are facing a probable race condition for drbd - checking if the commit (fixing a race condition) is on the 3.16 kernel - we are missing this fix: - commit a88215312c5ed74697973f6c9f0fce718bcf18ad - Author: Philipp Reisner <philipp.reis...@linbit.com> - Date: Mon Nov 10 17:21:11 2014 +0100 - drbd: fix race between role change and handshake - probably - i need you to open a bug in launchpad for the "linux" package - and let me know the number - i'll provide the fix and ask the kernel team to fix this + It was brought to my attention the following kernel panic: - Distributor ID: Ubuntu - Description: Ubuntu 14.04.2 LTS + [1191751.738854] request: minor=1, resource=vm-appserver; but that minor belongs to resource libvirt + [1191759.892350] drbd vm-database: peer( Secondary -> Unknown ) conn( Connected -> Disconnecting ) pdsk( UpToDate -> DUnknown ) + [1191759.892359] drbd vm-database: asender terminated + [1191759.892362] drbd vm-database: Terminating drbd_a_vm-datab + [1191759.892471] drbd vm-database: Connection closed + [1191759.892480] drbd vm-database: conn( Disconnecting -> StandAlone ) + [1191759.892481] drbd vm-database: receiver terminated + [1191759.892485] drbd vm-database: Terminating drbd_r_vm-datab + [1191759.892497] block drbd6: disk( UpToDate -> Failed ) + [1191759.902311] block drbd6: bitmap WRITE of 0 pages took 0 jiffies + [1191759.902315] block drbd6: 0 KB (0 bits) marked out-of-sync by on disk bit-map. + [1191759.902322] block drbd6: disk( Failed -> Diskless ) + [1191759.902565] block drbd6: drbd_bm_resize called with capacity == 0 + [1191759.902585] drbd vm-database: Terminating drbd_w_vm-datab + [1191992.802513] INFO: task drbdsetup:20254 blocked for more than 120 seconds. + [1191992.834141] Not tainted 3.16.0-31-generic #41~14.04.1-Ubuntu + [1191992.862889] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. + [1191992.900367] drbdsetup D ffff88085fc53440 0 20254 1 0x00000004 + [1191992.900371] ffff880845967ab8 0000000000000086 ffff880abe65c750 ffff880845967fd8 + [1191992.900374] 0000000000013440 0000000000013440 ffff8808546d65e0 ffff8808b5365800 + [1191992.900378] 0000000000001e00 0000000000000400 000000000000000e ffff8808b5365a30 + [1191992.900388] Call Trace: + [1191992.900398] [<ffffffff817675c9>] schedule+0x29/0x70 + [1191992.900407] [<ffffffffc078fba5>] _drbd_request_state+0x65/0xb0 [drbd] + [1191992.900413] [<ffffffff810b4d10>] ? prepare_to_wait_event+0x100/0x100 + [1191992.900418] [<ffffffffc0787a3e>] adm_detach.part.52+0x3e/0x100 [drbd] + [1191992.900422] [<ffffffffc07842f1>] ? drbd_adm_prepare.isra.48+0xd1/0x4e0 [drbd] + [1191992.900426] [<ffffffffc0787c49>] drbd_adm_detach+0x149/0x150 [drbd] + [1191992.900431] [<ffffffff81692a39>] genl_family_rcv_msg+0x199/0x380 + [1191992.900432] [<ffffffff81692c20>] ? genl_family_rcv_msg+0x380/0x380 + [1191992.900434] [<ffffffff81692cb1>] genl_rcv_msg+0x91/0xd0 + [1191992.900436] [<ffffffff81690d39>] netlink_rcv_skb+0xa9/0xc0 + [1191992.900438] [<ffffffff81691238>] genl_rcv+0x28/0x40 + [1191992.900439] [<ffffffff816903f3>] netlink_unicast+0xf3/0x200 + [1191992.900441] [<ffffffff81690815>] netlink_sendmsg+0x315/0x680 + [1191992.900448] [<ffffffff81333d7d>] ? aa_sk_perm.isra.4+0x6d/0x150 + [1191992.900452] [<ffffffff8164625e>] sock_aio_write+0xfe/0x130 + [1191992.900456] [<ffffffff811d358a>] do_sync_write+0x5a/0x90 + [1191992.900458] [<ffffffff811d4005>] vfs_write+0x195/0x1f0 + [1191992.900461] [<ffffffff811d4ac6>] SyS_write+0x46/0xb0 + [1191992.900464] [<ffffffff8176b66d>] system_call_fastpath+0x1a/0x1f + [1191992.900465] sending NMI to all CPUs: - Linux bluegrass4 3.16.0-31-generic #41~14.04.1-Ubuntu SMP Wed Feb 11 - 19:30:13 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux - - Kernel panic, dump at: + Dump at: http://telsasoft.com/tmp/varcrash-201503271156.tar.bz2 ** Description changed: It was brought to my attention the following kernel panic: + SYSTEM MAP: /boot/System.map-3.16.0-31-generic + DEBUG KERNEL: /usr/lib/debug/boot/vmlinux-3.16.0-31-generic + DUMPFILE: ./dump.201503271156 [PARTIAL DUMP] + CPUS: 12 + DATE: Fri Mar 27 12:56:49 2015 + UPTIME: 13 days, 19:14:00 + LOAD AVERAGE: 1.72, 0.67, 0.28 + TASKS: 340 + NODENAME: bluegrass3 + RELEASE: 3.16.0-31-generic + VERSION: #41~14.04.1-Ubuntu SMP Wed Feb 11 19:30:13 UTC 2015 + MACHINE: x86_64 (2397 Mhz) + MEMORY: 63.9 GB + PANIC: "Kernel panic - not syncing: hung_task: blocked tasks" + [1191751.738854] request: minor=1, resource=vm-appserver; but that minor belongs to resource libvirt - [1191759.892350] drbd vm-database: peer( Secondary -> Unknown ) conn( Connected -> Disconnecting ) pdsk( UpToDate -> DUnknown ) + [1191759.892350] drbd vm-database: peer( Secondary -> Unknown ) conn( Connected -> Disconnecting ) pdsk( UpToDate -> DUnknown ) [1191759.892359] drbd vm-database: asender terminated [1191759.892362] drbd vm-database: Terminating drbd_a_vm-datab [1191759.892471] drbd vm-database: Connection closed - [1191759.892480] drbd vm-database: conn( Disconnecting -> StandAlone ) + [1191759.892480] drbd vm-database: conn( Disconnecting -> StandAlone ) [1191759.892481] drbd vm-database: receiver terminated [1191759.892485] drbd vm-database: Terminating drbd_r_vm-datab - [1191759.892497] block drbd6: disk( UpToDate -> Failed ) + [1191759.892497] block drbd6: disk( UpToDate -> Failed ) [1191759.902311] block drbd6: bitmap WRITE of 0 pages took 0 jiffies [1191759.902315] block drbd6: 0 KB (0 bits) marked out-of-sync by on disk bit-map. - [1191759.902322] block drbd6: disk( Failed -> Diskless ) + [1191759.902322] block drbd6: disk( Failed -> Diskless ) [1191759.902565] block drbd6: drbd_bm_resize called with capacity == 0 [1191759.902585] drbd vm-database: Terminating drbd_w_vm-datab [1191992.802513] INFO: task drbdsetup:20254 blocked for more than 120 seconds. [1191992.834141] Not tainted 3.16.0-31-generic #41~14.04.1-Ubuntu [1191992.862889] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [1191992.900367] drbdsetup D ffff88085fc53440 0 20254 1 0x00000004 [1191992.900371] ffff880845967ab8 0000000000000086 ffff880abe65c750 ffff880845967fd8 [1191992.900374] 0000000000013440 0000000000013440 ffff8808546d65e0 ffff8808b5365800 [1191992.900378] 0000000000001e00 0000000000000400 000000000000000e ffff8808b5365a30 [1191992.900388] Call Trace: [1191992.900398] [<ffffffff817675c9>] schedule+0x29/0x70 [1191992.900407] [<ffffffffc078fba5>] _drbd_request_state+0x65/0xb0 [drbd] [1191992.900413] [<ffffffff810b4d10>] ? prepare_to_wait_event+0x100/0x100 [1191992.900418] [<ffffffffc0787a3e>] adm_detach.part.52+0x3e/0x100 [drbd] [1191992.900422] [<ffffffffc07842f1>] ? drbd_adm_prepare.isra.48+0xd1/0x4e0 [drbd] [1191992.900426] [<ffffffffc0787c49>] drbd_adm_detach+0x149/0x150 [drbd] [1191992.900431] [<ffffffff81692a39>] genl_family_rcv_msg+0x199/0x380 [1191992.900432] [<ffffffff81692c20>] ? genl_family_rcv_msg+0x380/0x380 [1191992.900434] [<ffffffff81692cb1>] genl_rcv_msg+0x91/0xd0 [1191992.900436] [<ffffffff81690d39>] netlink_rcv_skb+0xa9/0xc0 [1191992.900438] [<ffffffff81691238>] genl_rcv+0x28/0x40 [1191992.900439] [<ffffffff816903f3>] netlink_unicast+0xf3/0x200 [1191992.900441] [<ffffffff81690815>] netlink_sendmsg+0x315/0x680 [1191992.900448] [<ffffffff81333d7d>] ? aa_sk_perm.isra.4+0x6d/0x150 [1191992.900452] [<ffffffff8164625e>] sock_aio_write+0xfe/0x130 [1191992.900456] [<ffffffff811d358a>] do_sync_write+0x5a/0x90 [1191992.900458] [<ffffffff811d4005>] vfs_write+0x195/0x1f0 [1191992.900461] [<ffffffff811d4ac6>] SyS_write+0x46/0xb0 [1191992.900464] [<ffffffff8176b66d>] system_call_fastpath+0x1a/0x1f [1191992.900465] sending NMI to all CPUs: Dump at: http://telsasoft.com/tmp/varcrash-201503271156.tar.bz2 -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1439872 Title: kernel panic involving drbd Status in linux package in Ubuntu: Incomplete Bug description: It was brought to my attention the following kernel panic: SYSTEM MAP: /boot/System.map-3.16.0-31-generic DEBUG KERNEL: /usr/lib/debug/boot/vmlinux-3.16.0-31-generic DUMPFILE: ./dump.201503271156 [PARTIAL DUMP] CPUS: 12 DATE: Fri Mar 27 12:56:49 2015 UPTIME: 13 days, 19:14:00 LOAD AVERAGE: 1.72, 0.67, 0.28 TASKS: 340 NODENAME: bluegrass3 RELEASE: 3.16.0-31-generic VERSION: #41~14.04.1-Ubuntu SMP Wed Feb 11 19:30:13 UTC 2015 MACHINE: x86_64 (2397 Mhz) MEMORY: 63.9 GB PANIC: "Kernel panic - not syncing: hung_task: blocked tasks" [1191751.738854] request: minor=1, resource=vm-appserver; but that minor belongs to resource libvirt [1191759.892350] drbd vm-database: peer( Secondary -> Unknown ) conn( Connected -> Disconnecting ) pdsk( UpToDate -> DUnknown ) [1191759.892359] drbd vm-database: asender terminated [1191759.892362] drbd vm-database: Terminating drbd_a_vm-datab [1191759.892471] drbd vm-database: Connection closed [1191759.892480] drbd vm-database: conn( Disconnecting -> StandAlone ) [1191759.892481] drbd vm-database: receiver terminated [1191759.892485] drbd vm-database: Terminating drbd_r_vm-datab [1191759.892497] block drbd6: disk( UpToDate -> Failed ) [1191759.902311] block drbd6: bitmap WRITE of 0 pages took 0 jiffies [1191759.902315] block drbd6: 0 KB (0 bits) marked out-of-sync by on disk bit-map. [1191759.902322] block drbd6: disk( Failed -> Diskless ) [1191759.902565] block drbd6: drbd_bm_resize called with capacity == 0 [1191759.902585] drbd vm-database: Terminating drbd_w_vm-datab [1191992.802513] INFO: task drbdsetup:20254 blocked for more than 120 seconds. [1191992.834141] Not tainted 3.16.0-31-generic #41~14.04.1-Ubuntu [1191992.862889] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [1191992.900367] drbdsetup D ffff88085fc53440 0 20254 1 0x00000004 [1191992.900371] ffff880845967ab8 0000000000000086 ffff880abe65c750 ffff880845967fd8 [1191992.900374] 0000000000013440 0000000000013440 ffff8808546d65e0 ffff8808b5365800 [1191992.900378] 0000000000001e00 0000000000000400 000000000000000e ffff8808b5365a30 [1191992.900388] Call Trace: [1191992.900398] [<ffffffff817675c9>] schedule+0x29/0x70 [1191992.900407] [<ffffffffc078fba5>] _drbd_request_state+0x65/0xb0 [drbd] [1191992.900413] [<ffffffff810b4d10>] ? prepare_to_wait_event+0x100/0x100 [1191992.900418] [<ffffffffc0787a3e>] adm_detach.part.52+0x3e/0x100 [drbd] [1191992.900422] [<ffffffffc07842f1>] ? drbd_adm_prepare.isra.48+0xd1/0x4e0 [drbd] [1191992.900426] [<ffffffffc0787c49>] drbd_adm_detach+0x149/0x150 [drbd] [1191992.900431] [<ffffffff81692a39>] genl_family_rcv_msg+0x199/0x380 [1191992.900432] [<ffffffff81692c20>] ? genl_family_rcv_msg+0x380/0x380 [1191992.900434] [<ffffffff81692cb1>] genl_rcv_msg+0x91/0xd0 [1191992.900436] [<ffffffff81690d39>] netlink_rcv_skb+0xa9/0xc0 [1191992.900438] [<ffffffff81691238>] genl_rcv+0x28/0x40 [1191992.900439] [<ffffffff816903f3>] netlink_unicast+0xf3/0x200 [1191992.900441] [<ffffffff81690815>] netlink_sendmsg+0x315/0x680 [1191992.900448] [<ffffffff81333d7d>] ? aa_sk_perm.isra.4+0x6d/0x150 [1191992.900452] [<ffffffff8164625e>] sock_aio_write+0xfe/0x130 [1191992.900456] [<ffffffff811d358a>] do_sync_write+0x5a/0x90 [1191992.900458] [<ffffffff811d4005>] vfs_write+0x195/0x1f0 [1191992.900461] [<ffffffff811d4ac6>] SyS_write+0x46/0xb0 [1191992.900464] [<ffffffff8176b66d>] system_call_fastpath+0x1a/0x1f [1191992.900465] sending NMI to all CPUs: Dump at: http://telsasoft.com/tmp/varcrash-201503271156.tar.bz2 To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1439872/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp