Hi Everyone,
we're having the same issue described in here: http://lists.linbit.com/pipermail/drbd-user/2018-November/024667.html in the exact same constellation. Sadly, no one aswered to that Thread so i'm opening a new one. Packages installed on both nodes: root@node2:~# dpkg -l | grep 'drbd\|linstor' ii drbd-dkms 9.0.17-1 all RAID 1 over TCP/IP for Linux module source ii drbd-utils 9.9.0-1 amd64 RAID 1 over TCP/IP for Linux (user utilities) ii drbdtop 0.2.1-1 amd64 like top, but for drbd ii linstor-client 0.9.5-1 all Linstor client command line tool ii linstor-common 0.9.10-1 all DRBD distributed resource management utility ii linstor-proxmox 3.0.3-1 all DRBD distributed resource management utility ii linstor-satellite 0.9.10-1 all DRBD distributed resource management utility ii python-linstor 0.9.5-1 all Linstor python api library Proxmox Error Message: () SUCCESS: Description: New resource definition 'vm-102-state-test123' created. Details: Resource definition 'vm-102-state-test123' UUID is: 7eab2e30-5a5a-4e57-a4e2-d5da7c58f770 SUCCESS: Description: Resource definition 'vm-102-state-test123' modified. Details: Resource definition 'vm-102-state-test123' UUID is: 7eab2e30-5a5a-4e57-a4e2-d5da7c58f770 SUCCESS: New volume definition with number '0' of resource definition 'vm-102-state-test123' created. SUCCESS: Description: Resource 'vm-102-state-test123' successfully autoplaced on 2 nodes Details: Used storage pool: 'drbdpool' Used nodes: 'node1', 'node2' SUCCESS: Created resource 'vm-102-state-test123' on 'node2' SUCCESS: Created resource 'vm-102-state-test123' on 'node1' SUCCESS: Description: Resource 'vm-102-state-test123' on 'node2' ready Details: Auto-placing resource: vm-102-state-test123 SUCCESS: Description: Resource 'vm-102-state-test123' on 'node1' ready Details: Auto-placing resource: vm-102-state-test123 SUCCESS: Description: New snapshot 'snap_vm-102-disk-1_test123' of resource 'vm-102-disk-1' registered. Details: Snapshot 'snap_vm-102-disk-1_test123' of resource 'vm-102-disk-1' UUID is: 832dbdf2-b3dc-46f9-8f1d-ac9d20c1790c SUCCESS: Suspended IO of 'vm-102-disk-1' on 'node2' for snapshot SUCCESS: Suspended IO of 'vm-102-disk-1' on 'node1' for snapshot ERROR: Description: (Node: 'node2') Preparing resources for layer StorageLayer failed Cause: External command timed out Details: External command: lvs -o lv_name,lv_path,lv_size,vg_name,pool_lv,data_percent,lv_attr --separator ; --noheadings --units k --nosuffix drbdpool ERROR: Description: (Node: 'node1') Preparing resources for layer StorageLayer failed Cause: External command timed out Details: External command: lvs -o lv_name,lv_path,lv_size,vg_name,pool_lv,data_percent,lv_attr --separator ; --noheadings --units k --nosuffix drbdpool ERROR: Description: (Node: 'node2') Preparing resources for layer StorageLayer failed Cause: External command timed out Details: External command: lvs -o lv_name,lv_path,lv_size,vg_name,pool_lv,data_percent,lv_attr --separator ; --noheadings --units k --nosuffix drbdpool ERROR: Description: (Node: 'node1') Preparing resources for layer StorageLayer failed Cause: External command timed out Details: External command: lvs -o lv_name,lv_path,lv_size,vg_name,pool_lv,data_percent,lv_attr --separator ; --noheadings --units k --nosuffix drbdpool VM 102 qmp command 'savevm-end' failed - unable to connect to VM 102 qmp socket - timeout after 5992 retries snapshot create failed: starting cleanup error with cfs lock 'storage-drbdpool': Could not remove vm-102-state-test123: got lock timeout - aborting command TASK ERROR: Could not create cluster wide snapshot for: vm-102-disk-1: exit code 10 Haning Tasks: root@node2:~# systemctl status linstor-satellite.service ● linstor-satellite.service - LINSTOR Satellite Service Loaded: loaded (/lib/systemd/system/linstor-satellite.service; enabled; vendor preset: enabled) Drop-In: /etc/systemd/system/linstor-satellite.service.d └─override.conf Active: active (running) since Tue 2019-05-28 17:42:11 CEST; 30min ago Main PID: 4612 (java) Tasks: 303 (limit: 14745) Memory: 263.1M CPU: 24.237s CGroup: /system.slice/linstor-satellite.service ├─ 4612 java -Xms32M -XX:MinHeapFreeRatio=40 -XX:MaxHeapFreeRatio=70 -classpath /usr/share/linstor-server/lib/satellite-0.9.10.jar:/usr/share/linstor-server/lib/reactive-streams-1.0.2.jar:/usr/share/linstor-server/lib/logback-classic-1.2.3.ja ├─ 5406 drbdsetup events2 all ├─16954 lvs -o lv_name,lv_path,lv_size,vg_name,pool_lv,data_percent,lv_attr --separator ; --noheadings --units k --nosuffix drbdpool ├─17236 vgs -o vg_name --noheadings ├─17525 lvs -o lv_name,lv_path,lv_size,vg_name,pool_lv,data_percent,lv_attr --separator ; --noheadings --units k --nosuffix drbdpool ├─17788 vgs -o vg_name --noheadings ├─18075 vgs -o vg_name --noheadings ├─18323 vgs -o vg_name --noheadings ├─18612 vgs -o vg_name --noheadings ├─18889 vgs -o vg_name --noheadings ├─19162 vgs -o vg_name --noheadings ├─19442 vgs -o vg_name --noheadings ├─19700 vgs -o vg_name --noheadings ├─19984 vgs -o vg_name --noheadings ├─20242 vgs -o vg_name --noheadings ├─20523 lvs -o lv_name,lv_path,lv_size,vg_name,pool_lv,data_percent,lv_attr --separator ; --noheadings --units k --nosuffix drbdpool ├─20779 vgs -o vg_name --noheadings ├─21047 lvs -o lv_name,lv_path,lv_size,vg_name,pool_lv,data_percent,lv_attr --separator ; --noheadings --units k --nosuffix drbdpool ├─21310 vgs -o vg_name --noheadings ├─21572 lvs -o lv_name,lv_path,lv_size,vg_name,pool_lv,data_percent,lv_attr --separator ; --noheadings --units k --nosuffix drbdpool ├─21821 vgs -o vg_name --noheadings ├─22083 lvs -o lv_name,lv_path,lv_size,vg_name,pool_lv,data_percent,lv_attr --separator ; --noheadings --units k --nosuffix drbdpool ├─22346 vgs -o vg_name --noheadings ├─22612 lvs -o lv_name,lv_path,lv_size,vg_name,pool_lv,data_percent,lv_attr --separator ; --noheadings --units k --nosuffix drbdpool ├─22864 vgs -o vg_name --noheadings └─23135 lvs -o lv_name,lv_path,lv_size,vg_name,pool_lv,data_percent,lv_attr --separator ; --noheadings --units k --nosuffix drbdpool Kernel Stack Trace: [ 967.562965] INFO: task kvm:18314 blocked for more than 120 seconds. [ 967.563540] Tainted: P O 4.15.18-14-pve #1 [ 967.563974] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 967.564414] kvm D 0 18314 1 0x00000000 [ 967.564419] Call Trace: [ 967.564435] __schedule+0x3e0/0x870 [ 967.564440] ? touch_pmd+0x32/0x40 [ 967.564442] schedule+0x36/0x80 [ 967.564460] __drbd_make_request+0x27c/0x5b0 [drbd] [ 967.564465] ? wait_woken+0x80/0x80 [ 967.564472] drbd_make_request+0x4f/0x80 [drbd] [ 967.564477] generic_make_request+0x123/0x2f0 [ 967.564479] submit_bio+0x73/0x140 [ 967.564480] ? submit_bio+0x73/0x140 [ 967.564483] blkdev_direct_IO+0x3c0/0x400 [ 967.564486] ? do_io_getevents+0xd0/0xd0 [ 967.564489] ? generic_update_time+0x7b/0xd0 [ 967.564490] ? current_time+0x32/0x70 [ 967.564494] generic_file_read_iter+0xb1/0xbb0 [ 967.564496] ? generic_file_read_iter+0xb1/0xbb0 [ 967.564498] ? __check_object_size+0xb3/0x190 [ 967.564500] ? _copy_from_user+0x36/0x70 [ 967.564502] blkdev_read_iter+0x35/0x40 [ 967.564503] aio_read+0xfc/0x140 [ 967.564505] do_io_submit+0x530/0x6c0 [ 967.564506] ? do_io_submit+0x530/0x6c0 [ 967.564508] SyS_io_submit+0x10/0x20 [ 967.564509] ? SyS_io_submit+0x10/0x20 [ 967.564512] do_syscall_64+0x73/0x130 [ 967.564514] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [ 967.564517] RIP: 0033:0x7f1f217d6717 [ 967.564517] RSP: 002b:00007f1de9999558 EFLAGS: 00000246 ORIG_RAX: 00000000000000d1 [ 967.564519] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f1f217d6717 [ 967.564520] RDX: 00007f1de9999570 RSI: 0000000000000001 RDI: 00007f1f23117000 [ 967.564521] RBP: 00007f1de9999570 R08: 0000000000000008 R09: 0000000000000000 [ 967.564521] R10: 00007f1efc94ba60 R11: 0000000000000246 R12: 00007f1de9999560 [ 967.564522] R13: 00007f1efc94ba38 R14: 00007f1efc94ba10 R15: 000000114f118a00 [ 967.564531] INFO: task lvs:18730 blocked for more than 120 seconds. [ 967.564975] Tainted: P O 4.15.18-14-pve #1 [ 967.565425] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 967.565894] lvs D 0 18730 5826 0x00000004 [ 967.565896] Call Trace: [ 967.565898] __schedule+0x3e0/0x870 [ 967.565900] schedule+0x36/0x80 [ 967.565907] __drbd_make_request+0x27c/0x5b0 [drbd] [ 967.565909] ? wait_woken+0x80/0x80 [ 967.565915] drbd_make_request+0x4f/0x80 [drbd] [ 967.565917] generic_make_request+0x123/0x2f0 [ 967.565918] submit_bio+0x73/0x140 [ 967.565920] ? submit_bio+0x73/0x140 [ 967.565921] __blkdev_direct_IO_simple+0x1b1/0x380 [ 967.565922] ? __blkdev_get+0x240/0x4d0 [ 967.565924] ? bdget+0x120/0x120 [ 967.565925] blkdev_direct_IO+0x3f1/0x400 [ 967.565926] ? blkdev_direct_IO+0x3f1/0x400 [ 967.565929] ? dput+0xc7/0x1f0 [ 967.565931] ? mntput+0x24/0x40 [ 967.565932] ? terminate_walk+0x91/0xf0 [ 967.565934] ? path_openat+0x6f7/0x14a0 [ 967.565935] ? putname+0x54/0x60 [ 967.565936] generic_file_read_iter+0xb1/0xbb0 [ 967.565938] ? generic_file_read_iter+0xb1/0xbb0 [ 967.565944] ? kref_debug_sub+0x3c/0x60 [drbd] [ 967.565946] ? _copy_to_user+0x2a/0x40 [ 967.565948] ? cp_new_stat+0x152/0x180 [ 967.565949] blkdev_read_iter+0x35/0x40 [ 967.565951] new_sync_read+0xe4/0x130 [ 967.565952] __vfs_read+0x29/0x40 [ 967.565954] vfs_read+0x96/0x130 [ 967.565955] SyS_read+0x55/0xc0 [ 967.565957] do_syscall_64+0x73/0x130 [ 967.565958] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [ 967.565959] RIP: 0033:0x7f1d049a9910 [ 967.565960] RSP: 002b:00007ffec8316918 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [ 967.565961] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1d049a9910 [ 967.565962] RDX: 0000000000001000 RSI: 0000556670ae3000 RDI: 0000000000000004 [ 967.565963] RBP: 00007ffec8316970 R08: 00007f1d04c68258 R09: 0000000000001000 [ 967.565963] R10: 0000000000000030 R11: 0000000000000246 R12: 0000556670ae3000 [ 967.565964] R13: 0000000000000000 R14: 0000000000000004 R15: 00007ffec83169d0 [ 967.565968] INFO: task lvs:18773 blocked for more than 120 seconds. [ 967.566433] Tainted: P O 4.15.18-14-pve #1 [ 967.566944] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 967.567435] lvs D 0 18773 18766 0x00000004 [ 967.567438] Call Trace: [ 967.567441] __schedule+0x3e0/0x870 [ 967.567443] schedule+0x36/0x80 [ 967.567450] __drbd_make_request+0x27c/0x5b0 [drbd] [ 967.567452] ? wait_woken+0x80/0x80 [ 967.567458] drbd_make_request+0x4f/0x80 [drbd] [ 967.567460] generic_make_request+0x123/0x2f0 [ 967.567462] submit_bio+0x73/0x140 [ 967.567464] ? submit_bio+0x73/0x140 [ 967.567465] __blkdev_direct_IO_simple+0x1b1/0x380 [ 967.567467] ? __blkdev_get+0x240/0x4d0 [ 967.567468] ? bdget+0x120/0x120 [ 967.567470] blkdev_direct_IO+0x3f1/0x400 [ 967.567471] ? blkdev_direct_IO+0x3f1/0x400 [ 967.567472] ? dput+0xc7/0x1f0 [ 967.567474] ? mntput+0x24/0x40 [ 967.567476] ? terminate_walk+0x91/0xf0 [ 967.567478] ? path_openat+0x6f7/0x14a0 [ 967.567479] ? putname+0x54/0x60 [ 967.567480] generic_file_read_iter+0xb1/0xbb0 [ 967.567482] ? generic_file_read_iter+0xb1/0xbb0 [ 967.567488] ? kref_debug_sub+0x3c/0x60 [drbd] [ 967.567490] ? _copy_to_user+0x2a/0x40 [ 967.567492] ? cp_new_stat+0x152/0x180 [ 967.567493] blkdev_read_iter+0x35/0x40 [ 967.567494] new_sync_read+0xe4/0x130 [ 967.567497] __vfs_read+0x29/0x40 [ 967.567499] vfs_read+0x96/0x130 [ 967.567500] SyS_read+0x55/0xc0 [ 967.567502] do_syscall_64+0x73/0x130 [ 967.567504] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [ 967.567505] RIP: 0033:0x7f313938e910 [ 967.567506] RSP: 002b:00007fffa10d77c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [ 967.567507] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f313938e910 [ 967.567508] RDX: 0000000000001000 RSI: 0000564dbe7da000 RDI: 0000000000000004 [ 967.567509] RBP: 00007fffa10d7820 R08: 00007f313964d248 R09: 0000000000001000 [ 967.567509] R10: 0000000000000030 R11: 0000000000000246 R12: 0000564dbe7da000 [ 967.567510] R13: 0000000000000000 R14: 0000000000000004 R15: 00007fffa10d7880 We also have LVM_THIN Storage Pools. Maybe there's anybody that could help us and resolve this Thread and the one before. Kind Regards, Alex
_______________________________________________ drbd-user mailing list [email protected] http://lists.linbit.com/mailman/listinfo/drbd-user
