--- Begin Message ---
Hi all,

This morning an OSD in our office cluster crashed:

Oct 26 12:52:17 sanmarko ceph-osd[2161]: ceph-osd: ../nptl/pthread_mutex_lock.c:81: __pthread_mutex_lock: Assertion `mutex->__data.__owner == 0' failed.
Oct 26 12:52:17 sanmarko ceph-osd[2161]: *** Caught signal (Aborted) **
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  in thread 7fb2a6722700 thread_name:filestore_sync Oct 26 12:52:17 sanmarko ceph-osd[2161]:  ceph version 16.2.6 (1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable) Oct 26 12:52:17 sanmarko ceph-osd[2161]:  1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7fb2b1805140]
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  2: gsignal()
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  3: abort()
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  4: /lib/x86_64-linux-gnu/libc.so.6(+0x2540f) [0x7fb2b12ba40f] Oct 26 12:52:17 sanmarko ceph-osd[2161]:  5: /lib/x86_64-linux-gnu/libc.so.6(+0x34662) [0x7fb2b12c9662]
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  6: pthread_mutex_lock()
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  7: (JournalingObjectStore::ApplyManager::commit_start()+0xbc) [0x555bd4e5ab2c] Oct 26 12:52:17 sanmarko ceph-osd[2161]:  8: (FileStore::sync_entry()+0x32e) [0x555bd4e2448e] Oct 26 12:52:17 sanmarko ceph-osd[2161]:  9: (FileStore::SyncThread::entry()+0xd) [0x555bd4e4ed2d] Oct 26 12:52:17 sanmarko ceph-osd[2161]:  10: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7fb2b17f9ea7]
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  11: clone()
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 2021-10-26T12:52:17.458+0200 7fb2a6722700 -1 *** Caught signal (Aborted) ** Oct 26 12:52:17 sanmarko ceph-osd[2161]:  in thread 7fb2a6722700 thread_name:filestore_sync Oct 26 12:52:17 sanmarko ceph-osd[2161]:  ceph version 16.2.6 (1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable) Oct 26 12:52:17 sanmarko ceph-osd[2161]:  1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7fb2b1805140]
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  2: gsignal()
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  3: abort()
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  4: /lib/x86_64-linux-gnu/libc.so.6(+0x2540f) [0x7fb2b12ba40f] Oct 26 12:52:17 sanmarko ceph-osd[2161]:  5: /lib/x86_64-linux-gnu/libc.so.6(+0x34662) [0x7fb2b12c9662]
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  6: pthread_mutex_lock()
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  7: (JournalingObjectStore::ApplyManager::commit_start()+0xbc) [0x555bd4e5ab2c] Oct 26 12:52:17 sanmarko ceph-osd[2161]:  8: (FileStore::sync_entry()+0x32e) [0x555bd4e2448e] Oct 26 12:52:17 sanmarko ceph-osd[2161]:  9: (FileStore::SyncThread::entry()+0xd) [0x555bd4e4ed2d] Oct 26 12:52:17 sanmarko ceph-osd[2161]:  10: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7fb2b17f9ea7]
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  11: clone()
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Oct 26 12:52:17 sanmarko ceph-osd[2161]:      0> 2021-10-26T12:52:17.458+0200 7fb2a6722700 -1 *** Caught signal (Aborted) ** Oct 26 12:52:17 sanmarko ceph-osd[2161]:  in thread 7fb2a6722700 thread_name:filestore_sync Oct 26 12:52:17 sanmarko ceph-osd[2161]:  ceph version 16.2.6 (1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable) Oct 26 12:52:17 sanmarko ceph-osd[2161]:  1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7fb2b1805140]
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  2: gsignal()
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  3: abort()
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  4: /lib/x86_64-linux-gnu/libc.so.6(+0x2540f) [0x7fb2b12ba40f] Oct 26 12:52:17 sanmarko ceph-osd[2161]:  5: /lib/x86_64-linux-gnu/libc.so.6(+0x34662) [0x7fb2b12c9662]
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  6: pthread_mutex_lock()
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  7: (JournalingObjectStore::ApplyManager::commit_start()+0xbc) [0x555bd4e5ab2c] Oct 26 12:52:17 sanmarko ceph-osd[2161]:  8: (FileStore::sync_entry()+0x32e) [0x555bd4e2448e] Oct 26 12:52:17 sanmarko ceph-osd[2161]:  9: (FileStore::SyncThread::entry()+0xd) [0x555bd4e4ed2d] Oct 26 12:52:17 sanmarko ceph-osd[2161]:  10: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7fb2b17f9ea7]
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  11: clone()
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Oct 26 12:52:17 sanmarko ceph-osd[2161]:      0> 2021-10-26T12:52:17.458+0200 7fb2a6722700 -1 *** Caught signal (Aborted) ** Oct 26 12:52:17 sanmarko ceph-osd[2161]:  in thread 7fb2a6722700 thread_name:filestore_sync Oct 26 12:52:17 sanmarko ceph-osd[2161]:  ceph version 16.2.6 (1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable) Oct 26 12:52:17 sanmarko ceph-osd[2161]:  1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7fb2b1805140]
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  2: gsignal()
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  3: abort()
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  4: /lib/x86_64-linux-gnu/libc.so.6(+0x2540f) [0x7fb2b12ba40f] Oct 26 12:52:17 sanmarko ceph-osd[2161]:  5: /lib/x86_64-linux-gnu/libc.so.6(+0x34662) [0x7fb2b12c9662]
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  6: pthread_mutex_lock()
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  7: (JournalingObjectStore::ApplyManager::commit_start()+0xbc) [0x555bd4e5ab2c] Oct 26 12:52:17 sanmarko ceph-osd[2161]:  8: (FileStore::sync_entry()+0x32e) [0x555bd4e2448e] Oct 26 12:52:17 sanmarko ceph-osd[2161]:  9: (FileStore::SyncThread::entry()+0xd) [0x555bd4e4ed2d] Oct 26 12:52:17 sanmarko ceph-osd[2161]:  10: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7fb2b17f9ea7]
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  11: clone()
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Oct 26 12:52:17 sanmarko systemd[1]: [email protected]: Main process exited, code=killed, status=6/ABRT Oct 26 12:52:17 sanmarko systemd[1]: [email protected]: Failed with result 'signal'. Oct 26 12:52:17 sanmarko systemd[1]: [email protected]: Consumed 1h 7min 59.828s CPU time.

This is a filestore OSD. This node has 4 OSDs: 1 SSD OSD and 3 HDD OSDs with journal in system SSD.

I don't see any useful detail in ceph-osd.6.log... but it's full of logs like these before crash (they're there even at -9999):
   -15> 2021-10-26T12:52:11.422+0200 7fb29ff23700 10 monclient: tick
   -14> 2021-10-26T12:52:11.422+0200 7fb29ff23700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2021-10-26T12:51:41.424126+0200)
   -13> 2021-10-26T12:52:12.422+0200 7fb29ff23700 10 monclient: tick
   -12> 2021-10-26T12:52:12.422+0200 7fb29ff23700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2021-10-26T12:51:42.424200+0200)
   -11> 2021-10-26T12:52:13.422+0200 7fb29ff23700 10 monclient: tick
   -10> 2021-10-26T12:52:13.422+0200 7fb29ff23700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2021-10-26T12:51:43.424270+0200)
    -9> 2021-10-26T12:52:14.422+0200 7fb29ff23700 10 monclient: tick
    -8> 2021-10-26T12:52:14.422+0200 7fb29ff23700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2021-10-26T12:51:44.424342+0200)
    -7> 2021-10-26T12:52:15.422+0200 7fb29ff23700 10 monclient: tick
    -6> 2021-10-26T12:52:15.422+0200 7fb29ff23700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2021-10-26T12:51:45.424414+0200)     -5> 2021-10-26T12:52:15.726+0200 7fb29370a700  5 osd.6 39883 heartbeat osd_stat(store_statfs(0x999ad46000/0xd0000/0xe8c3af0000, data 0x999ae16000/0x999ae16000, compress 0x0/0x0/0x0, omap 0x2d2d2dc, meta 0x0)
, peers [0,2,3,4,7,8,9,10,11,15] op hist [])
    -4> 2021-10-26T12:52:16.422+0200 7fb29ff23700 10 monclient: tick
    -3> 2021-10-26T12:52:16.422+0200 7fb29ff23700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2021-10-26T12:51:46.424489+0200)
    -2> 2021-10-26T12:52:17.422+0200 7fb29ff23700 10 monclient: tick
    -1> 2021-10-26T12:52:17.422+0200 7fb29ff23700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2021-10-26T12:51:47.424562+0200)


I don't see those logs after daemon has been automatically restarted.

Any idea what could be the issue?

System has plenty of RAM:

# free -h
               total        used        free      shared buff/cache   available
Mem:           125Gi        28Gi       752Mi        65Mi 96Gi        96Gi
Swap:             0B          0B          0B

And CPU is ~90% iddle (of 16 cores).

# pveversion -v
proxmox-ve: 7.0-2 (running kernel: 5.11.22-5-pve)
pve-manager: 7.0-13 (running version: 7.0-13/7aa7e488)
pve-kernel-helper: 7.1-2
pve-kernel-5.11: 7.0-8
pve-kernel-5.4: 6.4-6
pve-kernel-5.11.22-5-pve: 5.11.22-10
pve-kernel-5.4.140-1-pve: 5.4.140-1
pve-kernel-5.4.124-1-pve: 5.4.124-2
ceph: 16.2.6-pve2
ceph-fuse: 16.2.6-pve2
corosync: 3.1.5-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve1
libproxmox-acme-perl: 1.3.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.0-5
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-10
libpve-guest-common-perl: 4.0-2
libpve-http-server-perl: 4.0-3
libpve-storage-perl: 7.0-12
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.0.11-1
proxmox-backup-file-restore: 2.0.11-1
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.3-6
pve-cluster: 7.0-3
pve-container: 4.0-10
pve-docs: 7.0-5
pve-edk2-firmware: 3.20210831-1
pve-firewall: 4.2-4
pve-firmware: 3.3-2
pve-ha-manager: 3.3-1
pve-i18n: 2.5-1
pve-qemu-kvm: 6.0.0-4
pve-xtermjs: 4.12.0-1
qemu-server: 7.0-16
smartmontools: 7.2-pve2
spiceterm: 3.2-2
vncterm: 1.7-1
zfsutils-linux: 2.0.5-pve1

Thanks


Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project

Tel. +34 943 569 206 | https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/



--- End Message ---
_______________________________________________
pve-user mailing list
[email protected]
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Reply via email to