Hi,
I've recently upgraded from Nautilus 14.2.2 to 14.2.6. I've also been
installing some new OSDs to my cluster. It looks as though either the
backplane I've added has power issues or the raid card I've added has bad
memory. Several new-ish, known good drives were bounced out of their JBOD
configs (which I know is bad practice and work is being done to remove the
raid card in favour of an HBA). cephfs pool is k=4 m=2, rbd pool is
replication 3.

    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 2: (()+0x4dddd7)
[0x55ed69e03dd7]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 3:
(BlueStore::_upgrade_super()+0x52b) [0x55ed6a32968b]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 4:
(BlueStore::_mount(bool, bool)+0x5d3) [0x55ed6a3692a3]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 5:
(OSD::init()+0x321) [0x55ed69f08521]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 6: (main()+0x195b)
[0x55ed69e6945b]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 7:
(__libc_start_main()+0xf5) [0x7f929de4d505]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 8: (()+0x578be5)
[0x55ed69e9ebe5]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 0> 2020-01-22
14:35:49.324 7f92a2012a80 -1 *** Caught signal (Aborted) **
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: in thread
7f92a2012a80 thread_name:ceph-osd
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: ceph version 14.2.6
(f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 1: (()+0xf5f0)
[0x7f929f06d5f0]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 2: (gsignal()+0x37)
[0x7f929de61337]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 3: (abort()+0x148)
[0x7f929de62a28]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 4:
(ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x199) [0x55ed69e03c5e]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 5: (()+0x4dddd7)
[0x55ed69e03dd7]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 6:
(BlueStore::_upgrade_super()+0x52b) [0x55ed6a32968b]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 7:
(BlueStore::_mount(bool, bool)+0x5d3) [0x55ed6a3692a3]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 8:
(OSD::init()+0x321) [0x55ed69f08521]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 9: (main()+0x195b)
[0x55ed69e6945b]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 10:
(__libc_start_main()+0xf5) [0x7f929de4d505]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 11: (()+0x578be5)
[0x55ed69e9ebe5]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: NOTE: a copy of the
executable, or `objdump -rdS <executable>` is needed to interpret this.
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: -10> 2020-01-22
14:35:49.291 7f92a2012a80 -1 rocksdb: Corruption: missing start of
fragmented record(2)
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: -9> 2020-01-22
14:35:49.291 7f92a2012a80 -1 bluestore(/var/lib/ceph/osd/ceph-26) _open_db
erroring opening db:
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: -1> 2020-01-22
14:35:49.320 7f92a2012a80 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.6/rpm/el7/BUILD/ceph-14.2.6/src/os/bluest
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]:
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.6/rpm/el7/BUILD/ceph-14.2.6/src/os/bluestore/BlueStore.cc:
10135: FAILED ceph_assert(
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: ceph version 14.2.6
(f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 1:
(ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x14a) [0x55ed69e03c0f]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 2: (()+0x4dddd7)
[0x55ed69e03dd7]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 3:
(BlueStore::_upgrade_super()+0x52b) [0x55ed6a32968b]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 4:
(BlueStore::_mount(bool, bool)+0x5d3) [0x55ed6a3692a3]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 5:
(OSD::init()+0x321) [0x55ed69f08521]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 6: (main()+0x195b)
[0x55ed69e6945b]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 7:
(__libc_start_main()+0xf5) [0x7f929de4d505]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 8: (()+0x578be5)
[0x55ed69e9ebe5]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 0> 2020-01-22
14:35:49.324 7f92a2012a80 -1 *** Caught signal (Aborted) **
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: in thread
7f92a2012a80 thread_name:ceph-osd
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: ceph version 14.2.6
(f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 1: (()+0xf5f0)
[0x7f929f06d5f0]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 2: (gsignal()+0x37)
[0x7f929de61337]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 3: (abort()+0x148)
[0x7f929de62a28]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 4:
(ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x199) [0x55ed69e03c5e]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 5: (()+0x4dddd7)
[0x55ed69e03dd7]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 6:
(BlueStore::_upgrade_super()+0x52b) [0x55ed6a32968b]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 7:
(BlueStore::_mount(bool, bool)+0x5d3) [0x55ed6a3692a3]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 8:
(OSD::init()+0x321) [0x55ed69f08521]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 9: (main()+0x195b)
[0x55ed69e6945b]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 10:
(__libc_start_main()+0xf5) [0x7f929de4d505]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 11: (()+0x578be5)
[0x55ed69e9ebe5]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: NOTE: a copy of the
executable, or `objdump -rdS <executable>` is needed to interpret this.
    Jan 22 14:35:49 kvm2.mordor.local systemd[1]: ceph-osd@26.service: main
process exited, code=killed, status=6/ABRT
    Jan 22 14:35:49 kvm2.mordor.local systemd[1]: Unit ceph-osd@26.service
entered failed state.
    Jan 22 14:35:49 kvm2.mordor.local systemd[1]: ceph-osd@26.service
failed.
    Jan 22 14:35:49 kvm2.mordor.local systemd[1]: ceph-osd@26.service
holdoff time over, scheduling restart.
    Jan 22 14:35:49 kvm2.mordor.local systemd[1]: Stopped Ceph object
storage daemon osd.26.

I've attempted to run fsck and repair on these OSD, but I get an error
there:

    [root@kvm2 ~]# ceph-bluestore-tool repair --path
/var/lib/ceph/osd/ceph-20 --deep 1
    2020-01-22 14:31:34.346 7f0399e4bc00 -1 rocksdb: Corruption: missing
start of fragmented record(2)
    2020-01-22 14:31:34.346 7f0399e4bc00 -1
bluestore(/var/lib/ceph/osd/ceph-20) _open_db erroring opening db:
    error from fsck: (5) Input/output error

I'd really like to not have to start over. I have several backups, but I
doubt everything is up to date and complete. I'm sure I can get more robust
logs if necessary, though I'm not sure how to enable them :/

-- 

*Justin Engwer*
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to