After restarting several OSD daemons in our ceph cluster a couple days ago, a 
couple of our OSDs won’t come online. The services start and crash with the 
below error. We have one pg marked as incomplete, and will not peer. The pool 
is erasure coded, 2+1, currently set to size=3, min_size=2. The incomplete pg 
states it is not peering due to:

"comment": "not enough complete instances of this PG" and:
           "down_osds_we_would_probe": [
                7,
                16
            ],
7 is completely lost, drive dead, 16 will not come online (refer to log output 
below).

We’ve tried searching user-list and tweaking osd conf settings for several 
days, to no avail. Reaching out here as a last ditch effort before we have to 
give up on the pg.


tcmalloc: large alloc 1073741824 bytes == 0x560ada35c000 @  0x7f5c1081e4ef 
0x7f5c1083dbd6 0x7f5c0e945ab9 0x7f5c0e9466cb 0x7f5c0e946774 0x7f5c0e9469df 
0x560a8fdb7db0 0x560a8fda8d28 0x560a8fdaa6b6 0x560a8fdab973 0x560a8fdacbb6 
0x560a8f9f8f88 0x560a8f983d83 0x560a8f9b5d7e 0x560a8f474069 0x7f5c0dfc5445 
0x560a8f514373

tcmalloc: large alloc 2147483648 bytes == 0x560b1a35c000 @  0x7f5c1081e4ef 
0x7f5c1083dbd6 0x7f5c0e945ab9 0x7f5c0e9466cb 0x7f5c0e946774 0x7f5c0e9469df 
0x560a8fdb7db0 0x560a8fda8d28 0x560a8fdaa6b6 0x560a8fdab973 0x560a8fdacbb6 
0x560a8f9f8f88 0x560a8f983d83 0x560a8f9b5d7e 0x560a8f474069 0x7f5c0dfc5445 
0x560a8f514373

tcmalloc: large alloc 4294967296 bytes == 0x560b9a35c000 @  0x7f5c1081e4ef 
0x7f5c1083dbd6 0x7f5c0e945ab9 0x7f5c0e9466cb 0x7f5c0e946774 0x7f5c0e9469df 
0x560a8fdb7db0 0x560a8fda8d28 0x560a8fdaa6b6 0x560a8fdab973 0x560a8fdacbb6 
0x560a8f9f8f88 0x560a8f983d83 0x560a8f9b5d7e 0x560a8f474069 0x7f5c0dfc5445 
0x560a8f514373

tcmalloc: large alloc 3840745472 bytes == 0x560a9a334000 @  0x7f5c1081e4ef 
0x7f5c1083dbd6 0x7f5c0e945ab9 0x7f5c0e945c76 0x7f5c0e94623e 0x560a8fdea280 
0x560a8fda8f36 0x560a8fdaa6b6 0x560a8fdab973 0x560a8fdacbb6 0x560a8f9f8f88 
0x560a8f983d83 0x560a8f9b5d7e 0x560a8f474069 0x7f5c0dfc5445 0x560a8f514373

tcmalloc: large alloc 2728992768 bytes == 0x560e779ee000 @  0x7f5c1081e4ef 
0x7f5c1083f010 0x560a8faa5674 0x560a8faa7125 0x560a8fa835a7 0x560a8fa5aa3c 
0x560a8fa5c238 0x560a8fa77dcc 0x560a8fe439ef 0x560a8fe43c03 0x560a8fe5acd4 
0x560a8fda75ec 0x560a8fda9260 0x560a8fdaa6b6 0x560a8fdab973 0x560a8fdacbb6 
0x560a8f9f8f88 0x560a8f983d83 0x560a8f9b5d7e 0x560a8f474069 0x7f5c0dfc5445 
0x560a8f514373

/builddir/build/BUILD/ceph-12.2.5/src/os/bluestore/KernelDevice.cc: In function 
'void KernelDevice::_aio_thread()' thread 7f5c0a749700 time 2019-03-13 
12:46:39.632156

/builddir/build/BUILD/ceph-12.2.5/src/os/bluestore/KernelDevice.cc: 384: FAILED 
assert(0 == "unexpected aio error")

2019-03-13 12:46:39.632132 7f5c0a749700 -1 bdev(0x560a99c05000 
/var/lib/ceph/osd/ceph-16/block) aio to 4817558700032~2728988672 but returned: 
2147479552

 ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous 
(stable)

 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x110) [0x560a8fadd2a0]

 2: (KernelDevice::_aio_thread()+0xd34) [0x560a8fa7fe24]

 3: (KernelDevice::AioCompletionThread::entry()+0xd) [0x560a8fa8517d]

 4: (()+0x7e25) [0x7f5c0efb0e25]

 5: (clone()+0x6d) [0x7f5c0e0a1bad]

 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to 
interpret this.

2019-03-13 12:46:39.633822 7f5c0a749700 -1 
/builddir/build/BUILD/ceph-12.2.5/src/os/bluestore/KernelDevice.cc: In function 
'void KernelDevice::_aio_thread()' thread 7f5c0a749700 time 2019-03-13 
12:46:39.632156

/builddir/build/BUILD/ceph-12.2.5/src/os/bluestore/KernelDevice.cc: 384: FAILED 
assert(0 == "unexpected aio error")



 ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous 
(stable)

 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x110) [0x560a8fadd2a0]

 2: (KernelDevice::_aio_thread()+0xd34) [0x560a8fa7fe24]

 3: (KernelDevice::AioCompletionThread::entry()+0xd) [0x560a8fa8517d]

 4: (()+0x7e25) [0x7f5c0efb0e25]

 5: (clone()+0x6d) [0x7f5c0e0a1bad]

 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to 
interpret this.



    -1> 2019-03-13 12:46:39.632132 7f5c0a749700 -1 bdev(0x560a99c05000 
/var/lib/ceph/osd/ceph-16/block) aio to 4817558700032~2728988672 but returned: 
2147479552

     0> 2019-03-13 12:46:39.633822 7f5c0a749700 -1 
/builddir/build/BUILD/ceph-12.2.5/src/os/bluestore/KernelDevice.cc: In function 
'void KernelDevice::_aio_thread()' thread 7f5c0a749700 time 2019-03-13 
12:46:39.632156

/builddir/build/BUILD/ceph-12.2.5/src/os/bluestore/KernelDevice.cc: 384: FAILED 
assert(0 == "unexpected aio error")



 ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous 
(stable)

 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x110) [0x560a8fadd2a0]

 2: (KernelDevice::_aio_thread()+0xd34) [0x560a8fa7fe24]

 3: (KernelDevice::AioCompletionThread::entry()+0xd) [0x560a8fa8517d]

 4: (()+0x7e25) [0x7f5c0efb0e25]

 5: (clone()+0x6d) [0x7f5c0e0a1bad]

 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to 
interpret this.



*** Caught signal (Aborted) **

 in thread 7f5c0a749700 thread_name:bstore_aio

 ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous 
(stable)

 1: (()+0xa41911) [0x560a8fa9e911]

 2: (()+0xf6d0) [0x7f5c0efb86d0]

 3: (gsignal()+0x37) [0x7f5c0dfd9277]

 4: (abort()+0x148) [0x7f5c0dfda968]

 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x284) [0x560a8fadd414]

 6: (KernelDevice::_aio_thread()+0xd34) [0x560a8fa7fe24]

 7: (KernelDevice::AioCompletionThread::entry()+0xd) [0x560a8fa8517d]

 8: (()+0x7e25) [0x7f5c0efb0e25]

 9: (clone()+0x6d) [0x7f5c0e0a1bad]

2019-03-13 12:46:39.635955 7f5c0a749700 -1 *** Caught signal (Aborted) **

 in thread 7f5c0a749700 thread_name:bstore_aio



 ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous 
(stable)

 1: (()+0xa41911) [0x560a8fa9e911]

 2: (()+0xf6d0) [0x7f5c0efb86d0]

 3: (gsignal()+0x37) [0x7f5c0dfd9277]

 4: (abort()+0x148) [0x7f5c0dfda968]

 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x284) [0x560a8fadd414]

 6: (KernelDevice::_aio_thread()+0xd34) [0x560a8fa7fe24]

 7: (KernelDevice::AioCompletionThread::entry()+0xd) [0x560a8fa8517d]

 8: (()+0x7e25) [0x7f5c0efb0e25]

 9: (clone()+0x6d) [0x7f5c0e0a1bad]

 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to 
interpret this.



     0> 2019-03-13 12:46:39.635955 7f5c0a749700 -1 *** Caught signal (Aborted) 
**

 in thread 7f5c0a749700 thread_name:bstore_aio



 ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous 
(stable)

 1: (()+0xa41911) [0x560a8fa9e911]

 2: (()+0xf6d0) [0x7f5c0efb86d0]

 3: (gsignal()+0x37) [0x7f5c0dfd9277]

 4: (abort()+0x148) [0x7f5c0dfda968]

 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x284) [0x560a8fadd414]

 6: (KernelDevice::_aio_thread()+0xd34) [0x560a8fa7fe24]

 7: (KernelDevice::AioCompletionThread::entry()+0xd) [0x560a8fa8517d]

 8: (()+0x7e25) [0x7f5c0efb0e25]

 9: (clone()+0x6d) [0x7f5c0e0a1bad]

 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to 
interpret this.



Aborted

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to