Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index

Paul Emmerich Thu, 13 Jun 2019 07:02:13 -0700

Something I had suggested off-list (repeated here if anyone else finds
themselves in a similar scenario):


since only one PG is dead and the OSD now seems to be alive enough to
start/mount: consider taking a backup of the affected PG with

ceph-objectstore-tool --op export --pgid X.YY

(That might also take a loong time)

That export can later be imported into any other OSD if these three dead
OSDs turn out to be a lost cause.
(Risk: importing the PG somewhere else might kill that OSD as well,
depending on the nature of the problem; I suggested new OSDs as import
target)

Paul

On Thu, Jun 13, 2019 at 3:52 PM Sage Weil <s...@newdream.net> wrote:

> On Thu, 13 Jun 2019, Harald Staub wrote:
> > Idea received from Wido den Hollander:
> > bluestore rocksdb options = "compaction_readahead_size=0"
> >
> > With this option, I just tried to start 1 of the 3 crashing OSDs, and it
> came
> > up! I did with "ceph osd set noin" for now.
>
> Yay!
>
> > Later it aborted:
> >
> > 2019-06-13 13:11:11.862 7f2a19f5f700  1 heartbeat_map reset_timeout
> > 'OSD::osd_op_tp thread 0x7f2a19f5f700' had timed out after 15
> > 2019-06-13 13:11:11.862 7f2a19f5f700  1 heartbeat_map reset_timeout
> > 'OSD::osd_op_tp thread 0x7f2a19f5f700' had suicide timed out after 150
> > 2019-06-13 13:11:11.862 7f2a37982700  0 --1-
> > v1:[2001:620:5ca1:201::119]:6809/3426631 >>
> > v1:[2001:620:5ca1:201::144]:6821/3627456 conn(0x564f65c0c000
> 0x564f26d6d800
> > :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=18075 cs=1
> l=0).handle_connect_reply_2
> > connect got RESETSESSION
> > 2019-06-13 13:11:11.862 7f2a19f5f700 -1 *** Caught signal (Aborted) **
> >  in thread 7f2a19f5f700 thread_name:tp_osd_tp
> >
> >  ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus
> > (stable)
> >  1: (()+0x12890) [0x7f2a3a818890]
> >  2: (pthread_kill()+0x31) [0x7f2a3a8152d1]
> >  3: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char
> const*,
> > unsigned long)+0x24b) [0x564d732ca2bb]
> >  4: (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*,
> unsigned
> > long, unsigned long)+0x255) [0x564d732ca895]
> >  5: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5a0)
> > [0x564d732eb560]
> >  6: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x564d732ed5d0]
> >  7: (()+0x76db) [0x7f2a3a80d6db]
> >  8: (clone()+0x3f) [0x7f2a395ad88f]
> >
> > I guess that this is because of pending backfilling and the noin flag?
> > Afterwards it restarted by itself and came up. I stopped it again for
> now.
>
> I think that increasing the various suicide timeout options will allow
> it to stay up long enough to clean up the ginormous objects:
>
>  ceph config set osd.NNN osd_op_thread_suicide_timeout 2h
>
> > It looks healthy so far:
> > ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-266 fsck
> > fsck success
> >
> > Now we have to choose how to continue, trying to reduce the risk of
> losing
> > data (most bucket indexes are intact currently). My guess would be to
> let this
> > OSD (which was not the primary) go in and hope that it recovers. In case
> of a
> > problem, maybe we could still use the other OSDs "somehow"? In case of
> > success, we would bring back the other OSDs as well?
> >
> > OTOH we could try to continue with the key dump from earlier today.
>
> I would start all three osds the same way, with 'noout' set on the
> cluster.  You should try to avoid triggering recovery because it will have
> a hard time getting through the big index object on that bucket (i.e., it
> will take a long time, and might trigger some blocked ios and so forth).
>
> (Side note that since you started the OSD read-write using the internal
> copy of rocksdb, don't forget that the external copy you extracted
> (/mnt/ceph/db?) is now stale!)
>
> sage
>
> >
> > Any opinions?
> >
> > Thanks!
> >  Harry
> >
> > On 13.06.19 09:32, Harald Staub wrote:
> > > On 13.06.19 00:33, Sage Weil wrote:
> > > [...]
> > > > One other thing to try before taking any drastic steps (as described
> > > > below):
> > > >
> > > >   ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-NNN fsck
> > >
> > > This gives: fsck success
> > >
> > > and the large alloc warnings:
> > >
> > > tcmalloc: large alloc 2145263616 bytes == 0x562412e10000 @
> 0x7fed890d6887
> > > 0x562385370229 0x5623853703a3 0x5623856c51ec 0x56238566dce2
> 0x56238566fa05
> > > 0x562385681d41 0x562385476201 0x5623853d5737 0x5623853ef418
> 0x562385420ae1
> > > 0x5623852901c2 0x7fed7ddddb97 0x56238536977a
> > > tcmalloc: large alloc 4290519040 bytes == 0x562492bf2000 @
> 0x7fed890d6887
> > > 0x562385370229 0x5623853703a3 0x5623856c51ec 0x56238566dce2
> 0x56238566fa05
> > > 0x562385681d41 0x562385476201 0x5623853d5737 0x5623853ef418
> 0x562385420ae1
> > > 0x5623852901c2 0x7fed7ddddb97 0x56238536977a
> > > tcmalloc: large alloc 8581029888 bytes == 0x562593068000 @
> 0x7fed890d6887
> > > 0x562385370229 0x5623853703a3 0x5623856c51ec 0x56238566dce2
> 0x56238566fa05
> > > 0x562385681d41 0x562385476201 0x5623853d5737 0x5623853ef418
> 0x562385420ae1
> > > 0x5623852901c2 0x7fed7ddddb97 0x56238536977a
> > > tcmalloc: large alloc 17162051584 bytes == 0x562792fea000 @
> 0x7fed890d6887
> > > 0x562385370229 0x5623853703a3 0x5623856c51ec 0x56238566dce2
> 0x56238566fa05
> > > 0x562385681d41 0x562385476201 0x5623853d5737 0x5623853ef418
> 0x562385420ae1
> > > 0x5623852901c2 0x7fed7ddddb97 0x56238536977a
> > > tcmalloc: large alloc 13559291904 bytes == 0x562b92eec000 @
> 0x7fed890d6887
> > > 0x562385370229 0x56238537181b 0x562385723a99 0x56238566dd25
> 0x56238566fa05
> > > 0x562385681d41 0x562385476201 0x5623853d5737 0x5623853ef418
> 0x562385420ae1
> > > 0x5623852901c2 0x7fed7ddddb97 0x56238536977a
> > >
> > > Thanks!
> > >   Harry
> > >
> > > [...]
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> > _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index

Reply via email to