Nice to hear this was resolved in the end. Coming back to the beginning -- is it clear to anyone what was the root cause and how other users can avoid this from happening? Maybe some better default configs to warn users earlier about too-large omaps?
Cheers, Dan On Thu, Jun 13, 2019 at 7:36 PM Harald Staub <harald.st...@switch.ch> wrote: > > Looks fine (at least so far), thank you all! > > After having exported all 3 copies of the bad PG, we decided to try it > in-place. We also set norebalance to make sure that no data is moved. > When the PG was up, the resharding finished with a "success" message. > The large omap warning is gone after deep-scrubbing the PG. > > Then we set the 3 OSDs to out. Soon after, one after the other was down > (maybe for 2 minutes) and we got degraded PGs, but only once. > > Thank you! > Harry > > On 13.06.19 16:14, Sage Weil wrote: > > On Thu, 13 Jun 2019, Harald Staub wrote: > >> On 13.06.19 15:52, Sage Weil wrote: > >>> On Thu, 13 Jun 2019, Harald Staub wrote: > >> [...] > >>> I think that increasing the various suicide timeout options will allow > >>> it to stay up long enough to clean up the ginormous objects: > >>> > >>> ceph config set osd.NNN osd_op_thread_suicide_timeout 2h > >> > >> ok > >> > >>>> It looks healthy so far: > >>>> ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-266 fsck > >>>> fsck success > >>>> > >>>> Now we have to choose how to continue, trying to reduce the risk of > >>>> losing > >>>> data (most bucket indexes are intact currently). My guess would be to let > >>>> this > >>>> OSD (which was not the primary) go in and hope that it recovers. In case > >>>> of a > >>>> problem, maybe we could still use the other OSDs "somehow"? In case of > >>>> success, we would bring back the other OSDs as well? > >>>> > >>>> OTOH we could try to continue with the key dump from earlier today. > >>> > >>> I would start all three osds the same way, with 'noout' set on the > >>> cluster. You should try to avoid triggering recovery because it will have > >>> a hard time getting through the big index object on that bucket (i.e., it > >>> will take a long time, and might trigger some blocked ios and so forth). > >> > >> This I do not understand, how would I avoid recovery? > > > > Well, simply doing 'ceph osd set noout' is sufficient to avoid > > recovery, I suppose. But in any case, getting at least 2 of the > > existing copies/OSDs online (assuming your pool's min_size=2) will mean > > you can finish the reshard process and clean up the big object without > > copying the PG anywhere. > > > > I think you may as well do all 3 OSDs this way, then clean up the big > > object--that way in the end no data will have to move. > > > > This is Nautilus, right? If you scrub the PGs in question, that will also > > now raise the health alert if there are any remaining big omap objects... > > if that warning goes away you'll know you're doing cleaning up. A final > > rocksdb compaction should then be enough to remove any remaing weirdness > > from rocksdb's internal layout. > > > >>> (Side note that since you started the OSD read-write using the internal > >>> copy of rocksdb, don't forget that the external copy you extracted > >>> (/mnt/ceph/db?) is now stale!) > >> > >> As suggested by Paul Emmerich (see next E-mail in this thread), I exported > >> this PG. It took not that long (20 minutes). > > > > Great :) > > > > sage > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com