Is this reproducible with crushtool? ceph osd getcrushmap -o crushmap crushtool -i crushmap --update-item XX 1.0 osd.XX --loc host hostname-that-doesnt-exist-yet -o crushmap.modified
Does it still happen if the crushmap is decompiled and recompiled? (crushtool -d and crushtool -c) Replacing XX with the osd ID you tried to add. Posting your (binary) crushmap would be helpful to debug this. (see crushtool -d for what information this file contains) Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 On Fri, Aug 23, 2019 at 1:10 PM Florian Haas <flor...@citynetwork.eu> wrote: > > Hi everyone, > > there are a couple of bug reports about this in Redmine but only one > (unanswered) mailing list message[1] that I could find. So I figured I'd > raise the issue here again and copy the original reporters of the bugs > (they are BCC'd, because in case they are no longer subscribed it > wouldn't be appropriate to share their email addresses with the list). > > This is about https://tracker.ceph.com/issues/40029, and > https://tracker.ceph.com/issues/39978 (the latter of which was recently > closed as a duplicate of the former). > > In short, it appears that at least in luminous and mimic (I haven't > tried nautilus yet), it's possible to crash a mon when attempting to add > a new OSD as it's trying to inject itself into the crush map under its > host bucket, when that host bucket does not exist yet. > > What's worse is that when the OSD's "ceph osd new" process has thus > crashed the leader mon, a new leader is elected and in case the "ceph > osd new" process is still running on the OSD node, it will promptly > connect to that mon, and kill it too. This then continues until > sufficiently many mons have died for quorum to be lost. > > The recovery steps appear to involve > > - killing the "ceph osd new" process, > - restarting mons until you regain quorum, > - and then running "ceph osd purge" to drop the problematic OSD entry > from the crushmap and osdmap. > > The issue can apparently be worked around by adding the host buckets to > the crushmap manually before adding the new OSDs, but surely this isn't > intended to be a prerequisite, at least not to the point of mons > crashing otherwise? > > Also I am guessing that this is some weird corner case rooted in an > unusual combination of contributing factors, because otherwise I am > guessing more people would be bitten by this problem. > > Anyone able to share their thoughts on this one? Have more people run > into this? > > Cheers, > Florian > > > > [1] > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-May/034880.html > — interestingly I could find this message in the pipermail archive but > none in the one that my MUA keeps for me. So perhaps that message wasn't > delivered to all subscribers, which might be why it has gone unanswered. > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io