[ceph-users] Re: Mon went down and won't come back

Paul Mezzanini Mon, 09 Nov 2020 12:26:46 -0800

Problem Resolved:  Reason - NoDangClue

I had the broken monitor sitting there trying to join and failing, just 
watching the debug log scroll.  I then stopped ceph-mon-01 and started it in 
debug to watch the messages and also see if debug on ceph-mon-02 was able to 
read it all.


Not only did it report the information timely and correctly, it ended up 
joining the cluster as well.    Using highly technical IT terms, stuff got all 
wanged up so I hit it on the side again to unstick it.  I really wish I had a 
root cause on this one.   Now to open another email chain regarding ceph device 
/ smart monitoring and it killing managers.

--
Paul Mezzanini
Sr Systems Administrator / Engineer, Research Computing
Information & Technology Services
Finance & Administration
Rochester Institute of Technology
o:(585) 475-3245 | pfm...@rit.edu



________________________________________
From: Eugen Block <ebl...@nde.ag>
Sent: Monday, November 9, 2020 11:22 AM
To: Paul Mezzanini
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Re: Mon went down and won't come back

I thought it might be related to reported issues where the MONs were
specified with IP:PORT, but that can be ruled out.
Does the current monmap match your actual setup? You wrote the keys
are correct, but maybe there's still a keyring left in 'ceph auth ls'?


Zitat von Paul Mezzanini <pfm...@rit.edu>:

> Correct, just comma separated IP addresses
>
> --
> Paul Mezzanini
> Sr Systems Administrator / Engineer, Research Computing
> Information & Technology Services
> Finance & Administration
> Rochester Institute of Technology
> o:(585) 475-3245 | pfm...@rit.edu
>
> CONFIDENTIALITY NOTE: The information transmitted, including attachments, is
> intended only for the person(s) or entity to which it is addressed and may
> contain confidential and/or privileged material. Any review, retransmission,
> dissemination or other use of, or taking of any action in reliance upon this
> information by persons or entities other than the intended recipient is
> prohibited. If you received this in error, please contact the sender and
> destroy any copies of this information.
> ------------------------
>
> ________________________________________
> From: Eugen Block <ebl...@nde.ag>
> Sent: Friday, November 6, 2020 9:00 AM
> To: Paul Mezzanini
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] Re: Mon went down and won't come back
>
> So the mon_host line is without a port, correct, just the IP?
>
>
> Zitat von Paul Mezzanini <pfm...@rit.edu>:
>
>> Relevant ceph.conf file lines:
>> [global]
>> mon initial members = ceph-mon-01,ceph-mon-03
>> mon host = IPFor01,IPFor03
>> mon max pg per osd = 400
>> mon pg warn max object skew = -1
>>
>> [mon]
>> mon allow pool delete = true
>>
>>
>>
>> ceph config has:
>> global                  advanced mon_max_pg_per_osd
>>            400
>> global                  advanced mon_pg_warn_max_object_skew
>>            -1.000000
>> global                  dev
>> mon_warn_on_pool_pg_num_not_power_of_two       false
>> mon                    advanced mon_allow_pool_delete
>>           true
>>
>>
>> I'm slowly pulling it all into ceph config and I just haven't sat
>> down to verify it and deploy the stub config everywhere.   Non power
>> of two is set because i'm slowly walking a pool back to a lower PG
>> num and I was sick of the health warn :)  (same with max pg per osd
>> but I'm well under 400 now so I could purge that line)
>>
>> Again, lightly sanitized.  The actual IP's do match forward and reverse DNS.
>>
>> --
>> Paul Mezzanini
>> Sr Systems Administrator / Engineer, Research Computing
>> Information & Technology Services
>> Finance & Administration
>> Rochester Institute of Technology
>> o:(585) 475-3245 | pfm...@rit.edu
>>
>> CONFIDENTIALITY NOTE: The information transmitted, including attachments, is
>> intended only for the person(s) or entity to which it is addressed and may
>> contain confidential and/or privileged material. Any review, retransmission,
>> dissemination or other use of, or taking of any action in reliance upon this
>> information by persons or entities other than the intended recipient is
>> prohibited. If you received this in error, please contact the sender and
>> destroy any copies of this information.
>> ------------------------
>>
>> ________________________________________
>> From: Eugen Block <ebl...@nde.ag>
>> Sent: Friday, November 6, 2020 3:41 AM
>> To: ceph-users@ceph.io
>> Subject: [ceph-users] Re: Mon went down and won't come back
>>
>> Hi,
>>
>> can you share your ceph.conf (mon section)?
>>
>>
>> Zitat von Paul Mezzanini <pfm...@rit.edu>:
>>
>>> Hi everyone,
>>>
>>> I figure it's time to pull in more brain power on this one.  We had
>>> an NVMe mostly die in one of our monitors and it caused the write
>>> latency for the machine to spike.  Ceph did the RightThing(tm) and
>>> when it lost quorum on that machine it was ignored.  I pulled the
>>> bad drive out of the array and tried to bring the mon and mgr back
>>> in (our monitors double-duty as managers).
>>>
>>> The manager came up 0 problems but the monitor got stuck probing.
>>>
>>> I removed the bad host from the monmap and stood up a new one on an
>>> OSD node to get back to 3 active.  That new node added perfectly
>>> using the same methods I've tried on the old one.
>>>
>>> Network appears to be clean between all hosts.  Packet captures show
>>> them chatting just fine.  Since we are getting ready to upgrade from
>>> RHEL7 to RHEL8 I took this as an opportunity to reinstall the
>>> monitor as an 8 box to get that process rolling.  Box is now on
>>> RHEL8 with no changes to how ceph-mon is acting.
>>>
>>> I install machines with a kickstart and use our own ansible roles to
>>> get it 95% into service.  I then follow the manual install
>>> instructions
>>> (https://docs.ceph.com/en/latest/rados/operations/add-or-rm-mons/#adding-monitors).
>>>
>>> Time is in sync, /var/lib/ceph/mon/* is owned by the right UID, keys
>>> are in sync, configs are in sync.  I pulled the old mon out of "mon
>>> initial members" and "mon host".  `nc` can talk to all the ports in
>>> question and we've tried it with firewalld off as well (ditto with
>>> selinux).  Cleaned up some stale DNS and even tried a different IP
>>> (same DNS name). I started all of this with 14.2.12 but .13 was
>>> released while debugging so I've got that on the broken monitor at
>>> the moment.
>>>
>>> I manually start the daemon in debug mode (/usr/bin/ceph-mon -d
>>> --cluster ceph --id ceph-mon-02 --setuser ceph --setgroup ceph)
>>> until it's joined in then use the systemd scripts to start it once
>>> it's clean.  The current state is:
>>>
>>> (Lightly sanitized output)
>>> :snip:
>>> 2020-11-04 11:38:57.049 7f4232fb3540  0 mon.ceph-mon-02 does not
>>> exist in monmap, will attempt to join an existing cluster
>>> 2020-11-04 11:38:57.049 7f4232fb3540  0 using public_addr
>>> v2:Num.64:0/0 -> [v2:Num.64:3300/0,v1:Num.64:6789/0]
>>> 2020-11-04 11:38:57.050 7f4232fb3540  0 starting mon.ceph-mon-02
>>> rank -1 at public addrs [v2:Num.64:3300/0,v1:Num.64:6789/0] at bind
>>> addrs [v2:Num.64:3300/0,v1:Num.64:6789/0] mon_data
>>> /var/lib/ceph/mon/ceph-ceph-mon-02 fsid
>>> 8514c8d5-4cd3-4dee-b460-27633e3adb1a
>>> 2020-11-04 11:38:57.051 7f4232fb3540  1 mon.ceph-mon-02@-1(???) e25
>>> preinit fsid 8514c8d5-4cd3-4dee-b460-27633e3adb1a
>>> 2020-11-04 11:38:57.051 7f4232fb3540  1 mon.ceph-mon-02@-1(???) e25
>>> initial_members ceph-mon-01,ceph-mon-03, filtering seed monmap
>>> 2020-11-04 11:38:57.051 7f4232fb3540  0 mon.ceph-mon-02@-1(???).mds
>>> e430081 new map
>>> 2020-11-04 11:38:57.051 7f4232fb3540  0 mon.ceph-mon-02@-1(???).mds
>>> e430081 print_map
>>> :snip:
>>> 2020-11-04 11:38:57.053 7f4232fb3540  0 mon.ceph-mon-02@-1(???).osd
>>> e1198618 crush map has features 288514119978713088, adjusting msgr
>>> requires
>>> 2020-11-04 11:38:57.053 7f4232fb3540  0 mon.ceph-mon-02@-1(???).osd
>>> e1198618 crush map has features 288514119978713088, adjusting msgr
>>> requires
>>> 2020-11-04 11:38:57.053 7f4232fb3540  0 mon.ceph-mon-02@-1(???).osd
>>> e1198618 crush map has features 3314933069571702784, adjusting msgr
>>> requires
>>> 2020-11-04 11:38:57.053 7f4232fb3540  0 mon.ceph-mon-02@-1(???).osd
>>> e1198618 crush map has features 288514119978713088, adjusting msgr
>>> requires
>>> 2020-11-04 11:38:57.054 7f4232fb3540  1
>>> mon.ceph-mon-02@-1(???).paxosservice(auth 54141..54219) refresh
>>> upgraded, format 0 -> 3
>>> 2020-11-04 11:38:57.069 7f421d891700  1 mon.ceph-mon-02@-1(probing)
>>> e25 handle_auth_request failed to assign global_id
>>>  ^^^ last line repeated every few seconds until process killed
>>>
>>> I've exhausted everything I can think of so I've just been doing the
>>> scientific shotgun (one slug at a time) approach to see what
>>> changes.  Does anyone else have any ideas?
>>>
>>> --
>>> Paul Mezzanini
>>> Sr Systems Administrator / Engineer, Research Computing
>>> Information & Technology Services
>>> Finance & Administration
>>> Rochester Institute of Technology
>>> o:(585) 475-3245 | pfm...@rit.edu
>>>
>>> CONFIDENTIALITY NOTE: The information transmitted, including
>>> attachments, is
>>> intended only for the person(s) or entity to which it is addressed and may
>>> contain confidential and/or privileged material. Any review,
>>> retransmission,
>>> dissemination or other use of, or taking of any action in reliance
>>> upon this
>>> information by persons or entities other than the intended recipient is
>>> prohibited. If you received this in error, please contact the sender and
>>> destroy any copies of this information.
>>> ------------------------
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io



_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Mon went down and won't come back

Reply via email to