The monitors can't replace an MDS if they're trying to agree amongst
themselves. Note the repeated monitor elections happening at the same time.

A monitor election is *often* transparent to clients, since they and the
OSDs only need the monitors when something happens in the cluster. But when
you collide losing an MDS and losing monitors at the same time, the clients
can't do their MDS requests but the monitors can't do a fast failover of
the MDS because the monitors are trying to establish a quorum.

(There's also the time sync warnings going on; those may or may not be
causing issues here but certainly aren't helping anything!)
-Greg

On Fri, Sep 7, 2018 at 7:24 PM Bryan Henderson <bry...@giraffe-data.com>
wrote:

> > It's mds_beacon_grace.  Set that on the monitor to control the
> replacement of
> > laggy MDS daemons,
>
> Sounds like William's issue is something else.  William shuts down MDS 2
> and
> MON 4 simultaneously.  The log shows that some time later (we don't know
> how
> long), MON 3 detects that MDS 2 is gone ("MDS_ALL_DOWN"), but does nothing
> about it until 30 seconds later, which happens to be when MDS 2 and MON 4
> come
> back.  At that point, MON 3 reports that the rank has been reassigned to
> MDS
> 1.
>
> 'mds_beacon_grace' determines when a monitor declares MDS_ALL_DOWN, right?
>
> I think if things are working as designed, the log should show MON 3
> reassigning the rank to MDS 1 immediately after it reports MDS 2 is gone.
>
>
> From the original post:
>
> 2018-08-25 03:30:02.936554 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 55 : cluster [ERR] Health check failed: 1 filesystem is offline
> (MDS_ALL_DOWN)
> 2018-08-25 03:30:04.235703 mon.dub-sitv-ceph-05 mon.2 10.18.186.208:6789/0
> 226 : cluster [INF] mon.dub-sitv-ceph-05 calling monitor election
> 2018-08-25 03:30:04.238672 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 56 : cluster [INF] mon.dub-sitv-ceph-03 calling monitor election
> 2018-08-25 03:30:09.242595 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 57 : cluster [INF] mon.dub-sitv-ceph-03 is new leader, mons
> dub-sitv-ceph-03,dub-sitv-ceph-05 in quorum (ranks 0,2)
> 2018-08-25 03:30:09.252804 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 62 : cluster [WRN] Health check failed: 1/3 mons down, quorum
> dub-sitv-ceph-03,dub-sitv-ceph-05 (MON_DOWN)
> 2018-08-25 03:30:09.258693 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 63 : cluster [WRN] overall HEALTH_WARN 2 osds down; 2 hosts (2 osds) down;
> 1/3 mons down, quorum dub-sitv-ceph-03,dub-sitv-ceph-05
> 2018-08-25 03:30:10.254162 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 64 : cluster [WRN] Health check failed: Reduced data availability: 2 pgs
> inactive, 115 pgs peering (PG_AVAILABILITY)
> 2018-08-25 03:30:12.429145 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 66 : cluster [WRN] Health check failed: Degraded data redundancy: 712/2504
> objects degraded (28.435%), 86 pgs degraded (PG_DEGRADED)
> 2018-08-25 03:30:16.137408 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 67 : cluster [WRN] Health check update: Reduced data availability: 1 pg
> inactive, 69 pgs peering (PG_AVAILABILITY)
> 2018-08-25 03:30:17.193322 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 68 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data
> availability: 1 pg inactive, 69 pgs peering)
> 2018-08-25 03:30:18.432043 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 69 : cluster [WRN] Health check update: Degraded data redundancy: 1286/2572
> objects degraded (50.000%), 166 pgs degraded (PG_DEGRADED)
> 2018-08-25 03:30:26.139491 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 71 : cluster [WRN] Health check update: Degraded data redundancy: 1292/2584
> objects degraded (50.000%), 166 pgs degraded (PG_DEGRADED)
> 2018-08-25 03:30:31.355321 mon.dub-sitv-ceph-04 mon.1 10.18.53.155:6789/0
> 1 : cluster [INF] mon.dub-sitv-ceph-04 calling monitor election
> 2018-08-25 03:30:31.371519 mon.dub-sitv-ceph-04 mon.1 10.18.53.155:6789/0
> 2 : cluster [WRN] message from mon.0 was stamped 0.817433s in the future,
> clocks not synchronized
> 2018-08-25 03:30:32.175677 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 72 : cluster [INF] mon.dub-sitv-ceph-03 calling monitor election
> 2018-08-25 03:30:32.175864 mon.dub-sitv-ceph-05 mon.2 10.18.186.208:6789/0
> 227 : cluster [INF] mon.dub-sitv-ceph-05 calling monitor election
> 2018-08-25 03:30:32.180615 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 73 : cluster [INF] mon.dub-sitv-ceph-03 is new leader, mons
> dub-sitv-ceph-03,dub-sitv-ceph-04,dub-sitv-ceph-05 in quorum (ranks 0,1,2)
> 2018-08-25 03:30:32.189593 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 78 : cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down,
> quorum dub-sitv-ceph-03,dub-sitv-ceph-05)
> 2018-08-25 03:30:32.190820 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 79 : cluster [WRN] mon.1 10.18.53.155:6789/0 clock skew 0.811318s > max
> 0.05s
> 2018-08-25 03:30:32.194280 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 80 : cluster [WRN] overall HEALTH_WARN 2 osds down; 2 hosts (2 osds) down;
> Degraded data redundancy: 1292/2584 objects degraded (50.000%), 166 pgs
> degraded
> 2018-08-25 03:30:35.076121 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 83 : cluster [INF] daemon mds.dub-sitv-ceph-02 restarted
> 2018-08-25 03:30:35.270222 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 85 : cluster [WRN] Health check failed: 1 filesystem is degraded
> (FS_DEGRADED)
> 2018-08-25 03:30:35.270267 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 86 : cluster [ERR] Health check failed: 1 filesystem is offline
> (MDS_ALL_DOWN)
> 2018-08-25 03:30:35.282139 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 88 : cluster [INF] Standby daemon mds.dub-sitv-ceph-01 assigned to
> filesystem cephfs as rank 0
> 2018-08-25 03:30:35.282268 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0
> 89 : cluster [INF] Health check cleared: MDS_ALL_DOWN (was: 1 filesystem is
> offline)
>
>
>
> --
> Bryan Henderson                                   San Jose, California
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to