Thanks for the response Greg. We did originally have co-located mds and mon but 
realised this wasn't a good idea early on and separated them out onto different 
hosts. So our mds hosts are on ceph-01 and ceph-02, and our mon hosts are on 
ceph-03, 04 and 05. Unfortunately we see this issue occurring when we reboot 
ceph-02(mds) and ceph-04(mon) together. We expect ceph-01 to become the active 
mds but often it doesnt.

Sent from my iPhone

On 30 Aug 2018, at 17:46, Gregory Farnum 
<gfar...@redhat.com<mailto:gfar...@redhat.com>> wrote:

Yes, this is a consequence of co-locating the MDS and monitors — if the MDS 
reports to its co-located monitor and both fail, the monitor cluster has to go 
through its own failure detection and then wait for a full MDS timeout period 
after that before it marks the MDS down. :(

We might conceivably be able to optimize for this, but there's not a general 
solution. If you need to co-locate, one thing that would make it better without 
being a lot of work is trying to have the MDS connect to one of the monitors on 
a different host. You can do that by just restricting the list of monitors you 
feed it in the ceph.conf, although it's not a guarantee that will *prevent* it 
from connecting to its own monitor if there are failures or reconnects after 
first startup.
-Greg

On Thu, Aug 30, 2018 at 8:38 AM William Lawton 
<william.law...@irdeto.com<mailto:william.law...@irdeto.com>> wrote:
Hi.

We have a 5 node Ceph cluster (refer to ceph -s output at bottom of email). 
During resiliency tests we have an occasional problem when we reboot the active 
MDS instance and a MON instance together i.e.  dub-sitv-ceph-02 and 
dub-sitv-ceph-04. We expect the MDS to failover to the standby instance 
dub-sitv-ceph-01 which is in standby-replay mode, and 80% of the time it does 
with no problems. However, 20% of the time it doesn’t and the MDS_ALL_DOWN 
health check is not cleared until 30 seconds later when the rebooted 
dub-sitv-ceph-02 and dub-sitv-ceph-04 instances come back up.

When the MDS successfully fails over to the standby we see in the ceph.log the 
following:

2018-08-25 00:30:02.231811 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 50 : cluster [ERR] Health check 
failed: 1 filesystem is offline (MDS_ALL_DOWN)
2018-08-25 00:30:02.237389 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 52 : cluster [INF] Standby daemon 
mds.dub-sitv-ceph-01 assigned to filesystem cephfs as rank 0
2018-08-25 00:30:02.237528 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 54 : cluster [INF] Health check 
cleared: MDS_ALL_DOWN (was: 1 filesystem is offline)

When the active MDS role does not failover to the standby the MDS_ALL_DOWN 
check is not cleared until after the rebooted instances have come back up e.g.:

2018-08-25 03:30:02.936554 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 55 : cluster [ERR] Health check 
failed: 1 filesystem is offline (MDS_ALL_DOWN)
2018-08-25 03:30:04.235703 mon.dub-sitv-ceph-05 mon.2 
10.18.186.208:6789/0<http://10.18.186.208:6789/0> 226 : cluster [INF] 
mon.dub-sitv-ceph-05 calling monitor election
2018-08-25 03:30:04.238672 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 56 : cluster [INF] 
mon.dub-sitv-ceph-03 calling monitor election
2018-08-25 03:30:09.242595 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 57 : cluster [INF] 
mon.dub-sitv-ceph-03 is new leader, mons dub-sitv-ceph-03,dub-sitv-ceph-05 in 
quorum (ranks 0,2)
2018-08-25 03:30:09.252804 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 62 : cluster [WRN] Health check 
failed: 1/3 mons down, quorum dub-sitv-ceph-03,dub-sitv-ceph-05 (MON_DOWN)
2018-08-25 03:30:09.258693 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 63 : cluster [WRN] overall 
HEALTH_WARN 2 osds down; 2 hosts (2 osds) down; 1/3 mons down, quorum 
dub-sitv-ceph-03,dub-sitv-ceph-05
2018-08-25 03:30:10.254162 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 64 : cluster [WRN] Health check 
failed: Reduced data availability: 2 pgs inactive, 115 pgs peering 
(PG_AVAILABILITY)
2018-08-25 03:30:12.429145 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 66 : cluster [WRN] Health check 
failed: Degraded data redundancy: 712/2504 objects degraded (28.435%), 86 pgs 
degraded (PG_DEGRADED)
2018-08-25 03:30:16.137408 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 67 : cluster [WRN] Health check 
update: Reduced data availability: 1 pg inactive, 69 pgs peering 
(PG_AVAILABILITY)
2018-08-25 03:30:17.193322 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 68 : cluster [INF] Health check 
cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg inactive, 69 pgs 
peering)
2018-08-25 03:30:18.432043 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 69 : cluster [WRN] Health check 
update: Degraded data redundancy: 1286/2572 objects degraded (50.000%), 166 pgs 
degraded (PG_DEGRADED)
2018-08-25 03:30:26.139491 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 71 : cluster [WRN] Health check 
update: Degraded data redundancy: 1292/2584 objects degraded (50.000%), 166 pgs 
degraded (PG_DEGRADED)
2018-08-25 03:30:31.355321 mon.dub-sitv-ceph-04 mon.1 
10.18.53.155:6789/0<http://10.18.53.155:6789/0> 1 : cluster [INF] 
mon.dub-sitv-ceph-04 calling monitor election
2018-08-25 03:30:31.371519 mon.dub-sitv-ceph-04 mon.1 
10.18.53.155:6789/0<http://10.18.53.155:6789/0> 2 : cluster [WRN] message from 
mon.0 was stamped 0.817433s in the future, clocks not synchronized
2018-08-25 03:30:32.175677 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 72 : cluster [INF] 
mon.dub-sitv-ceph-03 calling monitor election
2018-08-25 03:30:32.175864 mon.dub-sitv-ceph-05 mon.2 
10.18.186.208:6789/0<http://10.18.186.208:6789/0> 227 : cluster [INF] 
mon.dub-sitv-ceph-05 calling monitor election
2018-08-25 03:30:32.180615 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 73 : cluster [INF] 
mon.dub-sitv-ceph-03 is new leader, mons 
dub-sitv-ceph-03,dub-sitv-ceph-04,dub-sitv-ceph-05 in quorum (ranks 0,1,2)
2018-08-25 03:30:32.189593 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 78 : cluster [INF] Health check 
cleared: MON_DOWN (was: 1/3 mons down, quorum dub-sitv-ceph-03,dub-sitv-ceph-05)
2018-08-25 03:30:32.190820 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 79 : cluster [WRN] mon.1 
10.18.53.155:6789/0<http://10.18.53.155:6789/0> clock skew 0.811318s > max 0.05s
2018-08-25 03:30:32.194280 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 80 : cluster [WRN] overall 
HEALTH_WARN 2 osds down; 2 hosts (2 osds) down; Degraded data redundancy: 
1292/2584 objects degraded (50.000%), 166 pgs degraded
2018-08-25 03:30:35.076121 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 83 : cluster [INF] daemon 
mds.dub-sitv-ceph-02 restarted
2018-08-25 03:30:35.270222 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 85 : cluster [WRN] Health check 
failed: 1 filesystem is degraded (FS_DEGRADED)
2018-08-25 03:30:35.270267 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 86 : cluster [ERR] Health check 
failed: 1 filesystem is offline (MDS_ALL_DOWN)
2018-08-25 03:30:35.282139 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 88 : cluster [INF] Standby daemon 
mds.dub-sitv-ceph-01 assigned to filesystem cephfs as rank 0
2018-08-25 03:30:35.282268 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0<http://10.18.53.32:6789/0> 89 : cluster [INF] Health check 
cleared: MDS_ALL_DOWN (was: 1 filesystem is offline)

In the MDS log we’ve noticed that when the issue occurs, at precisely the time 
when the active MDS/MON nodes are rebooted, the standby MDS instance briefly 
stops logging replay_done (as standby). This is shown in the log exert below 
where there is a 9s gap in these logs.

2018-08-25 03:30:00.085 7f3ab9b00700  1 mds.0.0 replay_done (as standby)
2018-08-25 03:30:01.091 7f3ab9b00700  1 mds.0.0 replay_done (as standby)
2018-08-25 03:30:10.332 7f3ab9b00700  1 mds.0.0 replay_done (as standby)
2018-08-25 03:30:11.333 7f3abb303700  1 mds.0.0 replay_done (as standby)

I’ve tried to reproduce the issue by rebooting each MDS instance in turn 
repeatedly 5 minutes apart but so far haven’t been able to do so, so my 
assumption is that rebooting the MDS and a MON instance at the same time is a 
significant factor.

Our mds_standby* configuration is set as follows:

    "mon_force_standby_active": "true",
    "mds_standby_for_fscid": "-1",
    "mds_standby_for_name": "",
    "mds_standby_for_rank": "0",
    "mds_standby_replay": "true",

The cluster status is as follows:

cluster:
    id:     f774b9b2-d514-40d9-85ab-d0389724b6c0
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum dub-sitv-ceph-03,dub-sitv-ceph-04,dub-sitv-ceph-05
    mgr: dub-sitv-ceph-04(active), standbys: dub-sitv-ceph-03, dub-sitv-ceph-05
    mds: cephfs-1/1/1 up  {0=dub-sitv-ceph-02=up:active}, 1 up:standby-replay
    osd: 4 osds: 4 up, 4 in

  data:
    pools:   2 pools, 200 pgs
    objects: 554  objects, 980 MiB
    usage:   7.9 GiB used, 1.9 TiB / 2.0 TiB avail
    pgs:     200 active+clean

  io:
    client:   1.5 MiB/s rd, 810 KiB/s wr, 286 op/s rd, 218 op/s wr

Hope someone can help!
William Lawton


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to