Nothing too crazy for non default settings. Some of those osd settings were in place while I was testing recovery speeds and need to be brought back closer to defaults. I was setting nodown before but it seems to mask the problem. While its good to stop the osdmap changes, OSDs would come up, get marked up, but at some point go down again (but the process is still running) and still stay up in the map. Then when I'd unset nodown the cluster would immediately mark 250+ osd down again and i'd be back where I started.

This morning I went ahead and finished the osd upgrades to mimic to remove that variable. I've looked for networking problems but haven't found any. 2 of the mons are on the same switch. I've also tried combinations of shutting down a mon to see if a single one was the problem, but they keep electing no matter the mix of them that are up. Part of it feels like a networking problem but I haven't been able to find a culprit yet as everything was working normally before starting the upgrade. Other than the constant mon elections, yesterday I had the cluster 95% healthy 3 or 4 times, but it doesn't last long since at some point the OSDs start trying to fail each other through their heartbeats. 2018-09-09 17:37:29.079 7eff774f5700  1 mon.sephmon1@0(leader).osd e991282 prepare_failure osd.39 10.1.9.2:6802/168438 from osd.49 10.1.9.3:6884/317908 is reporting failure:1 2018-09-09 17:37:29.079 7eff774f5700  0 log_channel(cluster) log [DBG] : osd.39 10.1.9.2:6802/168438 reported failed by osd.49 10.1.9.3:6884/317908 2018-09-09 17:37:29.083 7eff774f5700  1 mon.sephmon1@0(leader).osd e991282 prepare_failure osd.93 10.1.9.9:6853/287469 from osd.372 10.1.9.13:6801/275806 is reporting failure:1

I'm working on getting things mostly good again with everything on mimic and will see if it behaves better.

Thanks for your input on this David.


[global]
mon_initial_members = sephmon1, sephmon2, sephmon3
mon_host = 10.1.9.201,10.1.9.202,10.1.9.203
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
public_network = 10.1.0.0/16
osd backfill full ratio = 0.92
osd failsafe nearfull ratio = 0.90
osd max object size = 21474836480
mon max pg per osd = 350

[mon]
mon warn on legacy crush tunables = false
mon pg warn max per osd = 300
mon osd down out subtree limit = host
mon osd nearfull ratio = 0.90
mon osd full ratio = 0.97
mon health preluminous compat warning = false
osd heartbeat grace = 60
rocksdb cache size = 1342177280

[mds]
mds log max segments = 100
mds log max expiring = 40
mds bal fragment size max = 200000
mds cache memory limit = 4294967296

[osd]
osd mkfs options xfs = -i size=2048 -d su=512k,sw=1
osd recovery delay start = 30
osd recovery max active = 5
osd max backfills = 3
osd recovery threads = 2
osd crush initial weight = 0
osd heartbeat interval = 30
osd heartbeat grace = 60


On 09/08/2018 11:24 PM, David Turner wrote:
What osd/mon/etc config settings do you have that are not default? It might be worth utilizing nodown to stop osds from marking each other down and finish the upgrade to be able to set the minimum osd version to mimic. Stop the osds in a node, manually mark them down, start them back up in mimic. Depending on how bad things are, setting pause on the cluster to just finish the upgrade faster might not be a bad idea either.

This should be a simple question, have you confirmed that there are no networking problems between the MONs while the elections are happening?

On Sat, Sep 8, 2018, 7:52 PM Kevin Hrpcek <kevin.hrp...@ssec.wisc.edu <mailto:kevin.hrp...@ssec.wisc.edu>> wrote:

    Hey Sage,

    I've posted the file with my email address for the user. It is
    with debug_mon 20/20, debug_paxos 20/20, and debug ms 1/5. The
    mons are calling for elections about every minute so I let this
    run for a few elections and saw this node become the leader a
    couple times. Debug logs start around 23:27:30. I had managed to
    get about 850/857 osds up, but it seems that within the last 30
    min it has all gone bad again due to the OSDs reporting each other
    as failed. We relaxed the osd_heartbeat_interval to 30 and
    osd_heartbeat_grace to 60 in an attempt to slow down how quickly
    OSDs are trying to fail each other. I'll put in the
    rocksdb_cache_size setting.

    Thanks for taking a look.

    Kevin

    On 09/08/2018 06:04 PM, Sage Weil wrote:
    Hi Kevin,

    I can't think of any major luminous->mimic changes off the top of my head
    that would impact CPU usage, but it's always possible there is something
    subtle.  Can you ceph-post-file a the full log from one of your mons
    (preferbably the leader)?

    You might try adjusting the rocksdb cache size.. try setting

      rocksdb_cache_size = 1342177280   # 10x the default, ~1.3 GB

    on the mons and restarting?

    Thanks!
    sage

    On Sat, 8 Sep 2018, Kevin Hrpcek wrote:

    Hello,

    I've had a Luminous -> Mimic upgrade go very poorly and my cluster is stuck
    with almost all pgs down. One problem is that the mons have started to
    re-elect a new quorum leader almost every minute. This is making it 
difficult
    to monitor the cluster and even run any commands on it since at least half 
the
    time a ceph command times out or takes over a minute to return results. I've
    looked at the debug logs and it appears there is some timeout occurring with
    paxos of about a minute. The msg_dispatch thread of the mons is often 
running
    a core at 100% for about a minute(user time, no iowait). Running strace on 
it
    shows the process is going through all of the mon db files (about 6gb in
    store.db/*.sst). Does anyone have an idea of what this timeout is or why my
    mons are always reelecting? One theory I have is that the msg_dispatch can't
    process the SST's fast enough and hits some timeout for a health check and 
the
    mon drops itself from the quorum since it thinks it isn't healthy. I've been
    thinking of introducing a new mon to the cluster on hardware with a better 
cpu
    to see if that can process the SSTs within this timeout.

    My cluster has the mons,mds,mgr and 30/41 osd servers on mimic, and 11/41 
osd
    servers on luminous. The original problem started when I restarted the osds 
on
    one of the hosts. The cluster reacted poorly to them going down and went 
into
    a frenzy of taking down other osds and remapping. I eventually got that 
stable
    and the PGs were 90% good with the finish line in sight and then the mons
    started their issue of releecting every minute. Now I can't keep any decent
    amount of PGs up for more than a few hours. This started on Wednesday.....

    Any help would be greatly appreciated.

    Thanks,
    Kevin

    --Debug snippet from a mon at reelection time
    2018-09-07 20:08:08.655 7f57b92cd700 20 mon.sephmon2@1(leader).mds e14242
    maybe_resize_cluster in 1 max 1
    2018-09-07 20:08:08.655 7f57b92cd700  4 mon.sephmon2@1(leader).mds e14242
    tick: resetting beacon timeouts due to mon delay (slow election?) of 
59.8106s
    seconds
    2018-09-07 20:08:08.655 7f57b92cd700 10
    mon.sephmon2@1(leader).paxosservice(mdsmap 13504..14242) maybe_trim trim_to
    13742 would only trim 238 < paxos_service_trim_min 250
    2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).auth v120657
    auth
    2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).auth v120657
    check_rotate updated rotating
    2018-09-07 20:08:08.655 7f57b92cd700 10
    mon.sephmon2@1(leader).paxosservice(auth 120594..120657) propose_pending
    2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).auth v120657
    encode_pending v 120658
    2018-09-07 20:08:08.655 7f57b92cd700  5 mon.sephmon2@1(leader).paxos(paxos
    updating c 132917556..132918214) queue_pending_finisher 0x55dce8e5b370
    2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).paxos(paxos
    updating c 132917556..132918214) trigger_propose not active, will propose
    later
    2018-09-07 20:08:08.655 7f57b92cd700  4 mon.sephmon2@1(leader).mgr e2234 
tick:
    resetting beacon timeouts due to mon delay (slow election?) of 59.8844s
    seconds
    2018-09-07 20:08:08.655 7f57b92cd700 10
    mon.sephmon2@1(leader).paxosservice(mgr 1513..2234) maybe_trim trim_to 1734
    would only trim 221 < paxos_service_trim_min 250
    2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).health tick
    2018-09-07 20:08:08.655 7f57b92cd700 20 mon.sephmon2@1(leader).health
    check_member_health
    2018-09-07 20:08:08.657 7f57bcdd0700  1 --10.1.9.202:6789/0 
<http://10.1.9.202:6789/0>  >> -
    conn(0x55dcee55be00 :6789 s=STATE_ACCEPTING pgs=0 cs=0
    l=0)._process_connection sd=447 -
    2018-09-07 20:08:08.657 7f57bcdd0700 10 mon.sephmon2@1(leader) e17
    ms_verify_authorizer10.1.9.32:6823/4007 <http://10.1.9.32:6823/4007>  osd 
protocol 0
    2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader).health
    check_member_health avail 79% total 40 GiB, used 8.4 GiB, avail 32 GiB
    2018-09-07 20:08:08.662 7f57b92cd700 20 mon.sephmon2@1(leader).health
    check_leader_health
    2018-09-07 20:08:08.662 7f57b92cd700 10
    mon.sephmon2@1(leader).paxosservice(health 1534..1720) maybe_trim trim_to 
1715
    would only trim 181 < paxos_service_trim_min 250
    2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader).config tick
    2018-09-07 20:08:08.662 7f57b92cd700 20 mon.sephmon2@1(leader) e17
    sync_trim_providers
    2018-09-07 20:08:08.662 7f57b92cd700 -1 mon.sephmon2@1(leader) e17
    get_health_metrics reporting 1940 slow ops, oldest is osd_failure(failed
    timeout osd.7210.1.9.9:6800/68904 <http://10.1.9.9:6800/68904>  for 317sec 
e987498 v987498)
    2018-09-07 20:08:08.662 7f57b92cd700  1 mon.sephmon2@1(leader).paxos(paxos
    updating c 132917556..132918214) accept timeout, calling fresh election
    2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17 bootstrap
    2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17
    sync_reset_requester
    2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17
    unregister_cluster_logger
    2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17
    cancel_probe_timeout (none scheduled)
    2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing) e17 _reset
    2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing) e17
    cancel_probe_timeout (none scheduled)
    2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing) e17
    timecheck_finish
    2018-09-07 20:08:08.662 7f57b92cd700 15 mon.sephmon2@1(probing) e17
    health_tick_stop
    2018-09-07 20:08:08.662 7f57b92cd700 15 mon.sephmon2@1(probing) e17
    health_interval_stop
    2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing) e17
    scrub_event_cancel
    2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing) e17
    scrub_reset
    2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(probing).paxos(paxos
    updating c 132917556..132918214) restart -- canceling timeouts
    2018-09-07 20:08:08.662 7f57b92cd700 10
    mon.sephmon2@1(probing).paxosservice(mdsmap 13504..14242) restart
    2018-09-07 20:08:08.662 7f57b92cd700 10
    mon.sephmon2@1(probing).paxosservice(osdmap 917471..987503) restart



    _______________________________________________
    ceph-users mailing list
    ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to