Hi David, thanks for your feedback.
With that in mind, I did rm a 15TB RBD Pool about 1 hour or so before this had happened. I wouldn't think it would be related to this because there was nothing different going on after I removed it. Not even high system load. But considering what you sid, I think it could have been due to OSDs operations related to that pool removal. Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* On Wed, Aug 9, 2017 at 10:15 AM, David Turner <drakonst...@gmail.com> wrote: > I just want to point out that there are many different types of network > issues that don't involve entire networks. Bad nic, bad/loose cable, a > service on a server restarting our modifying the network stack, etc. > > That said there are other things that can prevent an mds service, or any > service from responding to the mons and being wings marked down. It happens > to osds enough that they even have the ability to wire in their logs that > they were wrongly marked down. That usually happens when the service is so > busy with an operation that it can't get to the request from the mon fast > enough and it gets marked down. This could also be environment from the mds > server. If something else on the host is using too many resources > preventing the mds service from having what it needs, this could easily > happen. > > What level of granularity do you have in your monitoring to tell what your > system state was when this happened? Is there a time of day it is more > likely to happen (expect to find a Cron at that time)? > > On Wed, Aug 9, 2017, 8:37 AM Webert de Souza Lima <webert.b...@gmail.com> > wrote: > >> Hi, >> >> I recently had a mds outage beucase the mds suicided due to "dne in the >> mds map". >> I've asked it here before and I know that happens because the monitors >> took out this mds from the mds map even though it was alive. >> >> Weird thing, there was no network related issues happening at the time, >> which if there was, it would impact many other systems. >> >> I found this in the mon logs, and i'd like to understand it better: >> lease_timeout -- calling new election >> >> full logs: >> >> 2017-08-08 23:12:33.286908 7f2b8398d700 1 leveldb: Manual compaction at >> level-1 from 'pgmap_pg\x009.a' @ 1830392430 : 1 .. 'paxos\x0057687834' @ 0 >> : 0; will stop at (end) >> >> 2017-08-08 23:12:36.885087 7f2b86f9a700 0 >> mon.bhs1-mail02-ds03@2(peon).data_health(3524) >> update_stats avail 81% total 19555 MB, used 2632 MB, avail 15907 MB >> 2017-08-08 23:13:29.357625 7f2b86f9a700 1 >> mon.bhs1-mail02-ds03@2(peon).paxos(paxos >> updating c 57687834..57688383) lease_timeout -- calling new election >> 2017-08-08 23:13:29.358965 7f2b86799700 0 log_channel(cluster) log [INF] >> : mon.bhs1-mail02-ds03 calling new monitor election >> 2017-08-08 23:13:29.359128 7f2b86799700 1 >> mon.bhs1-mail02-ds03@2(electing).elector(3524) >> init, last seen epoch 3524 >> 2017-08-08 23:13:35.383530 7f2b86799700 1 mon.bhs1-mail02-ds03@2(peon).osd >> e12617 e12617: 19 osds: 19 up, 19 in >> 2017-08-08 23:13:35.605839 7f2b86799700 0 mon.bhs1-mail02-ds03@2(peon).mds >> e18460 print_map >> e18460 >> enable_multiple, ever_enabled_multiple: 0,0 >> compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable >> ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds >> uses versioned encoding,6=dirfrag is stored in omap,8=file layout v2} >> >> Filesystem 'cephfs' (2) >> fs_name cephfs >> epoch 18460 >> flags 0 >> created 2016-08-01 11:07:47.592124 >> modified 2017-07-03 10:32:44.426431 >> tableserver 0 >> root 0 >> session_timeout 60 >> session_autoclose 300 >> max_file_size 1099511627776 >> last_failure 0 >> last_failure_osd_epoch 12617 >> compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable >> ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds >> uses versioned encoding,6=dirfrag is stored in omap,8=file layout v2} >> max_mds 1 >> in 0 >> up {0=1574278} >> failed >> damaged >> stopped >> data_pools 8,9 >> metadata_pool 7 >> inline_data disabled >> 1574278: 10.0.2.4:6800/2556733458 'd' mds.0.18460 up:replay seq 1 >> laggy since 2017-08-08 23:13:35.174109 (standby for rank 0) >> >> >> >> 2017-08-08 23:13:35.606303 7f2b86799700 0 log_channel(cluster) log [INF] >> : mon.bhs1-mail02-ds03 calling new monitor election >> 2017-08-08 23:13:35.606361 7f2b86799700 1 >> mon.bhs1-mail02-ds03@2(electing).elector(3526) >> init, last seen epoch 3526 >> 2017-08-08 23:13:36.885540 7f2b86f9a700 0 >> mon.bhs1-mail02-ds03@2(peon).data_health(3528) >> update_stats avail 81% total 19555 MB, used 2636 MB, avail 15903 MB >> 2017-08-08 23:13:38.311777 7f2b86799700 0 mon.bhs1-mail02-ds03@2(peon).mds >> e18461 print_map >> >> >> Regards, >> >> Webert Lima >> DevOps Engineer at MAV Tecnologia >> *Belo Horizonte - Brasil* >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com