I really need some help through this.

This is happening very frequently and I can't seem to figure out why.
My services rely on cephfs and when this happens, the mds suicides.

It's always the same, see the last occurrence logs:

host bhs1-mail03-ds03:

2017-08-19 06:35:54.072809 7f44c60f1700  1
mon.bhs1-mail03-ds03@2(peon).paxos(paxos
updating c 8317388..8318114) lease_timeout -- calling new election
2017-08-19 06:35:54.073777 7f44c58f0700  0 log_channel(cluster) log [INF] :
mon.bhs1-mail03-ds03 calling new monitor election

host bhs1-mail03-ds02:

2017-08-19 06:35:54.163225 7fa963d5d700  0 log_channel(cluster) log [INF] :
mon.bhs1-mail03-ds02 calling new monitor election
2017-08-19 06:35:54.163373 7fa963d5d700  1
mon.bhs1-mail03-ds02@1(electing).elector(59)
init, last seen epoch 59
2017-08-19 06:35:56.066938 7fa96455e700  0
mon.bhs1-mail03-ds02@1(electing).data_health(58)
update_stats avail 76% total 19555 MB, used 3553 MB, avail 14986 MB
2017-08-19 06:35:59.655130 7fa96455e700  0 log_channel(cluster) log [INF] :
mon.bhs1-mail03-ds02@1 won leader election with quorum 1,2
2017-08-19 06:36:00.087081 7fa96455e700  0 log_channel(cluster) log [INF] :
HEALTH_WARN; 1 mons down, quorum 1,2 bhs1-mail03-ds02,bhs1-mail03-ds03
2017-08-19 06:36:00.456332 7fa965f7f700  0 log_channel(cluster) log [INF] :
monmap e2: 3 mons at {bhs1-mail03-ds01=
10.0.0.101:6789/0,bhs1-mail03-ds02=10.0.0.102:6789/0,bhs1-mail03-ds03=10.0.0.103:6789/0
}
2017-08-19 06:36:00.456511 7fa965f7f700  0 log_channel(cluster) log [INF] :
pgmap v4088049: 1664 pgs: 1 active+clean+scrubbing+deep, 1663 active+clean;
6840 GB data, 20295 GB used, 62641 GB / 82937 GB avail; 1019
 kB/s rd, 142 kB/s wr, 24 op/s
2017-08-19 06:36:00.456588 7fa965f7f700  0 log_channel(cluster) log [INF] :
fsmap e1366: 1/1/1 up {0=bhs1-mail03-ds02=up:active}, 1 up:standby-replay


host bhs1-mail03-ds01:

2017-08-19 06:36:09.995951 7f94e861c700  0 log_channel(cluster) log [INF] :
mon.bhs1-mail03-ds01 calling new monitor election
2017-08-19 06:36:09.996005 7f94e861c700  1
mon.bhs1-mail03-ds01@0(electing).elector(59)
init, last seen epoch 59
2017-08-19 06:36:17.653441 7f94e861c700  0 log_channel(cluster) log [INF] :
mon.bhs1-mail03-ds01 calling new monitor election
2017-08-19 06:36:17.653505 7f94e861c700  1
mon.bhs1-mail03-ds01@0(electing).elector(61)
init, last seen epoch 61
2017-08-19 06:36:27.603721 7f94e861c700  0 log_channel(cluster) log [INF] :
mon.bhs1-mail03-ds01@0 won leader election with quorum 0,1,2
2017-08-19 06:36:28.579178 7f94e861c700  0 log_channel(cluster) log [INF] :
HEALTH_OK

the mds host (bhs1-mail03-ds02):
2017-08-19 06:36:06.858100 7f9295a76700  0 -- 10.0.0.102:6801/267422746 >>
10.0.0.103:0/3970353996 pipe(0x55b450c7c000 sd=19 :6801 s=2 pgs=5396 cs=1
l=0 c=0x55b4526de000).fault with nothing to send, going to standby
2017-08-19 06:36:06.936091 7f9295874700  0 -- 10.0.0.102:6801/267422746 >>
10.0.0.110:0/3302883508 pipe(0x55b450c7e800 sd=20 :6801 s=2 pgs=38167 cs=1
l=0 c=0x55b4526de300).fault with nothing to send, going to standby
2017-08-19 06:36:06.936122 7f9294e6a700  0 -- 10.0.0.102:6801/267422746 >>
10.0.0.101:0/2088724837 pipe(0x55b45cef0800 sd=22 :6801 s=2 pgs=360 cs=1
l=0 c=0x55b45daf3b00).fault with nothing to send, going to standby
2017-08-19 06:36:06.936169 7f9295672700  0 -- 10.0.0.102:6801/267422746 >>
10.0.0.111:0/2802182284 pipe(0x55b450c7d400 sd=21 :6801 s=2 pgs=30804 cs=1
l=0 c=0x55b4526de180).fault with nothing to send, going to standby
2017-08-19 06:36:11.412799 7f929c788700  1 mds.bhs1-mail03-ds02
handle_mds_map i (10.0.0.102:6801/267422746) dne in the mdsmap, respawning
myself
2017-08-19 06:36:11.412808 7f929c788700  1 mds.bhs1-mail03-ds02 respawn




Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*

On Wed, Aug 9, 2017 at 10:53 AM, Webert de Souza Lima <webert.b...@gmail.com
> wrote:

> Hi David,
>
> thanks for your feedback.
>
> With that in mind, I did rm a 15TB RBD Pool about 1 hour or so before this
> had happened.
> I wouldn't think it would be related to this because there was nothing
> different going on after I removed it. Not even high system load.
>
> But considering what you sid, I think it could have been due to OSDs
> operations related to that pool removal.
>
>
>
>
>
>
> Regards,
>
> Webert Lima
> DevOps Engineer at MAV Tecnologia
> *Belo Horizonte - Brasil*
>
> On Wed, Aug 9, 2017 at 10:15 AM, David Turner <drakonst...@gmail.com>
> wrote:
>
>> I just want to point out that there are many different types of network
>> issues that don't involve entire networks. Bad nic, bad/loose cable, a
>> service on a server restarting our modifying the network stack, etc.
>>
>> That said there are other things that can prevent an mds service, or any
>> service from responding to the mons and being wings marked down. It happens
>> to osds enough that they even have the ability to wire in their logs that
>> they were wrongly marked down. That usually happens when the service is so
>> busy with an operation that it can't get to the request from the mon fast
>> enough and it gets marked down. This could also be environment from the mds
>> server. If something else on the host is using too many resources
>> preventing the mds service from having what it needs, this could easily
>> happen.
>>
>> What level of granularity do you have in your monitoring to tell what
>> your system state was when this happened? Is there a time of day it is more
>> likely to happen (expect to find a Cron at that time)?
>>
>> On Wed, Aug 9, 2017, 8:37 AM Webert de Souza Lima <webert.b...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I recently had a mds outage beucase the mds suicided due to "dne in the
>>> mds map".
>>> I've asked it here before and I know that happens because the monitors
>>> took out this mds from the mds map even though it was alive.
>>>
>>> Weird thing, there was no network related issues happening at the time,
>>> which if there was, it would impact many other systems.
>>>
>>> I found this in the mon logs, and i'd like to understand it better:
>>>  lease_timeout -- calling new election
>>>
>>> full logs:
>>>
>>> 2017-08-08 23:12:33.286908 7f2b8398d700  1 leveldb: Manual compaction at
>>> level-1 from 'pgmap_pg\x009.a' @ 1830392430 : 1 .. 'paxos\x0057687834' @ 0
>>> : 0; will stop at (end)
>>>
>>> 2017-08-08 23:12:36.885087 7f2b86f9a700  0 mon.bhs1-mail02-ds03@2
>>> (peon).data_health(3524) update_stats avail 81% total 19555 MB, used
>>> 2632 MB, avail 15907 MB
>>> 2017-08-08 23:13:29.357625 7f2b86f9a700  1 mon.bhs1-mail02-ds03@2
>>> (peon).paxos(paxos updating c 57687834..57688383) lease_timeout --
>>> calling new election
>>> 2017-08-08 23:13:29.358965 7f2b86799700  0 log_channel(cluster) log
>>> [INF] : mon.bhs1-mail02-ds03 calling new monitor election
>>> 2017-08-08 23:13:29.359128 7f2b86799700  1 mon.bhs1-mail02-ds03@2
>>> (electing).elector(3524) init, last seen epoch 3524
>>> 2017-08-08 23:13:35.383530 7f2b86799700  1 mon.bhs1-mail02-ds03@2
>>> (peon).osd e12617 e12617: 19 osds: 19 up, 19 in
>>> 2017-08-08 23:13:35.605839 7f2b86799700  0 mon.bhs1-mail02-ds03@2
>>> (peon).mds e18460 print_map
>>> e18460
>>> enable_multiple, ever_enabled_multiple: 0,0
>>> compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable
>>> ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds
>>> uses versioned encoding,6=dirfrag is stored in omap,8=file layout v2}
>>>
>>> Filesystem 'cephfs' (2)
>>> fs_name cephfs
>>> epoch   18460
>>> flags   0
>>> created 2016-08-01 11:07:47.592124
>>> modified        2017-07-03 10:32:44.426431
>>> tableserver     0
>>> root    0
>>> session_timeout 60
>>> session_autoclose       300
>>> max_file_size   1099511627776
>>> last_failure    0
>>> last_failure_osd_epoch  12617
>>> compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable
>>> ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds
>>> uses versioned encoding,6=dirfrag is stored in omap,8=file layout v2}
>>> max_mds 1
>>> in      0
>>> up      {0=1574278}
>>> failed
>>> damaged
>>> stopped
>>> data_pools      8,9
>>> metadata_pool   7
>>> inline_data     disabled
>>> 1574278:        10.0.2.4:6800/2556733458 'd' mds.0.18460 up:replay seq
>>> 1 laggy since 2017-08-08 23:13:35.174109 (standby for rank 0)
>>>
>>>
>>>
>>> 2017-08-08 23:13:35.606303 7f2b86799700  0 log_channel(cluster) log
>>> [INF] : mon.bhs1-mail02-ds03 calling new monitor election
>>> 2017-08-08 23:13:35.606361 7f2b86799700  1 mon.bhs1-mail02-ds03@2
>>> (electing).elector(3526) init, last seen epoch 3526
>>> 2017-08-08 23:13:36.885540 7f2b86f9a700  0 mon.bhs1-mail02-ds03@2
>>> (peon).data_health(3528) update_stats avail 81% total 19555 MB, used
>>> 2636 MB, avail 15903 MB
>>> 2017-08-08 23:13:38.311777 7f2b86799700  0 mon.bhs1-mail02-ds03@2
>>> (peon).mds e18461 print_map
>>>
>>>
>>> Regards,
>>>
>>> Webert Lima
>>> DevOps Engineer at MAV Tecnologia
>>> *Belo Horizonte - Brasil*
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to