Hi David,

thanks for your feedback.

With that in mind, I did rm a 15TB RBD Pool about 1 hour or so before this
had happened.
I wouldn't think it would be related to this because there was nothing
different going on after I removed it. Not even high system load.

But considering what you sid, I think it could have been due to OSDs
operations related to that pool removal.






Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*

On Wed, Aug 9, 2017 at 10:15 AM, David Turner <drakonst...@gmail.com> wrote:

> I just want to point out that there are many different types of network
> issues that don't involve entire networks. Bad nic, bad/loose cable, a
> service on a server restarting our modifying the network stack, etc.
>
> That said there are other things that can prevent an mds service, or any
> service from responding to the mons and being wings marked down. It happens
> to osds enough that they even have the ability to wire in their logs that
> they were wrongly marked down. That usually happens when the service is so
> busy with an operation that it can't get to the request from the mon fast
> enough and it gets marked down. This could also be environment from the mds
> server. If something else on the host is using too many resources
> preventing the mds service from having what it needs, this could easily
> happen.
>
> What level of granularity do you have in your monitoring to tell what your
> system state was when this happened? Is there a time of day it is more
> likely to happen (expect to find a Cron at that time)?
>
> On Wed, Aug 9, 2017, 8:37 AM Webert de Souza Lima <webert.b...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I recently had a mds outage beucase the mds suicided due to "dne in the
>> mds map".
>> I've asked it here before and I know that happens because the monitors
>> took out this mds from the mds map even though it was alive.
>>
>> Weird thing, there was no network related issues happening at the time,
>> which if there was, it would impact many other systems.
>>
>> I found this in the mon logs, and i'd like to understand it better:
>>  lease_timeout -- calling new election
>>
>> full logs:
>>
>> 2017-08-08 23:12:33.286908 7f2b8398d700  1 leveldb: Manual compaction at
>> level-1 from 'pgmap_pg\x009.a' @ 1830392430 : 1 .. 'paxos\x0057687834' @ 0
>> : 0; will stop at (end)
>>
>> 2017-08-08 23:12:36.885087 7f2b86f9a700  0 
>> mon.bhs1-mail02-ds03@2(peon).data_health(3524)
>> update_stats avail 81% total 19555 MB, used 2632 MB, avail 15907 MB
>> 2017-08-08 23:13:29.357625 7f2b86f9a700  1 
>> mon.bhs1-mail02-ds03@2(peon).paxos(paxos
>> updating c 57687834..57688383) lease_timeout -- calling new election
>> 2017-08-08 23:13:29.358965 7f2b86799700  0 log_channel(cluster) log [INF]
>> : mon.bhs1-mail02-ds03 calling new monitor election
>> 2017-08-08 23:13:29.359128 7f2b86799700  1 
>> mon.bhs1-mail02-ds03@2(electing).elector(3524)
>> init, last seen epoch 3524
>> 2017-08-08 23:13:35.383530 7f2b86799700  1 mon.bhs1-mail02-ds03@2(peon).osd
>> e12617 e12617: 19 osds: 19 up, 19 in
>> 2017-08-08 23:13:35.605839 7f2b86799700  0 mon.bhs1-mail02-ds03@2(peon).mds
>> e18460 print_map
>> e18460
>> enable_multiple, ever_enabled_multiple: 0,0
>> compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable
>> ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds
>> uses versioned encoding,6=dirfrag is stored in omap,8=file layout v2}
>>
>> Filesystem 'cephfs' (2)
>> fs_name cephfs
>> epoch   18460
>> flags   0
>> created 2016-08-01 11:07:47.592124
>> modified        2017-07-03 10:32:44.426431
>> tableserver     0
>> root    0
>> session_timeout 60
>> session_autoclose       300
>> max_file_size   1099511627776
>> last_failure    0
>> last_failure_osd_epoch  12617
>> compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable
>> ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds
>> uses versioned encoding,6=dirfrag is stored in omap,8=file layout v2}
>> max_mds 1
>> in      0
>> up      {0=1574278}
>> failed
>> damaged
>> stopped
>> data_pools      8,9
>> metadata_pool   7
>> inline_data     disabled
>> 1574278:        10.0.2.4:6800/2556733458 'd' mds.0.18460 up:replay seq 1
>> laggy since 2017-08-08 23:13:35.174109 (standby for rank 0)
>>
>>
>>
>> 2017-08-08 23:13:35.606303 7f2b86799700  0 log_channel(cluster) log [INF]
>> : mon.bhs1-mail02-ds03 calling new monitor election
>> 2017-08-08 23:13:35.606361 7f2b86799700  1 
>> mon.bhs1-mail02-ds03@2(electing).elector(3526)
>> init, last seen epoch 3526
>> 2017-08-08 23:13:36.885540 7f2b86f9a700  0 
>> mon.bhs1-mail02-ds03@2(peon).data_health(3528)
>> update_stats avail 81% total 19555 MB, used 2636 MB, avail 15903 MB
>> 2017-08-08 23:13:38.311777 7f2b86799700  0 mon.bhs1-mail02-ds03@2(peon).mds
>> e18461 print_map
>>
>>
>> Regards,
>>
>> Webert Lima
>> DevOps Engineer at MAV Tecnologia
>> *Belo Horizonte - Brasil*
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to