[ceph-users] Re: Problem with OSD::osd_op_tp thread had timed out and other connected issues

Anthony D'Atri Sat, 21 Mar 2020 07:57:50 -0700

This is an expensive operation.  You want to slow it down, not burden the OSDs.


> On Mar 21, 2020, at 5:46 AM, Jan Pekař - Imatic <jan.pe...@imatic.cz> wrote:
> 
> Each node has 64GB RAM so it should be enough (12 OSD's = 48GB used).
> 
>> On 21/03/2020 13.14, XuYun wrote:
>> Bluestore requires more than 4G memory per OSD, do you have enough memory?
>> 
>>> 2020年3月21日 下午8:09，Jan Pekař - Imatic <jan.pe...@imatic.cz> 写道：
>>> 
>>> Hello,
>>> 
>>> I have ceph cluster version 14.2.7 
>>> (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus (stable)
>>> 
>>> 4 nodes - each node 11 HDD, 1 SSD, 10Gbit network
>>> 
>>> Cluster was empty, fresh install. We filled cluster with data (small 
>>> blocks) using RGW.
>>> 
>>> Cluster is now used for testing so no client was using it during my admin 
>>> operations mentioned below
>>> 
>>> After a while (7TB of data / 40M objects uploaded) we decided, that we 
>>> increase pg_num from 128 to 256 to better spread data and to speedup this 
>>> operation, I've set
>>> 
>>>  ceph config set mgr target_max_misplaced_ratio 1
>>> 
>>> so that whole cluster rebalance as quickly as it can.
>>> 
>>> I have 3 issues/questions below:
>>> 
>>> 1)
>>> 
>>> I noticed, that manual increase from 128 to 256 caused approx. 6 OSD's to 
>>> restart with logged
>>> 
>>> heartbeat_map clear_timeout 'OSD::osd_op_tp thread 0x7f8c84b8b700' had 
>>> suicide timed out after 150
>>> 
>>> after a while OSD's were back so I continued after a while with my tests.
>>> 
>>> My question - increasing number of PG with maximal 
>>> target_max_misplaced_ratio was too much for that OSDs? It is not 
>>> recommended to do it this way? I had no problem with this increase before, 
>>> but configuration of cluster was slightly different and it was luminous 
>>> version.
>>> 
>>> 2)
>>> 
>>> Rebuild was still slow so I increased number of backfills
>>> 
>>>  ceph tell osd.*  injectargs "--osd-max-backfills 10"
>>> 
>>> and reduced recovery sleep time
>>> 
>>>  ceph tell osd.*  injectargs "--osd-recovery-sleep-hdd 0.01"
>>> 
>>> and after few hours I noticed, that some of my OSD's were restarted during 
>>> recovery, in log I can see
>>> 
>>> ...
>>> 
>>> |2020-03-21 06:41:28.343 7fe1f8bee700 1 heartbeat_map is_healthy 
>>> 'OSD::osd_op_tp thread 0x7fe1da154700' had timed out after 15 2020-03-21 
>>> 06:41:28.343 7fe1f8bee700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 
>>> 0x7fe1da154700' had timed out after 15 2020-03-21 06:41:36.780 7fe1da154700 
>>> 1 heartbeat_map clear_timeout 'OSD::osd_op_tp thread 0x7fe1da154700' had 
>>> timed out after 15 2020-03-21 06:41:36.888 7fe1e7769700 0 
>>> log_channel(cluster) log [WRN] : Monitor daemon marked osd.7 down, but it 
>>> is still running 2020-03-21 06:41:36.888 7fe1e7769700 0 
>>> log_channel(cluster) log [DBG] : map e3574 wrongly marked me down at e3573 
>>> 2020-03-21 06:41:36.888 7fe1e7769700 1 osd.7 3574 start_waiting_for_healthy 
>>> |
>>> 
>>> I observed network graph usage and network utilization was low during 
>>> recovery (10Gbit was not saturated).
>>> 
>>> So lot of IOPS on OSD causes also hartbeat operation to timeout? I thought 
>>> that OSD is using threads and HDD timeouts are not influencing heartbeats 
>>> to other OSD's and MON. It looks like it is not true.
>>> 
>>> 3)
>>> 
>>> After OSD was wrongly marked down I can see that cluster has object 
>>> degraded. There were no degraded object before that.
>>> 
>>>  Degraded data redundancy: 251754/117225048 objects degraded (0.215%), 8 
>>> pgs degraded, 8 pgs undersized
>>> 
>>> It means that this OSD disconnection causes data degraded? How is it 
>>> possible, when no OSD was lost. Data should be on that OSD and after 
>>> peering should be everything OK. With luminous I had no problem, after OSD 
>>> up degraded objects where recovered/found during few seconds and cluster 
>>> was healthy within seconds.
>>> 
>>> Thank you very much for additional info. I can perform additional tests you 
>>> recommend because cluster is used for testing purpose now.
>>> 
>>> With regards
>>> Jan Pekar
>>> 
>>> -- 
>>> ============
>>> Ing. Jan Pekař
>>> jan.pe...@imatic.cz
>>> ----
>>> Imatic | Jagellonská 14 | Praha 3 | 130 00
>>> http://www.imatic.cz | +420326555326
>>> ============
>>> --
>>> 
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> -- 
> ============
> Ing. Jan Pekař
> jan.pe...@imatic.cz
> ----
> Imatic | Jagellonská 14 | Praha 3 | 130 00
> http://www.imatic.cz | +420326555326
> ============
> --
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Problem with OSD::osd_op_tp thread had timed out and other connected issues

Reply via email to