[ceph-users] Re: mimic 13.2.6 too much broken connexions

Vincent Godin Fri, 29 Nov 2019 01:59:03 -0800

Hello Franck,
Thank you for your help
Ceph is our Openstack main storage. We have 64 computes (ceph
clients), 36 Ceph-Hosts (client and cluster networks) and 3 Mons :  so
roughly 140 arp entries
Our ARP cache size is based on default so 128/512/1024. As 140 < 512,
default should works (i will check over time the arp cachesize however
We tried these settings below 2 weeks ago (we thought it should
improve our network) but it was worst !
net.core.rmem_max = 134217728     (for a 10Gbps with low latency)
net.core.wmem_max = 134217728     (for a 10Gbps with low latency)
net.core.netdev_max_backlog = 300000
net.core.somaxconn = 2000
net.ipv4.ip_local_port_range=’10000 65000’
net.ipv4.tcp_rmem = 4096 87380 134217728     (for a 10Gbps with low latency)
net.ipv4.tcp_wmem = 4096 87380 134217728     (for a 10Gbps with low latency)
net.ipv4.tcp_mtu_probing = 1
net.ipv4.tcp_sack = 0
net.ipv4.tcp_dsack = 0
net.ipv4.tcp_fack = 0
net.ipv4.tcp_fin_timeout = 20
net.ipv4.tcp_slow_start_after_idle = 0
net.ipv4.tcp_timestamps = 0
net.ipv4.tcp_max_syn_backlog = 30000


Client and Cluster Network have a 9000 MTU. Each OSD-Host has two
teaming (LACP): 2x10Gbps for client and 2x10Gbps for cluster. Client
network is one level-2 lan, idem for Cluster network
As i said we didn't see significant errors counters on switchs or server

Vincent


Le ven. 29 nov. 2019 à 09:30, Frank Schilder <fr...@dtu.dk> a écrit :
>
> How large is your arp cache? We have seen ceph dropping connections as soon 
> as the level-2 network (direct neighbours) is larger than the arp cache. We 
> adjusted the following settings:
>
> # Increase ARP cache size to accommodate large level-2 client network.
> net.ipv4.neigh.default.gc_thresh1 = 1024
> net.ipv4.neigh.default.gc_thresh2 = 2048
> net.ipv4.neigh.default.gc_thresh3 = 4096
>
> Another important group of parameters for TCP connections seems to be these, 
> with our values:
>
> ## Increase number of incoming connections. The value can be raised to bursts 
> of request, default is 128
> net.core.somaxconn = 2048
> ## Increase number of incoming connections backlog, default is 1000
> net.core.netdev_max_backlog = 50000
> ## Maximum number of remembered connection requests, default is 128
> net.ipv4.tcp_max_syn_backlog = 30000
>
> With this, we got rid of dropped connections in a cluster of 20 ceph nodes 
> and ca. 550 client nodes, accounting for about 1500 active ceph clients, 1400 
> cephfs and 170 RBD images.
>
> Best regards,
>
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Vincent Godin <vince.ml...@gmail.com>
> Sent: 27 November 2019 20:11:23
> To: Anthony D'Atri; ceph-users@ceph.io; Ceph Development
> Subject: [ceph-users] Re: mimic 13.2.6 too much broken connexions
>
> If it was a network issue, the counters should explose (as i said,
> with a log level of 5 on the messenger, we observed more then 80 000
> lossy channels per minute) but nothing abnormal is relevant on the
> counters (on switchs and servers)
> On the switchs  no drop, no crc error, no packet loss, only some
> output discards but not enough to be significant. On the NICs on the
> servers via ethtool -S, nothing is relevant.
> And as i said, an other mimic cluster with different hardware has the
> same behavior
> Ceph uses connexions pools from host to host but how does it check the
> availability of these connexions over the time ?
> And as the network doesn't seem to be guilty, what can explain these
> broken channels ?
>
> Le mer. 27 nov. 2019 à 19:05, Anthony D'Atri <a...@dreamsnake.net> a écrit :
> >
> > Are you bonding NIC ports?   If so do you have the correct hash policy 
> > defined? Have you looked at the *switch* side for packet loss, CRC errors, 
> > etc?   What you report could be consistent with this.  Since the host  
> > interface for a given connection will vary by the bond hash, some OSD 
> > connections will use one port and some the other.   So if one port has 
> > switch side errors, or is blackholed on the switch, you could see some 
> > heart beating impacted but not others.
> >
> > Also make sure you have the optimal reporters value.
> >
> > > On Nov 27, 2019, at 7:31 AM, Vincent Godin <vince.ml...@gmail.com> wrote:
> > >
> > > Till i submit the mail below few days ago, we found some clues
> > > We observed a lot of lossy connexion like :
> > > ceph-osd.9.log:2019-11-27 11:03:49.369 7f6bb77d0700  0 --
> > > 192.168.4.181:6818/2281415 >> 192.168.4.41:0/1962809518
> > > conn(0x563979a9f600 :6818   s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH
> > > pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy)
> > > channel (new one lossy=1)
> > > We raised the log of the messenger to 5/5 and observed for the whole
> > > cluster more than 80 000 lossy connexion per minute !!!
> > > We adjusted  the "ms_tcp_read_timeout" from 900 to 60 sec then no more
> > > lossy connexion in logs nor health check failed
> > > It's just a workaround but there is a real problem with these broken
> > > sessions and it leads to two
> > > assertions :
> > > - Ceph take too much time to detect broken session and should recycle 
> > > quicker !
> > > - The reasons for these broken sessions ?
> > >
> > > We have a other mimic cluster on different hardware and observed the
> > > same behavior : lot of lossy sessions, slow ops and co.
> > > Symptoms are the same :
> > > - some OSDs on one host have no response from an other osd on a different 
> > > hosts
> > > - after some time, slow ops are detected
> > > - sometime it leads to ioblocked
> > > - after about 15mn the problem vanish
> > >
> > > -----------
> > >
> > > Help on diag needed : heartbeat_failed
> > >
> > > We encounter a strange behavior on our Mimic 13.2.6 cluster. A any
> > > time, and without any load, some OSDs become unreachable from only
> > > some hosts. It last 10 mn and then the problem vanish.
> > > It 's not always the same OSDs and the same hosts. There is no network
> > > failure on any of the host (because only some OSDs become unreachable)
> > > nor disk freeze as we can see in our grafana dashboard. Logs message
> > > are :
> > > first msg :
> > > 2019-11-24 09:19:43.292 7fa9980fc700 -1 osd.596 146481
> > > heartbeat_check: no reply from 192.168.6.112:6817 osd.394 since back
> > > 2019-11-24 09:19:22.761142 front 2019-11-24 09:19:39.769138 (cutoff
> > > 2019-11-24 09:19:23.293436)
> > > last msg:
> > > 2019-11-24 09:30:33.735 7f632354f700 -1 osd.591 146481
> > > heartbeat_check: no reply from 192.168.6.123:6828 osd.600 since back
> > > 2019-11-24 09:27:05.269330 front 2019-11-24 09:30:33.214874 (cutoff
> > > 2019-11-24 09:30:13.736517)
> > > During this time, 3 hosts were involved : host-18, host-20 and host-30 :
> > > host-30 is the only one who can't see osds 346,356,and 352 on host-18
> > > host-30 is the only one who can't see osds 387 and 394 on host-20
> > > host-18 is the only one who can't see osds 583, 585, 591 and 597 on 
> > > host-30
> > > We can't see any strange behavior on hosts 18, 20 and 30 in our node
> > > exporter data during this time
> > > Any ideas or advices ?
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: mimic 13.2.6 too much broken connexions

Reply via email to