[ceph-users] Re: Networking Idea/Question
On 3/17/21 7:44 AM, Janne Johansson wrote: Den ons 17 mars 2021 kl 02:04 skrev Tony Liu : What's the purpose of "cluster" network, simply increasing total bandwidth or for some isolations? Not having client traffic (that only occurs on the public network) fight over bandwidth with OSD<->OSD traffic (replication and recovery). Now, this wasn't as much of an issue, which is why most have stopped recommending split networks and just go with bonding the interfaces and let the side that currently needs a lot of BW get it. It does have another drawback: when public network is up, but one cluster network port(-channel) is down (or one of the routers for cluster network), clients can do IO to all OSDs, but not all OSDs can communicate, and you will get slow ops, OSDs flagging each other down and that might not be obvious to spot. Gr. Stefan ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Networking Idea/Question
Den ons 17 mars 2021 kl 02:04 skrev Tony Liu : > What's the purpose of "cluster" network, simply increasing total > bandwidth or for some isolations? Not having client traffic (that only occurs on the public network) fight over bandwidth with OSD<->OSD traffic (replication and recovery). Now, this wasn't as much of an issue, which is why most have stopped recommending split networks and just go with bonding the interfaces and let the side that currently needs a lot of BW get it. > For example, > 1 network on 1 bonding with 2 x 40GB ports > vs. > 2 networks on 2 bonding each with 2 x 20GB ports > > They have the same total bandwidth 80GB, so they will support > the same performance, right? Depending on how the bonding is done, the latter will never have any single tcp stream go past 20Gbit ever, but will support several parallel streams to different destinations together be more than 20Gbit. Also, faster ethernet has slightly smaller latencies so 1x100 would have less latency than 4x25 and so on. Not always this simple, but in general. -- May the most significant bit of your life be positive. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Networking Idea/Question
"but you may see significant performance improvement with a second "cluster" network in a large cluster." "does not usually have a significant impact on overall performance." The above two statements look conflict to me and cause confusing. What's the purpose of "cluster" network, simply increasing total bandwidth or for some isolations? For example, 1 network on 1 bonding with 2 x 40GB ports vs. 2 networks on 2 bonding each with 2 x 20GB ports They have the same total bandwidth 80GB, so they will support the same performance, right? Thanks! Tony > -Original Message- > From: Andrew Walker-Brown > Sent: Tuesday, March 16, 2021 9:18 AM > To: Tony Liu ; Stefan Kooman ; > Dave Hall ; ceph-users > Subject: RE: [ceph-users] Re: Networking Idea/Question > > https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/ > > > > Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for > Windows 10 > > > > From: Tony Liu <mailto:tonyliu0...@hotmail.com> > Sent: 16 March 2021 16:16 > To: Stefan Kooman <mailto:ste...@bit.nl> ; Dave Hall > <mailto:kdh...@binghamton.edu> ; ceph-users <mailto:ceph-users@ceph.io> > Subject: [ceph-users] Re: Networking Idea/Question > > > > > -----Original Message----- > > From: Stefan Kooman > > Sent: Tuesday, March 16, 2021 4:10 AM > > To: Dave Hall ; ceph-users > > Subject: [ceph-users] Re: Networking Idea/Question > > > > On 3/15/21 5:34 PM, Dave Hall wrote: > > > Hello, > > > > > > If anybody out there has tried this or thought about it, I'd like to > > > know... > > > > > > I've been thinking about ways to squeeze as much performance as > > > possible from the NICs on a Ceph OSD node. The nodes in our > cluster > > > (6 x OSD, 3 x MGR/MON/MDS/RGW) currently have 2 x 10GB ports. > > > Currently, one port is assigned to the front-side network, and one > to > > > the back-side network. However, there are times when the traffic on > > > one side or the other is more intense and might benefit from a bit > > more bandwidth. > > > > What is (are) the reason(s) to choose a separate cluster and public > > network? > > That used to be the recommendation to separate client traffic and > cluster traffic. I heard it's not true any more as the latest. > It would be good if someone can point to the right link of such > recommendation. > > > Thanks! > Tony > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Networking Idea/Question
https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/ Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10 From: Tony Liu<mailto:tonyliu0...@hotmail.com> Sent: 16 March 2021 16:16 To: Stefan Kooman<mailto:ste...@bit.nl>; Dave Hall<mailto:kdh...@binghamton.edu>; ceph-users<mailto:ceph-users@ceph.io> Subject: [ceph-users] Re: Networking Idea/Question > -Original Message- > From: Stefan Kooman > Sent: Tuesday, March 16, 2021 4:10 AM > To: Dave Hall ; ceph-users > Subject: [ceph-users] Re: Networking Idea/Question > > On 3/15/21 5:34 PM, Dave Hall wrote: > > Hello, > > > > If anybody out there has tried this or thought about it, I'd like to > > know... > > > > I've been thinking about ways to squeeze as much performance as > > possible from the NICs on a Ceph OSD node. The nodes in our cluster > > (6 x OSD, 3 x MGR/MON/MDS/RGW) currently have 2 x 10GB ports. > > Currently, one port is assigned to the front-side network, and one to > > the back-side network. However, there are times when the traffic on > > one side or the other is more intense and might benefit from a bit > more bandwidth. > > What is (are) the reason(s) to choose a separate cluster and public > network? That used to be the recommendation to separate client traffic and cluster traffic. I heard it's not true any more as the latest. It would be good if someone can point to the right link of such recommendation. Thanks! Tony ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Networking Idea/Question
> -Original Message- > From: Stefan Kooman > Sent: Tuesday, March 16, 2021 4:10 AM > To: Dave Hall ; ceph-users > Subject: [ceph-users] Re: Networking Idea/Question > > On 3/15/21 5:34 PM, Dave Hall wrote: > > Hello, > > > > If anybody out there has tried this or thought about it, I'd like to > > know... > > > > I've been thinking about ways to squeeze as much performance as > > possible from the NICs on a Ceph OSD node. The nodes in our cluster > > (6 x OSD, 3 x MGR/MON/MDS/RGW) currently have 2 x 10GB ports. > > Currently, one port is assigned to the front-side network, and one to > > the back-side network. However, there are times when the traffic on > > one side or the other is more intense and might benefit from a bit > more bandwidth. > > What is (are) the reason(s) to choose a separate cluster and public > network? That used to be the recommendation to separate client traffic and cluster traffic. I heard it's not true any more as the latest. It would be good if someone can point to the right link of such recommendation. Thanks! Tony ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Networking Idea/Question
On 3/15/21 5:34 PM, Dave Hall wrote: Hello, If anybody out there has tried this or thought about it, I'd like to know... I've been thinking about ways to squeeze as much performance as possible from the NICs on a Ceph OSD node. The nodes in our cluster (6 x OSD, 3 x MGR/MON/MDS/RGW) currently have 2 x 10GB ports. Currently, one port is assigned to the front-side network, and one to the back-side network. However, there are times when the traffic on one side or the other is more intense and might benefit from a bit more bandwidth. What is (are) the reason(s) to choose a separate cluster and public network? Gr. Stefan ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Networking Idea/Question
Burkhard, I woke up with the same conclusion - LACP load balancing can break down when the traffic traverses a router since the IP headers have the router as the destination address and thus the Ethernet header has the same to MAC addresses. (I think that in a pure layer 2 fabric the MAC addresses vary enough to produce reasonable - not perfect - LACP load balancing.) Then add VXLAN, SDN, and other newer networking technologies and it all gets even more confusing. But I always come back to the starter cluster, likely a proof of concept demonstration, that might be built with left-over parts. Networking is frequently an afterthought. In this case node-level traffic management - weighted fair queueing - could make all the difference. -Dave -- Dave Hall Binghamton University kdh...@binghamton.edu On Tue, Mar 16, 2021 at 4:20 AM Burkhard Linke < burkhard.li...@computational.bio.uni-giessen.de> wrote: > Hi, > > On 16.03.21 03:40, Dave Hall wrote: > > Andrew, > > > > I agree that the choice of hash function is important for LACP. My > > thinking has always been to stay down in layers 2 and 3. With enough > > hosts it seems likely that traffic would be split close to evenly. > > Heads or tails - 50% of the time you're right. TCP ports should also > > be nearly equally split, but listening ports could introduce some > > asymmetry. > > > Just a comment on the hashing methods. LACP specs does not include > layer3+4, so running it is somewhat outside of the spec. > > The main reason for it being present it the fact that LACP load > balancing does not work well in case of routing. If all your clients are > in a different network reachable via a gateway, all your traffic will be > directed to the MAC address of the gateway. As a result all that traffic > will use a single link only. > > Also keep in mind that these hashing methods only affect the traffic the > originate from the corresponding system. In case of a ceph host only the > traffic sent from the host is controlled by it; the traffic from the > switch to the host uses the switch's hashing setting. > > > We use layer 3+4 hashing on all baremetal hosts (including ceph hosts) > and all switches, and traffic is roughly evenly distributed between the > links. > > > Regards, > > Burkhard > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Networking Idea/Question
Hi, On 16.03.21 03:40, Dave Hall wrote: Andrew, I agree that the choice of hash function is important for LACP. My thinking has always been to stay down in layers 2 and 3. With enough hosts it seems likely that traffic would be split close to evenly. Heads or tails - 50% of the time you're right. TCP ports should also be nearly equally split, but listening ports could introduce some asymmetry. Just a comment on the hashing methods. LACP specs does not include layer3+4, so running it is somewhat outside of the spec. The main reason for it being present it the fact that LACP load balancing does not work well in case of routing. If all your clients are in a different network reachable via a gateway, all your traffic will be directed to the MAC address of the gateway. As a result all that traffic will use a single link only. Also keep in mind that these hashing methods only affect the traffic the originate from the corresponding system. In case of a ceph host only the traffic sent from the host is controlled by it; the traffic from the switch to the host uses the switch's hashing setting. We use layer 3+4 hashing on all baremetal hosts (including ceph hosts) and all switches, and traffic is roughly evenly distributed between the links. Regards, Burkhard ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Networking Idea/Question
Andrew, I agree that the choice of hash function is important for LACP. My thinking has always been to stay down in layers 2 and 3. With enough hosts it seems likely that traffic would be split close to evenly. Heads or tails - 50% of the time you're right. TCP ports should also be nearly equally split, but listening ports could introduce some asymmetry. What I'm concerned about is the next level up: With the client network and the cluster network (Marc's terms are more descriptive) on the same NICs/Switch Ports, with or without LACP and LAGs, it seems possible that at times the bandwidth consumed by cluster traffic could overwhelm and starve the client traffic. Or the other way around, which would be worse if the cluster nodes can't communicate on their 'private' network to keep the cluster consistent. These overloads could happen in the packet queues in the NIC drivers, or maybe in the switch fabric. Maybe these starvation scenarios aren't that likely in clusters with 10GB networking. Maybe it's hard to fill up a 10GB pipe, much less two. But it could happen with 1GB NICs, even in LAGs of 4 or 6 ports, and eventually it will be possible with faster NVMe drives to easily fill a 10GB pipe. So, what could we do with some of the 'exotic' queuing mechanisms available in Linux to keep the balance - to assure that the lesser category can transmit proportionally? (And is 'proportional' the right answer, or should one side get a slight advantage?) -Dave Dave Hall Binghamton University kdh...@binghamton.edu On 3/15/2021 12:48 PM, Andrew Walker-Brown wrote: Dave That’s the way our cluster is setup. It’s relatively small, 5 hosts, 12 osd’s. Each host has 2x10G with LACP to the switches. We’ve vlan’d public/private networks. Making best use of the LACP lag will to a greater extent be down to choosing the best hashing policy. At the moment we’re using layer3+4 on the Linux config and switch configs. We’re monitoring link utilisation to make sure the balancing is as close to equal as possible. Hope this helps A Sent from my iPhone On 15 Mar 2021, at 16:39, Marc wrote: I have client and cluster network on one 10gbit port (with different vlans). I think many smaller clusters do this ;) I've been thinking about ways to squeeze as much performance as possible from the NICs on a Ceph OSD node. The nodes in our cluster (6 x OSD, 3 x MGR/MON/MDS/RGW) currently have 2 x 10GB ports. Currently, one port is assigned to the front-side network, and one to the back-side network. However, there are times when the traffic on one side or the other is more intense and might benefit from a bit more bandwidth. The idea I had was to bond the two ports together, and to run the back-side network in a tagged VLAN on the combined 20GB LACP port. In order to keep the balance and prevent starvation from either side it would be necessary to apply some sort of a weighted fair queuing mechanism via the 'tc' command. The idea is that if the client side isn't using up the full 10GB/node, and there is a burst of re-balancing activity, the bandwidth consumed by the back-side traffic could swell to 15GB or more. Or vice versa. From what I have read and studied, these algorithms are fairly responsive to changes in load and would thus adjust rapidly if the demand from either side suddenly changed. Maybe this is a crazy idea, or maybe it's really cool. Your thoughts? Thanks. -Dave -- Dave Hall Binghamton University kdh...@binghamton.edu ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Networking Idea/Question
Dave That’s the way our cluster is setup. It’s relatively small, 5 hosts, 12 osd’s. Each host has 2x10G with LACP to the switches. We’ve vlan’d public/private networks. Making best use of the LACP lag will to a greater extent be down to choosing the best hashing policy. At the moment we’re using layer3+4 on the Linux config and switch configs. We’re monitoring link utilisation to make sure the balancing is as close to equal as possible. Hope this helps A Sent from my iPhone On 15 Mar 2021, at 16:39, Marc wrote: I have client and cluster network on one 10gbit port (with different vlans). I think many smaller clusters do this ;) > > I've been thinking about ways to squeeze as much performance as possible > from the NICs on a Ceph OSD node. The nodes in our cluster (6 x OSD, 3 > x MGR/MON/MDS/RGW) currently have 2 x 10GB ports. Currently, one port > is assigned to the front-side network, and one to the back-side > network. However, there are times when the traffic on one side or the > other is more intense and might benefit from a bit more bandwidth. > > The idea I had was to bond the two ports together, and to run the > back-side network in a tagged VLAN on the combined 20GB LACP port. In > order to keep the balance and prevent starvation from either side it > would be necessary to apply some sort of a weighted fair queuing > mechanism via the 'tc' command. The idea is that if the client side > isn't using up the full 10GB/node, and there is a burst of re-balancing > activity, the bandwidth consumed by the back-side traffic could swell to > 15GB or more. Or vice versa. > > From what I have read and studied, these algorithms are fairly > responsive to changes in load and would thus adjust rapidly if the > demand from either side suddenly changed. > > Maybe this is a crazy idea, or maybe it's really cool. Your thoughts? > > Thanks. > > -Dave > > -- > Dave Hall > Binghamton University > kdh...@binghamton.edu > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io