[ceph-users] Re: Networking Idea/Question

2021-03-17 Thread Stefan Kooman

On 3/17/21 7:44 AM, Janne Johansson wrote:

Den ons 17 mars 2021 kl 02:04 skrev Tony Liu :


What's the purpose of "cluster" network, simply increasing total
bandwidth or for some isolations?


Not having client traffic (that only occurs on the public network)
fight over bandwidth with OSD<->OSD traffic (replication and
recovery).

Now, this wasn't as much of an issue, which is why most have stopped
recommending split networks and just go with bonding the interfaces
and let the side that currently needs a lot of BW get it.


It does have another drawback: when public network is up, but one 
cluster network port(-channel) is down (or one of the routers for 
cluster network), clients can do IO to all OSDs, but not all OSDs can 
communicate, and you will get slow ops, OSDs flagging each other down 
and that might not be obvious to spot.


Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Networking Idea/Question

2021-03-17 Thread Janne Johansson
Den ons 17 mars 2021 kl 02:04 skrev Tony Liu :

> What's the purpose of "cluster" network, simply increasing total
> bandwidth or for some isolations?

Not having client traffic (that only occurs on the public network)
fight over bandwidth with OSD<->OSD traffic (replication and
recovery).

Now, this wasn't as much of an issue, which is why most have stopped
recommending split networks and just go with bonding the interfaces
and let the side that currently needs a lot of BW get it.

> For example,
> 1 network on 1 bonding with 2 x 40GB ports
> vs.
> 2 networks on 2 bonding each with 2 x 20GB ports
>
> They have the same total bandwidth 80GB, so they will support
> the same performance, right?

Depending on how the bonding is done, the latter will never have any
single tcp stream go past 20Gbit ever, but will support several
parallel streams to different destinations together be more than
20Gbit. Also, faster ethernet has slightly smaller latencies so 1x100
would have less latency than 4x25 and so on. Not always this simple,
but in general.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Networking Idea/Question

2021-03-16 Thread Tony Liu
"but you may see significant performance improvement with a
second "cluster" network in a large cluster."

"does not usually have a significant impact on overall performance."

The above two statements look conflict to me and cause confusing.

What's the purpose of "cluster" network, simply increasing total
bandwidth or for some isolations?

For example, 
1 network on 1 bonding with 2 x 40GB ports
vs.
2 networks on 2 bonding each with 2 x 20GB ports

They have the same total bandwidth 80GB, so they will support
the same performance, right?


Thanks!
Tony
> -Original Message-
> From: Andrew Walker-Brown 
> Sent: Tuesday, March 16, 2021 9:18 AM
> To: Tony Liu ; Stefan Kooman ;
> Dave Hall ; ceph-users 
> Subject: RE: [ceph-users] Re: Networking Idea/Question
> 
> https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/
> 
> 
> 
> Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986>  for
> Windows 10
> 
> 
> 
> From: Tony Liu <mailto:tonyliu0...@hotmail.com>
> Sent: 16 March 2021 16:16
> To: Stefan Kooman <mailto:ste...@bit.nl> ; Dave Hall
> <mailto:kdh...@binghamton.edu> ; ceph-users <mailto:ceph-users@ceph.io>
> Subject: [ceph-users] Re: Networking Idea/Question
> 
> 
> 
> > -----Original Message-----
> > From: Stefan Kooman 
> > Sent: Tuesday, March 16, 2021 4:10 AM
> > To: Dave Hall ; ceph-users 
> > Subject: [ceph-users] Re: Networking Idea/Question
> >
> > On 3/15/21 5:34 PM, Dave Hall wrote:
> > > Hello,
> > >
> > > If anybody out there has tried this or thought about it, I'd like to
> > > know...
> > >
> > > I've been thinking about ways to squeeze as much performance as
> > > possible from the NICs  on a Ceph OSD node.  The nodes in our
> cluster
> > > (6 x OSD, 3 x MGR/MON/MDS/RGW) currently have 2 x 10GB ports.
> > > Currently, one port is assigned to the front-side network, and one
> to
> > > the back-side network.  However, there are times when the traffic on
> > > one side or the other is more intense and might benefit from a bit
> > more bandwidth.
> >
> > What is (are) the reason(s) to choose a separate cluster and public
> > network?
> 
> That used to be the recommendation to separate client traffic and
> cluster traffic. I heard it's not true any more as the latest.
> It would be good if someone can point to the right link of such
> recommendation.
> 
> 
> Thanks!
> Tony
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Networking Idea/Question

2021-03-16 Thread Andrew Walker-Brown
https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/

Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10

From: Tony Liu<mailto:tonyliu0...@hotmail.com>
Sent: 16 March 2021 16:16
To: Stefan Kooman<mailto:ste...@bit.nl>; Dave 
Hall<mailto:kdh...@binghamton.edu>; ceph-users<mailto:ceph-users@ceph.io>
Subject: [ceph-users] Re: Networking Idea/Question

> -Original Message-
> From: Stefan Kooman 
> Sent: Tuesday, March 16, 2021 4:10 AM
> To: Dave Hall ; ceph-users 
> Subject: [ceph-users] Re: Networking Idea/Question
>
> On 3/15/21 5:34 PM, Dave Hall wrote:
> > Hello,
> >
> > If anybody out there has tried this or thought about it, I'd like to
> > know...
> >
> > I've been thinking about ways to squeeze as much performance as
> > possible from the NICs  on a Ceph OSD node.  The nodes in our cluster
> > (6 x OSD, 3 x MGR/MON/MDS/RGW) currently have 2 x 10GB ports.
> > Currently, one port is assigned to the front-side network, and one to
> > the back-side network.  However, there are times when the traffic on
> > one side or the other is more intense and might benefit from a bit
> more bandwidth.
>
> What is (are) the reason(s) to choose a separate cluster and public
> network?

That used to be the recommendation to separate client traffic and
cluster traffic. I heard it's not true any more as the latest.
It would be good if someone can point to the right link of such
recommendation.


Thanks!
Tony

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Networking Idea/Question

2021-03-16 Thread Tony Liu
> -Original Message-
> From: Stefan Kooman 
> Sent: Tuesday, March 16, 2021 4:10 AM
> To: Dave Hall ; ceph-users 
> Subject: [ceph-users] Re: Networking Idea/Question
> 
> On 3/15/21 5:34 PM, Dave Hall wrote:
> > Hello,
> >
> > If anybody out there has tried this or thought about it, I'd like to
> > know...
> >
> > I've been thinking about ways to squeeze as much performance as
> > possible from the NICs  on a Ceph OSD node.  The nodes in our cluster
> > (6 x OSD, 3 x MGR/MON/MDS/RGW) currently have 2 x 10GB ports.
> > Currently, one port is assigned to the front-side network, and one to
> > the back-side network.  However, there are times when the traffic on
> > one side or the other is more intense and might benefit from a bit
> more bandwidth.
> 
> What is (are) the reason(s) to choose a separate cluster and public
> network?

That used to be the recommendation to separate client traffic and
cluster traffic. I heard it's not true any more as the latest.
It would be good if someone can point to the right link of such
recommendation.


Thanks!
Tony

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Networking Idea/Question

2021-03-16 Thread Stefan Kooman

On 3/15/21 5:34 PM, Dave Hall wrote:

Hello,

If anybody out there has tried this or thought about it, I'd like to 
know...


I've been thinking about ways to squeeze as much performance as possible 
from the NICs  on a Ceph OSD node.  The nodes in our cluster (6 x OSD, 3 
x MGR/MON/MDS/RGW) currently have 2 x 10GB ports.  Currently, one port 
is assigned to the front-side network, and one to the back-side 
network.  However, there are times when the traffic on one side or the 
other is more intense and might benefit from a bit more bandwidth.


What is (are) the reason(s) to choose a separate cluster and public network?

Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Networking Idea/Question

2021-03-16 Thread Dave Hall
Burkhard,

I woke up with the same conclusion - LACP load balancing can break down
when the traffic traverses a router since the IP headers have the router as
the destination address and thus the Ethernet header has the same to MAC
addresses.

(I think that in a pure layer 2 fabric the MAC addresses vary enough to
produce reasonable - not perfect - LACP load balancing.)

Then add VXLAN, SDN, and other newer networking technologies and it all
gets even more confusing.

But I always come back to the starter cluster, likely a proof of concept
demonstration, that might be built with left-over parts.  Networking is
frequently an afterthought.  In this case node-level traffic management -
weighted fair queueing - could make all the difference.

-Dave

--
Dave Hall
Binghamton University
kdh...@binghamton.edu

On Tue, Mar 16, 2021 at 4:20 AM Burkhard Linke <
burkhard.li...@computational.bio.uni-giessen.de> wrote:

> Hi,
>
> On 16.03.21 03:40, Dave Hall wrote:
> > Andrew,
> >
> > I agree that the choice of hash function is important for LACP. My
> > thinking has always been to stay down in layers 2 and 3.  With enough
> > hosts it seems likely that traffic would be split close to evenly.
> > Heads or tails - 50% of the time you're right.  TCP ports should also
> > be nearly equally split, but listening ports could introduce some
> > asymmetry.
>
>
> Just a comment on the hashing methods. LACP specs does not include
> layer3+4, so running it is somewhat outside of the spec.
>
> The main reason for it being present it the fact that LACP load
> balancing does not work well in case of routing. If all your clients are
> in a different network reachable via a gateway, all your traffic will be
> directed to the MAC address of the gateway. As a result all that traffic
> will use a single link only.
>
> Also keep in mind that these hashing methods only affect the traffic the
> originate from the corresponding system. In case of a ceph host only the
> traffic sent from the host is controlled by it; the traffic from the
> switch to the host uses the switch's hashing setting.
>
>
> We use layer 3+4 hashing on all baremetal hosts (including ceph hosts)
> and all switches, and traffic is roughly evenly distributed between the
> links.
>
>
> Regards,
>
> Burkhard
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Networking Idea/Question

2021-03-16 Thread Burkhard Linke

Hi,

On 16.03.21 03:40, Dave Hall wrote:

Andrew,

I agree that the choice of hash function is important for LACP. My 
thinking has always been to stay down in layers 2 and 3.  With enough 
hosts it seems likely that traffic would be split close to evenly.  
Heads or tails - 50% of the time you're right.  TCP ports should also 
be nearly equally split, but listening ports could introduce some 
asymmetry.



Just a comment on the hashing methods. LACP specs does not include 
layer3+4, so running it is somewhat outside of the spec.


The main reason for it being present it the fact that LACP load 
balancing does not work well in case of routing. If all your clients are 
in a different network reachable via a gateway, all your traffic will be 
directed to the MAC address of the gateway. As a result all that traffic 
will use a single link only.


Also keep in mind that these hashing methods only affect the traffic the 
originate from the corresponding system. In case of a ceph host only the 
traffic sent from the host is controlled by it; the traffic from the 
switch to the host uses the switch's hashing setting.



We use layer 3+4 hashing on all baremetal hosts (including ceph hosts) 
and all switches, and traffic is roughly evenly distributed between the 
links.



Regards,

Burkhard

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Networking Idea/Question

2021-03-16 Thread Dave Hall

Andrew,

I agree that the choice of hash function is important for LACP. My 
thinking has always been to stay down in layers 2 and 3.  With enough 
hosts it seems likely that traffic would be split close to evenly.  
Heads or tails - 50% of the time you're right.  TCP ports should also be 
nearly equally split, but listening ports could introduce some asymmetry.


What I'm concerned about is the next level up:  With the client network 
and the cluster network (Marc's terms are more descriptive) on the same 
NICs/Switch Ports, with or without LACP and LAGs, it seems possible that 
at times the bandwidth consumed by cluster traffic could overwhelm and 
starve the client traffic. Or the other way around, which would be worse 
if the cluster nodes can't communicate on their 'private' network to 
keep the cluster consistent.  These overloads could  happen in the 
packet queues in the NIC drivers, or maybe in the switch fabric.


Maybe these starvation scenarios aren't that likely in clusters with 
10GB networking.  Maybe it's hard to fill up a 10GB pipe, much less 
two.  But it could happen with 1GB NICs, even in LAGs of 4 or 6 ports, 
and eventually it will be possible with faster NVMe drives to easily 
fill a 10GB pipe.


So, what could we do with some of the 'exotic' queuing mechanisms 
available in Linux to keep the balance - to assure that the lesser 
category can transmit proportionally?  (And is 'proportional' the right 
answer, or should one side get a slight advantage?)


-Dave

Dave Hall
Binghamton University
kdh...@binghamton.edu
On 3/15/2021 12:48 PM, Andrew Walker-Brown wrote:


Dave

That’s the way our cluster is setup. It’s relatively small, 5 hosts, 12 osd’s.

Each host has 2x10G with LACP to the switches.  We’ve vlan’d public/private 
networks.

Making best use of the LACP lag will to a greater extent be down to choosing 
the best hashing policy.  At the moment we’re using layer3+4 on the Linux 
config and switch configs.  We’re monitoring link utilisation to make sure the 
balancing is as close to equal as possible.

Hope this helps

A

Sent from my iPhone

On 15 Mar 2021, at 16:39, Marc  wrote:

I have client and cluster network on one 10gbit port (with different vlans).
I think many smaller clusters do this ;)


I've been thinking about ways to squeeze as much performance as possible
from the NICs  on a Ceph OSD node.  The nodes in our cluster (6 x OSD, 3
x MGR/MON/MDS/RGW) currently have 2 x 10GB ports.  Currently, one port
is assigned to the front-side network, and one to the back-side
network.  However, there are times when the traffic on one side or the
other is more intense and might benefit from a bit more bandwidth.

The idea I had was to bond the two ports together, and to run the
back-side network in a tagged VLAN on the combined 20GB LACP port.  In
order to keep the balance and prevent starvation from either side it
would be necessary to apply some sort of a weighted fair queuing
mechanism via the 'tc' command.  The idea is that if the client side
isn't using up the full 10GB/node, and there is a burst of re-balancing
activity, the bandwidth consumed by the back-side traffic could swell to
15GB or more.   Or vice versa.

 From what I have read and studied, these algorithms are fairly
responsive to changes in load and would thus adjust rapidly if the
demand from either side suddenly changed.

Maybe this is a crazy idea, or maybe it's really cool.  Your thoughts?

Thanks.

-Dave

--
Dave Hall
Binghamton University
kdh...@binghamton.edu
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Networking Idea/Question

2021-03-15 Thread Andrew Walker-Brown
Dave

That’s the way our cluster is setup. It’s relatively small, 5 hosts, 12 osd’s.  

Each host has 2x10G with LACP to the switches.  We’ve vlan’d public/private 
networks. 

Making best use of the LACP lag will to a greater extent be down to choosing 
the best hashing policy.  At the moment we’re using layer3+4 on the Linux 
config and switch configs.  We’re monitoring link utilisation to make sure the 
balancing is as close to equal as possible. 

Hope this helps

A

Sent from my iPhone

On 15 Mar 2021, at 16:39, Marc  wrote:

I have client and cluster network on one 10gbit port (with different vlans). 
I think many smaller clusters do this ;)

> 
> I've been thinking about ways to squeeze as much performance as possible
> from the NICs  on a Ceph OSD node.  The nodes in our cluster (6 x OSD, 3
> x MGR/MON/MDS/RGW) currently have 2 x 10GB ports.  Currently, one port
> is assigned to the front-side network, and one to the back-side
> network.  However, there are times when the traffic on one side or the
> other is more intense and might benefit from a bit more bandwidth.
> 
> The idea I had was to bond the two ports together, and to run the
> back-side network in a tagged VLAN on the combined 20GB LACP port.  In
> order to keep the balance and prevent starvation from either side it
> would be necessary to apply some sort of a weighted fair queuing
> mechanism via the 'tc' command.  The idea is that if the client side
> isn't using up the full 10GB/node, and there is a burst of re-balancing
> activity, the bandwidth consumed by the back-side traffic could swell to
> 15GB or more.   Or vice versa.
> 
> From what I have read and studied, these algorithms are fairly
> responsive to changes in load and would thus adjust rapidly if the
> demand from either side suddenly changed.
> 
> Maybe this is a crazy idea, or maybe it's really cool.  Your thoughts?
> 
> Thanks.
> 
> -Dave
> 
> --
> Dave Hall
> Binghamton University
> kdh...@binghamton.edu
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io