Re: [ceph-users] Ceph network topology with redundant switches

2013-12-20 Thread Kyle Bader
 The area I'm currently investigating is how to configure the
 networking. To avoid a SPOF I'd like to have redundant switches for
 both the public network and the internal network, most likely running
 at 10Gb. I'm considering splitting the nodes in to two separate racks
 and connecting each half to its own switch, and then trunk the
 switches together to allow the two halves of the cluster to see each
 other. The idea being that if a single switch fails I'd only lose half
 of the cluster.

This is fine if you are using a replication factor of 2, you would need 2/3 of
the cluster surviving if using a replication factor 3 with osd pool default min
size set to 2.

 My question is about configuring the public network. If it's all one
 subnet then the clients consuming the Ceph resources can't have both
 links active, so they'd be configured in an active/standby role. But
 this results in quite heavy usage of the trunk between the two
 switches when a client accesses nodes on the other switch than the one
 they're actively connected to.

The linux bonding driver supports several strategies for teaming network
adapters on L2 networks.

 So, can I configure multiple public networks? I think so, based on the
 documentation, but I'm not completely sure. Can I have one half of the
 cluster on one subnet, and the other half on another? And then the
 client machine can have interfaces in different subnets and do the
 right thing with both interfaces to talk to all the nodes. This seems
 like a fairly simple solution that avoids a SPOF in Ceph or the network
 layer.

You can have multiple networks for both the public and cluster networks,
the only restriction is that all subnets for a given type be within the same
supernet. For example

10.0.0.0/16 - Public supernet (configured in ceph.conf)
10.0.1.0/24 - Public rack 1
10.0.2.0/24 - Public rack 2
10.1.0.0/16 - Cluster supernet (configured in ceph.conf)
10.1.1.0/24 - Cluster rack 1
10.1.2.0/24 - Cluster rack 2

 Or maybe I'm missing an alternative that would be better? I'm aiming
 for something that keeps things as simple as possible while meeting
 the redundancy requirements.

 As an aside, there's a similar issue on the cluster network side with
 heavy traffic on the trunk between the two cluster switches. But I
 can't see that's avoidable, and presumably it's something people just
 have to deal with in larger Ceph installations?

A proper CRUSH configuration is going to place a replica on a node in
each rack, this means every write is going to cross the trunk. Other
traffic that you will see on the trunk:

* OSDs gossiping with one another
* OSD/Monitor traffic in the case where an OSD is connected to a
  monitor connected in the adjacent rack (map updates, heartbeats).
* OSD/Client traffic where the OSD and client are in adjacent racks

If you use all 4 40GbE uplinks (common on 10GbE ToR) then your
cluster level bandwidth is oversubscribed 4:1. To lower oversubscription
you are going to have to steal some of the other 48 ports, 12 for 2:1 and
24 for a non-blocking fabric. Given number of nodes you have/plan to
have you will be utilizing 6-12 links per switch, leaving you with 12-18
links for clients on a non-blocking fabric, 24-30 for 2:1 and 36-48 for 4:1.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph network topology with redundant switches

2013-12-20 Thread Tim Bishop
Hi Wido,

Thanks for the reply.

On Fri, Dec 20, 2013 at 08:14:13AM +0100, Wido den Hollander wrote:
 On 12/18/2013 09:39 PM, Tim Bishop wrote:
  I'm investigating and planning a new Ceph cluster starting with 6
  nodes with currently planned growth to 12 nodes over a few years. Each
  node will probably contain 4 OSDs, maybe 6.
 
  The area I'm currently investigating is how to configure the
  networking. To avoid a SPOF I'd like to have redundant switches for
  both the public network and the internal network, most likely running
  at 10Gb. I'm considering splitting the nodes in to two separate racks
  and connecting each half to its own switch, and then trunk the
  switches together to allow the two halves of the cluster to see each
  other. The idea being that if a single switch fails I'd only lose half
  of the cluster.
 
 Why not three switches in total and use VLANs on the switches to 
 separate public/cluster traffic?
 
 This way you can configure the CRUSH map to have one replica go to each 
 switch so that when you loose a switch you still have two replicas 
 available.
 
 Saves you a lot of switches and makes the network simpler.

I was planning to use VLANs to separate the public and cluster traffic
on the same switches.

Two switches costs less than three switches :-) I think on a slightly
larger scale cluster it might make more sense to go up to three (or even
more) switches, but I'm not sure the extra cost is worth it at this
level. I was planning two switches, using VLANs to separate the public
and cluster traffic, and connecting half of the cluster to each switch.

  (I'm not touching on the required third MON in a separate location and
  the CRUSH rules to make sure data is correctly replicated - I'm happy
  with the setup there)
 
  To allow consumers of Ceph to see the full cluster they'd be directly
  connected to both switches. I could have another layer of switches for
  them and interlinks between them, but I'm not sure it's worth it on
  this sort of scale.
 
  My question is about configuring the public network. If it's all one
  subnet then the clients consuming the Ceph resources can't have both
  links active, so they'd be configured in an active/standby role. But
  this results in quite heavy usage of the trunk between the two
  switches when a client accesses nodes on the other switch than the one
  they're actively connected to.
 
 
 Why can't the clients have both links active? You could use LACP? Some 
 switches support mlag to span LACP trunks over two switches.
 
 Or use some intelligent bonding mode in the Linux kernel.

I've only ever used LACP to the same switch, and I hadn't realised there
were options for spanning LACP links across multiple switches. Thanks
for the information there.

  So, can I configure multiple public networks? I think so, based on the
  documentation, but I'm not completely sure. Can I have one half of the
  cluster on one subnet, and the other half on another? And then the
  client machine can have interfaces in different subnets and do the
  right thing with both interfaces to talk to all the nodes. This seems
  like a fairly simple solution that avoids a SPOF in Ceph or the network
  layer.
 
 There is no restriction on the IPs of the OSDs. All they need is a Layer 
 3 route to the WHOLE cluster and monitors.
 
 Say doesn't have to be in a Layer 2 network, everything can be simply 
 Layer 3. You just have to make sure all the nodes can reach each other.

Thanks, that makes sense and makes planning simpler. I suppose it's
logical really... in a HUGE cluster you'd probably have a whole manner
of networks spread around the datacenter.

  Or maybe I'm missing an alternative that would be better? I'm aiming
  for something that keeps things as simple as possible while meeting
  the redundancy requirements.
 
client
  |
  |
core switch
 /| \
/ |  \
   /  |   \
  /   |\
 /| \
 switch1  switch2 switch3
 ||  |
OSD  OSD   OSD
 
 
 You could build something like that. That would be fairly simple.

Isn't the core switch in that diagram a SPOF? Or is it presumed to
already be a redundant setup?

 Keep in mind that you can always loose a switch and still keep I/O going.
 
 Wido

Thanks for your help. You answered my main point about IP addressing on
the public side, and gave me some other stuff to think about.

Tim.

  As an aside, there's a similar issue on the cluster network side with
  heavy traffic on the trunk between the two cluster switches. But I
  can't see that's avoidable, and presumably it's something people just
  have to deal with in larger Ceph installations?
 
  Finally, this is all theoretical planning to try and avoid designing
  in bottlenecks at the outset. I don't have any concrete ideas of
  loading so in practice none of it 

Re: [ceph-users] Ceph network topology with redundant switches

2013-12-19 Thread Wido den Hollander

On 12/18/2013 09:39 PM, Tim Bishop wrote:

Hi all,

I'm investigating and planning a new Ceph cluster starting with 6
nodes with currently planned growth to 12 nodes over a few years. Each
node will probably contain 4 OSDs, maybe 6.

The area I'm currently investigating is how to configure the
networking. To avoid a SPOF I'd like to have redundant switches for
both the public network and the internal network, most likely running
at 10Gb. I'm considering splitting the nodes in to two separate racks
and connecting each half to its own switch, and then trunk the
switches together to allow the two halves of the cluster to see each
other. The idea being that if a single switch fails I'd only lose half
of the cluster.



Why not three switches in total and use VLANs on the switches to 
separate public/cluster traffic?


This way you can configure the CRUSH map to have one replica go to each 
switch so that when you loose a switch you still have two replicas 
available.


Saves you a lot of switches and makes the network simpler.


(I'm not touching on the required third MON in a separate location and
the CRUSH rules to make sure data is correctly replicated - I'm happy
with the setup there)

To allow consumers of Ceph to see the full cluster they'd be directly
connected to both switches. I could have another layer of switches for
them and interlinks between them, but I'm not sure it's worth it on
this sort of scale.

My question is about configuring the public network. If it's all one
subnet then the clients consuming the Ceph resources can't have both
links active, so they'd be configured in an active/standby role. But
this results in quite heavy usage of the trunk between the two
switches when a client accesses nodes on the other switch than the one
they're actively connected to.



Why can't the clients have both links active? You could use LACP? Some 
switches support mlag to span LACP trunks over two switches.


Or use some intelligent bonding mode in the Linux kernel.


So, can I configure multiple public networks? I think so, based on the
documentation, but I'm not completely sure. Can I have one half of the
cluster on one subnet, and the other half on another? And then the
client machine can have interfaces in different subnets and do the
right thing with both interfaces to talk to all the nodes. This seems
like a fairly simple solution that avoids a SPOF in Ceph or the network
layer.


There is no restriction on the IPs of the OSDs. All they need is a Layer 
3 route to the WHOLE cluster and monitors.


Say doesn't have to be in a Layer 2 network, everything can be simply 
Layer 3. You just have to make sure all the nodes can reach each other.




Or maybe I'm missing an alternative that would be better? I'm aiming
for something that keeps things as simple as possible while meeting
the redundancy requirements.


  client
|
|
  core switch
   /| \
  / |  \
 /  |   \
/   |\
   /| \
switch1  switch2 switch3
   ||  |
  OSD  OSD   OSD


You could build something like that. That would be fairly simple.

Keep in mind that you can always loose a switch and still keep I/O going.

Wido


As an aside, there's a similar issue on the cluster network side with
heavy traffic on the trunk between the two cluster switches. But I
can't see that's avoidable, and presumably it's something people just
have to deal with in larger Ceph installations?

Finally, this is all theoretical planning to try and avoid designing
in bottlenecks at the outset. I don't have any concrete ideas of
loading so in practice none of it may be an issue.

Thanks for your thoughts.

Tim.




--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph network topology with redundant switches

2013-12-18 Thread Tim Bishop
Hi all,

I'm investigating and planning a new Ceph cluster starting with 6
nodes with currently planned growth to 12 nodes over a few years. Each
node will probably contain 4 OSDs, maybe 6.

The area I'm currently investigating is how to configure the
networking. To avoid a SPOF I'd like to have redundant switches for
both the public network and the internal network, most likely running
at 10Gb. I'm considering splitting the nodes in to two separate racks
and connecting each half to its own switch, and then trunk the
switches together to allow the two halves of the cluster to see each
other. The idea being that if a single switch fails I'd only lose half
of the cluster.

(I'm not touching on the required third MON in a separate location and
the CRUSH rules to make sure data is correctly replicated - I'm happy
with the setup there)

To allow consumers of Ceph to see the full cluster they'd be directly
connected to both switches. I could have another layer of switches for
them and interlinks between them, but I'm not sure it's worth it on
this sort of scale.

My question is about configuring the public network. If it's all one
subnet then the clients consuming the Ceph resources can't have both
links active, so they'd be configured in an active/standby role. But
this results in quite heavy usage of the trunk between the two
switches when a client accesses nodes on the other switch than the one
they're actively connected to.

So, can I configure multiple public networks? I think so, based on the
documentation, but I'm not completely sure. Can I have one half of the
cluster on one subnet, and the other half on another? And then the
client machine can have interfaces in different subnets and do the
right thing with both interfaces to talk to all the nodes. This seems
like a fairly simple solution that avoids a SPOF in Ceph or the network
layer.

Or maybe I'm missing an alternative that would be better? I'm aiming
for something that keeps things as simple as possible while meeting
the redundancy requirements.

As an aside, there's a similar issue on the cluster network side with
heavy traffic on the trunk between the two cluster switches. But I
can't see that's avoidable, and presumably it's something people just
have to deal with in larger Ceph installations?

Finally, this is all theoretical planning to try and avoid designing
in bottlenecks at the outset. I don't have any concrete ideas of
loading so in practice none of it may be an issue.

Thanks for your thoughts.

Tim.

-- 
Tim Bishop
http://www.bishnet.net/tim/
PGP Key: 0x6C226B37FDF38D55

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com