Thought I'd share details of my setup as I am effectively achieving this
(ie making monitors accessible over multiple interfaces) with IP routing
as follows:
My Ceph hosts each have a /32 IP address on a loopback interface. And
that is the IP address that all their Ceph daemons are bound to. In
ceph.conf I do that by setting all the following values to the host's
loopback IP: "mon addr" in the [mon.x] sections; "cluster addr" and
"public addr" in the [mds.x] sections; and "cluster addr", "public
addr", and "osd heartbeat addr" in the [osd.x] sections. Then I use IP
routing to ensure that the hosts can all reach each other's loopback
IPs, and that clients can reach them too, over the relevant networks.
This also allows inter-host traffic to fail over to using alternate
paths if the normal path is down for some reason.
To put this in context, my cluster is just a small 3-node cluster, and I
have a pair of layer 3 switches, with networking arranged like this:
node1:
has an 8 Gbps point-to-point routed link to node2 (and uses this link to
communicate with node2 under normal circumstances)
has an 8 Gbps point-to-point routed link to node3 (and uses this link to
communicate with node3 under normal circumstances)
has a 2 Gbps routed link to the primary layer 3 switch (clients which
are not other nodes in the cluster use this link to communicate with
node1 under normal circumstances, and traffic to nodes 2 and 3 will
switch to using this if their respective point-to-point links go down)
has a 1 Gbps routed link to the secondary layer 3 switch (this is not
used under normal circumstances - only if the 2 Gbps link goes down)
node2:
has an 8 Gbps point-to-point routed link to node3 (and uses this link to
communicate with node3 under normal circumstances)
has an 8 Gbps point-to-point routed link to node1 (and uses this link to
communicate with node1 under normal circumstances)
has a 2 Gbps routed link to the primary layer 3 switch (clients which
are not other nodes in the cluster use this link to communicate with
node2 under normal circumstances, and traffic to nodes 3 and 1 will
switch to using this if their respective point-to-point links go down)
has a 1 Gbps routed link to the secondary layer 3 switch (this is not
used under normal circumstances - only if the 2 Gbps link goes down)
node3:
has an 8 Gbps point-to-point routed link to node2 (and uses this link to
communicate with node1 under normal circumstances)
has an 8 Gbps point-to-point routed link to node3 (and uses this link to
communicate with node2 under normal circumstances)
has a 2 Gbps routed link to the primary layer 3 switch (clients which
are not other nodes in the cluster use this link to communicate with
node3 under normal circumstances, and traffic to nodes 1 and 2 will
switch to using this if their respective point-to-point links go down)
has a 1 Gbps routed link to the secondary layer 3 switch (this is not
used under normal circumstances - only if the 2 Gbps link goes down)
I have avoided using a proper routing protocol for this, as the failover
still works automatically when links go down even with static routes.
Although I do also have scripts running on the hosts that detect when a
device at the other end of a link is not pingable even though the link
is up, and dynamically removes/inserts the routes as necessary in such a
situation. But adapting this approach to a larger cluster where
point-to-point links between all hosts isn't viable might well warrant
use of a routing protocol.
The end result being that I have more control over where the different
traffic goes, and it allows me to mess around with the networking
without any effect on the cluster.
Alex
On 20/12/2014 5:23 AM, Craig Lewis wrote:
On Fri, Dec 19, 2014 at 6:19 PM, Francois Lafont <flafdiv...@free.fr
<mailto:flafdiv...@free.fr>> wrote:
So, indeed, I have to use routing *or* maybe create 2 monitors
by server like this:
[mon.node1-public1]
host = ceph-node1
mon addr = 10.0.1.1
[mon.node1-public2]
host = ceph-node1
mon addr = 10.0.2.1
# etc...
But, in this case, the working directories of mon.node1-public1
and mon.node1-public2 will be in the same disk (I have no
choice). Is it a problem? Are monitors big consumers of I/O disk?
Interesting idea. While you will have an even number of monitors,
you'll still have an odd number of failure domains. I'm not sure if
it'll work though... make sure you test having the leader on both
networks. It might cause problems if the leader is on the 10.0.1.0/24
<http://10.0.1.0/24> network?
Monitors can be big consumers of disk IO, if there is a lot of cluster
activity. Monitors records all of the cluster changes in LevelDB, and
send copies to all of the daemons. There have been posts to the ML
about people running out of Disk IOps on the monitors, and the
problems it causes. The bigger the cluster, the more IOps. As long
as you monitor and alert on your monitor disk IOps, I don't think it
would be a problem.
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com