[ceph-users] failure of public network kills connectivity

2016-01-05 Thread Adrian Imboden

Hi

I recently set up a small ceph cluster at home for testing and private 
purposes.
It works really great, but I have a problem that may come from my 
small-size configuration.


All nodes are running Ubuntu 14.04 and ceph infernalis 9.2.0.

I have two networks as recommended:
cluster network: 10.10.128.0/24
public network: 10.10.48.0/22

All ip addresses are configured statically.

The behavior that I see here is that when I let the rados benchmark run 
(e.g "rados bench -p data 300 write"),
the public and the cluster network is being used to transmit the data 
(about 50%/50%).


When I disconnect the cluster from the public network, the connection 
between the osd's is lost, while the monitors keep seeing each other:
HEALTH_WARN 129 pgs degraded; 127 pgs stale; 129 pgs undersized; 
recovery 1885/7316 objects degraded (25.765%); 6/8 in osds are down



What I expect is, that only the cluster network is being used, when a 
ceph node itself reads or writes data.
Furthermore, I expected that a failure of the public network does not 
affect the connectivity of the nodes themselves.


What do I not yet understand, or what am I configuring the wrong way?

I plan to run kvm on these same nodes beside the storage-cluster as it 
is only a small setup. Thats the reason why I am a little bit concerned 
about

this behaviour.


This is how it is setup:
|- node1 (cluster: 10.10.128.1, public: 10.10.49.1)
|  |- osd
|  |- osd
|  |- mon
|
|- node2 (cluster: 10.10.128.2, public: 10.10.49.2)
|  |- osd
|  |- osd
|  |- mon
|
|- node3 (cluster: 10.10.128.3, public: 10.10.49.3)
|  |- osd
|  |- osd
|  |- mon
|
|- node4 (cluster: 10.10.128.4, public: 10.10.49.4)
|  |- osd
|  |- osd
|

This is my ceph config:

[global]
auth supported = cephx

fsid = 64599def-5741-4bda-8ce5-31a85af884bb
mon initial members = node1 node3 node2 node4
mon host = 10.10.128.1 10.10.128.3 10.10.128.2 10.10.128.4
public network = 10.10.48.0/22
cluster network = 10.10.128.0/24
auth cluster required = cephx
auth service required = cephx
auth client required = cephx

[mon.node1]
host = node1
mon data = /var/lib/ceph/mon/ceph-node1/

[mon.node3]
host = node3
mon data = /var/lib/ceph/mon/ceph-node3/

[mon.node2]
host = node2
mon data = /var/lib/ceph/mon/ceph-node2/

[mon.node4]
host = node4
mon data = /var/lib/ceph/mon/ceph-node4/


Thank you very much

Greetings
Adrian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] failure of public network kills connectivity

2016-01-05 Thread Wido den Hollander
On 01/05/2016 07:59 PM, Adrian Imboden wrote:
> Hi
> 
> I recently set up a small ceph cluster at home for testing and private
> purposes.
> It works really great, but I have a problem that may come from my
> small-size configuration.
> 
> All nodes are running Ubuntu 14.04 and ceph infernalis 9.2.0.
> 
> I have two networks as recommended:
> cluster network: 10.10.128.0/24
> public network: 10.10.48.0/22
> 
> All ip addresses are configured statically.
> 
> The behavior that I see here is that when I let the rados benchmark run
> (e.g "rados bench -p data 300 write"),
> the public and the cluster network is being used to transmit the data
> (about 50%/50%).
> 
> When I disconnect the cluster from the public network, the connection
> between the osd's is lost, while the monitors keep seeing each other:
> HEALTH_WARN 129 pgs degraded; 127 pgs stale; 129 pgs undersized;
> recovery 1885/7316 objects degraded (25.765%); 6/8 in osds are down
> 

The cluster network is only used by replication and recovery between
OSDs. The monitors are not present on the cluster network and only live
on the public network.

I usually don't see any good reason to use the cluster network since it
only adds a failure domain.

If you have a single network for Ceph it works just fine. Bandwidth is
in most cases I see not the problem, latency usually is.

My advise, do not over-engineer and stick with a single network. Makes
life a lot easier.

Wido

> 
> What I expect is, that only the cluster network is being used, when a
> ceph node itself reads or writes data.
> Furthermore, I expected that a failure of the public network does not
> affect the connectivity of the nodes themselves.
> 
> What do I not yet understand, or what am I configuring the wrong way?
> 
> I plan to run kvm on these same nodes beside the storage-cluster as it
> is only a small setup. Thats the reason why I am a little bit concerned
> about
> this behaviour.
> 
> 
> This is how it is setup:
> |- node1 (cluster: 10.10.128.1, public: 10.10.49.1)
> |  |- osd
> |  |- osd
> |  |- mon
> |
> |- node2 (cluster: 10.10.128.2, public: 10.10.49.2)
> |  |- osd
> |  |- osd
> |  |- mon
> |
> |- node3 (cluster: 10.10.128.3, public: 10.10.49.3)
> |  |- osd
> |  |- osd
> |  |- mon
> |
> |- node4 (cluster: 10.10.128.4, public: 10.10.49.4)
> |  |- osd
> |  |- osd
> |
> 
> This is my ceph config:
> 
> [global]
> auth supported = cephx
> 
> fsid = 64599def-5741-4bda-8ce5-31a85af884bb
> mon initial members = node1 node3 node2 node4
> mon host = 10.10.128.1 10.10.128.3 10.10.128.2 10.10.128.4
> public network = 10.10.48.0/22
> cluster network = 10.10.128.0/24
> auth cluster required = cephx
> auth service required = cephx
> auth client required = cephx
> 
> [mon.node1]
> host = node1
> mon data = /var/lib/ceph/mon/ceph-node1/
> 
> [mon.node3]
> host = node3
> mon data = /var/lib/ceph/mon/ceph-node3/
> 
> [mon.node2]
> host = node2
> mon data = /var/lib/ceph/mon/ceph-node2/
> 
> [mon.node4]
> host = node4
> mon data = /var/lib/ceph/mon/ceph-node4/
> 
> 
> Thank you very much
> 
> Greetings
> Adrian
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com