Hi,

I'm playing with our new Ceph cluster and it seems that Ceph is not gracefully 
handling a maxed out cluster network.

I had some "flapping" nodes once every few minutes when pushing a lot of 
traffic to the nodes so I decided to set the noup and nodown as described in 
the docs.
http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/
After this the setup actually breaks: it will start complaining about slow 
requests and the ceph cluster stops processing all traffic.

ceph -w shows the following:
2013-11-20 08:02:20.031412 osd.4 [WRN] slow request 120.991605 seconds old, 
received at 2013-11-20 08:00:19.039748: osd_op(client.4650.0:46 
benchmark_data_fqdn_hostname_9016_object45 [write 0~4194304] 3.a11ea1e6 e158) 
v4 currently waiting for subops from [17,26]

When I disable noup and nodown things start working again.
So I am inclined to just take the flapping nodes for granted now since, except 
for some short flapping in the Ceph logging, things actually do keep working.
(also this is rados bench, actual traffic might well be IO limited)

Suggestions?

Thx,
Robert
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to