Hi,

I’m running in production a 3 monitors, 10 osdnodes ceph cluster. This cluster 
is used to host Openstack VM rbd. My pools are set to use a k=6 m=2 erasure 
code profile with a 3 copy metadata pool in front. The cluster runs well, but 
we recently had a short outage which triggered unexpected behaviour in the 
cluster.

I’ve always been under the impression that Ceph would continue working properly 
even if nodes would go down. I tested it several months ago with this 
configuration and it worked fine as long as only 2 nodes went down. However, 
this time, the first monitor as well as two osd nodes went down. As a result, 
Openstack VMs were able to mount their rbd volume but unable to read from it, 
even after the cluster had recovered with the following message : Reduced data 
availability: 599 pgs inactive, 599 pgs incomplete .

I believe the cluster should have continued to work properly despite the 
outage, so what could have prevented it from functioning? Is it because there 
was only two monitors remaining? Or is it that reduced data availability 
message? In that case, is my erasure coding configuration fine for that number 
of nodes?

Jean-Philippe Méthot
Openstack system administrator
Administrateur système Openstack
PlanetHoster inc.




_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to