Also, your min_size is set to 2. What this means is that you need at least 2 copies of your data up to be able to access it. You do not want to have min_size of 1. If you had min_size of 1 and you only have 1 copy of your data receiving writes and then that copy goes down as well... What is to stop one of the other 2 copies coming up before the copy that was last up except not knowing about the current state of the data. Now you're in a state where your data is corrupt in that the client doesn't know what state the data is in.
On Fri, Jun 2, 2017 at 10:34 AM Ashley Merrick <ash...@amerrick.co.uk> wrote: > You only have 3 osd's hence with one down you only have 2 left for > replication of 3 objects. > > No spare OSD to place the 3rd object on, if you was to add a 4th node the > issue would be removed. > > ,Ashley > On 2 Jun 2017, at 10:31 PM, Oleg Obleukhov <leoleov...@gmail.com> wrote: > > Hello, > I am playing around with ceph (ceph version 10.2.7 > (50e863e0f4bc8f4b9e31156de690d765af245185)) on Debian Jessie and I build a > test setup: > > $ ceph osd tree > ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY > -1 0.01497 root default > -2 0.00499 host af-staging-ceph01 > 0 0.00499 osd.0 up 1.00000 1.00000 > -3 0.00499 host af-staging-ceph02 > 1 0.00499 osd.1 up 1.00000 1.00000 > -4 0.00499 host af-staging-ceph03 > 2 0.00499 osd.2 up 1.00000 1.00000 > > So I have 3 osd on 3 servers. > I also created 2 pools: > > ceph osd dump | grep 'replicated size' > pool 1 'cephfs_data' replicated size 3 min_size 2 crush_ruleset 0 > object_hash rjenkins pg_num 32 pgp_num 32 last_change 33 flags hashpspool > crash_replay_interval 45 stripe_width 0 > pool 2 'cephfs_metadata' replicated size 3 min_size 2 crush_ruleset 0 > object_hash rjenkins pg_num 32 pgp_num 32 last_change 31 flags hashpspool > stripe_width 0 > > Now I am testing failover and kill one of servers: > ceph osd tree > ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY > -1 0.01497 root default > -2 0.00499 host af-staging-ceph01 > 0 0.00499 osd.0 up 1.00000 1.00000 > -3 0.00499 host af-staging-ceph02 > 1 0.00499 osd.1 down 1.00000 1.00000 > -4 0.00499 host af-staging-ceph03 > 2 0.00499 osd.2 up 1.00000 1.00000 > > And now it stuck in the recovery state: > ceph -s > cluster 6b5ff07a-7232-4840-b486-6b7906248de7 > health HEALTH_WARN > 64 pgs degraded > 18 pgs stuck unclean > 64 pgs undersized > recovery 21/63 objects degraded (33.333%) > 1/3 in osds are down > 1 mons down, quorum 0,2 af-staging-ceph01,af-staging-ceph03 > monmap e1: 3 mons at {af-staging-ceph01= > 10.36.0.121:6789/0,af-staging-ceph02=10.36.0.122:6789/0,af-staging-ceph03=10.36.0.123:6789/0 > } > election epoch 38, quorum 0,2 > af-staging-ceph01,af-staging-ceph03 > fsmap e29: 1/1/1 up {0=af-staging-ceph03.crm.ig.local=up:active}, 2 > up:standby > osdmap e78: 3 osds: 2 up, 3 in; 64 remapped pgs > flags sortbitwise,require_jewel_osds > pgmap v334: 64 pgs, 2 pools, 47129 bytes data, 21 objects > 122 MB used, 15204 MB / 15326 MB avail > 21/63 objects degraded (33.333%) > 64 active+undersized+degraded > > And if I kill one more node I lose access to mounted file system on client. > Normally I would expect replica-factor to be respected and ceph should > create the missing copies of degraded pg. > > I was trying to rebuild the crush map and it looks like this, but this did > not help: > rule replicated_ruleset { > ruleset 0 > type replicated > min_size 1 > max_size 10 > step take default > step chooseleaf firstn 0 type osd > step emit > } > > # end crush map > > Would very appreciate help, > Thank you very much in advance, > Oleg. > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com