Let's see if I got this. 4 host cluster. size=3, min_size=2. 2 hosts fail. Are all of the following accurate?
a. An rdb is split into lots of objects, parts of which will probably exist on all 4 hosts. b. Some objects will have 2 of their 3 replicas on 2 of the offline OSDs. c. Reads can continue from the single online OSD even in pgs that happened to have two of 3 osds offline. d. Writes hang for pgs that have 2 offline OSDs because CRUSH can't meet the min_size=2 constraint. e. Rebalancing does not occur because with only two hosts online there is no way for CRUSH to meet the size=3 constraint even if it were to rebalance. f. I/O can been restored by setting min_size=1. g. Alternatively, I/O can be restored by setting size=2, which would kick off rebalancing and restored I/O as the pgs come into compliance with the size=2 constraint. h. If I instead have a cluster with 10 hosts, size=3 and min_size=2 and two hosts fail, some pgs would have only 1 OSD online, but rebalancing would start immediately since CRUSH can honor the size=3 constraint by rebalancing. This means more nodes makes for a more reliable cluster. i. If I wanted to force CRUSH to bring I/O back online with size=3 and min_size=2 but only 2 hosts online, I could remove the host bucket from the crushmap. CRUSH would then rebalance, but some PGs would likely end up with 3 OSDs all on the same host. (This is theory. I promise not to do any such thing to a production system ;) Thanks -- Adam Carheden On 03/21/2017 11:48 AM, Wes Dillingham wrote: > If you had set min_size to 1 you would not have seen the writes pause. a > min_size of 1 is dangerous though because it means you are 1 hard disk > failure away from losing the objects within that placement group > entirely. a min_size of 2 is generally considered the minimum you want > but many people ignore that advice, some wish they hadn't. > > On Tue, Mar 21, 2017 at 11:46 AM, Adam Carheden <carhe...@ucar.edu > <mailto:carhe...@ucar.edu>> wrote: > > Thanks everyone for the replies. Very informative. However, should I > have expected writes to pause if I'd had min_size set to 1 instead of 2? > > And yes, I was under the false impression that my rdb devices was a > single object. That explains what all those other things are on a test > cluster where I only created a single object! > > > -- > Adam Carheden > > On 03/20/2017 08:24 PM, Wes Dillingham wrote: > > This is because of the min_size specification. I would bet you have it > > set at 2 (which is good). > > > > ceph osd pool get rbd min_size > > > > With 4 hosts, and a size of 3, removing 2 of the hosts (or 2 drives 1 > > from each hosts) results in some of the objects only having 1 replica > > min_size dictates that IO freezes for those objects until min_size is > > achieved. > http://docs.ceph.com/docs/jewel/rados/operations/pools/#set-the-number-of-object-replicas > > <http://docs.ceph.com/docs/jewel/rados/operations/pools/#set-the-number-of-object-replicas> > > > > I cant tell if your under the impression that your RBD device is a > > single object. It is not. It is chunked up into many objects and spread > > throughout the cluster, as Kjeti mentioned earlier. > > > > On Mon, Mar 20, 2017 at 8:48 PM, Kjetil Jørgensen <kje...@medallia.com > <mailto:kje...@medallia.com> > > <mailto:kje...@medallia.com <mailto:kje...@medallia.com>>> wrote: > > > > Hi, > > > > rbd_id.vm-100-disk-1 is only a "meta object", IIRC, it's contents > > will get you a "prefix", which then gets you on to > > rbd_header.<prefix>, rbd_header.prefix contains block size, > > striping, etc. The actual data bearing objects will be named > > something like rbd_data.prefix.%-016x. > > > > Example - vm-100-disk-1 has the prefix 86ce2ae8944a, the first > > <block size> of that image will be named rbd_data. > > 86ce2ae8944a.000000000000, the second <block size> will be > > 86ce2ae8944a.000000000001, and so on, chances are that one of these > > objects are mapped to a pg which has both host3 and host4 among it's > > replicas. > > > > An rbd image will end up scattered across most/all osds of the pool > > it's in. > > > > Cheers, > > -KJ > > > > On Fri, Mar 17, 2017 at 12:30 PM, Adam Carheden <carhe...@ucar.edu > <mailto:carhe...@ucar.edu> > > <mailto:carhe...@ucar.edu <mailto:carhe...@ucar.edu>>> wrote: > > > > I have a 4 node cluster shown by `ceph osd tree` below. > Monitors are > > running on hosts 1, 2 and 3. It has a single replicated > pool of size > > 3. I have a VM with its hard drive replicated to OSDs > 11(host3), > > 5(host1) and 3(host2). > > > > I can 'fail' any one host by disabling the SAN network > interface and > > the VM keeps running with a simple slowdown in I/O performance > > just as > > expected. However, if 'fail' both nodes 3 and 4, I/O hangs on > > the VM. > > (i.e. `df` never completes, etc.) The monitors on hosts 1 > and 2 > > still > > have quorum, so that shouldn't be an issue. The placement > group > > still > > has 2 of its 3 replicas online. > > > > Why does I/O hang even though host4 isn't running a > monitor and > > doesn't have anything to do with my VM's hard drive. > > > > > > Size? > > # ceph osd pool get rbd size > > size: 3 > > > > Where's rbd_id.vm-100-disk-1? > > # ceph osd getmap -o /tmp/map && osdmaptool --pool 0 > > --test-map-object > > rbd_id.vm-100-disk-1 /tmp/map > > got osdmap epoch 1043 > > osdmaptool: osdmap file '/tmp/map' > > object 'rbd_id.vm-100-disk-1' -> 0.1ea -> [11,5,3] > > > > # ceph osd tree > > ID WEIGHT TYPE NAME UP/DOWN REWEIGHT > PRIMARY-AFFINITY > > -1 8.06160 root default > > -7 5.50308 room A > > -3 1.88754 host host1 > > 4 0.40369 osd.4 up 1.00000 > 1.00000 > > 5 0.40369 osd.5 up 1.00000 > 1.00000 > > 6 0.54008 osd.6 up 1.00000 > 1.00000 > > 7 0.54008 osd.7 up 1.00000 > 1.00000 > > -2 3.61554 host host2 > > 0 0.90388 osd.0 up 1.00000 > 1.00000 > > 1 0.90388 osd.1 up 1.00000 > 1.00000 > > 2 0.90388 osd.2 up 1.00000 > 1.00000 > > 3 0.90388 osd.3 up 1.00000 > 1.00000 > > -6 2.55852 room B > > -4 1.75114 host host3 > > 8 0.40369 osd.8 up 1.00000 > 1.00000 > > 9 0.40369 osd.9 up 1.00000 > 1.00000 > > 10 0.40369 osd.10 up 1.00000 > 1.00000 > > 11 0.54008 osd.11 up 1.00000 > 1.00000 > > -5 0.80737 host host4 > > 12 0.40369 osd.12 up 1.00000 > 1.00000 > > 13 0.40369 osd.13 up 1.00000 > 1.00000 > > > > > > -- > > Adam Carheden > > _______________________________________________ > > ceph-users mailing list > > ceph-users@lists.ceph.com > <mailto:ceph-users@lists.ceph.com> <mailto:ceph-users@lists.ceph.com > <mailto:ceph-users@lists.ceph.com>> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com> > > <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>> > > > > > > > > > > -- > > Kjetil Joergensen <kje...@medallia.com > <mailto:kje...@medallia.com> <mailto:kje...@medallia.com > <mailto:kje...@medallia.com>>> > > SRE, Medallia Inc > > Phone: +1 (650) 739-6580 <tel:%2B1%20%28650%29%20739-6580> > <tel:(650)%20739-6580> > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> > <mailto:ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com> > > <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>> > > > > > > > > > > -- > > Respectfully, > > > > Wes Dillingham > > wes_dilling...@harvard.edu <mailto:wes_dilling...@harvard.edu> > <mailto:wes_dilling...@harvard.edu <mailto:wes_dilling...@harvard.edu>> > > Research Computing | Infrastructure Engineer > > Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 210 > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com> > > > > > -- > Respectfully, > > Wes Dillingham > wes_dilling...@harvard.edu <mailto:wes_dilling...@harvard.edu> > Research Computing | Infrastructure Engineer > Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 210 > _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com