[ceph-users] Re: Bug in crush algorithm? 1 PG with the same OSD twice.
>> Note: "step chose" was selected by creating the crush rule with ceph on pool >> creation. If the default should be "step choseleaf" (with OSD buckets), then >> the automatic crush rule generation in ceph ought to be fixed for EC >> profiles. > Interesting. Which exact command was used to create the pool? I can reproduce. By default with "host" failure domain, the resulting rule will "chooseleaf indep host". But if you create an ec profile with crush-failure-domain=osd, then resulting rules will "choose indep osd". We should open a tracker for this. Either "choose indep osd" and "chooseleaf indep osd" should be give the same result, or the pool creation should use "chooseleaf indep osd" in this case. -- dan On Tue, Aug 30, 2022 at 1:43 PM Dan van der Ster wrote: > > > Note: "step chose" was selected by creating the crush rule with ceph on > > pool creation. If the default should be "step choseleaf" (with OSD > > buckets), then the automatic crush rule generation in ceph ought to be > > fixed for EC profiles. > > Interesting. Which exact command was used to create the pool? > > > These experiments indicate that there is a very weird behaviour > > implemented, I would actually call this a serious bug. > > I don't think this is a bug. Each of your attempts with different > _tries values changed the max iterations of the various loops in > crush. Since this takes crush on different "paths" to find a valid > OSD, the output is going to be different. > > > The resulting mapping should be independent of the maximum number of trials > > No this is wrong.. the "tunables" change the mapping. The important > thing is that every node + client in the cluster agrees on the mapping > -- and indeed since they all use the same tunables, including the > values for *_tries, they will all agree on the up/acting set. > > Cheers, Dan > > On Tue, Aug 30, 2022 at 1:10 PM Frank Schilder wrote: > > > > Hi Dan, > > > > thanks a lot for looking into this. I can't entirely reproduce your > > results. Maybe we are using different versions and there was a change? I'm > > testing with the octopus 16.2.16 image: quay.io/ceph/ceph:v15.2.16. > > > > Note: "step chose" was selected by creating the crush rule with ceph on > > pool creation. If the default should be "step choseleaf" (with OSD > > buckets), then the automatic crush rule generation in ceph ought to be > > fixed for EC profiles. > > > > My results with the same experiments as you did, I can partly confirm and > > partly I see oddness that I would consider a bug (reported at the very end): > > > > rule fs-data { > > id 1 > > type erasure > > min_size 3 > > max_size 6 > > step take default > > step choose indep 0 type osd > > step emit > > } > > > > # osdmaptool --test-map-pg 4.1c osdmap.bin > > osdmaptool: osdmap file 'osdmap.bin' > > parsed '4.1c' -> 4.1c > > 4.1c raw ([6,1,4,5,3,2147483647], p6) up ([6,1,4,5,3,2147483647], p6) > > acting ([6,1,4,5,3,1], p6) > > > > rule fs-data { > > id 1 > > type erasure > > min_size 3 > > max_size 6 > > step take default > > step chooseleaf indep 0 type osd > > step emit > > } > > > > # osdmaptool --test-map-pg 4.1c osdmap.bin > > osdmaptool: osdmap file 'osdmap.bin' > > parsed '4.1c' -> 4.1c > > 4.1c raw ([6,1,4,5,3,2], p6) up ([6,1,4,5,3,2], p6) acting ([6,1,4,5,3,1], > > p6) > > > > So far, so good. Now the oddness: > > > > rule fs-data { > > id 1 > > type erasure > > min_size 3 > > max_size 6 > > step set_chooseleaf_tries 5 > > step set_choose_tries 100 > > step take default > > step chooseleaf indep 0 type osd > > step emit > > } > > > > # osdmaptool --test-map-pg 4.1c osdmap.bin > > osdmaptool: osdmap file 'osdmap.bin' > > parsed '4.1c' -> 4.1c > > 4.1c raw ([6,1,4,5,3,8], p6) up ([6,1,4,5,3,8], p6) acting ([6,1,4,5,3,1], > > p6) > > > > How can this be different?? I thought crush returns on the first successful > > mapping. This ought to be identical to the previous one. It gets even more > > weird: > > > > rule fs-data { > > id 1 > > type erasure > > min_size 3 > > max_size 6 > > step set_chooseleaf_tries 50 > > step set_choose_tries 200 > > step take default > > step chooseleaf indep 0 type osd > > step emit > > } > > > > # osdmaptool --test-map-pg 4.1c osdmap.bin > > osdmaptool: osdmap file 'osdmap.bin' > > parsed '4.1c' -> 4.1c > > 4.1c raw ([6,1,4,5,3,7], p6) up ([6,1,4,5,3,2147483647], p6) acting > > ([6,1,4,5,3,1], p6) > > > > What We increase the maximum number of trials for searching and we > > end up with an invalid mapping?? > > > > These experiments indicate that there is a very weird behaviour > > implemented, I would actually call this a serious bug. The resulting > > mapping should be independent of the maximum number of trials
[ceph-users] Re: Bug in crush algorithm? 1 PG with the same OSD twice.
> Note: "step chose" was selected by creating the crush rule with ceph on pool > creation. If the default should be "step choseleaf" (with OSD buckets), then > the automatic crush rule generation in ceph ought to be fixed for EC profiles. Interesting. Which exact command was used to create the pool? > These experiments indicate that there is a very weird behaviour implemented, > I would actually call this a serious bug. I don't think this is a bug. Each of your attempts with different _tries values changed the max iterations of the various loops in crush. Since this takes crush on different "paths" to find a valid OSD, the output is going to be different. > The resulting mapping should be independent of the maximum number of trials No this is wrong.. the "tunables" change the mapping. The important thing is that every node + client in the cluster agrees on the mapping -- and indeed since they all use the same tunables, including the values for *_tries, they will all agree on the up/acting set. Cheers, Dan On Tue, Aug 30, 2022 at 1:10 PM Frank Schilder wrote: > > Hi Dan, > > thanks a lot for looking into this. I can't entirely reproduce your results. > Maybe we are using different versions and there was a change? I'm testing > with the octopus 16.2.16 image: quay.io/ceph/ceph:v15.2.16. > > Note: "step chose" was selected by creating the crush rule with ceph on pool > creation. If the default should be "step choseleaf" (with OSD buckets), then > the automatic crush rule generation in ceph ought to be fixed for EC profiles. > > My results with the same experiments as you did, I can partly confirm and > partly I see oddness that I would consider a bug (reported at the very end): > > rule fs-data { > id 1 > type erasure > min_size 3 > max_size 6 > step take default > step choose indep 0 type osd > step emit > } > > # osdmaptool --test-map-pg 4.1c osdmap.bin > osdmaptool: osdmap file 'osdmap.bin' > parsed '4.1c' -> 4.1c > 4.1c raw ([6,1,4,5,3,2147483647], p6) up ([6,1,4,5,3,2147483647], p6) acting > ([6,1,4,5,3,1], p6) > > rule fs-data { > id 1 > type erasure > min_size 3 > max_size 6 > step take default > step chooseleaf indep 0 type osd > step emit > } > > # osdmaptool --test-map-pg 4.1c osdmap.bin > osdmaptool: osdmap file 'osdmap.bin' > parsed '4.1c' -> 4.1c > 4.1c raw ([6,1,4,5,3,2], p6) up ([6,1,4,5,3,2], p6) acting ([6,1,4,5,3,1], p6) > > So far, so good. Now the oddness: > > rule fs-data { > id 1 > type erasure > min_size 3 > max_size 6 > step set_chooseleaf_tries 5 > step set_choose_tries 100 > step take default > step chooseleaf indep 0 type osd > step emit > } > > # osdmaptool --test-map-pg 4.1c osdmap.bin > osdmaptool: osdmap file 'osdmap.bin' > parsed '4.1c' -> 4.1c > 4.1c raw ([6,1,4,5,3,8], p6) up ([6,1,4,5,3,8], p6) acting ([6,1,4,5,3,1], p6) > > How can this be different?? I thought crush returns on the first successful > mapping. This ought to be identical to the previous one. It gets even more > weird: > > rule fs-data { > id 1 > type erasure > min_size 3 > max_size 6 > step set_chooseleaf_tries 50 > step set_choose_tries 200 > step take default > step chooseleaf indep 0 type osd > step emit > } > > # osdmaptool --test-map-pg 4.1c osdmap.bin > osdmaptool: osdmap file 'osdmap.bin' > parsed '4.1c' -> 4.1c > 4.1c raw ([6,1,4,5,3,7], p6) up ([6,1,4,5,3,2147483647], p6) acting > ([6,1,4,5,3,1], p6) > > What We increase the maximum number of trials for searching and we > end up with an invalid mapping?? > > These experiments indicate that there is a very weird behaviour implemented, > I would actually call this a serious bug. The resulting mapping should be > independent of the maximum number of trials (if I understood the crush > algorithm correctly). In any case, a valid mapping should never be replaced > in favour of an invalid one (containing a down+out OSD). > > For now there is a happy end on my test cluster: > > # ceph pg dump pgs_brief | grep 4.1c > dumped pgs_brief > 4.1c active+remapped+backfilling [6,1,4,5,3,8] 6 > [6,1,4,5,3,1] 6 > > Please look into the extremely odd behaviour reported above. I'm quite > confident that this is unintended if not dangerous behaviour and should be > corrected. I'm willing to file a tracker item with the data above. I'm > actually wondering if this might be related to > https://tracker.ceph.com/issues/56995 . > > Thanks for tracking this down and best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > > From: Dan van der Ster > Sent: 30 August 2022 12:16:37 > To: Frank Schilder > Cc: ceph-users@ceph.io > Subject: Re: [ceph-users] Bug in crush algorithm? 1
[ceph-users] Re: Bug in crush algorithm? 1 PG with the same OSD twice.
BTW, the defaults for _tries seems to work too: # diff -u crush.txt crush.txt2 --- crush.txt 2022-08-30 11:27:41.941836374 +0200 +++ crush.txt2 2022-08-30 11:55:45.601891010 +0200 @@ -90,10 +90,10 @@ type erasure min_size 3 max_size 6 - step set_chooseleaf_tries 50 - step set_choose_tries 200 + step set_chooseleaf_tries 5 + step set_choose_tries 100 step take default - step choose indep 0 type osd + step chooseleaf indep 0 type osd step emit } # osdmaptool --test-map-pg 4.1c osdmap.bin2 osdmaptool: osdmap file 'osdmap.bin2' parsed '4.1c' -> 4.1c 4.1c raw ([6,1,4,5,3,8], p6) up ([6,1,4,5,3,8], p6) acting ([6,1,4,5,3,1], p6) -- dan On Tue, Aug 30, 2022 at 11:50 AM Dan van der Ster wrote: > > BTW, I vaguely recalled seeing this before. Yup, found it: > https://tracker.ceph.com/issues/55169 > > On Tue, Aug 30, 2022 at 11:46 AM Dan van der Ster wrote: > > > > > 2. osd.7 is destroyed but still "up" in the osdmap. > > > > Oops, you can ignore this point -- this was an observation I had while > > playing with the osdmap -- your osdmap.bin has osd.7 down correctly. > > > > In case you're curious, here was what confused me: > > > > # osdmaptool osdmap.bin2 --mark-up-in --mark-out 7 --dump plain > > osd.7 up out weight 0 up_from 3846 up_thru 3853 down_at 3855 > > last_clean_interval [0,0) > > [v2:10.41.24.15:6810/1915819,v1:10.41.24.15:6811/1915819] > > [v2:192.168.0.15:6808/1915819,v1:192.168.0.15:6809/1915819] > > destroyed,exists,up > > > > Just ignore this ... > > > > > > > > -- dan > > > > On Tue, Aug 30, 2022 at 11:41 AM Dan van der Ster > > wrote: > > > > > > Hi Frank, > > > > > > I suspect this is a combination of issues. > > > 1. You have "choose" instead of "chooseleaf" in rule 1. > > > 2. osd.7 is destroyed but still "up" in the osdmap. > > > 3. The _tries settings in rule 1 are not helping. > > > > > > Here are my tests: > > > > > > # osdmaptool --test-map-pg 4.1c osdmap.bin > > > osdmaptool: osdmap file 'osdmap.bin' > > > parsed '4.1c' -> 4.1c > > > 4.1c raw ([6,1,4,5,3,2147483647], p6) up ([6,1,4,5,3,2147483647], p6) > > > acting ([6,1,4,5,3,1], p6) > > > > > > ^^ This is what you observe now. > > > > > > # diff -u crush.txt crush.txt2 > > > --- crush.txt 2022-08-30 11:27:41.941836374 +0200 > > > +++ crush.txt2 2022-08-30 11:31:29.631491424 +0200 > > > @@ -93,7 +93,7 @@ > > > step set_chooseleaf_tries 50 > > > step set_choose_tries 200 > > > step take default > > > - step choose indep 0 type osd > > > + step chooseleaf indep 0 type osd > > > step emit > > > } > > > # crushtool -c crush.txt2 -o crush.map2 > > > # cp osdmap.bin osdmap.bin2 > > > # osdmaptool --import-crush crush.map2 osdmap.bin2 > > > osdmaptool: osdmap file 'osdmap.bin2' > > > osdmaptool: imported 1166 byte crush map from crush.map2 > > > osdmaptool: writing epoch 4990 to osdmap.bin2 > > > # osdmaptool --test-map-pg 4.1c osdmap.bin2 > > > osdmaptool: osdmap file 'osdmap.bin2' > > > parsed '4.1c' -> 4.1c > > > 4.1c raw ([6,1,4,5,3,7], p6) up ([6,1,4,5,3,2147483647], p6) acting > > > ([6,1,4,5,3,1], p6) > > > > > > ^^ The mapping is now "correct" in that it doesn't duplicate the > > > mapping to osd.1. However it tries to use osd.7 which is destroyed but > > > up. > > > > > > You might be able to fix that by fully marking osd.7 out. > > > I can also get a good mapping by removing the *_tries settings from rule > > > 1: > > > > > > # diff -u crush.txt crush.txt2 > > > --- crush.txt 2022-08-30 11:27:41.941836374 +0200 > > > +++ crush.txt2 2022-08-30 11:38:14.068102835 +0200 > > > @@ -90,10 +90,8 @@ > > > type erasure > > > min_size 3 > > > max_size 6 > > > - step set_chooseleaf_tries 50 > > > - step set_choose_tries 200 > > > step take default > > > - step choose indep 0 type osd > > > + step chooseleaf indep 0 type osd > > > step emit > > > } > > > ... > > > # osdmaptool --test-map-pg 4.1c osdmap.bin2 > > > osdmaptool: osdmap file 'osdmap.bin2' > > > parsed '4.1c' -> 4.1c > > > 4.1c raw ([6,1,4,5,3,2], p6) up ([6,1,4,5,3,2], p6) acting > > > ([6,1,4,5,3,1], p6) > > > > > > Note that I didn't need to adjust the reweights: > > > > > > # osdmaptool osdmap.bin2 --tree > > > osdmaptool: osdmap file 'osdmap.bin2' > > > ID CLASS WEIGHT TYPE NAME STATUSREWEIGHT PRI-AFF > > > -1 2.44798 root default > > > -7 0.81599 host tceph-01 > > > 0 hdd 0.27199 osd.0up 0.87999 1.0 > > > 3 hdd 0.27199 osd.3up 0.98000 1.0 > > > 6 hdd 0.27199 osd.6up 0.92999 1.0 > > > -3 0.81599 host tceph-02 > > > 2 hdd 0.27199 osd.2up 0.95999 1.0 > > > 4 hdd 0.27199 osd.4up 0.8 1.0 > > > 8 hdd 0.27199 osd.8up 0.8 1.0 > > > -5 0.81599 host tceph-03 > > > 1 hdd 0.27199 osd.1up 0.8 1.0 > > > 5 hdd 0.27199 osd.5up 1.0 1.0 > > > 7 hdd 0.27199
[ceph-users] Re: Bug in crush algorithm? 1 PG with the same OSD twice.
BTW, I vaguely recalled seeing this before. Yup, found it: https://tracker.ceph.com/issues/55169 On Tue, Aug 30, 2022 at 11:46 AM Dan van der Ster wrote: > > > 2. osd.7 is destroyed but still "up" in the osdmap. > > Oops, you can ignore this point -- this was an observation I had while > playing with the osdmap -- your osdmap.bin has osd.7 down correctly. > > In case you're curious, here was what confused me: > > # osdmaptool osdmap.bin2 --mark-up-in --mark-out 7 --dump plain > osd.7 up out weight 0 up_from 3846 up_thru 3853 down_at 3855 > last_clean_interval [0,0) > [v2:10.41.24.15:6810/1915819,v1:10.41.24.15:6811/1915819] > [v2:192.168.0.15:6808/1915819,v1:192.168.0.15:6809/1915819] > destroyed,exists,up > > Just ignore this ... > > > > -- dan > > On Tue, Aug 30, 2022 at 11:41 AM Dan van der Ster wrote: > > > > Hi Frank, > > > > I suspect this is a combination of issues. > > 1. You have "choose" instead of "chooseleaf" in rule 1. > > 2. osd.7 is destroyed but still "up" in the osdmap. > > 3. The _tries settings in rule 1 are not helping. > > > > Here are my tests: > > > > # osdmaptool --test-map-pg 4.1c osdmap.bin > > osdmaptool: osdmap file 'osdmap.bin' > > parsed '4.1c' -> 4.1c > > 4.1c raw ([6,1,4,5,3,2147483647], p6) up ([6,1,4,5,3,2147483647], p6) > > acting ([6,1,4,5,3,1], p6) > > > > ^^ This is what you observe now. > > > > # diff -u crush.txt crush.txt2 > > --- crush.txt 2022-08-30 11:27:41.941836374 +0200 > > +++ crush.txt2 2022-08-30 11:31:29.631491424 +0200 > > @@ -93,7 +93,7 @@ > > step set_chooseleaf_tries 50 > > step set_choose_tries 200 > > step take default > > - step choose indep 0 type osd > > + step chooseleaf indep 0 type osd > > step emit > > } > > # crushtool -c crush.txt2 -o crush.map2 > > # cp osdmap.bin osdmap.bin2 > > # osdmaptool --import-crush crush.map2 osdmap.bin2 > > osdmaptool: osdmap file 'osdmap.bin2' > > osdmaptool: imported 1166 byte crush map from crush.map2 > > osdmaptool: writing epoch 4990 to osdmap.bin2 > > # osdmaptool --test-map-pg 4.1c osdmap.bin2 > > osdmaptool: osdmap file 'osdmap.bin2' > > parsed '4.1c' -> 4.1c > > 4.1c raw ([6,1,4,5,3,7], p6) up ([6,1,4,5,3,2147483647], p6) acting > > ([6,1,4,5,3,1], p6) > > > > ^^ The mapping is now "correct" in that it doesn't duplicate the > > mapping to osd.1. However it tries to use osd.7 which is destroyed but > > up. > > > > You might be able to fix that by fully marking osd.7 out. > > I can also get a good mapping by removing the *_tries settings from rule 1: > > > > # diff -u crush.txt crush.txt2 > > --- crush.txt 2022-08-30 11:27:41.941836374 +0200 > > +++ crush.txt2 2022-08-30 11:38:14.068102835 +0200 > > @@ -90,10 +90,8 @@ > > type erasure > > min_size 3 > > max_size 6 > > - step set_chooseleaf_tries 50 > > - step set_choose_tries 200 > > step take default > > - step choose indep 0 type osd > > + step chooseleaf indep 0 type osd > > step emit > > } > > ... > > # osdmaptool --test-map-pg 4.1c osdmap.bin2 > > osdmaptool: osdmap file 'osdmap.bin2' > > parsed '4.1c' -> 4.1c > > 4.1c raw ([6,1,4,5,3,2], p6) up ([6,1,4,5,3,2], p6) acting ([6,1,4,5,3,1], > > p6) > > > > Note that I didn't need to adjust the reweights: > > > > # osdmaptool osdmap.bin2 --tree > > osdmaptool: osdmap file 'osdmap.bin2' > > ID CLASS WEIGHT TYPE NAME STATUSREWEIGHT PRI-AFF > > -1 2.44798 root default > > -7 0.81599 host tceph-01 > > 0 hdd 0.27199 osd.0up 0.87999 1.0 > > 3 hdd 0.27199 osd.3up 0.98000 1.0 > > 6 hdd 0.27199 osd.6up 0.92999 1.0 > > -3 0.81599 host tceph-02 > > 2 hdd 0.27199 osd.2up 0.95999 1.0 > > 4 hdd 0.27199 osd.4up 0.8 1.0 > > 8 hdd 0.27199 osd.8up 0.8 1.0 > > -5 0.81599 host tceph-03 > > 1 hdd 0.27199 osd.1up 0.8 1.0 > > 5 hdd 0.27199 osd.5up 1.0 1.0 > > 7 hdd 0.27199 osd.7 destroyed0 1.0 > > > > > > Does this work in real life? > > > > Cheers, Dan > > > > > > On Mon, Aug 29, 2022 at 7:38 PM Frank Schilder wrote: > > > > > > Hi Dan, > > > > > > please find attached (only 7K, so I hope it goes through). > > > md5sum=1504652f1b95802a9f2fe3725bf1336e > > > > > > I was playing a bit around with the crush map and found out the following: > > > > > > 1) Setting all re-weights to 1 does produce valid mappings. However, it > > > will lead to large imbalances and is impractical in operations. > > > > > > 2) Doing something as simple/stupid as the following also results in > > > valid mappings without having to change the weights: > > > > > > rule fs-data { > > > id 1 > > > type erasure > > > min_size 3 > > > max_size 6 > > > step set_chooseleaf_tries 50 > > > step set_choose_tries 200 > > > step take default > > > step choosel
[ceph-users] Re: Bug in crush algorithm? 1 PG with the same OSD twice.
> 2. osd.7 is destroyed but still "up" in the osdmap. Oops, you can ignore this point -- this was an observation I had while playing with the osdmap -- your osdmap.bin has osd.7 down correctly. In case you're curious, here was what confused me: # osdmaptool osdmap.bin2 --mark-up-in --mark-out 7 --dump plain osd.7 up out weight 0 up_from 3846 up_thru 3853 down_at 3855 last_clean_interval [0,0) [v2:10.41.24.15:6810/1915819,v1:10.41.24.15:6811/1915819] [v2:192.168.0.15:6808/1915819,v1:192.168.0.15:6809/1915819] destroyed,exists,up Just ignore this ... -- dan On Tue, Aug 30, 2022 at 11:41 AM Dan van der Ster wrote: > > Hi Frank, > > I suspect this is a combination of issues. > 1. You have "choose" instead of "chooseleaf" in rule 1. > 2. osd.7 is destroyed but still "up" in the osdmap. > 3. The _tries settings in rule 1 are not helping. > > Here are my tests: > > # osdmaptool --test-map-pg 4.1c osdmap.bin > osdmaptool: osdmap file 'osdmap.bin' > parsed '4.1c' -> 4.1c > 4.1c raw ([6,1,4,5,3,2147483647], p6) up ([6,1,4,5,3,2147483647], p6) > acting ([6,1,4,5,3,1], p6) > > ^^ This is what you observe now. > > # diff -u crush.txt crush.txt2 > --- crush.txt 2022-08-30 11:27:41.941836374 +0200 > +++ crush.txt2 2022-08-30 11:31:29.631491424 +0200 > @@ -93,7 +93,7 @@ > step set_chooseleaf_tries 50 > step set_choose_tries 200 > step take default > - step choose indep 0 type osd > + step chooseleaf indep 0 type osd > step emit > } > # crushtool -c crush.txt2 -o crush.map2 > # cp osdmap.bin osdmap.bin2 > # osdmaptool --import-crush crush.map2 osdmap.bin2 > osdmaptool: osdmap file 'osdmap.bin2' > osdmaptool: imported 1166 byte crush map from crush.map2 > osdmaptool: writing epoch 4990 to osdmap.bin2 > # osdmaptool --test-map-pg 4.1c osdmap.bin2 > osdmaptool: osdmap file 'osdmap.bin2' > parsed '4.1c' -> 4.1c > 4.1c raw ([6,1,4,5,3,7], p6) up ([6,1,4,5,3,2147483647], p6) acting > ([6,1,4,5,3,1], p6) > > ^^ The mapping is now "correct" in that it doesn't duplicate the > mapping to osd.1. However it tries to use osd.7 which is destroyed but > up. > > You might be able to fix that by fully marking osd.7 out. > I can also get a good mapping by removing the *_tries settings from rule 1: > > # diff -u crush.txt crush.txt2 > --- crush.txt 2022-08-30 11:27:41.941836374 +0200 > +++ crush.txt2 2022-08-30 11:38:14.068102835 +0200 > @@ -90,10 +90,8 @@ > type erasure > min_size 3 > max_size 6 > - step set_chooseleaf_tries 50 > - step set_choose_tries 200 > step take default > - step choose indep 0 type osd > + step chooseleaf indep 0 type osd > step emit > } > ... > # osdmaptool --test-map-pg 4.1c osdmap.bin2 > osdmaptool: osdmap file 'osdmap.bin2' > parsed '4.1c' -> 4.1c > 4.1c raw ([6,1,4,5,3,2], p6) up ([6,1,4,5,3,2], p6) acting ([6,1,4,5,3,1], p6) > > Note that I didn't need to adjust the reweights: > > # osdmaptool osdmap.bin2 --tree > osdmaptool: osdmap file 'osdmap.bin2' > ID CLASS WEIGHT TYPE NAME STATUSREWEIGHT PRI-AFF > -1 2.44798 root default > -7 0.81599 host tceph-01 > 0 hdd 0.27199 osd.0up 0.87999 1.0 > 3 hdd 0.27199 osd.3up 0.98000 1.0 > 6 hdd 0.27199 osd.6up 0.92999 1.0 > -3 0.81599 host tceph-02 > 2 hdd 0.27199 osd.2up 0.95999 1.0 > 4 hdd 0.27199 osd.4up 0.8 1.0 > 8 hdd 0.27199 osd.8up 0.8 1.0 > -5 0.81599 host tceph-03 > 1 hdd 0.27199 osd.1up 0.8 1.0 > 5 hdd 0.27199 osd.5up 1.0 1.0 > 7 hdd 0.27199 osd.7 destroyed0 1.0 > > > Does this work in real life? > > Cheers, Dan > > > On Mon, Aug 29, 2022 at 7:38 PM Frank Schilder wrote: > > > > Hi Dan, > > > > please find attached (only 7K, so I hope it goes through). > > md5sum=1504652f1b95802a9f2fe3725bf1336e > > > > I was playing a bit around with the crush map and found out the following: > > > > 1) Setting all re-weights to 1 does produce valid mappings. However, it > > will lead to large imbalances and is impractical in operations. > > > > 2) Doing something as simple/stupid as the following also results in valid > > mappings without having to change the weights: > > > > rule fs-data { > > id 1 > > type erasure > > min_size 3 > > max_size 6 > > step set_chooseleaf_tries 50 > > step set_choose_tries 200 > > step take default > > step chooseleaf indep 3 type host > > step emit > > step take default > > step chooseleaf indep -3 type host > > step emit > > } > > > > rule fs-data { > > id 1 > > type erasure > > min_size 3 > > max_size 6 > > step set_chooseleaf_tries 50 > > step set_choose_tries 200 > > step take default > > step choose indep 3 type osd > > step emit
[ceph-users] Re: Bug in crush algorithm? 1 PG with the same OSD twice.
Hi Frank, I suspect this is a combination of issues. 1. You have "choose" instead of "chooseleaf" in rule 1. 2. osd.7 is destroyed but still "up" in the osdmap. 3. The _tries settings in rule 1 are not helping. Here are my tests: # osdmaptool --test-map-pg 4.1c osdmap.bin osdmaptool: osdmap file 'osdmap.bin' parsed '4.1c' -> 4.1c 4.1c raw ([6,1,4,5,3,2147483647], p6) up ([6,1,4,5,3,2147483647], p6) acting ([6,1,4,5,3,1], p6) ^^ This is what you observe now. # diff -u crush.txt crush.txt2 --- crush.txt 2022-08-30 11:27:41.941836374 +0200 +++ crush.txt2 2022-08-30 11:31:29.631491424 +0200 @@ -93,7 +93,7 @@ step set_chooseleaf_tries 50 step set_choose_tries 200 step take default - step choose indep 0 type osd + step chooseleaf indep 0 type osd step emit } # crushtool -c crush.txt2 -o crush.map2 # cp osdmap.bin osdmap.bin2 # osdmaptool --import-crush crush.map2 osdmap.bin2 osdmaptool: osdmap file 'osdmap.bin2' osdmaptool: imported 1166 byte crush map from crush.map2 osdmaptool: writing epoch 4990 to osdmap.bin2 # osdmaptool --test-map-pg 4.1c osdmap.bin2 osdmaptool: osdmap file 'osdmap.bin2' parsed '4.1c' -> 4.1c 4.1c raw ([6,1,4,5,3,7], p6) up ([6,1,4,5,3,2147483647], p6) acting ([6,1,4,5,3,1], p6) ^^ The mapping is now "correct" in that it doesn't duplicate the mapping to osd.1. However it tries to use osd.7 which is destroyed but up. You might be able to fix that by fully marking osd.7 out. I can also get a good mapping by removing the *_tries settings from rule 1: # diff -u crush.txt crush.txt2 --- crush.txt 2022-08-30 11:27:41.941836374 +0200 +++ crush.txt2 2022-08-30 11:38:14.068102835 +0200 @@ -90,10 +90,8 @@ type erasure min_size 3 max_size 6 - step set_chooseleaf_tries 50 - step set_choose_tries 200 step take default - step choose indep 0 type osd + step chooseleaf indep 0 type osd step emit } ... # osdmaptool --test-map-pg 4.1c osdmap.bin2 osdmaptool: osdmap file 'osdmap.bin2' parsed '4.1c' -> 4.1c 4.1c raw ([6,1,4,5,3,2], p6) up ([6,1,4,5,3,2], p6) acting ([6,1,4,5,3,1], p6) Note that I didn't need to adjust the reweights: # osdmaptool osdmap.bin2 --tree osdmaptool: osdmap file 'osdmap.bin2' ID CLASS WEIGHT TYPE NAME STATUSREWEIGHT PRI-AFF -1 2.44798 root default -7 0.81599 host tceph-01 0 hdd 0.27199 osd.0up 0.87999 1.0 3 hdd 0.27199 osd.3up 0.98000 1.0 6 hdd 0.27199 osd.6up 0.92999 1.0 -3 0.81599 host tceph-02 2 hdd 0.27199 osd.2up 0.95999 1.0 4 hdd 0.27199 osd.4up 0.8 1.0 8 hdd 0.27199 osd.8up 0.8 1.0 -5 0.81599 host tceph-03 1 hdd 0.27199 osd.1up 0.8 1.0 5 hdd 0.27199 osd.5up 1.0 1.0 7 hdd 0.27199 osd.7 destroyed0 1.0 Does this work in real life? Cheers, Dan On Mon, Aug 29, 2022 at 7:38 PM Frank Schilder wrote: > > Hi Dan, > > please find attached (only 7K, so I hope it goes through). > md5sum=1504652f1b95802a9f2fe3725bf1336e > > I was playing a bit around with the crush map and found out the following: > > 1) Setting all re-weights to 1 does produce valid mappings. However, it will > lead to large imbalances and is impractical in operations. > > 2) Doing something as simple/stupid as the following also results in valid > mappings without having to change the weights: > > rule fs-data { > id 1 > type erasure > min_size 3 > max_size 6 > step set_chooseleaf_tries 50 > step set_choose_tries 200 > step take default > step chooseleaf indep 3 type host > step emit > step take default > step chooseleaf indep -3 type host > step emit > } > > rule fs-data { > id 1 > type erasure > min_size 3 > max_size 6 > step set_chooseleaf_tries 50 > step set_choose_tries 200 > step take default > step choose indep 3 type osd > step emit > step take default > step choose indep -3 type osd > step emit > } > > Of course, now the current weights are probably unsuitable as everything > moves around. Its probably also a lot more total tries to get rid of mappings > with duplicate OSDs. > > I probably have to read the code to understand how drawing straws from 8 > different buckets with non-zero probabilities can lead to an infinite > sequence of failed attempts of getting 6 different ones. There seems to be a > hard-coded tunable that turns seemingly infinite into finite somehow. > > The first modified rule will probably lead to better distribution of load, > but bad distribution of data if a disk goes down (considering the tiny host- > and disk numbers). The second rule seems to be almost as good or bad as the > default one (step choose indep 0 type osd), except that it does
[ceph-users] Re: Bug in crush algorithm? 1 PG with the same OSD twice.
Hi Frank, CRUSH can only find 5 OSDs, given your current tree, rule, and reweights. This is why there is a NONE in the UP set for shard 6. But in ACTING we see that it is refusing to remove shard 6 from osd.1 -- that is the only copy of that shard, so in this case it's helping you rather than deleting the shard altogether. ACTING == what the OSDs are serving now. UP == where CRUSH wants to place the shards. I suspect that this is a case of CRUSH tunables + your reweights putting CRUSH in a corner case of not finding 6 OSDs for that particular PG. If you set the reweights all back to 1, it probably finds 6 OSDs? Cheers, Dan On Mon, Aug 29, 2022 at 4:44 PM Frank Schilder wrote: > > Hi all, > > I'm investigating a problem with a degenerated PG on an octopus 15.2.16 test > cluster. It has 3Hosts x 3OSDs and a 4+2 EC pool with failure domain OSD. > After simulating a disk fail by removing an OSD and letting the cluster > recover (all under load), I end up with a PG with the same OSD allocated > twice: > > PG 4.1c, UP: [6,1,4,5,3,NONE] ACTING: [6,1,4,5,3,1] > > OSD 1 is allocated twice. How is this even possible? > > Here the OSD tree: > > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF > -1 2.44798 root default > -7 0.81599 host tceph-01 > 0hdd 0.27199 osd.0 up 0.87999 1.0 > 3hdd 0.27199 osd.3 up 0.98000 1.0 > 6hdd 0.27199 osd.6 up 0.92999 1.0 > -3 0.81599 host tceph-02 > 2hdd 0.27199 osd.2 up 0.95999 1.0 > 4hdd 0.27199 osd.4 up 0.8 1.0 > 8hdd 0.27199 osd.8 up 0.8 1.0 > -5 0.81599 host tceph-03 > 1hdd 0.27199 osd.1 up 0.8 1.0 > 5hdd 0.27199 osd.5 up 1.0 1.0 > 7hdd 0.27199 osd.7 destroyed 0 1.0 > > I tried already to change some tunables thinking about > https://docs.ceph.com/en/octopus/rados/troubleshooting/troubleshooting-pg/#crush-gives-up-too-soon, > but giving up too soon is obviously not the problem. It is accepting a wrong > mapping. > > Is there a way out of this? Clearly this is calling for trouble if not data > loss and should not happen at all. > > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io