[ceph-users] Re: Bug in crush algorithm? 1 PG with the same OSD twice.

Dan van der Ster Tue, 30 Aug 2022 04:51:34 -0700

> Note: "step chose" was selected by creating the crush rule with ceph on pool 
> creation. If the default should be "step choseleaf" (with OSD buckets), then 
> the automatic crush rule generation in ceph ought to be fixed for EC profiles.


Interesting. Which exact command was used to create the pool?

> These experiments indicate that there is a very weird behaviour implemented, 
> I would actually call this a serious bug.

I don't think this is a bug. Each of your attempts with different
_tries values changed the max iterations of the various loops in
crush. Since this takes crush on different "paths" to find a valid
OSD, the output is going to be different.

> The resulting mapping should be independent of the maximum number of trials

No this is wrong.. the "tunables" change the mapping. The important
thing is that every node + client in the cluster agrees on the mapping
-- and indeed since they all use the same tunables, including the
values for *_tries, they will all agree on the up/acting set.

Cheers, Dan

On Tue, Aug 30, 2022 at 1:10 PM Frank Schilder <fr...@dtu.dk> wrote:
>
> Hi Dan,
>
> thanks a lot for looking into this. I can't entirely reproduce your results. 
> Maybe we are using different versions and there was a change? I'm testing 
> with the octopus 16.2.16 image: quay.io/ceph/ceph:v15.2.16.
>
> Note: "step chose" was selected by creating the crush rule with ceph on pool 
> creation. If the default should be "step choseleaf" (with OSD buckets), then 
> the automatic crush rule generation in ceph ought to be fixed for EC profiles.
>
> My results with the same experiments as you did, I can partly confirm and 
> partly I see oddness that I would consider a bug (reported at the very end):
>
> rule fs-data {
>         id 1
>         type erasure
>         min_size 3
>         max_size 6
>         step take default
>         step choose indep 0 type osd
>         step emit
> }
>
> # osdmaptool --test-map-pg 4.1c osdmap.bin
> osdmaptool: osdmap file 'osdmap.bin'
>  parsed '4.1c' -> 4.1c
> 4.1c raw ([6,1,4,5,3,2147483647], p6) up ([6,1,4,5,3,2147483647], p6) acting 
> ([6,1,4,5,3,1], p6)
>
> rule fs-data {
>         id 1
>         type erasure
>         min_size 3
>         max_size 6
>         step take default
>         step chooseleaf indep 0 type osd
>         step emit
> }
>
> # osdmaptool --test-map-pg 4.1c osdmap.bin
> osdmaptool: osdmap file 'osdmap.bin'
>  parsed '4.1c' -> 4.1c
> 4.1c raw ([6,1,4,5,3,2], p6) up ([6,1,4,5,3,2], p6) acting ([6,1,4,5,3,1], p6)
>
> So far, so good. Now the oddness:
>
> rule fs-data {
>         id 1
>         type erasure
>         min_size 3
>         max_size 6
>         step set_chooseleaf_tries 5
>         step set_choose_tries 100
>         step take default
>         step chooseleaf indep 0 type osd
>         step emit
> }
>
> # osdmaptool --test-map-pg 4.1c osdmap.bin
> osdmaptool: osdmap file 'osdmap.bin'
>  parsed '4.1c' -> 4.1c
> 4.1c raw ([6,1,4,5,3,8], p6) up ([6,1,4,5,3,8], p6) acting ([6,1,4,5,3,1], p6)
>
> How can this be different?? I thought crush returns on the first successful 
> mapping. This ought to be identical to the previous one. It gets even more 
> weird:
>
> rule fs-data {
>         id 1
>         type erasure
>         min_size 3
>         max_size 6
>         step set_chooseleaf_tries 50
>         step set_choose_tries 200
>         step take default
>         step chooseleaf indep 0 type osd
>         step emit
> }
>
> # osdmaptool --test-map-pg 4.1c osdmap.bin
> osdmaptool: osdmap file 'osdmap.bin'
>  parsed '4.1c' -> 4.1c
> 4.1c raw ([6,1,4,5,3,7], p6) up ([6,1,4,5,3,2147483647], p6) acting 
> ([6,1,4,5,3,1], p6)
>
> Whaaaaat???? We increase the maximum number of trials for searching and we 
> end up with an invalid mapping??
>
> These experiments indicate that there is a very weird behaviour implemented, 
> I would actually call this a serious bug. The resulting mapping should be 
> independent of the maximum number of trials (if I understood the crush 
> algorithm correctly). In any case, a valid mapping should never be replaced 
> in favour of an invalid one (containing a down+out OSD).
>
> For now there is a happy end on my test cluster:
>
> # ceph pg dump pgs_brief | grep 4.1c
> dumped pgs_brief
> 4.1c     active+remapped+backfilling  [6,1,4,5,3,8]           6  
> [6,1,4,5,3,1]               6
>
> Please look into the extremely odd behaviour reported above. I'm quite 
> confident that this is unintended if not dangerous behaviour and should be 
> corrected. I'm willing to file a tracker item with the data above. I'm 
> actually wondering if this might be related to 
> https://tracker.ceph.com/issues/56995 .
>
> Thanks for tracking this down and best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Dan van der Ster <dvand...@gmail.com>
> Sent: 30 August 2022 12:16:37
> To: Frank Schilder
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] Bug in crush algorithm? 1 PG with the same OSD 
> twice.
>
> BTW, the defaults for _tries seems to work too:
>
>
> # diff -u crush.txt crush.txt2
> --- crush.txt 2022-08-30 11:27:41.941836374 +0200
> +++ crush.txt2 2022-08-30 11:55:45.601891010 +0200
> @@ -90,10 +90,10 @@
>   type erasure
>   min_size 3
>   max_size 6
> - step set_chooseleaf_tries 50
> - step set_choose_tries 200
> + step set_chooseleaf_tries 5
> + step set_choose_tries 100
>   step take default
> - step choose indep 0 type osd
> + step chooseleaf indep 0 type osd
>   step emit
>  }
>
> # osdmaptool --test-map-pg 4.1c osdmap.bin2
> osdmaptool: osdmap file 'osdmap.bin2'
>  parsed '4.1c' -> 4.1c
> 4.1c raw ([6,1,4,5,3,8], p6) up ([6,1,4,5,3,8], p6) acting ([6,1,4,5,3,1], p6)
>
>
> -- dan
>
> On Tue, Aug 30, 2022 at 11:50 AM Dan van der Ster <dvand...@gmail.com> wrote:
> >
> > BTW, I vaguely recalled seeing this before. Yup, found it:
> > https://tracker.ceph.com/issues/55169
> >
> > On Tue, Aug 30, 2022 at 11:46 AM Dan van der Ster <dvand...@gmail.com> 
> > wrote:
> > >
> > > > 2. osd.7 is destroyed but still "up" in the osdmap.
> > >
> > > Oops, you can ignore this point -- this was an observation I had while
> > > playing with the osdmap -- your osdmap.bin has osd.7 down correctly.
> > >
> > > In case you're curious, here was what confused me:
> > >
> > > # osdmaptool osdmap.bin2  --mark-up-in --mark-out 7 --dump plain
> > > osd.7 up   out weight 0 up_from 3846 up_thru 3853 down_at 3855
> > > last_clean_interval [0,0)
> > > [v2:10.41.24.15:6810/1915819,v1:10.41.24.15:6811/1915819]
> > > [v2:192.168.0.15:6808/1915819,v1:192.168.0.15:6809/1915819]
> > > destroyed,exists,up
> > >
> > > Just ignore this ...
> > >
> > >
> > >
> > > -- dan
> > >
> > > On Tue, Aug 30, 2022 at 11:41 AM Dan van der Ster <dvand...@gmail.com> 
> > > wrote:
> > > >
> > > > Hi Frank,
> > > >
> > > > I suspect this is a combination of issues.
> > > > 1. You have "choose" instead of "chooseleaf" in rule 1.
> > > > 2. osd.7 is destroyed but still "up" in the osdmap.
> > > > 3. The _tries settings in rule 1 are not helping.
> > > >
> > > > Here are my tests:
> > > >
> > > > # osdmaptool --test-map-pg 4.1c osdmap.bin
> > > > osdmaptool: osdmap file 'osdmap.bin'
> > > >  parsed '4.1c' -> 4.1c
> > > > 4.1c raw ([6,1,4,5,3,2147483647], p6) up ([6,1,4,5,3,2147483647], p6)
> > > > acting ([6,1,4,5,3,1], p6)
> > > >
> > > > ^^ This is what you observe now.
> > > >
> > > > # diff -u crush.txt crush.txt2
> > > > --- crush.txt 2022-08-30 11:27:41.941836374 +0200
> > > > +++ crush.txt2 2022-08-30 11:31:29.631491424 +0200
> > > > @@ -93,7 +93,7 @@
> > > >   step set_chooseleaf_tries 50
> > > >   step set_choose_tries 200
> > > >   step take default
> > > > - step choose indep 0 type osd
> > > > + step chooseleaf indep 0 type osd
> > > >   step emit
> > > >  }
> > > > # crushtool -c crush.txt2 -o crush.map2
> > > > # cp osdmap.bin osdmap.bin2
> > > > # osdmaptool --import-crush crush.map2 osdmap.bin2
> > > > osdmaptool: osdmap file 'osdmap.bin2'
> > > > osdmaptool: imported 1166 byte crush map from crush.map2
> > > > osdmaptool: writing epoch 4990 to osdmap.bin2
> > > > # osdmaptool --test-map-pg 4.1c osdmap.bin2
> > > > osdmaptool: osdmap file 'osdmap.bin2'
> > > >  parsed '4.1c' -> 4.1c
> > > > 4.1c raw ([6,1,4,5,3,7], p6) up ([6,1,4,5,3,2147483647], p6) acting
> > > > ([6,1,4,5,3,1], p6)
> > > >
> > > > ^^ The mapping is now "correct" in that it doesn't duplicate the
> > > > mapping to osd.1. However it tries to use osd.7 which is destroyed but
> > > > up.
> > > >
> > > > You might be able to fix that by fully marking osd.7 out.
> > > > I can also get a good mapping by removing the *_tries settings from 
> > > > rule 1:
> > > >
> > > > # diff -u crush.txt crush.txt2
> > > > --- crush.txt 2022-08-30 11:27:41.941836374 +0200
> > > > +++ crush.txt2 2022-08-30 11:38:14.068102835 +0200
> > > > @@ -90,10 +90,8 @@
> > > >   type erasure
> > > >   min_size 3
> > > >   max_size 6
> > > > - step set_chooseleaf_tries 50
> > > > - step set_choose_tries 200
> > > >   step take default
> > > > - step choose indep 0 type osd
> > > > + step chooseleaf indep 0 type osd
> > > >   step emit
> > > >  }
> > > > ...
> > > > # osdmaptool --test-map-pg 4.1c osdmap.bin2
> > > > osdmaptool: osdmap file 'osdmap.bin2'
> > > >  parsed '4.1c' -> 4.1c
> > > > 4.1c raw ([6,1,4,5,3,2], p6) up ([6,1,4,5,3,2], p6) acting 
> > > > ([6,1,4,5,3,1], p6)
> > > >
> > > > Note that I didn't need to adjust the reweights:
> > > >
> > > > # osdmaptool osdmap.bin2 --tree
> > > > osdmaptool: osdmap file 'osdmap.bin2'
> > > > ID CLASS WEIGHT  TYPE NAME         STATUS    REWEIGHT PRI-AFF
> > > > -1       2.44798 root default
> > > > -7       0.81599     host tceph-01
> > > >  0   hdd 0.27199         osd.0            up  0.87999 1.00000
> > > >  3   hdd 0.27199         osd.3            up  0.98000 1.00000
> > > >  6   hdd 0.27199         osd.6            up  0.92999 1.00000
> > > > -3       0.81599     host tceph-02
> > > >  2   hdd 0.27199         osd.2            up  0.95999 1.00000
> > > >  4   hdd 0.27199         osd.4            up  0.89999 1.00000
> > > >  8   hdd 0.27199         osd.8            up  0.89999 1.00000
> > > > -5       0.81599     host tceph-03
> > > >  1   hdd 0.27199         osd.1            up  0.89999 1.00000
> > > >  5   hdd 0.27199         osd.5            up  1.00000 1.00000
> > > >  7   hdd 0.27199         osd.7     destroyed        0 1.00000
> > > >
> > > >
> > > > Does this work in real life?
> > > >
> > > > Cheers, Dan
> > > >
> > > >
> > > > On Mon, Aug 29, 2022 at 7:38 PM Frank Schilder <fr...@dtu.dk> wrote:
> > > > >
> > > > > Hi Dan,
> > > > >
> > > > > please find attached (only 7K, so I hope it goes through). 
> > > > > md5sum=1504652f1b95802a9f2fe3725bf1336e
> > > > >
> > > > > I was playing a bit around with the crush map and found out the 
> > > > > following:
> > > > >
> > > > > 1) Setting all re-weights to 1 does produce valid mappings. However, 
> > > > > it will lead to large imbalances and is impractical in operations.
> > > > >
> > > > > 2) Doing something as simple/stupid as the following also results in 
> > > > > valid mappings without having to change the weights:
> > > > >
> > > > > rule fs-data {
> > > > >         id 1
> > > > >         type erasure
> > > > >         min_size 3
> > > > >         max_size 6
> > > > >         step set_chooseleaf_tries 50
> > > > >         step set_choose_tries 200
> > > > >         step take default
> > > > >         step chooseleaf indep 3 type host
> > > > >         step emit
> > > > >         step take default
> > > > >         step chooseleaf indep -3 type host
> > > > >         step emit
> > > > > }
> > > > >
> > > > > rule fs-data {
> > > > >         id 1
> > > > >         type erasure
> > > > >         min_size 3
> > > > >         max_size 6
> > > > >         step set_chooseleaf_tries 50
> > > > >         step set_choose_tries 200
> > > > >         step take default
> > > > >         step choose indep 3 type osd
> > > > >         step emit
> > > > >         step take default
> > > > >         step choose indep -3 type osd
> > > > >         step emit
> > > > > }
> > > > >
> > > > > Of course, now the current weights are probably unsuitable as 
> > > > > everything moves around. Its probably also a lot more total tries to 
> > > > > get rid of mappings with duplicate OSDs.
> > > > >
> > > > > I probably have to read the code to understand how drawing straws 
> > > > > from 8 different buckets with non-zero probabilities can lead to an 
> > > > > infinite sequence of failed attempts of getting 6 different ones. 
> > > > > There seems to be a hard-coded tunable that turns seemingly infinite 
> > > > > into finite somehow.
> > > > >
> > > > > The first modified rule will probably lead to better distribution of 
> > > > > load, but bad distribution of data if a disk goes down (considering 
> > > > > the tiny host- and disk numbers). The second rule seems to be almost 
> > > > > as good or bad as the default one (step choose indep 0 type osd), 
> > > > > except that it does produce valid mappings where the default rule 
> > > > > fails.
> > > > >
> > > > > I will wait with changing the rule in the hope that you find a more 
> > > > > elegant solution to this riddle.
> > > > >
> > > > > Best regards,
> > > > > =================
> > > > > Frank Schilder
> > > > > AIT Risø Campus
> > > > > Bygning 109, rum S14
> > > > >
> > > > > ________________________________________
> > > > > From: Dan van der Ster <dvand...@gmail.com>
> > > > > Sent: 29 August 2022 19:13
> > > > > To: Frank Schilder
> > > > > Subject: Re: [ceph-users] Bug in crush algorithm? 1 PG with the same 
> > > > > OSD twice.
> > > > >
> > > > > Hi Frank,
> > > > >
> > > > > Could you share the osdmap so I can try to solve this riddle?
> > > > >
> > > > > Cheers , Dan
> > > > >
> > > > >
> > > > > On Mon, Aug 29, 2022, 17:26 Frank Schilder 
> > > > > <fr...@dtu.dk<mailto:fr...@dtu.dk>> wrote:
> > > > > Hi Dan,
> > > > >
> > > > > thanks for your answer. I'm not really convinced that we hit a corner 
> > > > > case here and even if its one, it seems quite relevant for production 
> > > > > clusters. The usual way to get a valid mapping is to increase the 
> > > > > number of tries. I increased the following max trial numbers, which I 
> > > > > would expect to produce a mapping for all PGs:
> > > > >
> > > > > # diff map-now.txt map-new.txt
> > > > > 4c4
> > > > > < tunable choose_total_tries 50
> > > > > ---
> > > > > > tunable choose_total_tries 250
> > > > > 93,94c93,94
> > > > > <       step set_chooseleaf_tries 5
> > > > > <       step set_choose_tries 100
> > > > > ---
> > > > > >       step set_chooseleaf_tries 50
> > > > > >       step set_choose_tries 200
> > > > >
> > > > > When I test the map with crushtool it does not report bad mappings. 
> > > > > Am I looking at the wrong tunables to increase? It should be possible 
> > > > > to get valid mappings without having to modify the re-weights.
> > > > >
> > > > > Thanks again for your help!
> > > > > =================
> > > > > Frank Schilder
> > > > > AIT Risø Campus
> > > > > Bygning 109, rum S14
> > > > >
> > > > > ________________________________________
> > > > > From: Dan van der Ster <dvand...@gmail.com<mailto:dvand...@gmail.com>>
> > > > > Sent: 29 August 2022 16:52:52
> > > > > To: Frank Schilder
> > > > > Cc: ceph-users@ceph.io<mailto:ceph-users@ceph.io>
> > > > > Subject: Re: [ceph-users] Bug in crush algorithm? 1 PG with the same 
> > > > > OSD twice.
> > > > >
> > > > > Hi Frank,
> > > > >
> > > > > CRUSH can only find 5 OSDs, given your current tree, rule, and
> > > > > reweights. This is why there is a NONE in the UP set for shard 6.
> > > > > But in ACTING we see that it is refusing to remove shard 6 from osd.1
> > > > > -- that is the only copy of that shard, so in this case it's helping
> > > > > you rather than deleting the shard altogether.
> > > > > ACTING == what the OSDs are serving now.
> > > > > UP == where CRUSH wants to place the shards.
> > > > >
> > > > > I suspect that this is a case of CRUSH tunables + your reweights
> > > > > putting CRUSH in a corner case of not finding 6 OSDs for that
> > > > > particular PG.
> > > > > If you set the reweights all back to 1, it probably finds 6 OSDs?
> > > > >
> > > > > Cheers, Dan
> > > > >
> > > > >
> > > > > On Mon, Aug 29, 2022 at 4:44 PM Frank Schilder 
> > > > > <fr...@dtu.dk<mailto:fr...@dtu.dk>> wrote:
> > > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > I'm investigating a problem with a degenerated PG on an octopus 
> > > > > > 15.2.16 test cluster. It has 3Hosts x 3OSDs and a 4+2 EC pool with 
> > > > > > failure domain OSD. After simulating a disk fail by removing an OSD 
> > > > > > and letting the cluster recover (all under load), I end up with a 
> > > > > > PG with the same OSD allocated twice:
> > > > > >
> > > > > > PG 4.1c, UP: [6,1,4,5,3,NONE] ACTING: [6,1,4,5,3,1]
> > > > > >
> > > > > > OSD 1 is allocated twice. How is this even possible?
> > > > > >
> > > > > > Here the OSD tree:
> > > > > >
> > > > > > ID  CLASS  WEIGHT   TYPE NAME          STATUS     REWEIGHT  PRI-AFF
> > > > > > -1         2.44798  root default
> > > > > > -7         0.81599      host tceph-01
> > > > > >  0    hdd  0.27199          osd.0             up   0.87999  1.00000
> > > > > >  3    hdd  0.27199          osd.3             up   0.98000  1.00000
> > > > > >  6    hdd  0.27199          osd.6             up   0.92999  1.00000
> > > > > > -3         0.81599      host tceph-02
> > > > > >  2    hdd  0.27199          osd.2             up   0.95999  1.00000
> > > > > >  4    hdd  0.27199          osd.4             up   0.89999  1.00000
> > > > > >  8    hdd  0.27199          osd.8             up   0.89999  1.00000
> > > > > > -5         0.81599      host tceph-03
> > > > > >  1    hdd  0.27199          osd.1             up   0.89999  1.00000
> > > > > >  5    hdd  0.27199          osd.5             up   1.00000  1.00000
> > > > > >  7    hdd  0.27199          osd.7      destroyed         0  1.00000
> > > > > >
> > > > > > I tried already to change some tunables thinking about 
> > > > > > https://docs.ceph.com/en/octopus/rados/troubleshooting/troubleshooting-pg/#crush-gives-up-too-soon,
> > > > > >  but giving up too soon is obviously not the problem. It is 
> > > > > > accepting a wrong mapping.
> > > > > >
> > > > > > Is there a way out of this? Clearly this is calling for trouble if 
> > > > > > not data loss and should not happen at all.
> > > > > >
> > > > > > Best regards,
> > > > > > =================
> > > > > > Frank Schilder
> > > > > > AIT Risø Campus
> > > > > > Bygning 109, rum S14
> > > > > > _______________________________________________
> > > > > > ceph-users mailing list -- 
> > > > > > ceph-users@ceph.io<mailto:ceph-users@ceph.io>
> > > > > > To unsubscribe send an email to 
> > > > > > ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Bug in crush algorithm? 1 PG with the same OSD twice.

Reply via email to