[ceph-users] Re: OSDs down after reweight

2022-11-15 Thread Frank Schilder
Here is how this looks on a test cluster:

# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME  STATUS  REWEIGHT  PRI-AFF
-1 2.44707  root default
-3 0.81569  host tceph-01
 0hdd  0.27190  osd.0  up   1.0  1.0
 2hdd  0.27190  osd.2  up   1.0  1.0
 4hdd  0.27190  osd.4  up   1.0  1.0
-7 0.81569  host tceph-02
 6hdd  0.27190  osd.6  up   1.0  1.0
 7hdd  0.27190  osd.7  up   1.0  1.0
 8hdd  0.27190  osd.8  up   1.0  1.0
-5 0.81569  host tceph-03
 1hdd  0.27190  osd.1  up   1.0  1.0
 3hdd  0.27190  osd.3  up   1.0  1.0
 5hdd  0.27190  osd.5  up   1.0  1.0

# ceph pg dump pgs_brief | head -8
PG_STAT  STATE   UP UP_PRIMARY  ACTING 
ACTING_PRIMARY
3.7e   active+clean  [6,0,2,5,3,7]   6  [6,0,2,5,3,7]   
6
2.7f   active+clean[7,5,2]   7[7,5,2]   
7
2.7e   active+clean[0,1,8]   0[0,1,8]   
0
3.7c   active+clean  [6,5,0,7,2,8]   6  [6,5,0,7,2,8]   
6
2.7d   active+clean[0,8,3]   0[0,8,3]   
0
3.7d   active+clean  [7,0,3,8,1,2]   7  [7,0,3,8,1,2]   
7


After osd reweight to 0.5:

# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME  STATUS  REWEIGHT  PRI-AFF
-1 2.44707  root default
-3 0.81569  host tceph-01
 0hdd  0.27190  osd.0  up   0.5  1.0
 2hdd  0.27190  osd.2  up   0.5  1.0
 4hdd  0.27190  osd.4  up   0.5  1.0
-7 0.81569  host tceph-02
 6hdd  0.27190  osd.6  up   0.5  1.0
 7hdd  0.27190  osd.7  up   0.5  1.0
 8hdd  0.27190  osd.8  up   0.5  1.0
-5 0.81569  host tceph-03
 1hdd  0.27190  osd.1  up   0.5  1.0
 3hdd  0.27190  osd.3  up   0.5  1.0
 5hdd  0.27190  osd.5  up   0.5  1.0

# ceph pg dump pgs_brief | head -8
PG_STAT  STATE  UP  
  UP_PRIMARY  ACTING ACTING_PRIMARY
3.7e active+remapped+backfill_wait
[6,0,4,5,1,2147483647]   6  [6,0,2,5,3,7]   6
2.7f  active+clean   
[7,5,2]   7[7,5,2]   7
3.7f active+remapped+backfill_wait   
[1,2147483647,7,8,2147483647,2]   1  [0,5,4,8,6,2]   0
2.7e active+remapped+backfill_wait   
[5,4,8]   5[1,8,0]   1
3.7c active+remapped+backfill_wait  
[2147483647,1,0,2147483647,2147483647,8]   1  [6,5,0,7,2,8] 
  6
2.7d active+remapped+backfill_wait   
[0,3,6]   0[0,3,8]   0
3.7d active+remapped+backfill_wait
[2147483647,0,3,6,4,2]   0  [7,0,3,8,1,2]   7


After osd crush reweight to 0.5*0.27190:

# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME  STATUS  REWEIGHT  PRI-AFF
-1 1.22346  root default
-3 0.40782  host tceph-01
 0hdd  0.13594  osd.0  up   1.0  1.0
 2hdd  0.13594  osd.2  up   1.0  1.0
 4hdd  0.13594  osd.4  up   1.0  1.0
-7 0.40782  host tceph-02
 6hdd  0.13594  osd.6  up   1.0  1.0
 7hdd  0.13594  osd.7  up   1.0  1.0
 8hdd  0.13594  osd.8  up   1.0  1.0
-5 0.40782  host tceph-03
 1hdd  0.13594  osd.1  up   1.0  1.0
 3hdd  0.13594  osd.3  up   1.0  1.0
 5hdd  0.13594  osd.5  up   1.0  1.0

# ceph pg dump pgs_brief | head -8
PG_STAT  STATEUP UP_PRIMARY  ACTING 
ACTING_PRIMARY
3.7eactive+clean  [6,0,2,5,3,7]   6  [6,0,2,5,3,7]  
 6
2.7factive+clean[7,5,2]   7[7,5,2]  
 7
3.7factive+clean  [0,5,4,8,6,2]   0  [0,5,4,8,6,2]  
 0
2.7eactive+clean[0,1,8]   0[0,1,8]  
 0
3.7cactive+clean  [6,5,0,7,2,8]   6  [6,5,0,7,2,8]  
 6
2.7dactive+clean[0,8,3]   0[0,8,3]  
 0
3.7da

[ceph-users] Re: OSDs down after reweight

2022-11-15 Thread Etienne Menguy
Hi,
You probably caused a large rebalance and overload your slow HDD. But all OSD 
are up in what you are sharing.

Also, I see you changed weight and reweight values, that's what you wanted? 

Étienne

> -Original Message-
> From: Frank Schilder 
> Sent: mardi 15 novembre 2022 10:38
> To: ceph-users@ceph.io
> Subject: [ceph-users] Re: OSDs down after reweight
> 
> Here is how this looks on a test cluster:
> 
> # ceph osd tree
> ID  CLASS  WEIGHT   TYPE NAME  STATUS  REWEIGHT  PRI-AFF
> -1 2.44707  root default
> -3 0.81569  host tceph-01
>  0hdd  0.27190  osd.0  up   1.0  1.0
>  2hdd  0.27190  osd.2  up   1.0  1.0
>  4hdd  0.27190  osd.4  up   1.0  1.0
> -7 0.81569  host tceph-02
>  6hdd  0.27190  osd.6  up   1.0  1.0
>  7hdd  0.27190  osd.7  up   1.0  1.0
>  8hdd  0.27190  osd.8  up   1.0  1.0
> -5 0.81569  host tceph-03
>  1hdd  0.27190  osd.1  up   1.0  1.0
>  3hdd  0.27190  osd.3  up   1.0  1.0
>  5hdd  0.27190  osd.5  up   1.0  1.0
> 
> # ceph pg dump pgs_brief | head -8
> PG_STAT  STATE   UP UP_PRIMARY  ACTING
> ACTING_PRIMARY
> 3.7e   active+clean  [6,0,2,5,3,7]   6  [6,0,2,5,3,7] 
>   6
> 2.7f   active+clean[7,5,2]   7[7,5,2] 
>   7
> 2.7e   active+clean[0,1,8]   0[0,1,8] 
>   0
> 3.7c   active+clean  [6,5,0,7,2,8]   6  [6,5,0,7,2,8] 
>   6
> 2.7d   active+clean[0,8,3]   0[0,8,3] 
>   0
> 3.7d   active+clean  [7,0,3,8,1,2]   7  [7,0,3,8,1,2] 
>   7
> 
> 
> After osd reweight to 0.5:
> 
> # ceph osd tree
> ID  CLASS  WEIGHT   TYPE NAME  STATUS  REWEIGHT  PRI-AFF
> -1 2.44707  root default
> -3 0.81569  host tceph-01
>  0hdd  0.27190  osd.0  up   0.5  1.0
>  2hdd  0.27190  osd.2  up   0.5  1.0
>  4hdd  0.27190  osd.4  up   0.5  1.0
> -7 0.81569  host tceph-02
>  6hdd  0.27190  osd.6  up   0.5  1.0
>  7hdd  0.27190  osd.7  up   0.5  1.0
>  8hdd  0.27190  osd.8  up   0.5  1.0
> -5 0.81569  host tceph-03
>  1hdd  0.27190  osd.1  up   0.5  1.0
>  3hdd  0.27190  osd.3  up   0.5  1.0
>  5hdd  0.27190  osd.5  up   0.5  1.0
> 
> # ceph pg dump pgs_brief | head -8
> PG_STAT  STATE  UP
> UP_PRIMARY  ACTING
> ACTING_PRIMARY
> 3.7e active+remapped+backfill_wait
> [6,0,4,5,1,2147483647]   6
> [6,0,2,5,3,7]   6
> 2.7f  active+clean   
> [7,5,2]   7[7,5,2]   7
> 3.7f active+remapped+backfill_wait   
> [1,2147483647,7,8,2147483647,2]
> 1  [0,5,4,8,6,2]   0
> 2.7e active+remapped+backfill_wait   
> [5,4,8]   5
> [1,8,0]   1
> 3.7c active+remapped+backfill_wait
> [2147483647,1,0,2147483647,2147483647,8]   1  [6,5,0,7,2,8]   
> 6
> 2.7d active+remapped+backfill_wait   
> [0,3,6]   0
> [0,3,8]   0
> 3.7d active+remapped+backfill_wait
> [2147483647,0,3,6,4,2]   0
> [7,0,3,8,1,2]   7
> 
> 
> After osd crush reweight to 0.5*0.27190:
> 
> # ceph osd tree
> ID  CLASS  WEIGHT   TYPE NAME  STATUS  REWEIGHT  PRI-AFF
> -1 1.22346  root default
> -3 0.40782  host tceph-01
>  0hdd  0.13594  osd.0  up   1.0  1.0
>  2hdd  0.13594  osd.2  up   1.0  1.0
>  4hdd  0.13594  osd.4  up   1.0  1.0
> -7 0.40782  host tceph-02
>  6hdd  0.13594  osd.6  up   1.0  1.0
>  7hdd  0.13594  osd.7  up   1.0  1.0
>  8hdd  0.13594  osd.8  up   1.0  1.0
> -5 0.40782  host tceph-03
>  1hdd  0.13594  osd.1  up   1.0  1.0
>  3hdd  0.13594  

[ceph-users] Re: OSDs down after reweight

2022-11-15 Thread Frank Schilder
I think you misinterpret the reason I posted the output from our test cluster. 
Firstly, there is zero load and OSDs stayed up here because of that. They went 
down on the production cluster after performing a similar stunt. I'm not going 
to do this again just to show a ceph status with down OSDs.

What I'm after with the test cluster output is: why are the mappings with 
crush-weight=0.5*0.27190 reweight=1.0 not identical to the mappings with 
crush-weight=0.27190 reweight=0.5. According to how I understand the 
documentation 
(https://docs.ceph.com/en/pacific/rados/operations/control/#osd-subsystem , 
search for "Set the override weight (reweight)") the crush algorithm will look 
at OSDs with a weight computed as crush-weight*reweight, which implies that 
results should be identical. However, this is obviously not true.

When you look at the large amount of broken mappings in the case 
crush-weight=0.27190 reweight=0.5, it is not very surprising to see slow ops 
and OSDs going down under load, its a completely dysfunctional set of PG 
mappings.

What I'm trying to say here is, that the OSD down observation is almost 
certainly a result of the failed mappings when using reweight. And this should 
not happen, it smells like a really bad bug.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Etienne Menguy 
Sent: 15 November 2022 10:45:19
To: Frank Schilder; ceph-users@ceph.io
Subject: RE: OSDs down after reweight

Hi,
You probably caused a large rebalance and overload your slow HDD. But all OSD 
are up in what you are sharing.

Also, I see you changed weight and reweight values, that's what you wanted?

Étienne

> -Original Message-
> From: Frank Schilder 
> Sent: mardi 15 novembre 2022 10:38
> To: ceph-users@ceph.io
> Subject: [ceph-users] Re: OSDs down after reweight
>
> Here is how this looks on a test cluster:
>
> # ceph osd tree
> ID  CLASS  WEIGHT   TYPE NAME  STATUS  REWEIGHT  PRI-AFF
> -1 2.44707  root default
> -3 0.81569  host tceph-01
>  0hdd  0.27190  osd.0  up   1.0  1.0
>  2hdd  0.27190  osd.2  up   1.0  1.0
>  4hdd  0.27190  osd.4  up   1.0  1.0
> -7 0.81569  host tceph-02
>  6hdd  0.27190  osd.6  up   1.0  1.0
>  7hdd  0.27190  osd.7  up   1.0  1.0
>  8hdd  0.27190  osd.8  up   1.0  1.0
> -5 0.81569  host tceph-03
>  1hdd  0.27190  osd.1  up   1.0  1.0
>  3hdd  0.27190  osd.3  up   1.0  1.0
>  5hdd  0.27190  osd.5  up   1.0  1.0
>
> # ceph pg dump pgs_brief | head -8
> PG_STAT  STATE   UP UP_PRIMARY  ACTING
> ACTING_PRIMARY
> 3.7e   active+clean  [6,0,2,5,3,7]   6  [6,0,2,5,3,7] 
>   6
> 2.7f   active+clean[7,5,2]   7[7,5,2] 
>   7
> 2.7e   active+clean[0,1,8]   0[0,1,8] 
>   0
> 3.7c   active+clean  [6,5,0,7,2,8]   6  [6,5,0,7,2,8] 
>   6
> 2.7d   active+clean[0,8,3]   0[0,8,3] 
>   0
> 3.7d   active+clean  [7,0,3,8,1,2]   7  [7,0,3,8,1,2] 
>   7
>
>
> After osd reweight to 0.5:
>
> # ceph osd tree
> ID  CLASS  WEIGHT   TYPE NAME  STATUS  REWEIGHT  PRI-AFF
> -1 2.44707  root default
> -3 0.81569  host tceph-01
>  0hdd  0.27190  osd.0  up   0.5  1.0
>  2hdd  0.27190  osd.2  up   0.5  1.0
>  4hdd  0.27190  osd.4  up   0.5  1.0
> -7 0.81569  host tceph-02
>  6hdd  0.27190  osd.6  up   0.5  1.0
>  7hdd  0.27190  osd.7  up   0.5  1.0
>  8hdd  0.27190  osd.8  up   0.5  1.0
> -5 0.81569  host tceph-03
>  1hdd  0.27190  osd.1  up   0.5  1.0
>  3hdd  0.27190  osd.3  up   0.5  1.0
>  5hdd  0.27190  osd.5  up   0.5  1.0
>
> # ceph pg dump pgs_brief | head -8
> PG_STAT  STATE  UP
> UP_PRIMARY  ACTING
> ACTING_PRIMARY
> 3.7e active+remapped+backfill_wait
> [6,0,4,5,1,2147483647]   6
> [6,0,2,5,3,7]   6
> 2.7f  active+clean   
> [7,5,2]   7[7,5,2] 

[ceph-users] Re: OSDs down after reweight

2022-11-15 Thread Dan van der Ster
Hi Frank,

Just a guess, but I wonder if for small values rounding/precision
start to impact the placement like you observed.

Do you see the same issue if you reweight to 2x the original?

-- Dan

On Tue, Nov 15, 2022 at 10:09 AM Frank Schilder  wrote:
>
> Hi all,
>
> I re-weighted all OSDs in a pool down from 1.0 to the same value 0.052 (see 
> reason below). After this, all hell broke loose. OSDs were marked down, slow 
> OPS all over the place and the MDSes started complaining about slow 
> ops/requests. Basically all PGs were remapped. After setting all re-weights 
> back to 1.0 the situation went back to normal.
>
> Expected behaviour: No (!!!) PGs are remapped and everything continues to 
> work. Why did things go down?
>
> More details: We have 24 OSDs with weight=1.74699 in a pool. I wanted to add 
> OSDs with weight=0.09099 in such a way that the small OSDs receive 
> approximately the same number of PGs as the large ones. Setting a re-weight 
> factor of 0.052 for the large ones should achieve just that: 
> 1.74699*0.05=0.09084. So, procedure was:
>
> - ceph osd crush reweight osd.N 0.052 for all OSDs in that pool
> - add the small disks and re-balance
>
> I would expect that the crush mapping is invariant under a uniform change of 
> weight. That is, if I apply the same relative weight-change to all OSDs 
> (new_weight=old_weight*common_factor) in a pool, the mappings should be 
> preserved. However, this is not what I observed. How is it possible that PG 
> mappings change if the relative weight of all OSDs to each other stays the 
> same (the probabilities of picking an OSD are unchanged over all OSDs)?
>
> Thanks for any hints.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSDs down after reweight

2022-11-15 Thread Frank Schilder
Hi Dan,

I guess you answered before my e-mails with output from a test cluster arrived. 
I gave an example with a reweight=0.5 with similarly disastrous results. It 
looks like applying reweights in the crush map is seriously broken. If I 
understand the intention of the reweight correctly, then

effective-weight = crush-weight * reweight,

but it is clearly not implemented this way. Please take a look at the specific 
re-mapping examples on a test cluster I posted with effective-weights=0.5*1 and 
1*0.5.

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Dan van der Ster 
Sent: 15 November 2022 11:23:44
To: Frank Schilder
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] OSDs down after reweight

Hi Frank,

Just a guess, but I wonder if for small values rounding/precision
start to impact the placement like you observed.

Do you see the same issue if you reweight to 2x the original?

-- Dan

On Tue, Nov 15, 2022 at 10:09 AM Frank Schilder  wrote:
>
> Hi all,
>
> I re-weighted all OSDs in a pool down from 1.0 to the same value 0.052 (see 
> reason below). After this, all hell broke loose. OSDs were marked down, slow 
> OPS all over the place and the MDSes started complaining about slow 
> ops/requests. Basically all PGs were remapped. After setting all re-weights 
> back to 1.0 the situation went back to normal.
>
> Expected behaviour: No (!!!) PGs are remapped and everything continues to 
> work. Why did things go down?
>
> More details: We have 24 OSDs with weight=1.74699 in a pool. I wanted to add 
> OSDs with weight=0.09099 in such a way that the small OSDs receive 
> approximately the same number of PGs as the large ones. Setting a re-weight 
> factor of 0.052 for the large ones should achieve just that: 
> 1.74699*0.05=0.09084. So, procedure was:
>
> - ceph osd crush reweight osd.N 0.052 for all OSDs in that pool
> - add the small disks and re-balance
>
> I would expect that the crush mapping is invariant under a uniform change of 
> weight. That is, if I apply the same relative weight-change to all OSDs 
> (new_weight=old_weight*common_factor) in a pool, the mappings should be 
> preserved. However, this is not what I observed. How is it possible that PG 
> mappings change if the relative weight of all OSDs to each other stays the 
> same (the probabilities of picking an OSD are unchanged over all OSDs)?
>
> Thanks for any hints.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io