[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-25 Thread Torkil Svensgaard



On 25-03-2024 23:07, Kai Stian Olstad wrote:

On Mon, Mar 25, 2024 at 10:58:24PM +0100, Kai Stian Olstad wrote:

On Mon, Mar 25, 2024 at 09:28:01PM +0100, Torkil Svensgaard wrote:
My tally came to 412 out of 539 OSDs showing up in a blocked_by list 
and that is about every OSD with data prior to adding ~100 empty 
OSDs. How 400 read targets and 100 write targets can only equal ~60 
backfills with osd_max_backill set at 3 just makes no sense to me but 
alas.


It seems I can just increase osd_max_backfill even further to get the 
numbers I want so that will do. Thank you all for taking the time to 
look at this.


It's a huge change and 42% of you data need to be moved.
And this move is not only to the new OSD but also between the existing 
OSD, but

they are busy with backfilling so they have no free backfill reservation.

I do recommend this document by Joshua Baergen at Digital Ocean that 
explains
backfilling and the problem with it and there solution, a tool called 
pgremapper.


Forgot the link
https://ceph.io/assets/pdfs/user_dev_meeting_2023_10_19_joshua_baergen.pdf


Thanks again, seems the explanation for the low number of concurrent 
backfills is then simply that backfill_wait can hold partial reservations.


Mvh.

Torkil

--
Torkil Svensgaard
Systems Administrator
Danish Research Centre for Magnetic Resonance DRCMR, Section 714
Copenhagen University Hospital Amager and Hvidovre
Kettegaard Allé 30, 2650 Hvidovre, Denmark
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-25 Thread Torkil Svensgaard



On 25-03-2024 22:58, Kai Stian Olstad wrote:

On Mon, Mar 25, 2024 at 09:28:01PM +0100, Torkil Svensgaard wrote:
My tally came to 412 out of 539 OSDs showing up in a blocked_by list 
and that is about every OSD with data prior to adding ~100 empty OSDs. 
How 400 read targets and 100 write targets can only equal ~60 
backfills with osd_max_backill set at 3 just makes no sense to me but 
alas.


It seems I can just increase osd_max_backfill even further to get the 
numbers I want so that will do. Thank you all for taking the time to 
look at this.


It's a huge change and 42% of you data need to be moved.
And this move is not only to the new OSD but also between the existing 
OSD, but

they are busy with backfilling so they have no free backfill reservation.


If I have 60 backfills going on that would be 60 read reservations and 
60 write reservations if I understand it correctly. The only way I can 
see that getting stuck at 60 backfills with osd_max_backfill = 3 is for 
those 60 reservations to be tied up on 20 OSDs being the only ones 
either read from or written to, and all other OSDs waiting on those.

 > I do recommend this document by Joshua Baergen at Digital Ocean that

explains
backfilling and the problem with it and there solution, a tool called 
pgremapper.


Thanks, I'll take a look at that =)

Mvh.

Torkil


--
Torkil Svensgaard
Systems Administrator
Danish Research Centre for Magnetic Resonance DRCMR, Section 714
Copenhagen University Hospital Amager and Hvidovre
Kettegaard Allé 30, 2650 Hvidovre, Denmark
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-25 Thread Kai Stian Olstad

On Mon, Mar 25, 2024 at 10:58:24PM +0100, Kai Stian Olstad wrote:

On Mon, Mar 25, 2024 at 09:28:01PM +0100, Torkil Svensgaard wrote:
My tally came to 412 out of 539 OSDs showing up in a blocked_by list 
and that is about every OSD with data prior to adding ~100 empty 
OSDs. How 400 read targets and 100 write targets can only equal ~60 
backfills with osd_max_backill set at 3 just makes no sense to me 
but alas.


It seems I can just increase osd_max_backfill even further to get 
the numbers I want so that will do. Thank you all for taking the 
time to look at this.


It's a huge change and 42% of you data need to be moved.
And this move is not only to the new OSD but also between the existing OSD, but
they are busy with backfilling so they have no free backfill reservation.

I do recommend this document by Joshua Baergen at Digital Ocean that explains
backfilling and the problem with it and there solution, a tool called 
pgremapper.


Forgot the link
https://ceph.io/assets/pdfs/user_dev_meeting_2023_10_19_joshua_baergen.pdf

--
Kai Stian Olstad
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-25 Thread Kai Stian Olstad

On Mon, Mar 25, 2024 at 09:28:01PM +0100, Torkil Svensgaard wrote:
My tally came to 412 out of 539 OSDs showing up in a blocked_by list 
and that is about every OSD with data prior to adding ~100 empty OSDs. 
How 400 read targets and 100 write targets can only equal ~60 
backfills with osd_max_backill set at 3 just makes no sense to me but 
alas.


It seems I can just increase osd_max_backfill even further to get the 
numbers I want so that will do. Thank you all for taking the time to 
look at this.


It's a huge change and 42% of you data need to be moved.
And this move is not only to the new OSD but also between the existing OSD, but
they are busy with backfilling so they have no free backfill reservation.

I do recommend this document by Joshua Baergen at Digital Ocean that explains
backfilling and the problem with it and there solution, a tool called 
pgremapper.

--
Kai Stian Olstad
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-25 Thread Torkil Svensgaard
Neither downing or restarting the OSD cleared the bogus blocked_by. I 
guess it makes no sense to look further at blocked_by as the cause when 
the data can't be trusted and there is no obvious smoking gun like a few 
OSDs blocking everything.


My tally came to 412 out of 539 OSDs showing up in a blocked_by list and 
that is about every OSD with data prior to adding ~100 empty OSDs. How 
400 read targets and 100 write targets can only equal ~60 backfills with 
osd_max_backill set at 3 just makes no sense to me but alas.


It seems I can just increase osd_max_backfill even further to get the 
numbers I want so that will do. Thank you all for taking the time to 
look at this.


Mvh.

Torkil

On 25-03-2024 20:44, Anthony D'Atri wrote:

First try "ceph osd down 89"


On Mar 25, 2024, at 15:37, Alexander E. Patrakov  wrote:

On Mon, Mar 25, 2024 at 7:37 PM Torkil Svensgaard  wrote:




On 24/03/2024 01:14, Torkil Svensgaard wrote:

On 24-03-2024 00:31, Alexander E. Patrakov wrote:

Hi Torkil,


Hi Alexander


Thanks for the update. Even though the improvement is small, it is
still an improvement, consistent with the osd_max_backfills value, and
it proves that there are still unsolved peering issues.

I have looked at both the old and the new state of the PG, but could
not find anything else interesting.

I also looked again at the state of PG 37.1. It is known what blocks
the backfill of this PG; please search for "blocked_by." However, this
is just one data point, which is insufficient for any conclusions. Try
looking at other PGs. Is there anything too common in the non-empty
"blocked_by" blocks?


I'll take a look at that tomorrow, perhaps we can script something
meaningful.


Hi Alexander

While working on a script querying all PGs and making a list of all OSDs
found in a blocked_by list, and how many times for each, I discovered
something odd about pool 38:

"
[root@lazy blocked_by]# sh blocked_by.sh 38 |tee pool38
OSDs blocking other OSDs:




All PGs in the pool are active+clean so why are there any blocked_by at
all? One example attached.


I don't know. In any case, it doesn't match the "one OSD blocks them
all" scenario that I was looking for. I think this is something bogus
that can probably be cleared in your example by restarting osd.89
(i.e, the one being blocked).



Mvh.

Torkil


I think we have to look for patterns in other ways, too. One tool that
produces good visualizations is TheJJ balancer. Although it is called
a "balancer," it can also visualize the ongoing backfills.

The tool is available at
https://raw.githubusercontent.com/TheJJ/ceph-balancer/master/placementoptimizer.py

Run it as follows:

./placementoptimizer.py showremapped --by-osd | tee remapped.txt


Output attached.

Thanks again.

Mvh.

Torkil


On Sun, Mar 24, 2024 at 5:50 AM Torkil Svensgaard 
wrote:


Hi Alex

New query output attached after restarting both OSDs. OSD 237 is no
longer mentioned but it unfortunately made no difference for the number
of backfills which went 59->62->62.

Mvh.

Torkil

On 23-03-2024 22:26, Alexander E. Patrakov wrote:

Hi Torkil,

I have looked at the files that you attached. They were helpful: pool
11 is problematic, it complains about degraded objects for no obvious
reason. I think that is the blocker.

I also noted that you mentioned peering problems, and I suspect that
they are not completely resolved. As a somewhat-irrational move, to
confirm this theory, you can restart osd.237 (it is mentioned at the
end of query.11.fff.txt, although I don't understand why it is there)
and then osd.298 (it is the primary for that pg) and see if any
additional backfills are unblocked after that. Also, please re-query
that PG again after the OSD restart.

On Sun, Mar 24, 2024 at 4:56 AM Torkil Svensgaard 
wrote:




On 23-03-2024 21:19, Alexander E. Patrakov wrote:

Hi Torkil,


Hi Alexander


I have looked at the CRUSH rules, and the equivalent rules work on my
test cluster. So this cannot be the cause of the blockage.


Thank you for taking the time =)


What happens if you increase the osd_max_backfills setting
temporarily?


We already had the mclock override option in place and I re-enabled
our
babysitter script which sets osd_max_backfills pr OSD to 1-3 depending
on how full they are. Active backfills went from 16 to 53 which is
probably because default osd_max_backfills for mclock is 1.

I think 53 is still a low number of active backfills given the large
percentage misplaced.


It may be a good idea to investigate a few of the stalled PGs. Please
run commands similar to this one:

ceph pg 37.0 query > query.37.0.txt
ceph pg 37.1 query > query.37.1.txt
...
and the same for the other affected pools.


A few samples attached.


Still, I must say that some of your rules are actually unsafe.

The 4+2 rule as used by rbd_ec_data will not survive a
datacenter-offline incident. Namely, for each PG, it chooses OSDs
from
two hosts in each datacenter, so 6 OSDs total. When a datacenter is
offline, you 

[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-25 Thread Anthony D'Atri
First try "ceph osd down 89"

> On Mar 25, 2024, at 15:37, Alexander E. Patrakov  wrote:
> 
> On Mon, Mar 25, 2024 at 7:37 PM Torkil Svensgaard  wrote:
>> 
>> 
>> 
>> On 24/03/2024 01:14, Torkil Svensgaard wrote:
>>> On 24-03-2024 00:31, Alexander E. Patrakov wrote:
 Hi Torkil,
>>> 
>>> Hi Alexander
>>> 
 Thanks for the update. Even though the improvement is small, it is
 still an improvement, consistent with the osd_max_backfills value, and
 it proves that there are still unsolved peering issues.
 
 I have looked at both the old and the new state of the PG, but could
 not find anything else interesting.
 
 I also looked again at the state of PG 37.1. It is known what blocks
 the backfill of this PG; please search for "blocked_by." However, this
 is just one data point, which is insufficient for any conclusions. Try
 looking at other PGs. Is there anything too common in the non-empty
 "blocked_by" blocks?
>>> 
>>> I'll take a look at that tomorrow, perhaps we can script something
>>> meaningful.
>> 
>> Hi Alexander
>> 
>> While working on a script querying all PGs and making a list of all OSDs
>> found in a blocked_by list, and how many times for each, I discovered
>> something odd about pool 38:
>> 
>> "
>> [root@lazy blocked_by]# sh blocked_by.sh 38 |tee pool38
>> OSDs blocking other OSDs:
> 
> 
>> All PGs in the pool are active+clean so why are there any blocked_by at
>> all? One example attached.
> 
> I don't know. In any case, it doesn't match the "one OSD blocks them
> all" scenario that I was looking for. I think this is something bogus
> that can probably be cleared in your example by restarting osd.89
> (i.e, the one being blocked).
> 
>> 
>> Mvh.
>> 
>> Torkil
>> 
 I think we have to look for patterns in other ways, too. One tool that
 produces good visualizations is TheJJ balancer. Although it is called
 a "balancer," it can also visualize the ongoing backfills.
 
 The tool is available at
 https://raw.githubusercontent.com/TheJJ/ceph-balancer/master/placementoptimizer.py
 
 Run it as follows:
 
 ./placementoptimizer.py showremapped --by-osd | tee remapped.txt
>>> 
>>> Output attached.
>>> 
>>> Thanks again.
>>> 
>>> Mvh.
>>> 
>>> Torkil
>>> 
 On Sun, Mar 24, 2024 at 5:50 AM Torkil Svensgaard 
 wrote:
> 
> Hi Alex
> 
> New query output attached after restarting both OSDs. OSD 237 is no
> longer mentioned but it unfortunately made no difference for the number
> of backfills which went 59->62->62.
> 
> Mvh.
> 
> Torkil
> 
> On 23-03-2024 22:26, Alexander E. Patrakov wrote:
>> Hi Torkil,
>> 
>> I have looked at the files that you attached. They were helpful: pool
>> 11 is problematic, it complains about degraded objects for no obvious
>> reason. I think that is the blocker.
>> 
>> I also noted that you mentioned peering problems, and I suspect that
>> they are not completely resolved. As a somewhat-irrational move, to
>> confirm this theory, you can restart osd.237 (it is mentioned at the
>> end of query.11.fff.txt, although I don't understand why it is there)
>> and then osd.298 (it is the primary for that pg) and see if any
>> additional backfills are unblocked after that. Also, please re-query
>> that PG again after the OSD restart.
>> 
>> On Sun, Mar 24, 2024 at 4:56 AM Torkil Svensgaard 
>> wrote:
>>> 
>>> 
>>> 
>>> On 23-03-2024 21:19, Alexander E. Patrakov wrote:
 Hi Torkil,
>>> 
>>> Hi Alexander
>>> 
 I have looked at the CRUSH rules, and the equivalent rules work on my
 test cluster. So this cannot be the cause of the blockage.
>>> 
>>> Thank you for taking the time =)
>>> 
 What happens if you increase the osd_max_backfills setting
 temporarily?
>>> 
>>> We already had the mclock override option in place and I re-enabled
>>> our
>>> babysitter script which sets osd_max_backfills pr OSD to 1-3 depending
>>> on how full they are. Active backfills went from 16 to 53 which is
>>> probably because default osd_max_backfills for mclock is 1.
>>> 
>>> I think 53 is still a low number of active backfills given the large
>>> percentage misplaced.
>>> 
 It may be a good idea to investigate a few of the stalled PGs. Please
 run commands similar to this one:
 
 ceph pg 37.0 query > query.37.0.txt
 ceph pg 37.1 query > query.37.1.txt
 ...
 and the same for the other affected pools.
>>> 
>>> A few samples attached.
>>> 
 Still, I must say that some of your rules are actually unsafe.
 
 The 4+2 rule as used by rbd_ec_data will not survive a
 datacenter-offline incident. Namely, for each PG, it chooses OSDs
 from
 two hosts in each datacenter, so 6 OSDs total. When 

[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-25 Thread Alexander E. Patrakov
On Mon, Mar 25, 2024 at 7:37 PM Torkil Svensgaard  wrote:
>
>
>
> On 24/03/2024 01:14, Torkil Svensgaard wrote:
> > On 24-03-2024 00:31, Alexander E. Patrakov wrote:
> >> Hi Torkil,
> >
> > Hi Alexander
> >
> >> Thanks for the update. Even though the improvement is small, it is
> >> still an improvement, consistent with the osd_max_backfills value, and
> >> it proves that there are still unsolved peering issues.
> >>
> >> I have looked at both the old and the new state of the PG, but could
> >> not find anything else interesting.
> >>
> >> I also looked again at the state of PG 37.1. It is known what blocks
> >> the backfill of this PG; please search for "blocked_by." However, this
> >> is just one data point, which is insufficient for any conclusions. Try
> >> looking at other PGs. Is there anything too common in the non-empty
> >> "blocked_by" blocks?
> >
> > I'll take a look at that tomorrow, perhaps we can script something
> > meaningful.
>
> Hi Alexander
>
> While working on a script querying all PGs and making a list of all OSDs
> found in a blocked_by list, and how many times for each, I discovered
> something odd about pool 38:
>
> "
> [root@lazy blocked_by]# sh blocked_by.sh 38 |tee pool38
> OSDs blocking other OSDs:


> All PGs in the pool are active+clean so why are there any blocked_by at
> all? One example attached.

I don't know. In any case, it doesn't match the "one OSD blocks them
all" scenario that I was looking for. I think this is something bogus
that can probably be cleared in your example by restarting osd.89
(i.e, the one being blocked).

>
> Mvh.
>
> Torkil
>
> >> I think we have to look for patterns in other ways, too. One tool that
> >> produces good visualizations is TheJJ balancer. Although it is called
> >> a "balancer," it can also visualize the ongoing backfills.
> >>
> >> The tool is available at
> >> https://raw.githubusercontent.com/TheJJ/ceph-balancer/master/placementoptimizer.py
> >>
> >> Run it as follows:
> >>
> >> ./placementoptimizer.py showremapped --by-osd | tee remapped.txt
> >
> > Output attached.
> >
> > Thanks again.
> >
> > Mvh.
> >
> > Torkil
> >
> >> On Sun, Mar 24, 2024 at 5:50 AM Torkil Svensgaard 
> >> wrote:
> >>>
> >>> Hi Alex
> >>>
> >>> New query output attached after restarting both OSDs. OSD 237 is no
> >>> longer mentioned but it unfortunately made no difference for the number
> >>> of backfills which went 59->62->62.
> >>>
> >>> Mvh.
> >>>
> >>> Torkil
> >>>
> >>> On 23-03-2024 22:26, Alexander E. Patrakov wrote:
>  Hi Torkil,
> 
>  I have looked at the files that you attached. They were helpful: pool
>  11 is problematic, it complains about degraded objects for no obvious
>  reason. I think that is the blocker.
> 
>  I also noted that you mentioned peering problems, and I suspect that
>  they are not completely resolved. As a somewhat-irrational move, to
>  confirm this theory, you can restart osd.237 (it is mentioned at the
>  end of query.11.fff.txt, although I don't understand why it is there)
>  and then osd.298 (it is the primary for that pg) and see if any
>  additional backfills are unblocked after that. Also, please re-query
>  that PG again after the OSD restart.
> 
>  On Sun, Mar 24, 2024 at 4:56 AM Torkil Svensgaard 
>  wrote:
> >
> >
> >
> > On 23-03-2024 21:19, Alexander E. Patrakov wrote:
> >> Hi Torkil,
> >
> > Hi Alexander
> >
> >> I have looked at the CRUSH rules, and the equivalent rules work on my
> >> test cluster. So this cannot be the cause of the blockage.
> >
> > Thank you for taking the time =)
> >
> >> What happens if you increase the osd_max_backfills setting
> >> temporarily?
> >
> > We already had the mclock override option in place and I re-enabled
> > our
> > babysitter script which sets osd_max_backfills pr OSD to 1-3 depending
> > on how full they are. Active backfills went from 16 to 53 which is
> > probably because default osd_max_backfills for mclock is 1.
> >
> > I think 53 is still a low number of active backfills given the large
> > percentage misplaced.
> >
> >> It may be a good idea to investigate a few of the stalled PGs. Please
> >> run commands similar to this one:
> >>
> >> ceph pg 37.0 query > query.37.0.txt
> >> ceph pg 37.1 query > query.37.1.txt
> >> ...
> >> and the same for the other affected pools.
> >
> > A few samples attached.
> >
> >> Still, I must say that some of your rules are actually unsafe.
> >>
> >> The 4+2 rule as used by rbd_ec_data will not survive a
> >> datacenter-offline incident. Namely, for each PG, it chooses OSDs
> >> from
> >> two hosts in each datacenter, so 6 OSDs total. When a datacenter is
> >> offline, you will, therefore, have only 4 OSDs up, which is exactly
> >> the number of data chunks. However, the pool requires min_size 5, so
> 

[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-25 Thread Torkil Svensgaard



On 24/03/2024 01:14, Torkil Svensgaard wrote:

On 24-03-2024 00:31, Alexander E. Patrakov wrote:

Hi Torkil,


Hi Alexander


Thanks for the update. Even though the improvement is small, it is
still an improvement, consistent with the osd_max_backfills value, and
it proves that there are still unsolved peering issues.

I have looked at both the old and the new state of the PG, but could
not find anything else interesting.

I also looked again at the state of PG 37.1. It is known what blocks
the backfill of this PG; please search for "blocked_by." However, this
is just one data point, which is insufficient for any conclusions. Try
looking at other PGs. Is there anything too common in the non-empty
"blocked_by" blocks?


I'll take a look at that tomorrow, perhaps we can script something 
meaningful.


Hi Alexander

While working on a script querying all PGs and making a list of all OSDs 
found in a blocked_by list, and how many times for each, I discovered 
something odd about pool 38:


"
[root@lazy blocked_by]# sh blocked_by.sh 38 |tee pool38
OSDs blocking other OSDs:
OSD 425: 5 instance(s)
OSD 426: 6 instance(s)
OSD 34: 7 instance(s)
OSD 36: 5 instance(s)
OSD 146: 3 instance(s)
OSD 6: 2 instance(s)
OSD 5: 8 instance(s)
OSD 131: 7 instance(s)
OSD 4: 9 instance(s)
OSD 3: 5 instance(s)
OSD 2: 5 instance(s)
OSD 1: 2 instance(s)
OSD 0: 4 instance(s)
OSD 167: 1 instance(s)
OSD 168: 3 instance(s)
OSD 450: 2 instance(s)
OSD 46: 6 instance(s)
OSD 154: 3 instance(s)
OSD 156: 2 instance(s)
OSD 90: 2 instance(s)
OSD 227: 4 instance(s)
OSD 10: 4 instance(s)
OSD 15: 6 instance(s)
OSD 449: 4 instance(s)
OSD 192: 2 instance(s)
OSD 67: 3 instance(s)
"

All PGs in the pool are active+clean so why are there any blocked_by at 
all? One example attached.


Mvh.

Torkil


I think we have to look for patterns in other ways, too. One tool that
produces good visualizations is TheJJ balancer. Although it is called
a "balancer," it can also visualize the ongoing backfills.

The tool is available at
https://raw.githubusercontent.com/TheJJ/ceph-balancer/master/placementoptimizer.py

Run it as follows:

./placementoptimizer.py showremapped --by-osd | tee remapped.txt


Output attached.

Thanks again.

Mvh.

Torkil

On Sun, Mar 24, 2024 at 5:50 AM Torkil Svensgaard  
wrote:


Hi Alex

New query output attached after restarting both OSDs. OSD 237 is no
longer mentioned but it unfortunately made no difference for the number
of backfills which went 59->62->62.

Mvh.

Torkil

On 23-03-2024 22:26, Alexander E. Patrakov wrote:

Hi Torkil,

I have looked at the files that you attached. They were helpful: pool
11 is problematic, it complains about degraded objects for no obvious
reason. I think that is the blocker.

I also noted that you mentioned peering problems, and I suspect that
they are not completely resolved. As a somewhat-irrational move, to
confirm this theory, you can restart osd.237 (it is mentioned at the
end of query.11.fff.txt, although I don't understand why it is there)
and then osd.298 (it is the primary for that pg) and see if any
additional backfills are unblocked after that. Also, please re-query
that PG again after the OSD restart.

On Sun, Mar 24, 2024 at 4:56 AM Torkil Svensgaard  
wrote:




On 23-03-2024 21:19, Alexander E. Patrakov wrote:

Hi Torkil,


Hi Alexander


I have looked at the CRUSH rules, and the equivalent rules work on my
test cluster. So this cannot be the cause of the blockage.


Thank you for taking the time =)

What happens if you increase the osd_max_backfills setting 
temporarily?


We already had the mclock override option in place and I re-enabled 
our

babysitter script which sets osd_max_backfills pr OSD to 1-3 depending
on how full they are. Active backfills went from 16 to 53 which is
probably because default osd_max_backfills for mclock is 1.

I think 53 is still a low number of active backfills given the large
percentage misplaced.


It may be a good idea to investigate a few of the stalled PGs. Please
run commands similar to this one:

ceph pg 37.0 query > query.37.0.txt
ceph pg 37.1 query > query.37.1.txt
...
and the same for the other affected pools.


A few samples attached.


Still, I must say that some of your rules are actually unsafe.

The 4+2 rule as used by rbd_ec_data will not survive a
datacenter-offline incident. Namely, for each PG, it chooses OSDs 
from

two hosts in each datacenter, so 6 OSDs total. When a datacenter is
offline, you will, therefore, have only 4 OSDs up, which is exactly
the number of data chunks. However, the pool requires min_size 5, so
all PGs will be inactive (to prevent data corruption) and will stay
inactive until the datacenter comes up again. However, please don't
set min_size to 4 - then, any additional incident (like a defective
disk) will lead to data loss, and the shards in the datacenter which
went offline would be useless because they do not correspond to the
updated shards written by the clients.


Thanks for the explanation. This is an 

[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-24 Thread Torkil Svensgaard

On 24-03-2024 13:41, Tyler Stachecki wrote:

On Sat, Mar 23, 2024, 4:26 AM Torkil Svensgaard  wrote:


Hi

... Using mclock with high_recovery_ops profile.

What is the bottleneck here? I would have expected a huge number of
simultaneous backfills. Backfill reservation logjam?



mClock is very buggy in my experience and frequently leads to issues like
this. Try using regular backfill and see if the problem goes away.


Hi Tyler

Just tried switching to wpq, same thing.

I'm inclined to think is must be a read reservation logjam of some sort, 
given that increasing osd_max_backfills had an immediate effect and we 
have 4 empty hosts as main write targets. Here's the output for one such 
OSD from the script Alexander linked:


"
osd.539: gimpy  =>0 0.0B <=42 1.54T (Δ1.54T)   drive=9.0% 358.72G/3.90T 
crush=9

.0% 358.72G/3.90T
  <-11.54f  waiting   44.8G  539<-421  1070 of 230320, 0.5%
  <-37.538  waiting   22.3G  539<-121  2819 of 556087, 0.5%
  <-11.507  waiting   45.1G  539<-61   912 of 139227, 0.7%
  <-37.450  waiting   22.3G  539<-220  1235 of 632776, 0.2%
  <-11.458  waiting   45.3G  539<-121  178 of 279150, 0.1%
  <-37.47c  waiting   22.2G  539<-83   2281 of 634472, 0.4%
  <-37.434  waiting   22.1G  539<-78   9496 of 316052, 3.0%
  <-11.3f3  waiting   44.9G  539<-109  2375 of 231055, 1.0%
  <-37.3d3  waiting   22.0G  539<-73   2144 of 316508, 0.7%
  <-37.3c5  waiting   22.2G  539<-83   313880 of 313880, 100.0%
  <-11.3c1  waiting   44.8G  539<-223  93878 of 230270, 40.8%
  <-37.3a4  waiting   21.9G  539<-85   4604 of 315504, 1.5%
  <-11.344  waiting   44.5G  539<-63   728 of 182876, 0.4%
  <-6.1ca   waiting  100.9G  539<-443  36076 of 56270, 64.1%
  <-4.1a2   waiting  157.6G  539<-218  508 of 91456, 0.6%
  <-37.1ba  waiting   22.2G  539<-64   316848 of 316848, 100.0%
  <-37.84   waiting   22.0G  539<-33   4380 of 237633, 1.8%
  <-37.ad   waiting   22.2G  539<-77   6730 of 396635, 1.7%
  <-37.36   waiting   22.1G  539<-47   2170 of 395955, 0.5%
  <-11.b9   waiting   45.1G  539<-223  0 of 231940, 0.0%
  <-37.11c  waiting   22.1G  539<-33   9952 of 316448, 3.1%
  <-11.144  waiting   45.1G  539<-207  528 of 278094, 0.2%
  <-37.2ae  waiting   22.1G  539<-224  2565 of 712539, 0.4%
  <-37.285  waiting   22.0G  539<-65   441 of 315336, 0.1%
  <-37.2ef  waiting   22.0G  539<-414  2124 of 475410, 0.4%
  <-37.674  waiting   22.0G  539<-56   60 of 236511, 0.0%
  <-37.655  waiting   22.3G  539<-143  237316 of 237381, 100.0%
  <-11.6b0  waiting   44.9G  539<-282  1131 of 277122, 0.4%
  <-37.71a  waiting   22.2G  539<-49   82865 of 315684, 26.2%
  <-11.789  waiting   45.0G  539<-196  736 of 277584, 0.3%
  <-11.7cf  waiting   44.8G  539<-127  143 of 276582, 0.1%
  <-11.7f2  waiting   45.2G  539<-272  145857 of 185680, 78.6%
  <-37.7dd  waiting   22.0G  539<-72   0 of 393475, 0.0%
  <-37.7d9  waiting   22.2G  539<-37   930 of 237831, 0.4%
  <-11.7fb  waiting   45.2G  539<-78   1062 of 279042, 0.4%
  <-37.7d2  waiting   22.0G  539<-71   2409 of 631368, 0.4%
  <-11.8db  waiting   44.9G  539<-84   108 of 277182, 0.0%
  <-11.9b6  waiting   44.8G  539<-74   772 of 184432, 0.4%
  <-11.b0b  waiting   45.0G  539<-166  2569 of 231430, 1.1%
  <-11.d42  waiting   45.2G  539<-118  15428 of 46429, 33.2%
  <-11.d5f  waiting   44.8G  539<-64   4 of 184356, 0.0%
  <-11.d98  waiting   45.1G  539<-418  0 of 278568, 0.0%
"

All waiting for something.

Mvh.

Torkil


Tyler




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Torkil Svensgaard
Systems Administrator
Danish Research Centre for Magnetic Resonance DRCMR, Section 714
Copenhagen University Hospital Amager and Hvidovre
Kettegaard Allé 30, 2650 Hvidovre, Denmark
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-24 Thread Tyler Stachecki
On Sat, Mar 23, 2024, 4:26 AM Torkil Svensgaard  wrote:

> Hi
>
> ... Using mclock with high_recovery_ops profile.
>
> What is the bottleneck here? I would have expected a huge number of
> simultaneous backfills. Backfill reservation logjam?
>

mClock is very buggy in my experience and frequently leads to issues like
this. Try using regular backfill and see if the problem goes away.

Tyler

>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-23 Thread Alexander E. Patrakov
Hi Torkil,

Thanks for the update. Even though the improvement is small, it is
still an improvement, consistent with the osd_max_backfills value, and
it proves that there are still unsolved peering issues.

I have looked at both the old and the new state of the PG, but could
not find anything else interesting.

I also looked again at the state of PG 37.1. It is known what blocks
the backfill of this PG; please search for "blocked_by." However, this
is just one data point, which is insufficient for any conclusions. Try
looking at other PGs. Is there anything too common in the non-empty
"blocked_by" blocks?

I think we have to look for patterns in other ways, too. One tool that
produces good visualizations is TheJJ balancer. Although it is called
a "balancer," it can also visualize the ongoing backfills.

The tool is available at
https://raw.githubusercontent.com/TheJJ/ceph-balancer/master/placementoptimizer.py

Run it as follows:

./placementoptimizer.py showremapped --by-osd | tee remapped.txt

On Sun, Mar 24, 2024 at 5:50 AM Torkil Svensgaard  wrote:
>
> Hi Alex
>
> New query output attached after restarting both OSDs. OSD 237 is no
> longer mentioned but it unfortunately made no difference for the number
> of backfills which went 59->62->62.
>
> Mvh.
>
> Torkil
>
> On 23-03-2024 22:26, Alexander E. Patrakov wrote:
> > Hi Torkil,
> >
> > I have looked at the files that you attached. They were helpful: pool
> > 11 is problematic, it complains about degraded objects for no obvious
> > reason. I think that is the blocker.
> >
> > I also noted that you mentioned peering problems, and I suspect that
> > they are not completely resolved. As a somewhat-irrational move, to
> > confirm this theory, you can restart osd.237 (it is mentioned at the
> > end of query.11.fff.txt, although I don't understand why it is there)
> > and then osd.298 (it is the primary for that pg) and see if any
> > additional backfills are unblocked after that. Also, please re-query
> > that PG again after the OSD restart.
> >
> > On Sun, Mar 24, 2024 at 4:56 AM Torkil Svensgaard  wrote:
> >>
> >>
> >>
> >> On 23-03-2024 21:19, Alexander E. Patrakov wrote:
> >>> Hi Torkil,
> >>
> >> Hi Alexander
> >>
> >>> I have looked at the CRUSH rules, and the equivalent rules work on my
> >>> test cluster. So this cannot be the cause of the blockage.
> >>
> >> Thank you for taking the time =)
> >>
> >>> What happens if you increase the osd_max_backfills setting temporarily?
> >>
> >> We already had the mclock override option in place and I re-enabled our
> >> babysitter script which sets osd_max_backfills pr OSD to 1-3 depending
> >> on how full they are. Active backfills went from 16 to 53 which is
> >> probably because default osd_max_backfills for mclock is 1.
> >>
> >> I think 53 is still a low number of active backfills given the large
> >> percentage misplaced.
> >>
> >>> It may be a good idea to investigate a few of the stalled PGs. Please
> >>> run commands similar to this one:
> >>>
> >>> ceph pg 37.0 query > query.37.0.txt
> >>> ceph pg 37.1 query > query.37.1.txt
> >>> ...
> >>> and the same for the other affected pools.
> >>
> >> A few samples attached.
> >>
> >>> Still, I must say that some of your rules are actually unsafe.
> >>>
> >>> The 4+2 rule as used by rbd_ec_data will not survive a
> >>> datacenter-offline incident. Namely, for each PG, it chooses OSDs from
> >>> two hosts in each datacenter, so 6 OSDs total. When a datacenter is
> >>> offline, you will, therefore, have only 4 OSDs up, which is exactly
> >>> the number of data chunks. However, the pool requires min_size 5, so
> >>> all PGs will be inactive (to prevent data corruption) and will stay
> >>> inactive until the datacenter comes up again. However, please don't
> >>> set min_size to 4 - then, any additional incident (like a defective
> >>> disk) will lead to data loss, and the shards in the datacenter which
> >>> went offline would be useless because they do not correspond to the
> >>> updated shards written by the clients.
> >>
> >> Thanks for the explanation. This is an old pool predating the 3 DC setup
> >> and we'll migrate the data to a 4+5 pool when we can.
> >>
> >>> The 4+5 rule as used by cephfs.hdd.data has min_size equal to the
> >>> number of data chunks. See above why it is bad. Please set min_size to
> >>> 5.
> >>
> >> Thanks, that was a leftover for getting the PGs to peer (stuck at
> >> creating+incomplete) when we created the pool. It's back to 5 now.
> >>
> >>> The rbd.ssd.data pool seems to be OK - and, by the way, its PGs are
> >>> 100% active+clean.
> >>
> >> There is very little data in this pool, that is probably the main reason.
> >>
> >>> Regarding the mon_max_pg_per_osd setting, you have a few OSDs that
> >>> have 300+ PGs, the observed maximum is 347. Please set it to 400.
> >>
> >> Copy that. Didn't seem to make a difference though, and we have
> >> osd_max_pg_per_osd_hard_ratio set to 5.00.
> >>
> >> Mvh.
> >>
> >> Torkil
> >>
> >>> On 

[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-23 Thread Torkil Svensgaard

Hi Alex

New query output attached after restarting both OSDs. OSD 237 is no 
longer mentioned but it unfortunately made no difference for the number 
of backfills which went 59->62->62.


Mvh.

Torkil

On 23-03-2024 22:26, Alexander E. Patrakov wrote:

Hi Torkil,

I have looked at the files that you attached. They were helpful: pool
11 is problematic, it complains about degraded objects for no obvious
reason. I think that is the blocker.

I also noted that you mentioned peering problems, and I suspect that
they are not completely resolved. As a somewhat-irrational move, to
confirm this theory, you can restart osd.237 (it is mentioned at the
end of query.11.fff.txt, although I don't understand why it is there)
and then osd.298 (it is the primary for that pg) and see if any
additional backfills are unblocked after that. Also, please re-query
that PG again after the OSD restart.

On Sun, Mar 24, 2024 at 4:56 AM Torkil Svensgaard  wrote:




On 23-03-2024 21:19, Alexander E. Patrakov wrote:

Hi Torkil,


Hi Alexander


I have looked at the CRUSH rules, and the equivalent rules work on my
test cluster. So this cannot be the cause of the blockage.


Thank you for taking the time =)


What happens if you increase the osd_max_backfills setting temporarily?


We already had the mclock override option in place and I re-enabled our
babysitter script which sets osd_max_backfills pr OSD to 1-3 depending
on how full they are. Active backfills went from 16 to 53 which is
probably because default osd_max_backfills for mclock is 1.

I think 53 is still a low number of active backfills given the large
percentage misplaced.


It may be a good idea to investigate a few of the stalled PGs. Please
run commands similar to this one:

ceph pg 37.0 query > query.37.0.txt
ceph pg 37.1 query > query.37.1.txt
...
and the same for the other affected pools.


A few samples attached.


Still, I must say that some of your rules are actually unsafe.

The 4+2 rule as used by rbd_ec_data will not survive a
datacenter-offline incident. Namely, for each PG, it chooses OSDs from
two hosts in each datacenter, so 6 OSDs total. When a datacenter is
offline, you will, therefore, have only 4 OSDs up, which is exactly
the number of data chunks. However, the pool requires min_size 5, so
all PGs will be inactive (to prevent data corruption) and will stay
inactive until the datacenter comes up again. However, please don't
set min_size to 4 - then, any additional incident (like a defective
disk) will lead to data loss, and the shards in the datacenter which
went offline would be useless because they do not correspond to the
updated shards written by the clients.


Thanks for the explanation. This is an old pool predating the 3 DC setup
and we'll migrate the data to a 4+5 pool when we can.


The 4+5 rule as used by cephfs.hdd.data has min_size equal to the
number of data chunks. See above why it is bad. Please set min_size to
5.


Thanks, that was a leftover for getting the PGs to peer (stuck at
creating+incomplete) when we created the pool. It's back to 5 now.


The rbd.ssd.data pool seems to be OK - and, by the way, its PGs are
100% active+clean.


There is very little data in this pool, that is probably the main reason.


Regarding the mon_max_pg_per_osd setting, you have a few OSDs that
have 300+ PGs, the observed maximum is 347. Please set it to 400.


Copy that. Didn't seem to make a difference though, and we have
osd_max_pg_per_osd_hard_ratio set to 5.00.

Mvh.

Torkil


On Sun, Mar 24, 2024 at 3:16 AM Torkil Svensgaard  wrote:




On 23-03-2024 19:05, Alexander E. Patrakov wrote:

Sorry for replying to myself, but "ceph osd pool ls detail" by itself
is insufficient. For every erasure code profile mentioned in the
output, please also run something like this:

ceph osd erasure-code-profile get prf-for-ec-data

...where "prf-for-ec-data" is the name that appears after the words
"erasure profile" in the "ceph osd pool ls detail" output.


[root@lazy ~]# ceph osd pool ls detail | grep erasure
pool 11 'rbd_ec_data' erasure profile DRCMR_k4m2 size 6 min_size 5
crush_rule 0 object_hash rjenkins pg_num 4096 pgp_num 4096
autoscale_mode off last_change 2257933 lfor 0/1291190/1755101 flags
hashpspool,ec_overwrites,selfmanaged_snaps,bulk stripe_width 16384
fast_read 1 compression_algorithm snappy compression_mode aggressive
application rbd
pool 37 'cephfs.hdd.data' erasure profile DRCMR_k4m5_datacenter_hdd size
9 min_size 4 crush_rule 7 object_hash rjenkins pg_num 2048 pgp_num 2048
autoscale_mode off last_change 2257933 lfor 0/0/2139486 flags
hashpspool,ec_overwrites,bulk stripe_width 16384 fast_read 1
compression_algorithm zstd compression_mode aggressive application cephfs
pool 38 'rbd.ssd.data' erasure profile DRCMR_k4m5_datacenter_ssd size 9
min_size 5 crush_rule 8 object_hash rjenkins pg_num 32 pgp_num 32
autoscale_mode warn last_change 2198930 lfor 0/2198930/2198928 flags
hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384

[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-23 Thread Alexander E. Patrakov
Hi Torkil,

I have looked at the files that you attached. They were helpful: pool
11 is problematic, it complains about degraded objects for no obvious
reason. I think that is the blocker.

I also noted that you mentioned peering problems, and I suspect that
they are not completely resolved. As a somewhat-irrational move, to
confirm this theory, you can restart osd.237 (it is mentioned at the
end of query.11.fff.txt, although I don't understand why it is there)
and then osd.298 (it is the primary for that pg) and see if any
additional backfills are unblocked after that. Also, please re-query
that PG again after the OSD restart.

On Sun, Mar 24, 2024 at 4:56 AM Torkil Svensgaard  wrote:
>
>
>
> On 23-03-2024 21:19, Alexander E. Patrakov wrote:
> > Hi Torkil,
>
> Hi Alexander
>
> > I have looked at the CRUSH rules, and the equivalent rules work on my
> > test cluster. So this cannot be the cause of the blockage.
>
> Thank you for taking the time =)
>
> > What happens if you increase the osd_max_backfills setting temporarily?
>
> We already had the mclock override option in place and I re-enabled our
> babysitter script which sets osd_max_backfills pr OSD to 1-3 depending
> on how full they are. Active backfills went from 16 to 53 which is
> probably because default osd_max_backfills for mclock is 1.
>
> I think 53 is still a low number of active backfills given the large
> percentage misplaced.
>
> > It may be a good idea to investigate a few of the stalled PGs. Please
> > run commands similar to this one:
> >
> > ceph pg 37.0 query > query.37.0.txt
> > ceph pg 37.1 query > query.37.1.txt
> > ...
> > and the same for the other affected pools.
>
> A few samples attached.
>
> > Still, I must say that some of your rules are actually unsafe.
> >
> > The 4+2 rule as used by rbd_ec_data will not survive a
> > datacenter-offline incident. Namely, for each PG, it chooses OSDs from
> > two hosts in each datacenter, so 6 OSDs total. When a datacenter is
> > offline, you will, therefore, have only 4 OSDs up, which is exactly
> > the number of data chunks. However, the pool requires min_size 5, so
> > all PGs will be inactive (to prevent data corruption) and will stay
> > inactive until the datacenter comes up again. However, please don't
> > set min_size to 4 - then, any additional incident (like a defective
> > disk) will lead to data loss, and the shards in the datacenter which
> > went offline would be useless because they do not correspond to the
> > updated shards written by the clients.
>
> Thanks for the explanation. This is an old pool predating the 3 DC setup
> and we'll migrate the data to a 4+5 pool when we can.
>
> > The 4+5 rule as used by cephfs.hdd.data has min_size equal to the
> > number of data chunks. See above why it is bad. Please set min_size to
> > 5.
>
> Thanks, that was a leftover for getting the PGs to peer (stuck at
> creating+incomplete) when we created the pool. It's back to 5 now.
>
> > The rbd.ssd.data pool seems to be OK - and, by the way, its PGs are
> > 100% active+clean.
>
> There is very little data in this pool, that is probably the main reason.
>
> > Regarding the mon_max_pg_per_osd setting, you have a few OSDs that
> > have 300+ PGs, the observed maximum is 347. Please set it to 400.
>
> Copy that. Didn't seem to make a difference though, and we have
> osd_max_pg_per_osd_hard_ratio set to 5.00.
>
> Mvh.
>
> Torkil
>
> > On Sun, Mar 24, 2024 at 3:16 AM Torkil Svensgaard  wrote:
> >>
> >>
> >>
> >> On 23-03-2024 19:05, Alexander E. Patrakov wrote:
> >>> Sorry for replying to myself, but "ceph osd pool ls detail" by itself
> >>> is insufficient. For every erasure code profile mentioned in the
> >>> output, please also run something like this:
> >>>
> >>> ceph osd erasure-code-profile get prf-for-ec-data
> >>>
> >>> ...where "prf-for-ec-data" is the name that appears after the words
> >>> "erasure profile" in the "ceph osd pool ls detail" output.
> >>
> >> [root@lazy ~]# ceph osd pool ls detail | grep erasure
> >> pool 11 'rbd_ec_data' erasure profile DRCMR_k4m2 size 6 min_size 5
> >> crush_rule 0 object_hash rjenkins pg_num 4096 pgp_num 4096
> >> autoscale_mode off last_change 2257933 lfor 0/1291190/1755101 flags
> >> hashpspool,ec_overwrites,selfmanaged_snaps,bulk stripe_width 16384
> >> fast_read 1 compression_algorithm snappy compression_mode aggressive
> >> application rbd
> >> pool 37 'cephfs.hdd.data' erasure profile DRCMR_k4m5_datacenter_hdd size
> >> 9 min_size 4 crush_rule 7 object_hash rjenkins pg_num 2048 pgp_num 2048
> >> autoscale_mode off last_change 2257933 lfor 0/0/2139486 flags
> >> hashpspool,ec_overwrites,bulk stripe_width 16384 fast_read 1
> >> compression_algorithm zstd compression_mode aggressive application cephfs
> >> pool 38 'rbd.ssd.data' erasure profile DRCMR_k4m5_datacenter_ssd size 9
> >> min_size 5 crush_rule 8 object_hash rjenkins pg_num 32 pgp_num 32
> >> autoscale_mode warn last_change 2198930 lfor 0/2198930/2198928 flags
> >> 

[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-23 Thread Alexander E. Patrakov
Hi Torkil,

I have looked at the CRUSH rules, and the equivalent rules work on my
test cluster. So this cannot be the cause of the blockage.

What happens if you increase the osd_max_backfills setting temporarily?

It may be a good idea to investigate a few of the stalled PGs. Please
run commands similar to this one:

ceph pg 37.0 query > query.37.0.txt
ceph pg 37.1 query > query.37.1.txt
...
and the same for the other affected pools.

Still, I must say that some of your rules are actually unsafe.

The 4+2 rule as used by rbd_ec_data will not survive a
datacenter-offline incident. Namely, for each PG, it chooses OSDs from
two hosts in each datacenter, so 6 OSDs total. When a datacenter is
offline, you will, therefore, have only 4 OSDs up, which is exactly
the number of data chunks. However, the pool requires min_size 5, so
all PGs will be inactive (to prevent data corruption) and will stay
inactive until the datacenter comes up again. However, please don't
set min_size to 4 - then, any additional incident (like a defective
disk) will lead to data loss, and the shards in the datacenter which
went offline would be useless because they do not correspond to the
updated shards written by the clients.

The 4+5 rule as used by cephfs.hdd.data has min_size equal to the
number of data chunks. See above why it is bad. Please set min_size to
5.

The rbd.ssd.data pool seems to be OK - and, by the way, its PGs are
100% active+clean.

Regarding the mon_max_pg_per_osd setting, you have a few OSDs that
have 300+ PGs, the observed maximum is 347. Please set it to 400.

On Sun, Mar 24, 2024 at 3:16 AM Torkil Svensgaard  wrote:
>
>
>
> On 23-03-2024 19:05, Alexander E. Patrakov wrote:
> > Sorry for replying to myself, but "ceph osd pool ls detail" by itself
> > is insufficient. For every erasure code profile mentioned in the
> > output, please also run something like this:
> >
> > ceph osd erasure-code-profile get prf-for-ec-data
> >
> > ...where "prf-for-ec-data" is the name that appears after the words
> > "erasure profile" in the "ceph osd pool ls detail" output.
>
> [root@lazy ~]# ceph osd pool ls detail | grep erasure
> pool 11 'rbd_ec_data' erasure profile DRCMR_k4m2 size 6 min_size 5
> crush_rule 0 object_hash rjenkins pg_num 4096 pgp_num 4096
> autoscale_mode off last_change 2257933 lfor 0/1291190/1755101 flags
> hashpspool,ec_overwrites,selfmanaged_snaps,bulk stripe_width 16384
> fast_read 1 compression_algorithm snappy compression_mode aggressive
> application rbd
> pool 37 'cephfs.hdd.data' erasure profile DRCMR_k4m5_datacenter_hdd size
> 9 min_size 4 crush_rule 7 object_hash rjenkins pg_num 2048 pgp_num 2048
> autoscale_mode off last_change 2257933 lfor 0/0/2139486 flags
> hashpspool,ec_overwrites,bulk stripe_width 16384 fast_read 1
> compression_algorithm zstd compression_mode aggressive application cephfs
> pool 38 'rbd.ssd.data' erasure profile DRCMR_k4m5_datacenter_ssd size 9
> min_size 5 crush_rule 8 object_hash rjenkins pg_num 32 pgp_num 32
> autoscale_mode warn last_change 2198930 lfor 0/2198930/2198928 flags
> hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384
> compression_algorithm zstd compression_mode aggressive application rbd
>
> [root@lazy ~]# ceph osd erasure-code-profile get DRCMR_k4m2
> crush-device-class=hdd
> crush-failure-domain=host
> crush-root=default
> jerasure-per-chunk-alignment=false
> k=4
> m=2
> plugin=jerasure
> technique=reed_sol_van
> w=8
> [root@lazy ~]# ceph osd erasure-code-profile get DRCMR_k4m5_datacenter_hdd
> crush-device-class=hdd
> crush-failure-domain=datacenter
> crush-root=default
> jerasure-per-chunk-alignment=false
> k=4
> m=5
> plugin=jerasure
> technique=reed_sol_van
> w=8
> [root@lazy ~]# ceph osd erasure-code-profile get DRCMR_k4m5_datacenter_ssd
> crush-device-class=ssd
> crush-failure-domain=datacenter
> crush-root=default
> jerasure-per-chunk-alignment=false
> k=4
> m=5
> plugin=jerasure
> technique=reed_sol_van
> w=8
>
> But as I understand it those profiles are only used to create the
> initial crush rule for the pool, and we have manually edited those along
> the way. Here are the 3 rules in use for the 3 EC pools:
>
> rule rbd_ec_data {
>  id 0
>  type erasure
>  step set_chooseleaf_tries 5
>  step set_choose_tries 100
>  step take default class hdd
>  step choose indep 0 type datacenter
>  step chooseleaf indep 2 type host
>  step emit
> }
> rule cephfs.hdd.data {
>  id 7
>  type erasure
>  step set_chooseleaf_tries 5
>  step set_choose_tries 100
>  step take default class hdd
>  step choose indep 0 type datacenter
>  step chooseleaf indep 3 type host
>  step emit
> }
> rule rbd.ssd.data {
>  id 8
>  type erasure
>  step set_chooseleaf_tries 5
>  step set_choose_tries 100
>  step take default class ssd
>  step choose indep 0 type datacenter
>  step chooseleaf 

[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-23 Thread Torkil Svensgaard



On 23-03-2024 18:43, Alexander E. Patrakov wrote:

Hi Torkil,

Unfortunately, your files contain nothing obviously bad or suspicious,
except for two things: more PGs than usual and bad balance.

What's your "mon max pg per osd" setting?


[root@lazy ~]# ceph config get mon mon_max_pg_per_osd
250

Mvh.

Torkil


On Sun, Mar 24, 2024 at 1:08 AM Torkil Svensgaard  wrote:


On 2024-03-23 17:54, Kai Stian Olstad wrote:

On Sat, Mar 23, 2024 at 12:09:29PM +0100, Torkil Svensgaard wrote:


The other output is too big for pastebin and I'm not familiar with
paste services, any suggestion for a preferred way to share such
output?


You can attached files to the mail here on the list.


Doh, for some reason I was sure attachments would be stripped. Thanks,
attached.

Mvh.

Torkil






--
Torkil Svensgaard
Systems Administrator
Danish Research Centre for Magnetic Resonance DRCMR, Section 714
Copenhagen University Hospital Amager and Hvidovre
Kettegaard Allé 30, 2650 Hvidovre, Denmark
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-23 Thread Torkil Svensgaard



On 23-03-2024 19:05, Alexander E. Patrakov wrote:

Sorry for replying to myself, but "ceph osd pool ls detail" by itself
is insufficient. For every erasure code profile mentioned in the
output, please also run something like this:

ceph osd erasure-code-profile get prf-for-ec-data

...where "prf-for-ec-data" is the name that appears after the words
"erasure profile" in the "ceph osd pool ls detail" output.


[root@lazy ~]# ceph osd pool ls detail | grep erasure
pool 11 'rbd_ec_data' erasure profile DRCMR_k4m2 size 6 min_size 5 
crush_rule 0 object_hash rjenkins pg_num 4096 pgp_num 4096 
autoscale_mode off last_change 2257933 lfor 0/1291190/1755101 flags 
hashpspool,ec_overwrites,selfmanaged_snaps,bulk stripe_width 16384 
fast_read 1 compression_algorithm snappy compression_mode aggressive 
application rbd
pool 37 'cephfs.hdd.data' erasure profile DRCMR_k4m5_datacenter_hdd size 
9 min_size 4 crush_rule 7 object_hash rjenkins pg_num 2048 pgp_num 2048 
autoscale_mode off last_change 2257933 lfor 0/0/2139486 flags 
hashpspool,ec_overwrites,bulk stripe_width 16384 fast_read 1 
compression_algorithm zstd compression_mode aggressive application cephfs
pool 38 'rbd.ssd.data' erasure profile DRCMR_k4m5_datacenter_ssd size 9 
min_size 5 crush_rule 8 object_hash rjenkins pg_num 32 pgp_num 32 
autoscale_mode warn last_change 2198930 lfor 0/2198930/2198928 flags 
hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384 
compression_algorithm zstd compression_mode aggressive application rbd


[root@lazy ~]# ceph osd erasure-code-profile get DRCMR_k4m2
crush-device-class=hdd
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=4
m=2
plugin=jerasure
technique=reed_sol_van
w=8
[root@lazy ~]# ceph osd erasure-code-profile get DRCMR_k4m5_datacenter_hdd
crush-device-class=hdd
crush-failure-domain=datacenter
crush-root=default
jerasure-per-chunk-alignment=false
k=4
m=5
plugin=jerasure
technique=reed_sol_van
w=8
[root@lazy ~]# ceph osd erasure-code-profile get DRCMR_k4m5_datacenter_ssd
crush-device-class=ssd
crush-failure-domain=datacenter
crush-root=default
jerasure-per-chunk-alignment=false
k=4
m=5
plugin=jerasure
technique=reed_sol_van
w=8

But as I understand it those profiles are only used to create the 
initial crush rule for the pool, and we have manually edited those along 
the way. Here are the 3 rules in use for the 3 EC pools:


rule rbd_ec_data {
id 0
type erasure
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default class hdd
step choose indep 0 type datacenter
step chooseleaf indep 2 type host
step emit
}
rule cephfs.hdd.data {
id 7
type erasure
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default class hdd
step choose indep 0 type datacenter
step chooseleaf indep 3 type host
step emit
}
rule rbd.ssd.data {
id 8
type erasure
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default class ssd
step choose indep 0 type datacenter
step chooseleaf indep 3 type host
step emit
}

Which should first pick all 3 datacenters in the choose step and then 
either 2 or 3 hosts in the chooseleaf step, matching EC 4+2 and 4+5 
respectively.


Mvh.

Torkil


On Sun, Mar 24, 2024 at 1:56 AM Alexander E. Patrakov
 wrote:


Hi Torkil,

I take my previous response back.

You have an erasure-coded pool with nine shards but only three
datacenters. This, in general, cannot work. You need either nine
datacenters or a very custom CRUSH rule. The second option may not be
available if the current EC setup is already incompatible, as there is
no way to change the EC parameters.

It would help if you provided the output of "ceph osd pool ls detail".

On Sun, Mar 24, 2024 at 1:43 AM Alexander E. Patrakov
 wrote:


Hi Torkil,

Unfortunately, your files contain nothing obviously bad or suspicious,
except for two things: more PGs than usual and bad balance.

What's your "mon max pg per osd" setting?

On Sun, Mar 24, 2024 at 1:08 AM Torkil Svensgaard  wrote:


On 2024-03-23 17:54, Kai Stian Olstad wrote:

On Sat, Mar 23, 2024 at 12:09:29PM +0100, Torkil Svensgaard wrote:


The other output is too big for pastebin and I'm not familiar with
paste services, any suggestion for a preferred way to share such
output?


You can attached files to the mail here on the list.


Doh, for some reason I was sure attachments would be stripped. Thanks,
attached.

Mvh.

Torkil




--
Alexander E. Patrakov




--
Alexander E. Patrakov






--
Torkil Svensgaard
Systems Administrator
Danish Research Centre for Magnetic Resonance DRCMR, Section 714
Copenhagen University Hospital Amager and Hvidovre
Kettegaard Allé 30, 2650 Hvidovre, Denmark
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-23 Thread Alexander E. Patrakov
Sorry for replying to myself, but "ceph osd pool ls detail" by itself
is insufficient. For every erasure code profile mentioned in the
output, please also run something like this:

ceph osd erasure-code-profile get prf-for-ec-data

...where "prf-for-ec-data" is the name that appears after the words
"erasure profile" in the "ceph osd pool ls detail" output.

On Sun, Mar 24, 2024 at 1:56 AM Alexander E. Patrakov
 wrote:
>
> Hi Torkil,
>
> I take my previous response back.
>
> You have an erasure-coded pool with nine shards but only three
> datacenters. This, in general, cannot work. You need either nine
> datacenters or a very custom CRUSH rule. The second option may not be
> available if the current EC setup is already incompatible, as there is
> no way to change the EC parameters.
>
> It would help if you provided the output of "ceph osd pool ls detail".
>
> On Sun, Mar 24, 2024 at 1:43 AM Alexander E. Patrakov
>  wrote:
> >
> > Hi Torkil,
> >
> > Unfortunately, your files contain nothing obviously bad or suspicious,
> > except for two things: more PGs than usual and bad balance.
> >
> > What's your "mon max pg per osd" setting?
> >
> > On Sun, Mar 24, 2024 at 1:08 AM Torkil Svensgaard  wrote:
> > >
> > > On 2024-03-23 17:54, Kai Stian Olstad wrote:
> > > > On Sat, Mar 23, 2024 at 12:09:29PM +0100, Torkil Svensgaard wrote:
> > > >>
> > > >> The other output is too big for pastebin and I'm not familiar with
> > > >> paste services, any suggestion for a preferred way to share such
> > > >> output?
> > > >
> > > > You can attached files to the mail here on the list.
> > >
> > > Doh, for some reason I was sure attachments would be stripped. Thanks,
> > > attached.
> > >
> > > Mvh.
> > >
> > > Torkil
> >
> >
> >
> > --
> > Alexander E. Patrakov
>
>
>
> --
> Alexander E. Patrakov



-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-23 Thread Alexander E. Patrakov
Hi Torkil,

I take my previous response back.

You have an erasure-coded pool with nine shards but only three
datacenters. This, in general, cannot work. You need either nine
datacenters or a very custom CRUSH rule. The second option may not be
available if the current EC setup is already incompatible, as there is
no way to change the EC parameters.

It would help if you provided the output of "ceph osd pool ls detail".

On Sun, Mar 24, 2024 at 1:43 AM Alexander E. Patrakov
 wrote:
>
> Hi Torkil,
>
> Unfortunately, your files contain nothing obviously bad or suspicious,
> except for two things: more PGs than usual and bad balance.
>
> What's your "mon max pg per osd" setting?
>
> On Sun, Mar 24, 2024 at 1:08 AM Torkil Svensgaard  wrote:
> >
> > On 2024-03-23 17:54, Kai Stian Olstad wrote:
> > > On Sat, Mar 23, 2024 at 12:09:29PM +0100, Torkil Svensgaard wrote:
> > >>
> > >> The other output is too big for pastebin and I'm not familiar with
> > >> paste services, any suggestion for a preferred way to share such
> > >> output?
> > >
> > > You can attached files to the mail here on the list.
> >
> > Doh, for some reason I was sure attachments would be stripped. Thanks,
> > attached.
> >
> > Mvh.
> >
> > Torkil
>
>
>
> --
> Alexander E. Patrakov



-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-23 Thread Alexander E. Patrakov
Hi Torkil,

Unfortunately, your files contain nothing obviously bad or suspicious,
except for two things: more PGs than usual and bad balance.

What's your "mon max pg per osd" setting?

On Sun, Mar 24, 2024 at 1:08 AM Torkil Svensgaard  wrote:
>
> On 2024-03-23 17:54, Kai Stian Olstad wrote:
> > On Sat, Mar 23, 2024 at 12:09:29PM +0100, Torkil Svensgaard wrote:
> >>
> >> The other output is too big for pastebin and I'm not familiar with
> >> paste services, any suggestion for a preferred way to share such
> >> output?
> >
> > You can attached files to the mail here on the list.
>
> Doh, for some reason I was sure attachments would be stripped. Thanks,
> attached.
>
> Mvh.
>
> Torkil



-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-23 Thread Kai Stian Olstad

On Sat, Mar 23, 2024 at 12:09:29PM +0100, Torkil Svensgaard wrote:


The other output is too big for pastebin and I'm not familiar with 
paste services, any suggestion for a preferred way to share such 
output?


You can attached files to the mail here on the list.

--
Kai Stian Olstad
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-23 Thread Torkil Svensgaard



On 23-03-2024 10:44, Alexander E. Patrakov wrote:

Hello Torkil,


Hi Alexander


It would help if you provided the whole "ceph osd df tree" and "ceph
pg ls" outputs.


Of course, here's ceph osd df tree to start with:

https://pastebin.com/X50b2W0J

The other output is too big for pastebin and I'm not familiar with paste 
services, any suggestion for a preferred way to share such output?


Mvh.

Torkil


On Sat, Mar 23, 2024 at 4:26 PM Torkil Svensgaard  wrote:


Hi

We have this after adding some hosts and changing crush failure domain
to datacenter:

pgs: 1338512379/3162732055 objects misplaced (42.321%)
   5970active+remapped+backfill_wait
   4853 active+clean
   11   active+remapped+backfilling

We have 3 datacenters each with 6 hosts and ~400 HDD OSDs with DB/WAL on
NVMe. Using mclock with high_recovery_ops profile.

What is the bottleneck here? I would have expected a huge number of
simultaneous backfills. Backfill reservation logjam?

Mvh.

Torkil

--
Torkil Svensgaard
Systems Administrator
Danish Research Centre for Magnetic Resonance DRCMR, Section 714
Copenhagen University Hospital Amager and Hvidovre
Kettegaard Allé 30, 2650 Hvidovre, Denmark
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io






--
Torkil Svensgaard
Systems Administrator
Danish Research Centre for Magnetic Resonance DRCMR, Section 714
Copenhagen University Hospital Amager and Hvidovre
Kettegaard Allé 30, 2650 Hvidovre, Denmark
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-23 Thread Alexander E. Patrakov
Hello Torkil,

It would help if you provided the whole "ceph osd df tree" and "ceph
pg ls" outputs.

On Sat, Mar 23, 2024 at 4:26 PM Torkil Svensgaard  wrote:
>
> Hi
>
> We have this after adding some hosts and changing crush failure domain
> to datacenter:
>
> pgs: 1338512379/3162732055 objects misplaced (42.321%)
>   5970active+remapped+backfill_wait
>   4853 active+clean
>   11   active+remapped+backfilling
>
> We have 3 datacenters each with 6 hosts and ~400 HDD OSDs with DB/WAL on
> NVMe. Using mclock with high_recovery_ops profile.
>
> What is the bottleneck here? I would have expected a huge number of
> simultaneous backfills. Backfill reservation logjam?
>
> Mvh.
>
> Torkil
>
> --
> Torkil Svensgaard
> Systems Administrator
> Danish Research Centre for Magnetic Resonance DRCMR, Section 714
> Copenhagen University Hospital Amager and Hvidovre
> Kettegaard Allé 30, 2650 Hvidovre, Denmark
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io