[ceph-users] Re: reef 18.2.3 QE validation status

2024-04-12 Thread Casey Bodley
On Fri, Apr 12, 2024 at 2:38 PM Yuri Weinstein  wrote:
>
> Details of this release are summarized here:
>
> https://tracker.ceph.com/issues/65393#note-1
> Release Notes - TBD
> LRC upgrade - TBD
>
> Seeking approvals/reviews for:
>
> smoke - infra issues, still trying, Laura PTL
>
> rados - Radek, Laura approved? Travis?  Nizamudeen?
>
> rgw - Casey approved?

rgw approved

> fs - Venky approved?
> orch - Adam King approved?
>
> krbd - Ilya approved
> powercycle - seems fs related, Venky, Brad PTL
>
> ceph-volume - will require
> https://github.com/ceph/ceph/pull/56857/commits/63fe3921638f1fb7fc065907a9e1a64700f8a600
> Guillaume is fixing it.
>
> TIA
> ___
> Dev mailing list -- d...@ceph.io
> To unsubscribe send an email to dev-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: reef 18.2.3 QE validation status

2024-04-12 Thread Ilya Dryomov
On Fri, Apr 12, 2024 at 8:38 PM Yuri Weinstein  wrote:
>
> Details of this release are summarized here:
>
> https://tracker.ceph.com/issues/65393#note-1
> Release Notes - TBD
> LRC upgrade - TBD
>
> Seeking approvals/reviews for:
>
> smoke - infra issues, still trying, Laura PTL
>
> rados - Radek, Laura approved? Travis?  Nizamudeen?
>
> rgw - Casey approved?
> fs - Venky approved?
> orch - Adam King approved?
>
> krbd - Ilya approved

Approved based on a rerun:

https://pulpito.ceph.com/dis-2024-04-12_20:04:30-krbd-reef-release-testing-default-smithi/

Thanks,

Ilya
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Setting S3 bucket policies with multi-tenants

2024-04-12 Thread Wesley Dillingham
Did you actually get this working? I am trying to replicate your steps but
am not being successful doing this with multi-tenant.

Respectfully,

*Wes Dillingham*
LinkedIn 
w...@wesdillingham.com




On Wed, Nov 1, 2023 at 12:52 PM Thomas Bennett  wrote:

> To update my own question, it would seem that  Principle should be
> defined like this:
>
>- "Principal": {"AWS": ["arn:aws:iam::Tenant1:user/readwrite"]}
>
> And resource should:
> "Resource": [ "arn:aws:s3:::backups"]
>
> Is it worth having the docs updates -
> https://docs.ceph.com/en/quincy/radosgw/bucketpolicy/
> to indicate that usfolks in the example is the tenant name?
>
>
> On Wed, 1 Nov 2023 at 18:27, Thomas Bennett  wrote:
>
> > Hi,
> >
> > I'm running Ceph Quincy (17.2.6) with a rados-gateway. I have muti
> > tenants, for example:
> >
> >- Tenant1$manager
> >- Tenant1$readwrite
> >
> > I would like to set a policy on a bucket (backups for example) owned by
> > *Tenant1$manager* to allow *Tenant1$readwrite* access to that bucket. I
> > can't find any documentation that discusses this scenario.
> >
> > Does anyone know how to specify the Principle and Resource section of a
> > policy.json file? Or any other configuration that I might be missing?
> >
> > I've tried some variations on Principal and Resource including and
> > excluding tenant information, but not no luck yet.
> >
> >
> > For example:
> > {
> >   "Version": "2012-10-17",
> >   "Statement": [{
> > "Effect": "Allow",
> > "Principal": {"AWS": ["arn:aws:iam:::user/*Tenant1$readwrite*"]},
> > "Action": ["s3:ListBucket","s3:GetObject", ,"s3:PutObject"],
> > "Resource": [
> >   "arn:aws:s3:::*Tenant1/backups*"
> > ]
> >   }]
> > }
> >
> > I'm using s3cmd for testing, so:
> > s3cmd --config s3cfg.manager setpolicy policy.json s3://backups/
> > Returns:
> > s3://backups/: Policy updated
> >
> > And then testing:
> > s3cmd --config s3cfg.readwrite ls s3://backups/
> > ERROR: Access to bucket 'backups' was denied
> > ERROR: S3 error: 403 (AccessDenied)
> >
> > Thanks,
> > Tom
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] reef 18.2.3 QE validation status

2024-04-12 Thread Yuri Weinstein
Details of this release are summarized here:

https://tracker.ceph.com/issues/65393#note-1
Release Notes - TBD
LRC upgrade - TBD

Seeking approvals/reviews for:

smoke - infra issues, still trying, Laura PTL

rados - Radek, Laura approved? Travis?  Nizamudeen?

rgw - Casey approved?
fs - Venky approved?
orch - Adam King approved?

krbd - Ilya approved
powercycle - seems fs related, Venky, Brad PTL

ceph-volume - will require
https://github.com/ceph/ceph/pull/56857/commits/63fe3921638f1fb7fc065907a9e1a64700f8a600
Guillaume is fixing it.

TIA
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: PG inconsistent

2024-04-12 Thread Anthony D'Atri
If you're using an Icinga active check that just looks for 

SMART overall-health self-assessment test result: PASSED

then it's not doing much for you.  That bivalue status can be shown for a drive 
that is decidedly an ex-parrot.  Gotta look at specific attributes, which is 
thorny since they aren't consistently implemented.  drivedb.h is a downright 
mess, which doesn't help.

> 
> 
> 
> 
> - Le 12 Avr 24, à 15:17, Albert Shih albert.s...@obspm.fr a écrit :
> 
>> Le 12/04/2024 à 12:56:12+0200, Frédéric Nass a écrit
>>> 
>> Hi,
>> 
>>> 
>>> Have you check the hardware status of the involved drives other than with
>>> smartctl? Like with the manufacturer's tools / WebUI (iDrac / perccli for 
>>> DELL
>>> hardware for example).
>> 
>> Yes, all my disk are «under» periodic check with smartctl + icinga.
> 
> Actually, I meant lower level tools (drive / server vendor tools).
> 
>> 
>>> If these tools don't report any media error (that is bad blocs on disks) 
>>> then
>>> you might just be facing the bit rot phenomenon. But this is very rare and
>>> should happen in a sysadmin's lifetime as often as a Royal Flush hand in a
>>> professional poker player's lifetime. ;-)
>>> 
>>> If no media error is reported, then you might want to check and update the
>>> firmware of all drives.
>> 
>> You're perfectly right.
>> 
>> It's just a newbie error, I check on the «main» osd of the PG (meaning the
>> first in the list) but forget to check on other.
>> 
> 
> Ok.
> 
>> On when server I indeed get some error on a disk.
>> 
>> But strangely smartctl report nothing. I will add a check with dmesg.
> 
> That's why I pointed you to the drive / server vendor tools earlier as 
> sometimes smartctl is missing the information you want.
> 
>> 
>>> 
>>> Once you figured it out, you may enable osd_scrub_auto_repair=true to have 
>>> these
>>> inconsistencies repaired automatically on deep-scrubbing, but make sure 
>>> you're
>>> using the alert module [1] so to at least get informed about the scrub 
>>> errors.
>> 
>> Thanks. I will look into because we got already icinga2 on site so I use
>> icinga2 to check the cluster.
>> 
>> Is they are a list of what the alert module going to check ?
> 
> Basically the module checks for ceph status (ceph -s) changes.
> 
> https://github.com/ceph/ceph/blob/main/src/pybind/mgr/alerts/module.py
> 
> Regards,
> Frédéric.
> 
>> 
>> 
>> Regards
>> 
>> JAS
>> --
>> Albert SHIH 嶺 
>> France
>> Heure locale/Local time:
>> ven. 12 avril 2024 15:13:13 CEST
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: PG inconsistent

2024-04-12 Thread Frédéric Nass


- Le 12 Avr 24, à 15:17, Albert Shih albert.s...@obspm.fr a écrit :

> Le 12/04/2024 à 12:56:12+0200, Frédéric Nass a écrit
>> 
> Hi,
> 
>> 
>> Have you check the hardware status of the involved drives other than with
>> smartctl? Like with the manufacturer's tools / WebUI (iDrac / perccli for 
>> DELL
>> hardware for example).
> 
> Yes, all my disk are «under» periodic check with smartctl + icinga.

Actually, I meant lower level tools (drive / server vendor tools).

> 
>> If these tools don't report any media error (that is bad blocs on disks) then
>> you might just be facing the bit rot phenomenon. But this is very rare and
>> should happen in a sysadmin's lifetime as often as a Royal Flush hand in a
>> professional poker player's lifetime. ;-)
>> 
>> If no media error is reported, then you might want to check and update the
>> firmware of all drives.
> 
> You're perfectly right.
> 
> It's just a newbie error, I check on the «main» osd of the PG (meaning the
> first in the list) but forget to check on other.
> 

Ok.

> On when server I indeed get some error on a disk.
> 
> But strangely smartctl report nothing. I will add a check with dmesg.

That's why I pointed you to the drive / server vendor tools earlier as 
sometimes smartctl is missing the information you want.

> 
>> 
>> Once you figured it out, you may enable osd_scrub_auto_repair=true to have 
>> these
>> inconsistencies repaired automatically on deep-scrubbing, but make sure 
>> you're
>> using the alert module [1] so to at least get informed about the scrub 
>> errors.
> 
> Thanks. I will look into because we got already icinga2 on site so I use
> icinga2 to check the cluster.
> 
> Is they are a list of what the alert module going to check ?

Basically the module checks for ceph status (ceph -s) changes.

https://github.com/ceph/ceph/blob/main/src/pybind/mgr/alerts/module.py

Regards,
Frédéric.

> 
> 
> Regards
> 
> JAS
> --
> Albert SHIH 嶺 
> France
> Heure locale/Local time:
> ven. 12 avril 2024 15:13:13 CEST
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Impact of large PG splits

2024-04-12 Thread Anthony D'Atri
One can up the ratios temporarily but it's all too easy to forget to reduce 
them later, or think that it's okay to run all the time with reduced headroom.

Until a host blows up and you don't have enough space to recover into.

> On Apr 12, 2024, at 05:01, Frédéric Nass  
> wrote:
> 
> 
> Oh, and yeah, considering "The fullest OSD is already at 85% usage" best move 
> for now would be to add new hardware/OSDs (to avoid reaching the backfill too 
> full limit), prior to start the splitting PGs before or after enabling upmap 
> balancer depending on how the PGs got rebalanced (well enough or not) after 
> adding new OSDs.
> 
> BTW, what ceph version is this? You should make sure you're running v16.2.11+ 
> or v17.2.4+ before splitting PGs to avoid this nasty bug: 
> https://tracker.ceph.com/issues/53729
> 
> Cheers,
> Frédéric.
> 
> - Le 12 Avr 24, à 10:41, Frédéric Nass frederic.n...@univ-lorraine.fr a 
> écrit :
> 
>> Hello Eugen,
>> 
>> Is this cluster using WPQ or mClock scheduler? (cephadm shell ceph daemon 
>> osd.0
>> config show | grep osd_op_queue)
>> 
>> If WPQ, you might want to tune osd_recovery_sleep* values as they do have a 
>> real
>> impact on the recovery/backfilling speed. Just lower osd_max_backfills to 1
>> before doing that.
>> If mClock scheduler then you might want to use a specific mClock profile as
>> suggested by Gregory (as osd_recovery_sleep* are not considered when using
>> mClock).
>> 
>> Since each PG involves reads/writes from/to apparently 18 OSDs (!) and this
>> cluster only has 240, increasing osd_max_backfills to any values higher than
>> 2-3 will not help much with the recovery/backfilling speed.
>> 
>> All the way, you'll have to be patient. :-)
>> 
>> Cheers,
>> Frédéric.
>> 
>> - Le 10 Avr 24, à 12:54, Eugen Block ebl...@nde.ag a écrit :
>> 
>>> Thank you for input!
>>> We started the split with max_backfills = 1 and watched for a few
>>> minutes, then gradually increased it to 8. Now it's backfilling with
>>> around 180 MB/s, not really much but since client impact has to be
>>> avoided if possible, we decided to let that run for a couple of hours.
>>> Then reevaluate the situation and maybe increase the backfills a bit
>>> more.
>>> 
>>> Thanks!
>>> 
>>> Zitat von Gregory Orange :
>>> 
 We are in the middle of splitting 16k EC 8+3 PGs on 2600x 16TB OSDs
 with NVME RocksDB, used exclusively for RGWs, holding about 60b
 objects. We are splitting for the same reason as you - improved
 balance. We also thought long and hard before we began, concerned
 about impact, stability etc.
 
 We set target_max_misplaced_ratio to 0.1% initially, so we could
 retain some control and stop it again fairly quickly if we weren't
 happy with the behaviour. It also serves to limit the performance
 impact on the cluster, but unfortunately it also makes the whole
 process slower.
 
 We now have the setting up to 1.5%, seeing recovery up to 10GB/s. No
 issues with the cluster. We could go higher, but are not in a rush
 at this point. Sometimes nearfull osd warnings get high and MAX
 AVAIL on the data pool in `ceph df` gets low enough that we want to
 interrupt it. So, we set pg_num to whatever the current value is
 (ceph osd pool ls detail), and let it stabilise. Then the balancer
 gets to work once the misplaced objects drop below the ratio, and
 things balance out. Nearfull osds drop usually to zero, and MAX
 AVAIL goes up again.
 
 The above behaviour is because while they share the same threshold
 setting, the autoscaler only runs every minute, and it won't run
 when misplaced are over the threshold. Meanwhile, checks for the
 next PG to split happen much more frequently, so the balancer never
 wins that race.
 
 
 We didn't know how long to expect it all to take, but decided that
 any improvement in PG size was worth starting. We now estimate it
 will take another 2-3 weeks to complete, for a total of 4-5 weeks
 total.
 
 We have lost a drive or two during the process, and of course
 degraded objects went up, and more backfilling work got going. We
 paused splits for at least one of those, to make sure the degraded
 objects were sorted out as quick as possible. We can't be sure it
 went any faster though - there's always a long tail on that sort of
 thing.
 
 Inconsistent objects are found at least a couple of times a week,
 and to get them repairing we disable scrubs, wait until they're
 stopped, then set the repair going and reenable scrubs. I don't know
 if this is special to the current higher splitting load, but we
 haven't noticed it before.
 
 HTH,
 Greg.
 
 
 On 10/4/24 14:42, Eugen Block wrote:
> Thank you, Janne.
> I believe the default 5% target_max_misplaced_ratio would work as
> well, we've had good experience with that in the past, without the

[ceph-users] Re: PG inconsistent

2024-04-12 Thread Albert Shih
Le 12/04/2024 à 12:56:12+0200, Frédéric Nass a écrit
> 
Hi, 

> 
> Have you check the hardware status of the involved drives other than with 
> smartctl? Like with the manufacturer's tools / WebUI (iDrac / perccli for 
> DELL hardware for example).

Yes, all my disk are «under» periodic check with smartctl + icinga. 

> If these tools don't report any media error (that is bad blocs on disks) then 
> you might just be facing the bit rot phenomenon. But this is very rare and 
> should happen in a sysadmin's lifetime as often as a Royal Flush hand in a 
> professional poker player's lifetime. ;-)
> 
> If no media error is reported, then you might want to check and update the 
> firmware of all drives.

You're perfectly right. 

It's just a newbie error, I check on the «main» osd of the PG (meaning the
first in the list) but forget to check on other. 

On when server I indeed get some error on a disk.

But strangely smartctl report nothing. I will add a check with dmesg. 

> 
> Once you figured it out, you may enable osd_scrub_auto_repair=true to have 
> these inconsistencies repaired automatically on deep-scrubbing, but make sure 
> you're using the alert module [1] so to at least get informed about the scrub 
> errors.

Thanks. I will look into because we got already icinga2 on site so I use
icinga2 to check the cluster. 

Is they are a list of what the alert module going to check ? 


Regards

JAS
-- 
Albert SHIH 嶺 
France
Heure locale/Local time:
ven. 12 avril 2024 15:13:13 CEST
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Impact of large PG splits

2024-04-12 Thread Eugen Block

Thanks for chiming in.
They are on version 16.2.13 (I was already made aware of the bug you  
mentioned, thanks!) with wpq.
Until now I haven't got an emergency call so I assume everything is  
calm (I hope). New hardware has been ordered but it will take a couple  
of weeks until it's delivered, installed and integrated, that's why we  
decided to take action now.

I'll update the thread when I know more.

Thanks again!
Eugen

Zitat von Frédéric Nass :

Oh, and yeah, considering "The fullest OSD is already at 85% usage"  
best move for now would be to add new hardware/OSDs (to avoid  
reaching the backfill too full limit), prior to start the splitting  
PGs before or after enabling upmap balancer depending on how the PGs  
got rebalanced (well enough or not) after adding new OSDs.


BTW, what ceph version is this? You should make sure you're running  
v16.2.11+ or v17.2.4+ before splitting PGs to avoid this nasty bug:  
https://tracker.ceph.com/issues/53729


Cheers,
Frédéric.

- Le 12 Avr 24, à 10:41, Frédéric Nass  
frederic.n...@univ-lorraine.fr a écrit :



Hello Eugen,

Is this cluster using WPQ or mClock scheduler? (cephadm shell ceph  
daemon osd.0

config show | grep osd_op_queue)

If WPQ, you might want to tune osd_recovery_sleep* values as they  
do have a real

impact on the recovery/backfilling speed. Just lower osd_max_backfills to 1
before doing that.
If mClock scheduler then you might want to use a specific mClock profile as
suggested by Gregory (as osd_recovery_sleep* are not considered when using
mClock).

Since each PG involves reads/writes from/to apparently 18 OSDs (!) and this
cluster only has 240, increasing osd_max_backfills to any values higher than
2-3 will not help much with the recovery/backfilling speed.

All the way, you'll have to be patient. :-)

Cheers,
Frédéric.

- Le 10 Avr 24, à 12:54, Eugen Block ebl...@nde.ag a écrit :


Thank you for input!
We started the split with max_backfills = 1 and watched for a few
minutes, then gradually increased it to 8. Now it's backfilling with
around 180 MB/s, not really much but since client impact has to be
avoided if possible, we decided to let that run for a couple of hours.
Then reevaluate the situation and maybe increase the backfills a bit
more.

Thanks!

Zitat von Gregory Orange :


We are in the middle of splitting 16k EC 8+3 PGs on 2600x 16TB OSDs
with NVME RocksDB, used exclusively for RGWs, holding about 60b
objects. We are splitting for the same reason as you - improved
balance. We also thought long and hard before we began, concerned
about impact, stability etc.

We set target_max_misplaced_ratio to 0.1% initially, so we could
retain some control and stop it again fairly quickly if we weren't
happy with the behaviour. It also serves to limit the performance
impact on the cluster, but unfortunately it also makes the whole
process slower.

We now have the setting up to 1.5%, seeing recovery up to 10GB/s. No
issues with the cluster. We could go higher, but are not in a rush
at this point. Sometimes nearfull osd warnings get high and MAX
AVAIL on the data pool in `ceph df` gets low enough that we want to
interrupt it. So, we set pg_num to whatever the current value is
(ceph osd pool ls detail), and let it stabilise. Then the balancer
gets to work once the misplaced objects drop below the ratio, and
things balance out. Nearfull osds drop usually to zero, and MAX
AVAIL goes up again.

The above behaviour is because while they share the same threshold
setting, the autoscaler only runs every minute, and it won't run
when misplaced are over the threshold. Meanwhile, checks for the
next PG to split happen much more frequently, so the balancer never
wins that race.


We didn't know how long to expect it all to take, but decided that
any improvement in PG size was worth starting. We now estimate it
will take another 2-3 weeks to complete, for a total of 4-5 weeks
total.

We have lost a drive or two during the process, and of course
degraded objects went up, and more backfilling work got going. We
paused splits for at least one of those, to make sure the degraded
objects were sorted out as quick as possible. We can't be sure it
went any faster though - there's always a long tail on that sort of
thing.

Inconsistent objects are found at least a couple of times a week,
and to get them repairing we disable scrubs, wait until they're
stopped, then set the repair going and reenable scrubs. I don't know
if this is special to the current higher splitting load, but we
haven't noticed it before.

HTH,
Greg.


On 10/4/24 14:42, Eugen Block wrote:

Thank you, Janne.
I believe the default 5% target_max_misplaced_ratio would work as
well, we've had good experience with that in the past, without the
autoscaler. I just haven't dealt with such large PGs, I've been
warning them for two years (when the PGs were only almost half this
size) and now they finally started to listen. Well, they would
still ignore it if it wouldn't impact all kinds of 

[ceph-users] Re: PG inconsistent

2024-04-12 Thread Wesley Dillingham
check your ceph.log on the mons for "stat mismatch" and grep for the PG in
question for potentially more information.

Additionally "rados list-inconsistent-obj {pgid}" will often show which OSD
and objects are implicated for the inconsistency. If the acting set has
changed since the scrub (for example an osd is removed or failed) in which
the inconsistency was found this data wont be there any longer and you
would need to deep-scrub the PG again to get that information.

Respectfully,

*Wes Dillingham*
LinkedIn 
w...@wesdillingham.com




On Fri, Apr 12, 2024 at 6:56 AM Frédéric Nass <
frederic.n...@univ-lorraine.fr> wrote:

>
> Hello Albert,
>
> Have you check the hardware status of the involved drives other than with
> smartctl? Like with the manufacturer's tools / WebUI (iDrac / perccli for
> DELL hardware for example).
> If these tools don't report any media error (that is bad blocs on disks)
> then you might just be facing the bit rot phenomenon. But this is very rare
> and should happen in a sysadmin's lifetime as often as a Royal Flush hand
> in a professional poker player's lifetime. ;-)
>
> If no media error is reported, then you might want to check and update the
> firmware of all drives.
>
> Once you figured it out, you may enable osd_scrub_auto_repair=true to have
> these inconsistencies repaired automatically on deep-scrubbing, but make
> sure you're using the alert module [1] so to at least get informed about
> the scrub errors.
>
> Regards,
> Frédéric.
>
> [1] https://docs.ceph.com/en/latest/mgr/alerts/
>
> - Le 12 Avr 24, à 11:59, Albert Shih albert.s...@obspm.fr a écrit :
>
> > Hi everyone.
> >
> > I got a warning with
> >
> > root@cthulhu1:/etc/ceph# ceph -s
> >  cluster:
> >id: 9c5bb196-c212-11ee-84f3-c3f2beae892d
> >health: HEALTH_ERR
> >1 scrub errors
> >Possible data damage: 1 pg inconsistent
> >
> > So I find the pg with the issue, and launch a pg repair (still waiting)
> >
> > But I try to find «why» so I check all the OSD related on this pg and
> > didn't find anything, no error from osd daemon, no errors from smartctl,
> no
> > error from the kernel message.
> >
> > So I just like to know if that's «normal» or should I scratch deeper.
> >
> > JAS
> > --
> > Albert SHIH 嶺 
> > France
> > Heure locale/Local time:
> > ven. 12 avril 2024 11:51:37 CEST
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: PG inconsistent

2024-04-12 Thread Frédéric Nass

Hello Albert,

Have you check the hardware status of the involved drives other than with 
smartctl? Like with the manufacturer's tools / WebUI (iDrac / perccli for DELL 
hardware for example).
If these tools don't report any media error (that is bad blocs on disks) then 
you might just be facing the bit rot phenomenon. But this is very rare and 
should happen in a sysadmin's lifetime as often as a Royal Flush hand in a 
professional poker player's lifetime. ;-)

If no media error is reported, then you might want to check and update the 
firmware of all drives.

Once you figured it out, you may enable osd_scrub_auto_repair=true to have 
these inconsistencies repaired automatically on deep-scrubbing, but make sure 
you're using the alert module [1] so to at least get informed about the scrub 
errors.

Regards,
Frédéric.

[1] https://docs.ceph.com/en/latest/mgr/alerts/

- Le 12 Avr 24, à 11:59, Albert Shih albert.s...@obspm.fr a écrit :

> Hi everyone.
> 
> I got a warning with
> 
> root@cthulhu1:/etc/ceph# ceph -s
>  cluster:
>id: 9c5bb196-c212-11ee-84f3-c3f2beae892d
>health: HEALTH_ERR
>1 scrub errors
>Possible data damage: 1 pg inconsistent
> 
> So I find the pg with the issue, and launch a pg repair (still waiting)
> 
> But I try to find «why» so I check all the OSD related on this pg and
> didn't find anything, no error from osd daemon, no errors from smartctl, no
> error from the kernel message.
> 
> So I just like to know if that's «normal» or should I scratch deeper.
> 
> JAS
> --
> Albert SHIH 嶺 
> France
> Heure locale/Local time:
> ven. 12 avril 2024 11:51:37 CEST
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] PG inconsistent

2024-04-12 Thread Albert Shih
Hi everyone. 

I got a warning with 

root@cthulhu1:/etc/ceph# ceph -s
  cluster:
id: 9c5bb196-c212-11ee-84f3-c3f2beae892d
health: HEALTH_ERR
1 scrub errors
Possible data damage: 1 pg inconsistent

So I find the pg with the issue, and launch a pg repair (still waiting)

But I try to find «why» so I check all the OSD related on this pg and
didn't find anything, no error from osd daemon, no errors from smartctl, no
error from the kernel message. 

So I just like to know if that's «normal» or should I scratch deeper. 

JAS
-- 
Albert SHIH 嶺 
France
Heure locale/Local time:
ven. 12 avril 2024 11:51:37 CEST
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Impact of large PG splits

2024-04-12 Thread Frédéric Nass

Oh, and yeah, considering "The fullest OSD is already at 85% usage" best move 
for now would be to add new hardware/OSDs (to avoid reaching the backfill too 
full limit), prior to start the splitting PGs before or after enabling upmap 
balancer depending on how the PGs got rebalanced (well enough or not) after 
adding new OSDs.

BTW, what ceph version is this? You should make sure you're running v16.2.11+ 
or v17.2.4+ before splitting PGs to avoid this nasty bug: 
https://tracker.ceph.com/issues/53729

Cheers,
Frédéric.

- Le 12 Avr 24, à 10:41, Frédéric Nass frederic.n...@univ-lorraine.fr a 
écrit :

> Hello Eugen,
> 
> Is this cluster using WPQ or mClock scheduler? (cephadm shell ceph daemon 
> osd.0
> config show | grep osd_op_queue)
> 
> If WPQ, you might want to tune osd_recovery_sleep* values as they do have a 
> real
> impact on the recovery/backfilling speed. Just lower osd_max_backfills to 1
> before doing that.
> If mClock scheduler then you might want to use a specific mClock profile as
> suggested by Gregory (as osd_recovery_sleep* are not considered when using
> mClock).
> 
> Since each PG involves reads/writes from/to apparently 18 OSDs (!) and this
> cluster only has 240, increasing osd_max_backfills to any values higher than
> 2-3 will not help much with the recovery/backfilling speed.
> 
> All the way, you'll have to be patient. :-)
> 
> Cheers,
> Frédéric.
> 
> - Le 10 Avr 24, à 12:54, Eugen Block ebl...@nde.ag a écrit :
> 
>> Thank you for input!
>> We started the split with max_backfills = 1 and watched for a few
>> minutes, then gradually increased it to 8. Now it's backfilling with
>> around 180 MB/s, not really much but since client impact has to be
>> avoided if possible, we decided to let that run for a couple of hours.
>> Then reevaluate the situation and maybe increase the backfills a bit
>> more.
>> 
>> Thanks!
>> 
>> Zitat von Gregory Orange :
>> 
>>> We are in the middle of splitting 16k EC 8+3 PGs on 2600x 16TB OSDs
>>> with NVME RocksDB, used exclusively for RGWs, holding about 60b
>>> objects. We are splitting for the same reason as you - improved
>>> balance. We also thought long and hard before we began, concerned
>>> about impact, stability etc.
>>>
>>> We set target_max_misplaced_ratio to 0.1% initially, so we could
>>> retain some control and stop it again fairly quickly if we weren't
>>> happy with the behaviour. It also serves to limit the performance
>>> impact on the cluster, but unfortunately it also makes the whole
>>> process slower.
>>>
>>> We now have the setting up to 1.5%, seeing recovery up to 10GB/s. No
>>> issues with the cluster. We could go higher, but are not in a rush
>>> at this point. Sometimes nearfull osd warnings get high and MAX
>>> AVAIL on the data pool in `ceph df` gets low enough that we want to
>>> interrupt it. So, we set pg_num to whatever the current value is
>>> (ceph osd pool ls detail), and let it stabilise. Then the balancer
>>> gets to work once the misplaced objects drop below the ratio, and
>>> things balance out. Nearfull osds drop usually to zero, and MAX
>>> AVAIL goes up again.
>>>
>>> The above behaviour is because while they share the same threshold
>>> setting, the autoscaler only runs every minute, and it won't run
>>> when misplaced are over the threshold. Meanwhile, checks for the
>>> next PG to split happen much more frequently, so the balancer never
>>> wins that race.
>>>
>>>
>>> We didn't know how long to expect it all to take, but decided that
>>> any improvement in PG size was worth starting. We now estimate it
>>> will take another 2-3 weeks to complete, for a total of 4-5 weeks
>>> total.
>>>
>>> We have lost a drive or two during the process, and of course
>>> degraded objects went up, and more backfilling work got going. We
>>> paused splits for at least one of those, to make sure the degraded
>>> objects were sorted out as quick as possible. We can't be sure it
>>> went any faster though - there's always a long tail on that sort of
>>> thing.
>>>
>>> Inconsistent objects are found at least a couple of times a week,
>>> and to get them repairing we disable scrubs, wait until they're
>>> stopped, then set the repair going and reenable scrubs. I don't know
>>> if this is special to the current higher splitting load, but we
>>> haven't noticed it before.
>>>
>>> HTH,
>>> Greg.
>>>
>>>
>>> On 10/4/24 14:42, Eugen Block wrote:
 Thank you, Janne.
 I believe the default 5% target_max_misplaced_ratio would work as
 well, we've had good experience with that in the past, without the
 autoscaler. I just haven't dealt with such large PGs, I've been
 warning them for two years (when the PGs were only almost half this
 size) and now they finally started to listen. Well, they would
 still ignore it if it wouldn't impact all kinds of things now. ;-)

 Thanks,
 Eugen

 Zitat von Janne Johansson :

> Den tis 9 apr. 2024 kl 10:39 skrev Eugen Block :
>> I'm 

[ceph-users] Re: Impact of large PG splits

2024-04-12 Thread Frédéric Nass

Hello Eugen,

Is this cluster using WPQ or mClock scheduler? (cephadm shell ceph daemon osd.0 
config show | grep osd_op_queue)

If WPQ, you might want to tune osd_recovery_sleep* values as they do have a 
real impact on the recovery/backfilling speed. Just lower osd_max_backfills to 
1 before doing that.
If mClock scheduler then you might want to use a specific mClock profile as 
suggested by Gregory (as osd_recovery_sleep* are not considered when using 
mClock).

Since each PG involves reads/writes from/to apparently 18 OSDs (!) and this 
cluster only has 240, increasing osd_max_backfills to any values higher than 
2-3 will not help much with the recovery/backfilling speed.

All the way, you'll have to be patient. :-)

Cheers,
Frédéric.

- Le 10 Avr 24, à 12:54, Eugen Block ebl...@nde.ag a écrit :

> Thank you for input!
> We started the split with max_backfills = 1 and watched for a few
> minutes, then gradually increased it to 8. Now it's backfilling with
> around 180 MB/s, not really much but since client impact has to be
> avoided if possible, we decided to let that run for a couple of hours.
> Then reevaluate the situation and maybe increase the backfills a bit
> more.
> 
> Thanks!
> 
> Zitat von Gregory Orange :
> 
>> We are in the middle of splitting 16k EC 8+3 PGs on 2600x 16TB OSDs
>> with NVME RocksDB, used exclusively for RGWs, holding about 60b
>> objects. We are splitting for the same reason as you - improved
>> balance. We also thought long and hard before we began, concerned
>> about impact, stability etc.
>>
>> We set target_max_misplaced_ratio to 0.1% initially, so we could
>> retain some control and stop it again fairly quickly if we weren't
>> happy with the behaviour. It also serves to limit the performance
>> impact on the cluster, but unfortunately it also makes the whole
>> process slower.
>>
>> We now have the setting up to 1.5%, seeing recovery up to 10GB/s. No
>> issues with the cluster. We could go higher, but are not in a rush
>> at this point. Sometimes nearfull osd warnings get high and MAX
>> AVAIL on the data pool in `ceph df` gets low enough that we want to
>> interrupt it. So, we set pg_num to whatever the current value is
>> (ceph osd pool ls detail), and let it stabilise. Then the balancer
>> gets to work once the misplaced objects drop below the ratio, and
>> things balance out. Nearfull osds drop usually to zero, and MAX
>> AVAIL goes up again.
>>
>> The above behaviour is because while they share the same threshold
>> setting, the autoscaler only runs every minute, and it won't run
>> when misplaced are over the threshold. Meanwhile, checks for the
>> next PG to split happen much more frequently, so the balancer never
>> wins that race.
>>
>>
>> We didn't know how long to expect it all to take, but decided that
>> any improvement in PG size was worth starting. We now estimate it
>> will take another 2-3 weeks to complete, for a total of 4-5 weeks
>> total.
>>
>> We have lost a drive or two during the process, and of course
>> degraded objects went up, and more backfilling work got going. We
>> paused splits for at least one of those, to make sure the degraded
>> objects were sorted out as quick as possible. We can't be sure it
>> went any faster though - there's always a long tail on that sort of
>> thing.
>>
>> Inconsistent objects are found at least a couple of times a week,
>> and to get them repairing we disable scrubs, wait until they're
>> stopped, then set the repair going and reenable scrubs. I don't know
>> if this is special to the current higher splitting load, but we
>> haven't noticed it before.
>>
>> HTH,
>> Greg.
>>
>>
>> On 10/4/24 14:42, Eugen Block wrote:
>>> Thank you, Janne.
>>> I believe the default 5% target_max_misplaced_ratio would work as
>>> well, we've had good experience with that in the past, without the
>>> autoscaler. I just haven't dealt with such large PGs, I've been
>>> warning them for two years (when the PGs were only almost half this
>>> size) and now they finally started to listen. Well, they would
>>> still ignore it if it wouldn't impact all kinds of things now. ;-)
>>>
>>> Thanks,
>>> Eugen
>>>
>>> Zitat von Janne Johansson :
>>>
 Den tis 9 apr. 2024 kl 10:39 skrev Eugen Block :
> I'm trying to estimate the possible impact when large PGs are
> splitted. Here's one example of such a PG:
>
> PG_STAT  OBJECTS  BYTES OMAP_BYTES*  OMAP_KEYS*  LOG
> DISK_LOG    UP
> 86.3ff    277708  414403098409    0   0  3092
> 3092
> [187,166,122,226,171,234,177,163,155,34,81,239,101,13,117,8,57,111]

 If you ask for small increases of pg_num, it will only split that many
 PGs at a time, so while there will be a lot of data movement, (50% due
 to half of the data needs to go to another newly made PG, and on top
 of that, PGs per OSD will change, but also the balancing can now work
 better) it will not be affecting the whole cluster if you increase
 with