[ceph-users] Re: [RGW] Resharding in multi-zonegroup env cause data loss

2024-07-30 Thread Szabo, Istvan (Agoda)
I think multisite reshard only works from quincy only.






From: Huy Nguyen 
Sent: Wednesday, July 31, 2024 10:51 AM
To: ceph-users@ceph.io 
Subject: [ceph-users] [RGW] Resharding in multi-zonegroup env cause data loss

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Hi,

My lab setup two Ceph clusters as 2 zonegroups in a realm. Ceph version 16.2.13

As I know, if a bucket is created on secondary zonegroup, the master zonegroup
only know about its metadata, not data. So after resharding command is run
on master (to avoid inconsistent metadata), the bucket is synced back to
secondary which mean it will have no data, just like in master.

Is this a bug or I'm doing it wrong? Is there any other way to reshard a bucket 
in non-master zonegroup?
Thanks
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Scheduled cluster maintenance tasks

2024-07-17 Thread Szabo, Istvan (Agoda)
Hi,

Trying to figure out what is the trigger time for scrubbing and garbage 
collection but in the config related to these operations are not 
straightforward.

I'm still looking for my daily laggy pg with slow ops around the same time and 
what I've found that some cleaning triggered around that time, however due to 
scrubbing running 0-24 based on config, GC running in theory hourly, lifecycle 
I'm not sure the schedule either, don't know what is the exact task.

Haven't really found proper metrics about these jobs either.

What I've tried is Frank S scrub report but that one no info about schedules, 
it has information about intervals how many scrubbed etc ...

So if there is some proper documentation about it, please share with me.

Thank you in advance.


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Large omap in index pool even if properly sharded and not "OVER"

2024-07-16 Thread Szabo, Istvan (Agoda)
Finally this article was the solution:
https://access.redhat.com/solutions/6450561


 Main point is to trim the bilog:

radosgw-admin bilog trim --bucket="bucket-name" --bucket-id="bucket-id"

Then scrub, done! Happy day!

From: Frédéric Nass 
Sent: Monday, July 15, 2024 8:22 PM
To: Szabo, Istvan (Agoda) 
Cc: Richard Bade ; Casey Bodley ; Ceph 
Users 
Subject: Re: [ceph-users] Re: Large omap in index pool even if properly sharded 
and not "OVER"

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !

Right.

This procedure would recatalog any lost S3 objects (out of the 40TB) and would 
allow you to delete them using S3 afterwards if that's what you want.
Note that I don't think it handles different versions of S3 objects, if any 
exist, so you might still end up with orphaned data in the RADOS pool.

If you no longer have any interest in this bucket, you could simply purge the 
bucket with all its data, then use 'rados ls' to list any orphan objects whose 
names begin with the bucket prefix (make sure you saved this information before 
deleting the bucket), and finally use 'rados rm' to remove them.

Regards,
Frédéric.

- Le 15 Juil 24, à 5:30, Istvan Szabo, Agoda  a 
écrit :
Hi,

But this not cleaning right?  Just restore if lost.


Istvan Szabo
Staff Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com<mailto:istvan.sz...@agoda.com>
---




From: Frédéric Nass 
Sent: Friday, July 12, 2024 6:52 PM
To: Richard Bade ; Szabo, Istvan (Agoda) 

Cc: Casey Bodley ; Ceph Users 
Subject: Re: [ceph-users] Re: Large omap in index pool even if properly sharded 
and not "OVER"

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


- Le 11 Juil 24, à 0:23, Richard Bade hitr...@gmail.com a écrit :

> Hi Casey,
> Thanks for that info on the bilog. I'm in a similar situation with
> large omap objects and we have also had to reshard buckets on
> multisite losing the index on the secondary.
> We also now have a lot of buckets with sync disable so I wanted to
> check that it's always safe to trim the bilog on buckets with sync
> disabled?
> I can see some stale entries with "completed" state and a timestamp of
> a number of months ago but also some that say pending and have no
> timestamp.
>
> Istvan, I can also possibly help with your orphaned 40TB on the secondary 
> zone.
> Each object has the bucket marker in its name. If you do a `rados -p
> {pool_name} ls` and find all the ones that start with the bucket
> marker (found with `radosgw-admin bucket stats
> --bucket={bucket_name}`) then you can do one of two things:
> 1, `rados rm` the object
> 2, restore the index with info from the object itself
>- create a dummy index template (use `radosgw-admin bi get` on a
> known good index to get the structure)
>- grab the etag from the object xattribs and use this and the name
> in the template (`rados -p {pool} getxattr {objname} user.rgw.etag`)
>- use ` radosgw-admin bi put` to create the index
>- use `radosgw-admin bucket check --check-objects --fix
> --bucket={bucket_name}` to fix up the bucket object count and object
> sizes at the end

One could also use `radosgw-admin object reindex --bucket {bucket_name}` to 
scan the data pool for objects that belong to a given bucket and add those 
objects back to the bucket index.

Same logic as rgw-restore-bucket-index [1][2] script that has proven to be 
successful in recovering bucket indexes destroyed by resharding [3].

Regards,
Frédéric.

[1] https://docs.ceph.com/en/latest/man/8/rgw-restore-bucket-index/
[2] https://github.com/ceph/ceph/blob/main/src/rgw/rgw-restore-bucket-index
[3] https://github.com/ceph/ceph/pull/50329

>
> This process takes quite some time and I can't say if it's 100%
> perfect but it enabled us to get to a state where we could delete the
> buckets and clean up the objects.
> I hope this helps.
>
> Regards,
> Richard
>
> On Thu, 11 Jul 2024 at 01:25, Casey Bodley  wrote:
>>
>> On Tue, Jul 9, 2024 at 12:41 PM Szabo, Istvan (Agoda)
>>  wrote:
>> >
>> > Hi Casey,
>> >
>> > 1.
>> > Regarding versioning, the user doesn't use verisoning it if I'm not 
>> > mistaken:
>> > https://gist.githubusercontent.com/Badb0yBadb0y/d80c1bdb8609088970413969826d2b7d/raw/baee46865178fff454c224040525b55b54e27218/gistfile1.txt
>> >
>> > 2.
>> > Regarding multiparts, if it would have multipart thrash, it would be listed
>> > here:
>> >

[ceph-users] Re: Large omap in index pool even if properly sharded and not "OVER"

2024-07-14 Thread Szabo, Istvan (Agoda)
Hi,

But this not cleaning right?  Just restore if lost.


Istvan Szabo
Staff Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com<mailto:istvan.sz...@agoda.com>
---




From: Frédéric Nass 
Sent: Friday, July 12, 2024 6:52 PM
To: Richard Bade ; Szabo, Istvan (Agoda) 

Cc: Casey Bodley ; Ceph Users 
Subject: Re: [ceph-users] Re: Large omap in index pool even if properly sharded 
and not "OVER"

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


- Le 11 Juil 24, à 0:23, Richard Bade hitr...@gmail.com a écrit :

> Hi Casey,
> Thanks for that info on the bilog. I'm in a similar situation with
> large omap objects and we have also had to reshard buckets on
> multisite losing the index on the secondary.
> We also now have a lot of buckets with sync disable so I wanted to
> check that it's always safe to trim the bilog on buckets with sync
> disabled?
> I can see some stale entries with "completed" state and a timestamp of
> a number of months ago but also some that say pending and have no
> timestamp.
>
> Istvan, I can also possibly help with your orphaned 40TB on the secondary 
> zone.
> Each object has the bucket marker in its name. If you do a `rados -p
> {pool_name} ls` and find all the ones that start with the bucket
> marker (found with `radosgw-admin bucket stats
> --bucket={bucket_name}`) then you can do one of two things:
> 1, `rados rm` the object
> 2, restore the index with info from the object itself
>- create a dummy index template (use `radosgw-admin bi get` on a
> known good index to get the structure)
>- grab the etag from the object xattribs and use this and the name
> in the template (`rados -p {pool} getxattr {objname} user.rgw.etag`)
>- use ` radosgw-admin bi put` to create the index
>- use `radosgw-admin bucket check --check-objects --fix
> --bucket={bucket_name}` to fix up the bucket object count and object
> sizes at the end

One could also use `radosgw-admin object reindex --bucket {bucket_name}` to 
scan the data pool for objects that belong to a given bucket and add those 
objects back to the bucket index.

Same logic as rgw-restore-bucket-index [1][2] script that has proven to be 
successful in recovering bucket indexes destroyed by resharding [3].

Regards,
Frédéric.

[1] https://docs.ceph.com/en/latest/man/8/rgw-restore-bucket-index/
[2] https://github.com/ceph/ceph/blob/main/src/rgw/rgw-restore-bucket-index
[3] https://github.com/ceph/ceph/pull/50329

>
> This process takes quite some time and I can't say if it's 100%
> perfect but it enabled us to get to a state where we could delete the
> buckets and clean up the objects.
> I hope this helps.
>
> Regards,
> Richard
>
> On Thu, 11 Jul 2024 at 01:25, Casey Bodley  wrote:
>>
>> On Tue, Jul 9, 2024 at 12:41 PM Szabo, Istvan (Agoda)
>>  wrote:
>> >
>> > Hi Casey,
>> >
>> > 1.
>> > Regarding versioning, the user doesn't use verisoning it if I'm not 
>> > mistaken:
>> > https://gist.githubusercontent.com/Badb0yBadb0y/d80c1bdb8609088970413969826d2b7d/raw/baee46865178fff454c224040525b55b54e27218/gistfile1.txt
>> >
>> > 2.
>> > Regarding multiparts, if it would have multipart thrash, it would be listed
>> > here:
>> > https://gist.githubusercontent.com/Badb0yBadb0y/d80c1bdb8609088970413969826d2b7d/raw/baee46865178fff454c224040525b55b54e27218/gistfile1.txt
>> > as a rgw.multimeta under the usage, right?
>> >
>> > 3.
>> > Regarding the multisite idea, this bucket has been a multisite bucket last 
>> > year
>> > but we had to reshard (accepting to loose the replica on the 2nd site and 
>> > just
>> > keep it in the master site) and that time as expected it has disappeared
>> > completely from the 2nd site (I guess the 40TB thrash still there but can't
>> > really find it how to clean  ). Now it is a single site bucket.
>> > Also it is the index pool, multisite logs should go to the rgw.log pool
>> > shouldn't it?
>>
>> some replication logs are in the log pool, but the per-object logs are
>> stored in the bucket index objects. you can inspect these with
>> `radosgw-admin bilog list --bucket=X`. by default, that will only list
>> --max-entries=1000. you can add --shard-id=Y to look at specific
>> 'large omap' objects
>>
>> even if your single-site bucket doesn't exist on the secondary zone,
>> changes on the primary zone are probably still generating these bilog
>> ent

[ceph-users] Re: Large omap in index pool even if properly sharded and not "OVER"

2024-07-11 Thread Szabo, Istvan (Agoda)
Hi Casey,

So the multisite thing when we resharded the bucket I've completely disabled 
and removed the bucket from the sync before like, disable, removed the pipe and 
everything step by step, finally period updated so this is not syncing I'm kind 
of sure so I think we can focus on the master zone only and after I check on 
the secondary site the bilogs, there I can try to trim.
2nd zone shouldn't cause large omap in the master zone if bucket sync already 
disabled I guess.

The 2nd that you've suggested is more interesting, so when you say shard id, 
let's say from this entry ( 
.dir.9213182a-14ba-48ad-bde9-289a1c0c0de8.2479481907.1.151), the 151 is the 
shard that you mentioned, which means in cli I guess:
radosgw-admin bilog list --bucket=bucketname --shard-id=151
(We use octopus)

And this huge list we should identify with the app owner objects by objects 
which is not there anymore and they don't use? Not easy for sure Not sure is 
there any proper way to do this? We've run bucket  check with fix but that one 
didn't help.

Thank you


From: Casey Bodley 
Sent: Wednesday, July 10, 2024 8:24 PM
To: Szabo, Istvan (Agoda) 
Cc: Eugen Block ; Ceph Users 
Subject: Re: [ceph-users] Re: Large omap in index pool even if properly sharded 
and not "OVER"

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


On Tue, Jul 9, 2024 at 12:41 PM Szabo, Istvan (Agoda)
 wrote:
>
> Hi Casey,
>
> 1.
> Regarding versioning, the user doesn't use verisoning it if I'm not mistaken:
> https://gist.githubusercontent.com/Badb0yBadb0y/d80c1bdb8609088970413969826d2b7d/raw/baee46865178fff454c224040525b55b54e27218/gistfile1.txt
>
> 2.
> Regarding multiparts, if it would have multipart thrash, it would be listed 
> here:
> https://gist.githubusercontent.com/Badb0yBadb0y/d80c1bdb8609088970413969826d2b7d/raw/baee46865178fff454c224040525b55b54e27218/gistfile1.txt
> as a rgw.multimeta under the usage, right?
>
> 3.
> Regarding the multisite idea, this bucket has been a multisite bucket last 
> year but we had to reshard (accepting to loose the replica on the 2nd site 
> and just keep it in the master site) and that time as expected it has 
> disappeared completely from the 2nd site (I guess the 40TB thrash still there 
> but can't really find it how to clean  ). Now it is a single site bucket.
> Also it is the index pool, multisite logs should go to the rgw.log pool 
> shouldn't it?

some replication logs are in the log pool, but the per-object logs are
stored in the bucket index objects. you can inspect these with
`radosgw-admin bilog list --bucket=X`. by default, that will only list
--max-entries=1000. you can add --shard-id=Y to look at specific
'large omap' objects

even if your single-site bucket doesn't exist on the secondary zone,
changes on the primary zone are probably still generating these bilog
entries. you would need to do something like `radosgw-admin bucket
sync disable --bucket=X` to make it stop. because you don't expect
these changes to replicate, it's safe to delete any of this bucket's
bilog entries with `radosgw-admin bilog trim --end-marker 9
--bucket=X`. depending on ceph version, you may need to run this trim
command in a loop until the `bilog list` output is empty

radosgw does eventually trim bilogs in the background after they're
processed, but the secondary zone isn't processing them in this case

>
> Thank you
>
>
> ________
> From: Casey Bodley 
> Sent: Tuesday, July 9, 2024 10:39 PM
> To: Szabo, Istvan (Agoda) 
> Cc: Eugen Block ; ceph-users@ceph.io 
> Subject: Re: [ceph-users] Re: Large omap in index pool even if properly 
> sharded and not "OVER"
>
> Email received from the internet. If in doubt, don't click any link nor open 
> any attachment !
> 
>
> in general, these omap entries should be evenly spread over the
> bucket's index shard objects. but there are two features that may
> cause entries to clump on a single shard:
>
> 1. for versioned buckets, multiple versions of the same object name
> map to the same index shard. this can become an issue if an
> application is repeatedly overwriting an object without cleaning up
> old versions. lifecycle rules can help to manage these noncurrent
> versions
>
> 2. during a multipart upload, all of the parts are tracked on the same
> index shard as the final object name. if applications are leaving a
> lot of incomplete multipart uploads behind (especially if they target
> the same object name) this can lead to similar clumping. the S3 api
> has operations to list and abort incomplete multipart uploads, along
> with lifecycle rules to automate their cleanup
>
> separately, multisite clusters use these same in

[ceph-users] Re: Large omap in index pool even if properly sharded and not "OVER"

2024-07-09 Thread Szabo, Istvan (Agoda)
Hi Casey,

1.
Regarding versioning, the user doesn't use verisoning it if I'm not mistaken:
https://gist.githubusercontent.com/Badb0yBadb0y/d80c1bdb8609088970413969826d2b7d/raw/baee46865178fff454c224040525b55b54e27218/gistfile1.txt

2.
Regarding multiparts, if it would have multipart thrash, it would be listed 
here:
https://gist.githubusercontent.com/Badb0yBadb0y/d80c1bdb8609088970413969826d2b7d/raw/baee46865178fff454c224040525b55b54e27218/gistfile1.txt
as a rgw.multimeta under the usage, right?

3.
Regarding the multisite idea, this bucket has been a multisite bucket last year 
but we had to reshard (accepting to loose the replica on the 2nd site and just 
keep it in the master site) and that time as expected it has disappeared 
completely from the 2nd site (I guess the 40TB thrash still there but can't 
really find it how to clean  ). Now it is a single site bucket.
Also it is the index pool, multisite logs should go to the rgw.log pool 
shouldn't it?


Thank you


From: Casey Bodley 
Sent: Tuesday, July 9, 2024 10:39 PM
To: Szabo, Istvan (Agoda) 
Cc: Eugen Block ; ceph-users@ceph.io 
Subject: Re: [ceph-users] Re: Large omap in index pool even if properly sharded 
and not "OVER"

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


in general, these omap entries should be evenly spread over the
bucket's index shard objects. but there are two features that may
cause entries to clump on a single shard:

1. for versioned buckets, multiple versions of the same object name
map to the same index shard. this can become an issue if an
application is repeatedly overwriting an object without cleaning up
old versions. lifecycle rules can help to manage these noncurrent
versions

2. during a multipart upload, all of the parts are tracked on the same
index shard as the final object name. if applications are leaving a
lot of incomplete multipart uploads behind (especially if they target
the same object name) this can lead to similar clumping. the S3 api
has operations to list and abort incomplete multipart uploads, along
with lifecycle rules to automate their cleanup

separately, multisite clusters use these same index shards to store
replication logs. if sync gets far enough behind, these log entries
can also lead to large omap warnings

On Tue, Jul 9, 2024 at 10:25 AM Szabo, Istvan (Agoda)
 wrote:
>
> It's the same bucket:
> https://gist.github.com/Badb0yBadb0y/d80c1bdb8609088970413969826d2b7d
>
>
> 
> From: Eugen Block 
> Sent: Tuesday, July 9, 2024 8:03 PM
> To: Szabo, Istvan (Agoda) 
> Cc: ceph-users@ceph.io 
> Subject: Re: [ceph-users] Re: Large omap in index pool even if properly 
> sharded and not "OVER"
>
> Email received from the internet. If in doubt, don't click any link nor open 
> any attachment !
> 
>
> Are those three different buckets? Could you share the stats for each of them?
>
> radosgw-admin bucket stats --bucket=
>
> Zitat von "Szabo, Istvan (Agoda)" :
>
> > Hello,
> >
> > Yeah, still:
> >
> > the .dir.9213182a-14ba-48ad-bde9-289a1c0c0de8.2479481907.1.151 | wc -l
> > 290005
> >
> > and the
> > .dir.9213182a-14ba-48ad-bde9-289a1c0c0de8.2479481907.1.726 | wc -l
> > 289378
> >
> > And just make me happy more I have one more
> > .dir.9213182a-14ba-48ad-bde9-289a1c0c0de8.2479481907.1.6 | wc -l
> > 181588
> >
> > This is my crush tree (I'm using host based crush rule)
> > https://gist.githubusercontent.com/Badb0yBadb0y/9bea911701184a51575619bc99cca94d/raw/e5e4a918d327769bb874aaed279a8428fd7150d5/gistfile1.txt
> >
> > I'm thinking could that be the issue that host 2s13-15 has less nvme
> > osd (however size wise same as in the other 12 host where have 8x
> > nvme osd) than the others?
> > But the pgs are located like this:
> >
> > pg26.427
> > osd.261 host8
> > osd.488 host13
> > osd.276 host4
> >
> > pg26.606
> > osd.443 host12
> > osd.197 host8
> > osd.524 host14
> >
> > pg26.78c
> > osd.89 host7
> > osd.406 host11
> > osd.254 host6
> >
> > If pg26.78c wouldn't be here I'd say 100% the nvme osd distribution
> > based on host is the issue, however this pg is not located on any of
> > the 4x nvme osd nodes 
> >
> > Ty
> >
> > 
> > From: Eugen Block 
> > Sent: Tuesday, July 9, 2024 6:02 PM
> > To: ceph-users@ceph.io 
> > Subject: [ceph-users] Re: Large omap in index pool even if properly
> > sharded and not "OVER"
> >
> > Email received from the internet. If in doubt, don't click any 

[ceph-users] Re: Large omap in index pool even if properly sharded and not "OVER"

2024-07-09 Thread Szabo, Istvan (Agoda)
It's the same bucket:
https://gist.github.com/Badb0yBadb0y/d80c1bdb8609088970413969826d2b7d



From: Eugen Block 
Sent: Tuesday, July 9, 2024 8:03 PM
To: Szabo, Istvan (Agoda) 
Cc: ceph-users@ceph.io 
Subject: Re: [ceph-users] Re: Large omap in index pool even if properly sharded 
and not "OVER"

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Are those three different buckets? Could you share the stats for each of them?

radosgw-admin bucket stats --bucket=

Zitat von "Szabo, Istvan (Agoda)" :

> Hello,
>
> Yeah, still:
>
> the .dir.9213182a-14ba-48ad-bde9-289a1c0c0de8.2479481907.1.151 | wc -l
> 290005
>
> and the
> .dir.9213182a-14ba-48ad-bde9-289a1c0c0de8.2479481907.1.726 | wc -l
> 289378
>
> And just make me happy more I have one more
> .dir.9213182a-14ba-48ad-bde9-289a1c0c0de8.2479481907.1.6 | wc -l
> 181588
>
> This is my crush tree (I'm using host based crush rule)
> https://gist.githubusercontent.com/Badb0yBadb0y/9bea911701184a51575619bc99cca94d/raw/e5e4a918d327769bb874aaed279a8428fd7150d5/gistfile1.txt
>
> I'm thinking could that be the issue that host 2s13-15 has less nvme
> osd (however size wise same as in the other 12 host where have 8x
> nvme osd) than the others?
> But the pgs are located like this:
>
> pg26.427
> osd.261 host8
> osd.488 host13
> osd.276 host4
>
> pg26.606
> osd.443 host12
> osd.197 host8
> osd.524 host14
>
> pg26.78c
> osd.89 host7
> osd.406 host11
> osd.254 host6
>
> If pg26.78c wouldn't be here I'd say 100% the nvme osd distribution
> based on host is the issue, however this pg is not located on any of
> the 4x nvme osd nodes 
>
> Ty
>
> 
> From: Eugen Block 
> Sent: Tuesday, July 9, 2024 6:02 PM
> To: ceph-users@ceph.io 
> Subject: [ceph-users] Re: Large omap in index pool even if properly
> sharded and not "OVER"
>
> Email received from the internet. If in doubt, don't click any link
> nor open any attachment !
> 
>
> Hi,
>
> the number of shards looks fine, maybe this was just a temporary
> burst? Did you check if the rados objects in the index pool still have
> more than 200k omap objects? I would try someting like
>
> rados -p  listomapkeys
> .dir.9213182a-14ba-48ad-bde9-289a1c0c0de8.2479481907.1.151 | wc -l
>
>
> Zitat von "Szabo, Istvan (Agoda)" :
>
>> Hi,
>>
>> I have a pretty big bucket which sharded with 1999 shard so in
>> theory can hold close to 200m objects (199.900.000).
>> Currently it has 54m objects.
>>
>> Bucket limit check looks also good:
>>  "bucket": ""xyz,
>>  "tenant": "",
>>  "num_objects": 53619489,
>>  "num_shards": 1999,
>>  "objects_per_shard": 26823,
>>  "fill_status": "OK"
>>
>> This is the bucket id:
>> "id": "9213182a-14ba-48ad-bde9-289a1c0c0de8.2479481907.1"
>>
>> This is the log lines:
>> 2024-06-27T10:41:05.679870+0700 osd.261 (osd.261) 9643 : cluster
>> [WRN] Large omap object found. Object:
>> 26:e433e65c:::.dir.9213182a-14ba-48ad-bde9-289a1c0c0de8.2479481907.1.151:head
>>  PG: 26.3a67cc27 (26.427) Key count: 236919 Size
>> (bytes):
>> 89969920
>>
>> 2024-06-27T10:43:35.557835+0700 osd.89 (osd.89) 9000 : cluster [WRN]
>> Large omap object found. Object:
>> 26:31ff4df1:::.dir.9213182a-14ba-48ad-bde9-289a1c0c0de8.2479481907.1.726:head
>>  PG: 26.8fb2ff8c (26.78c) Key count: 236495 Size
>> (bytes):
>> 95560458
>>
>> Tried to deep scrub the affected pgs, tried to deep-scrub the
>> mentioned osds in the log but didn't help.
>> Why? What I'm missing?
>>
>> Thank you in advance for your help.
>>
>> 
>> This message is confidential and is for the sole use of the intended
>> recipient(s). It may also be privileged or otherwise protected by
>> copyright or other legal rules. If you have received it by mistake
>> please let us know by reply email and delete it from your system. It
>> is prohibited to copy this message or disclose its content to
>> anyone. Any confidentiality or privilege is not waived or lost by
>> any mistaken delivery or unauthorized disclosure of the message. All
>> messages sent to and from Agoda may be monitored to ensure
>> compliance with company policies, to protect the company's interests
>> and to remove potential malware. Electronic messages may be
>> intercepted, amended

[ceph-users] Re: Large omap in index pool even if properly sharded and not "OVER"

2024-07-09 Thread Szabo, Istvan (Agoda)
Hello,

Yeah, still:

the .dir.9213182a-14ba-48ad-bde9-289a1c0c0de8.2479481907.1.151 | wc -l
290005

and the
.dir.9213182a-14ba-48ad-bde9-289a1c0c0de8.2479481907.1.726 | wc -l
289378

And just make me happy more I have one more
.dir.9213182a-14ba-48ad-bde9-289a1c0c0de8.2479481907.1.6 | wc -l
181588

This is my crush tree (I'm using host based crush rule)
https://gist.githubusercontent.com/Badb0yBadb0y/9bea911701184a51575619bc99cca94d/raw/e5e4a918d327769bb874aaed279a8428fd7150d5/gistfile1.txt

I'm thinking could that be the issue that host 2s13-15 has less nvme osd 
(however size wise same as in the other 12 host where have 8x nvme osd) than 
the others?
But the pgs are located like this:

pg26.427
osd.261 host8
osd.488 host13
osd.276 host4

pg26.606
osd.443 host12
osd.197 host8
osd.524 host14

pg26.78c
osd.89 host7
osd.406 host11
osd.254 host6

If pg26.78c wouldn't be here I'd say 100% the nvme osd distribution based on 
host is the issue, however this pg is not located on any of the 4x nvme osd 
nodes 

Ty


From: Eugen Block 
Sent: Tuesday, July 9, 2024 6:02 PM
To: ceph-users@ceph.io 
Subject: [ceph-users] Re: Large omap in index pool even if properly sharded and 
not "OVER"

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Hi,

the number of shards looks fine, maybe this was just a temporary
burst? Did you check if the rados objects in the index pool still have
more than 200k omap objects? I would try someting like

rados -p  listomapkeys
.dir.9213182a-14ba-48ad-bde9-289a1c0c0de8.2479481907.1.151 | wc -l


Zitat von "Szabo, Istvan (Agoda)" :

> Hi,
>
> I have a pretty big bucket which sharded with 1999 shard so in
> theory can hold close to 200m objects (199.900.000).
> Currently it has 54m objects.
>
> Bucket limit check looks also good:
>  "bucket": ""xyz,
>  "tenant": "",
>  "num_objects": 53619489,
>  "num_shards": 1999,
>  "objects_per_shard": 26823,
>  "fill_status": "OK"
>
> This is the bucket id:
> "id": "9213182a-14ba-48ad-bde9-289a1c0c0de8.2479481907.1"
>
> This is the log lines:
> 2024-06-27T10:41:05.679870+0700 osd.261 (osd.261) 9643 : cluster
> [WRN] Large omap object found. Object:
> 26:e433e65c:::.dir.9213182a-14ba-48ad-bde9-289a1c0c0de8.2479481907.1.151:head 
> PG: 26.3a67cc27 (26.427) Key count: 236919 Size (bytes):
> 89969920
>
> 2024-06-27T10:43:35.557835+0700 osd.89 (osd.89) 9000 : cluster [WRN]
> Large omap object found. Object:
> 26:31ff4df1:::.dir.9213182a-14ba-48ad-bde9-289a1c0c0de8.2479481907.1.726:head 
> PG: 26.8fb2ff8c (26.78c) Key count: 236495 Size (bytes):
> 95560458
>
> Tried to deep scrub the affected pgs, tried to deep-scrub the
> mentioned osds in the log but didn't help.
> Why? What I'm missing?
>
> Thank you in advance for your help.
>
> 
> This message is confidential and is for the sole use of the intended
> recipient(s). It may also be privileged or otherwise protected by
> copyright or other legal rules. If you have received it by mistake
> please let us know by reply email and delete it from your system. It
> is prohibited to copy this message or disclose its content to
> anyone. Any confidentiality or privilege is not waived or lost by
> any mistaken delivery or unauthorized disclosure of the message. All
> messages sent to and from Agoda may be monitored to ensure
> compliance with company policies, to protect the company's interests
> and to remove potential malware. Electronic messages may be
> intercepted, amended, lost or deleted, or contain viruses.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Large omap in index pool even if properly sharded and not "OVER"

2024-06-27 Thread Szabo, Istvan (Agoda)
Hi,

I have a pretty big bucket which sharded with 1999 shard so in theory can hold 
close to 200m objects (199.900.000).
Currently it has 54m objects.

Bucket limit check looks also good:
 "bucket": ""xyz,
 "tenant": "",
 "num_objects": 53619489,
 "num_shards": 1999,
 "objects_per_shard": 26823,
 "fill_status": "OK"

This is the bucket id:
"id": "9213182a-14ba-48ad-bde9-289a1c0c0de8.2479481907.1"

This is the log lines:
2024-06-27T10:41:05.679870+0700 osd.261 (osd.261) 9643 : cluster [WRN] Large 
omap object found. Object: 
26:e433e65c:::.dir.9213182a-14ba-48ad-bde9-289a1c0c0de8.2479481907.1.151:head 
PG: 26.3a67cc27 (26.427) Key count: 236919 Size (bytes): 89969920

2024-06-27T10:43:35.557835+0700 osd.89 (osd.89) 9000 : cluster [WRN] Large omap 
object found. Object: 
26:31ff4df1:::.dir.9213182a-14ba-48ad-bde9-289a1c0c0de8.2479481907.1.726:head 
PG: 26.8fb2ff8c (26.78c) Key count: 236495 Size (bytes): 95560458

Tried to deep scrub the affected pgs, tried to deep-scrub the mentioned osds in 
the log but didn't help.
Why? What I'm missing?

Thank you in advance for your help.


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Daily slow ops around the same time on different osds

2024-06-18 Thread Szabo, Istvan (Agoda)
Hi,

For some reason daily I receive a slow ops which affect the rgw.log pool the 
most (which is pretty small, 32pg / 57k objects / 9.5GB data)

Could be the issue related to rocksdb tasks?

Some log lines:
...
2024-06-18T14:17:43.849+0700 7fd8002ea700  1 heartbeat_map reset_timeout 
'OSD::osd_op_tp thread 0x7fd8002ea700' had timed out after 15
2024-06-18T14:17:50.865+0700 7fd81a31e700 -1 osd.461 472771 get_health_metrics 
reporting 1 slow ops, oldest is osd_op(client.3349483990.0:462716190 22.11 
22:8deed178:::meta.log.08e85e9f-c16e-43f0-b88d-362f3b7ced2d.15:head [call 
log.list in=69b] snapc 0=[] ondisk+read+known_if_redirected e472771)
2024-06-18T14:17:50.865+0700 7fd81a31e700  0 log_channel(cluster) log [WRN] : 1 
slow requests (by type [ 'delayed' : 1 ] most affected pool [ 'dc.rgw.log' : 1 
])
2024-06-18T14:19:36.040+0700 7fd81e5b4700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp
...
2024-06-18T14:19:51.908+0700 7fd81a31e700 -1 osd.461 472771 get_health_metrics 
reporting 2 slow ops, oldest is osd_op(client.3371539359.0:5095510 22.11 
22:8df1b9e1:::datalog.sync-status.shard.61c9d940-fde4-4bed-9389-edc8d7741817.111:
head [call lock.lock in=64b] snapc 0=[] ondisk+write+known_if_redirected 
e472771)
2024-06-18T14:19:51.908+0700 7fd81a31e700  0 log_channel(cluster) log [WRN] : 2 
slow requests (by type [ 'delayed' : 2 ] most affected pool [ 'dc.rgw.log' : 2 
])
2024-06-18T14:19:52.899+0700 7fd81a31e700 -1 osd.461 472771 get_health_metrics 
reporting 2 slow ops, oldest is osd_op(client.3371539359.0:5095510 22.11 
22:8df1b9e1:::datalog.sync-status.shard.61c9d940-fde4-4bed-9389-edc8d7741817.111:

I found this thread: 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/FR5V466HBXGRVL3Z3RAFKUPQ2FGK2T53/
but I think it is not relevant for me.

We use only SSDs without separated rocksdb so we don' have spillover.
CPU is not overloaded (80% idle) however it's interesting that in vmstat the 
"r" is pretty big which indicates waiting list for cpu:

procs ---memory-- ---swap-- -io -system-- --cpu-
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa st
12  0  0 6205288 329803136 1563384400  2149  189400  5  3 
92  0  0
17  0  0 6156780 329827840 1563725200 125584 250416 646098 1965051 
10  8 82  0  0
21  0  0 6154636 329849504 1563611200 99320 245024 493324 1682377  
7  6 87  0  0
16  0  0 6144256 329869664 1563692400 87484 301968 623345 1993374  
8  7 84  0  0
19  0  0 6057012 329890080 1563793200 160444 303664 549194 1820562  
8  6 85  0  0

Any idea what this could indicate on octopus 15.2.17 with ubuntu 20.04?

ty



This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Separated multisite sync and user traffic, doable?

2024-06-13 Thread Szabo, Istvan (Agoda)
Hi,

Could that cause any issue if the endpoints defined in the zonegroups are not 
in the endpoint list behind haproxy?
The question is mainly about the role of the endpoint servers in the zonegroup 
list. Their role is the sync only or something else also?

This would be the scenario, could it work?

  *

  *
I have 3 mon/mgr server and 15 OSD
  *
RGWs on the mon/mgr would be in the zonegroup definition like this

  "zones": [
{
  "id": "61c9sdf40-fdsd-4sdd-9rty9-ed56jda41817",
  "name": "dc",
  "endpoints": [
"http://mon1:8080;,
"http://mon2:8080;,
"http://mon3:8080;
  ],


  *   However for user traffic I'd use an haproxy endpoint with the 15 OSD node 
rgws (each osd node 1x).

Ty


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] How radosgw considers that the file upload is done?

2024-06-12 Thread Szabo, Istvan (Agoda)
Hi,

Wonder how radosgw knows that a transaction is done and didn't break the 
connection between the user interface and gateway?

Let's see this is one request:

2024-06-12T16:26:03.386+0700 7fa34c7f0700  1 beast: 0x7fa5bc776750: 1.1.1.1 - - 
[2024-06-12T16:26:03.386063+0700] "PUT /bucket/0/2/966394.delta HTTP/1.1" 200 
238 - "User-Agent: APN/1.0 Hortonworks/1.0 HDP/3.1.0.0-78, Hadoop 3.2.2, 
aws-sdk-java/1.11.563 Linux/5.15.0-101-generic 
OpenJDK_64-Bit_Server_VM/11.0.18+10-post-Debian-1deb10u1 java/11.0.18 
scala/2.12.15 vendor/Debian 
com.amazonaws.services.s3.transfer.TransferManager/1.11.563" -
2024-06-12T16:26:03.386+0700 7fa4e9ffb700  1 == req done req=0x7fa5a4572750 
op status=0 http_status=200 latency=737ns ==

What I can see here is the
req done
op status=0

I guess if the connection broke between user and gateway the req will be done 
also, but what s op status? Is it the one that I'm actually looking for? If 
connection broke maybe that is different value?

Thank you


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Best practice regarding rgw scaling

2024-05-23 Thread Szabo, Istvan (Agoda)
Hi,

Wonder what is the best practice to scale RGW, increase the thread numbers or 
spin up more gateways?


  *
Let's say I have 21000 connections on my haproxy
  *
I have 3 physical gateway servers so let's say each of them need to server 7000 
connections

This means with 512 thread pool size each of them needs 13 gateway altogether 
39 in the cluster.
or
3 gateway and each 8192 rgw thread?

Thank you


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Multisite: metadata behind on shards

2024-05-12 Thread Szabo, Istvan (Agoda)
Hi,

Wonder what is the mechanism behind the sync mechanism because I need to 
restart all the gateways every 2 days on the remote sites to keep those it in 
sync. (Octopus 15.2.7)

Thank you


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Numa pinning best practices

2024-05-07 Thread Szabo, Istvan (Agoda)
Hi,

Haven't really found a proper descripton in case of 2 socket how to pin osds to 
numa node, only this: 
https://tracker.ceph.com/projects/ceph/wiki/Tuning_for_All_Flash_Deployments#Ceph-Storage-Node-NUMA-Tuning
Tuning for All Flash Deployments - Ceph - Ceph 

Redmine
tracker.ceph.com


Is there anybody have some good how to on this topic?

Thank you


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] NVME node disks maxed out during rebalance after adding to existing cluster

2024-05-02 Thread Szabo, Istvan (Agoda)
Hi,

I have slow heartbeat in front and back with the extra node added to the 
cluster and this occasionally causing slow ops and failed osd reports.

I'm extending our cluster with +3 relatively differently configured servers 
compared to the original 12.
Our cluster (latest octopus) is an objectstore cluster with 12x identical node 
(8x15.3 TB SSD inside with 4osd on each, 512GB mem, 96 vcore cpu ...) and it 
hosts a 4:2 EC data pool.
The +3 nodes - currently done only 1 and now the 2nd is in progress but have 
the issue during rebalance - have 8x 15.3TB NVME drives with 4x osd on each.

The NVME drive specification:https://i.ibb.co/BVmLKnf/currentnvme.png
[https://i.ibb.co/BVmLKnf/currentnvme.png]


The old server SSD spec: https://i.ibb.co/dkD3VKx/oldssd.png
[https://i.ibb.co/dkD3VKx/oldssd.png]

Iostat on new nvme: https://i.ibb.co/PF0hrVW/iostat.png
[https://i.ibb.co/PF0hrVW/iostat.png]

Rebalance is ongoing with the slowest option like max_backfill/op priority/max 
recovery = 1
But it generates a very huge iowait, seems like the disk not fast enough (but 
why in the previous node didn't have this issue)?
Here is the metrics about the disk which is running the backfill/rebalance now 
(FYI we have 3.2B objects in the cluster):
https://i.ibb.co/LNXCRbj/disks.png
[https://i.ibb.co/LNXCRbj/disks.png]

Wonder what I'm missing or how this can happen?

Here you can see gigantic latencies, failed osds and slow ops:
https://i.ibb.co/Jn0sj9g/laten.png
[https://i.ibb.co/Jn0sj9g/laten.png]
Thank you for your help


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RBD image metric

2024-04-04 Thread Szabo, Istvan (Agoda)
Hi,

Let's say thin provisioned and no, no fast-diff and object map enabled. As I 
see this is a requirements to be able to use "du".


Istvan Szabo
Staff Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com<mailto:istvan.sz...@agoda.com>
---




From: Anthony D'Atri 
Sent: Thursday, April 4, 2024 3:19 AM
To: Szabo, Istvan (Agoda) 
Cc: Ceph Users 
Subject: [ceph-users] Re: RBD image metric

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Depending on your Ceph release you might need to enable rbdstats.

Are you after provisioned, allocated, or both sizes?  Do you have object-map 
and fast-diff enabled?  They speed up `rbd du` massively.

> On Apr 3, 2024, at 00:26, Szabo, Istvan (Agoda)  
> wrote:
>
> Hi,
>
> Trying to pull out some metrics from ceph about the rbd images sizes but 
> haven't found anything only pool related metrics.
>
> Wonder is there any metric about images or I need to create by myself to 
> collect it with some third party tool?
>
> Thank you
>
> 
> This message is confidential and is for the sole use of the intended 
> recipient(s). It may also be privileged or otherwise protected by copyright 
> or other legal rules. If you have received it by mistake please let us know 
> by reply email and delete it from your system. It is prohibited to copy this 
> message or disclose its content to anyone. Any confidentiality or privilege 
> is not waived or lost by any mistaken delivery or unauthorized disclosure of 
> the message. All messages sent to and from Agoda may be monitored to ensure 
> compliance with company policies, to protect the company's interests and to 
> remove potential malware. Electronic messages may be intercepted, amended, 
> lost or deleted, or contain viruses.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] RBD image metric

2024-04-02 Thread Szabo, Istvan (Agoda)
Hi,

Trying to pull out some metrics from ceph about the rbd images sizes but 
haven't found anything only pool related metrics.

Wonder is there any metric about images or I need to create by myself to 
collect it with some third party tool?

Thank you


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] 1x port from bond down causes all osd down in a single machine

2024-03-26 Thread Szabo, Istvan (Agoda)
Hi,

Wonder what we are missing from the netplan configuration on ubuntu which ceph 
needs to tolerate properly.
We are using this bond configuration on ubuntu 20.04 with octopus ceph:


bond1:
  macaddress: x.x.x.x.x.50
  dhcp4: no
  dhcp6: no
  addresses:
- 192.168.199.7/24
  interfaces:
- ens2f0np0
- ens2f1np1
  mtu: 9000
  parameters:
mii-monitor-interval: 100
mode: 802.3ad
lacp-rate: fast
transmit-hash-policy: layer3+4



ens2f1np1 failed and caused slow ops, all osd down ... = disaster

Any idea what is wrong with this bond config?

Thank you


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Robust cephfs design/best practice

2024-03-15 Thread Szabo, Istvan (Agoda)
Hi,

I'd like to add cephfs to our production objectstore/block storage cluster so 
I'd like to collect hands on experiences like, good to know/be careful/avoid 
etc ... other than ceph documentation.

Thank you


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Upgrading nautilus / centos7 to octopus / ubuntu 20.04. - Suggestions and hints?

2024-01-16 Thread Szabo, Istvan (Agoda)
Hi Goetz,

Which method you finally choose?
We've done a successful migration from Centos 8 to ubuntu 20.04 but we have a 
centos 7 nautilus cluster which we'd like to move to Ubuntu 20.04 octopus same 
as you.

 Wonder any of you tried to skip Rocky 8 from the flow?


Thank you


From: Boris Behrens 
Sent: Wednesday, August 2, 2023 1:24 AM
To: Götz Reinicke 
Cc: ceph-users@ceph.io 
Subject: [ceph-users] Re: Upgrading nautilus / centos7 to octopus / ubuntu 
20.04. - Suggestions and hints?

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Hi Goetz,
I've done the same, and went to Octopus and to Ubuntu. It worked like a
charm and with pip, you can get the pecan library working. I think I did it
with this:
yum -y install python36-six.noarch python36-PyYAML.x86_64
pip3 install pecan werkzeug cherrypy

Worked very well, until we got hit by this bug:
https://tracker.ceph.com/issues/53729#note-65
Nautilus seem not to have tooling to detect it, and the fix is not
backported to octopus.

And because our clusters started to act badly after the octopus upgrade,
and we fast forwarded to pacific (untested emergency cluster upgrades are
okayish but ugly :D ).

And because of the bug, we went another route with the last cluster.
I reinstalled all hosts with ubuntu 18.04, then update straight to pacific,
and then upgrade to ubuntu 20.04.

Hope that helped.

Cheers
 Boris


Am Di., 1. Aug. 2023 um 20:06 Uhr schrieb Götz Reinicke <
goetz.reini...@filmakademie.de>:

> Hi,
>
> As I’v read and thought a lot about the migration as this is a bigger
> project, I was wondering if anyone has done that already and might share
> some notes or playbooks, because in all readings there where some parts
> missing or miss understandable to me.
>
> I do have some different approaches in mind, so may be you have some
> suggestions or hints.
>
> a) upgrade nautilus on centos 7 with the few missing features like
> dashboard and prometheus. After that migrate one node after an other to
> ubuntu 20.04 with octopus and than upgrade ceph to the recent stable
> version.
>
> b) migrate one node after an other to ubuntu 18.04 with nautilus and then
> upgrade to octupus and after that to ubuntu 20.04.
>
> or
>
> c) upgrade one node after an other to ubuntu 20.04 with octopus and join
> it to the cluster until all nodes are upgraded.
>
>
> For test I tried c) with a mon node, but adding that to the cluster fails
> with some failed state, still probing for the other mons. (I dont have the
> right log at hand right now.)
>
> So my questions are:
>
> a) What would be the best (most stable) migration path and
>
> b) is it in general possible to add a new octopus mon (not upgraded one)
> to a nautilus cluster, where the other mons are still on nautilus?
>
>
> I hope my thoughts and questions are understandable :)
>
> Thanks for any hint and suggestion. Best . Götz
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


--
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: About ceph disk slowops effect to cluster

2024-01-12 Thread Szabo, Istvan (Agoda)
Is it better?


Istvan Szabo
Staff Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---




From: Phong Tran Thanh 
Sent: Friday, January 12, 2024 3:32 PM
To: David Yang 
Cc: ceph-users@ceph.io 
Subject: [ceph-users] Re: About ceph disk slowops effect to cluster

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


I update the config
osd_mclock_profile=custom
osd_mclock_scheduler_background_recovery_lim=0.2
osd_mclock_scheduler_background_recovery_res=0.2
osd_mclock_scheduler_client_wgt=6

Vào Th 6, 12 thg 1, 2024 vào lúc 15:31 Phong Tran Thanh <
tranphong...@gmail.com> đã viết:

> Hi Yang and Anthony,
>
> I found the solution for this problem on a HDD disk 7200rpm
>
> When the cluster recovers, one or multiple disk failures because slowop
> appears and then affects the cluster, we can change these configurations
> and may reduce IOPS when recovery.
> osd_mclock_profile=custom
> osd_mclock_scheduler_background_recovery_lim=0.2
> osd_mclock_scheduler_background_recovery_res=0.2
> osd_mclock_scheduler_client_wgt
>
>
> Vào Th 4, 10 thg 1, 2024 vào lúc 11:22 David Yang 
> đã viết:
>
>> The 2*10Gbps shared network seems to be full (1.9GB/s).
>> Is it possible to reduce part of the workload and wait for the cluster
>> to return to a healthy state?
>> Tip: Erasure coding needs to collect all data blocks when recovering
>> data, so it takes up a lot of network card bandwidth and processor
>> resources.
>>
>
>
> --
> Trân trọng,
>
> 
>
> *Tran Thanh Phong*
>
> Email: tranphong...@gmail.com
> Skype: tranphong079
>


--
Trân trọng,


*Tran Thanh Phong*

Email: tranphong...@gmail.com
Skype: tranphong079
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW rate-limiting or anti-hammering for (external) auth requests // Anti-DoS measures

2024-01-09 Thread Szabo, Istvan (Agoda)
Hi,

I'm using in the frontend https config on haproxy like this, it works so far 
good:

stick-table type ip size 1m expire 10s store http_req_rate(10s)

tcp-request inspect-delay 10s
tcp-request content track-sc0 src
http-request deny deny_status 429 if { sc_http_req_rate(0) gt 1 }


Istvan Szabo
Staff Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---




From: Christian Rohmann 
Sent: Tuesday, January 9, 2024 3:33 PM
To: ceph-users@ceph.io 
Subject: [ceph-users] Re: RGW rate-limiting or anti-hammering for (external) 
auth requests // Anti-DoS measures

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Happy New Year Ceph-Users!

With the holidays and people likely being away, I take the liberty to
bluntly BUMP this question about protecting RGW from DoS below:


On 22.12.23 10:24, Christian Rohmann wrote:
> Hey Ceph-Users,
>
>
> RGW does have options [1] to rate limit ops or bandwidth per bucket or
> user.
> But those only come into play when the request is authenticated.
>
> I'd like to also protect the authentication subsystem from malicious
> or invalid requests.
> So in case e.g. some EC2 credentials are not valid (anymore) and
> clients start hammering the RGW with those requests, I'd like to make
> it cheap to deal with those requests. Especially in case some external
> authentication like OpenStack Keystone [2] is used, valid access
> tokens are cached within the RGW. But requests with invalid
> credentials end up being sent at full rate to the external API [3] as
> there is no negative caching. And even if there was, that would only
> limit the external auth requests for the same set of invalid
> credentials, but it would surely reduce the load in that case:
>
> Since the HTTP request is blocking  
>
>
>> [...]
>> 2023-12-18T15:25:55.861+ 7fec91dbb640 20 sending request to
>> https://keystone.example.com/v3/s3tokens
>> 2023-12-18T15:25:55.861+ 7fec91dbb640 20 register_request
>> mgr=0x561a407ae0c0 req_data->id=778, curl_handle=0x7fedaccb36e0
>> 2023-12-18T15:25:55.861+ 7fec91dbb640 20 WARNING: blocking http
>> request
>> 2023-12-18T15:25:55.861+ 7fede37fe640 20 link_request
>> req_data=0x561a40a418b0 req_data->id=778, curl_handle=0x7fedaccb36e0
>> [...]
>
>
> this does not only stress the external authentication API (keystone in
> this case), but also blocks RGW threads for the duration of the
> external call.
>
> I am currently looking into using the capabilities of HAProxy to rate
> limit requests based on their resulting http-response [4]. So in
> essence to rate-limit or tarpit clients that "produce" a high number
> of 403 "InvalidAccessKeyId" responses. To have less collateral it
> might make sense to limit based on the presented credentials
> themselves. But this would require to extract and track HTTP headers
> or URL parameters (presigned URLs) [5] and to put them into tables.
>
>
> * What are your thoughts on the matter?
> * What kind of measures did you put in place?
> * Does it make sense to extend RGWs capabilities to deal with those
> cases itself?
> ** adding negative caching
> ** rate limits on concurrent external authentication requests (or is
> there a pool of connections for those requests?)
>
>
>
> Regards
>
>
> Christian
>
>
>
> [1] https://docs.ceph.com/en/latest/radosgw/admin/#rate-limit-management
> [2]
> https://docs.ceph.com/en/latest/radosgw/keystone/#integrating-with-openstack-keystone
> [3]
> https://github.com/ceph/ceph/blob/86bb77eb9633bfd002e73b5e58b863bc2d0df594/src/rgw/rgw_auth_keystone.cc#L475
> [4]
> https://www.haproxy.com/documentation/haproxy-configuration-manual/latest/#4.2-http-response%20track-sc0
> [5]
> https://docs.aws.amazon.com/AmazonS3/latest/API/sig-v4-authenticating-requests.html#auth-methods-intro
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing 

[ceph-users] Re: increasing number of (deep) scrubs

2023-12-12 Thread Szabo, Istvan (Agoda)
Hi,

You are on octopus right?


Istvan Szabo
Staff Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---




From: Frank Schilder 
Sent: Tuesday, December 12, 2023 7:33 PM
To: ceph-users@ceph.io 
Subject: [ceph-users] Re: increasing number of (deep) scrubs

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Hi all,

if you follow this thread, please see the update in "How to configure something 
like osd_deep_scrub_min_interval?" 
(https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/YUHWQCDAKP5MPU6ODTXUSKT7RVPERBJF/).
 I found out how to tune the scrub machine and I posted a quick update in the 
other thread, because the solution was not to increase the number of scrubs, 
but to tune parameters.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder
Sent: Monday, January 9, 2023 9:14 AM
To: Dan van der Ster
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] increasing number of (deep) scrubs

Hi Dan,

thanks for your answer. I don't have a problem with increasing osd_max_scrubs 
(=1 at the moment) as such. I would simply prefer a somewhat finer grained way 
of controlling scrubbing than just doubling or tripling it right away.

Some more info. These 2 pools are data pools for a large FS. Unfortunately, we 
have a large percentage of small files, which is a pain for recovery and 
seemingly also for deep scrubbing. Our OSDs are about 25% used and I had to 
increase the warning interval already to 2 weeks. With all the warning grace 
parameters this means that we manage to deep scrub everything about every 
month. I need to plan for 75% utilisation and a 3 months period is a bit far on 
the risky side.

Our data is to a large percentage cold data. Client reads will not do the check 
for us, we need to combat bit-rot pro-actively.

The reasons I'm interested in parameters initiating more scrubs while also 
converting more scrubs into deep scrubs are, that

1) scrubs seem to complete very fast. I almost never catch a PG in state 
"scrubbing", I usually only see "deep scrubbing".

2) I suspect the low deep-scrub count is due to a low number of deep-scrubs 
scheduled and not due to conflicting per-OSD deep scrub reservations. With the 
OSD count we have and the distribution over 12 servers I would expect at least 
a peak of 50% OSDs being active in scrubbing instead of the 25% peak I'm seeing 
now. It ought to be possible to schedule more PGs for deep scrub than actually 
are.

3) Every OSD having only 1 deep scrub active seems to have no measurable impact 
on user IO. If I could just get more PGs scheduled with 1 deep scrub per OSD it 
would already help a lot. Once this is working, I can eventually increase 
osd_max_scrubs when the OSDs fill up. For now I would just like that (deep) 
scrub scheduling looks a bit harder and schedules more eligible PGs per time 
unit.

If we can get deep scrubbing up to an average of 42PGs completing per hour with 
keeping osd_max_scrubs=1 to maintain current IO impact, we should be able to 
complete a deep scrub with 75% full OSDs in about 30 days. This is the current 
tail-time with 25% utilisation. I believe currently a deep scrub of a PG in 
these pools takes 2-3 hours. Its just a gut feeling from some repair and 
deep-scrub commands, I would need to check logs for more precise info.

Increasing osd_max_scrubs would then be a further and not the only option to 
push for more deep scrubbing. My expectation would be that values of 2-3 are 
fine due to the increasingly higher percentage of cold data for which no 
interference with client IO will happen.

Hope that makes sense and there is a way beyond bumping osd_max_scrubs to 
increase the number of scheduled and executed deep scrubs.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Dan van der Ster 
Sent: 05 January 2023 15:36
To: Frank Schilder
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] increasing number of (deep) scrubs

Hi Frank,

What is your current osd_max_scrubs, and why don't you want to increase it?
With 8+2, 8+3 pools each scrub is occupying the scrub slot on 10 or 11
OSDs, so at a minimum it could take 3-4x the amount of time to scrub
the data than if those were replicated pools.
If you want the scrub to complete in time, you need to increase the
amount of scrub slots accordingly.

On the other hand, IMHO the 1-week deadline for deep scrubs is often
much too ambitious for large clusters -- increasing the scrub
intervals is one solution, or I find it simpler to increase
mon_warn_pg_not_scrubbed_ratio and mon_warn_pg_not_deep_scrubbed_ratio
until you find a ratio 

[ceph-users] Re: Space reclaim doesn't happening in nautilus RBD pool

2023-12-05 Thread Szabo, Istvan (Agoda)
Hi,

Seems like the sparsify and manual fstrim is doing what it needs to do.
When sparsify the image, if image has snapshots let say 3 snapshots, need to 
wait until it rotates all of them (remove and create with new set instead).
I think it reclaims some of it too but I guess it up to free space on that 
filesystem/volume.
Those two commands together really reclaims back after snapshot is rotated.

This article is interesting  
https://www.ibm.com/docs/en/storage-fusion/2.6?topic=resources-reclaiming-space-target-volumes
 does both sparsify/fstrim to achieve reclaim.


Istvan Szabo
Staff Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com<mailto:istvan.sz...@agoda.com>
---




From: Ilya Dryomov 
Sent: Monday, December 4, 2023 6:10 PM
To: Szabo, Istvan (Agoda) 
Cc: Ceph Users 
Subject: Re: [ceph-users] Space reclaim doesn't happening in nautilus RBD pool

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Hi Istvan,

The number of objects in "im" pool (918.34k) doesn't line up with
"rbd du" output which says that only 2.2T are provisioned (that would
take roughly ~576k objects).  This usually occurs when there are object
clones caused by previous snapshots -- keep in mind that trimming
object clones after a snapshot is removed is an asynchronous process
and it can take a while.

Just to confirm, what is the output of "rbd info im/root",
"rbd snap ls --all im/root", "ceph df" (please recapture) and
"ceph osd pool ls detail" (only "im" pool is of interest)?

Thanks,

Ilya

On Fri, Dec 1, 2023 at 5:31 AM Szabo, Istvan (Agoda)
 wrote:
>
> Thrash empty.
>
> Istvan Szabo
> Staff Infrastructure Engineer
> ---
> Agoda Services Co., Ltd.
> e: istvan.sz...@agoda.com
> ---
>
>
>
> ________
> From: Ilya Dryomov 
> Sent: Thursday, November 30, 2023 6:27 PM
> To: Szabo, Istvan (Agoda) 
> Cc: Ceph Users 
> Subject: Re: [ceph-users] Space reclaim doesn't happening in nautilus RBD pool
>
> Email received from the internet. If in doubt, don't click any link nor open 
> any attachment !
> 
>
> On Thu, Nov 30, 2023 at 8:25 AM Szabo, Istvan (Agoda)
>  wrote:
> >
> > Hi,
> >
> > Is there any config on Ceph that block/not perform space reclaim?
> > I test on one pool which has only one image 1.8 TiB in used.
> >
> >
> > rbd $p du im/root
> > warning: fast-diff map is not enabled for root. operation may be slow.
> > NAMEPROVISIONED USED
> > root 2.2 TiB 1.8 TiB
> >
> >
> >
> > I already removed all snaphots and now pool has only one image alone.
> > I run both fstrim  over the filesystem (XFS) and try rbd sparsify im/root  
> > (don't know what it is exactly but it mentions to reclaim something)
> > It still shows the pool used 6.9 TiB which totally not make sense right? It 
> > should be up to 3.6 (1.8 * 2) according to its replica?
>
> Hi Istvan,
>
> Have you checked RBD trash?
>
> $ rbd trash ls -p im
>
> Thanks,
>
> Ilya
>
> 
>
> This message is confidential and is for the sole use of the intended 
> recipient(s). It may also be privileged or otherwise protected by copyright 
> or other legal rules. If you have received it by mistake please let us know 
> by reply email and delete it from your system. It is prohibited to copy this 
> message or disclose its content to anyone. Any confidentiality or privilege 
> is not waived or lost by any mistaken delivery or unauthorized disclosure of 
> the message. All messages sent to and from Agoda may be monitored to ensure 
> compliance with company policies, to protect the company's interests and to 
> remove potential malware. Electronic messages may be intercepted, amended, 
> lost or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to identify the index pool real usage?

2023-12-04 Thread Szabo, Istvan (Agoda)
These values shouldn't be true to be able to do triming?


"bdev_async_discard": "false",
"bdev_enable_discard": "false",



Istvan Szabo
Staff Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com<mailto:istvan.sz...@agoda.com>
---




From: David C. 
Sent: Monday, December 4, 2023 3:44 PM
To: Szabo, Istvan (Agoda) 
Cc: Anthony D'Atri ; Ceph Users 
Subject: Re: [ceph-users] How to identify the index pool real usage?

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !

Hi,

A flash system needs free space to work efficiently.

Hence my hypothesis that fully allocated disks need to be notified of free 
blocks (trim)


Cordialement,

David CASIER
____




Le lun. 4 déc. 2023 à 06:01, Szabo, Istvan (Agoda) 
mailto:istvan.sz...@agoda.com>> a écrit :
With the nodes that has some free space on that namespace, we don't have issue, 
only with this which is weird.

From: Anthony D'Atri mailto:anthony.da...@gmail.com>>
Sent: Friday, December 1, 2023 10:53 PM
To: David C. mailto:david.cas...@aevoo.fr>>
Cc: Szabo, Istvan (Agoda) 
mailto:istvan.sz...@agoda.com>>; Ceph Users 
mailto:ceph-users@ceph.io>>
Subject: Re: [ceph-users] How to identify the index pool real usage?

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


>>
>> Today we had a big issue with slow ops on the nvme drives which holding
>> the index pool.
>>
>> Why the nvme shows full if on ceph is barely utilized? Which one I should
>> belive?
>>
>> When I check the ceph osd df it shows 10% usage of the osds (1x 2TB nvme
>> drive has 4x osds on it):

Why split each device into 4 very small OSDs?  You're losing a lot of capacity 
to overhead.

>>
>> ID   CLASS  WEIGHT   REWEIGHT  SIZE RAW USE  DATA OMAP META  
>> AVAIL%USE   VAR   PGS  STATUS
>> 195   nvme  0.43660   1.0  447 GiB   47 GiB  161 MiB   46 GiB   656 MiB  
>> 400 GiB  10.47  0.21   64  up
>> 252   nvme  0.43660   1.0  447 GiB   46 GiB  161 MiB   45 GiB   845 MiB  
>> 401 GiB  10.35  0.21   64  up
>> 253   nvme  0.43660   1.0  447 GiB   46 GiB  229 MiB   45 GiB   662 MiB  
>> 401 GiB  10.26  0.21   66  up
>> 254   nvme  0.43660   1.0  447 GiB   46 GiB  161 MiB   44 GiB   1.3 GiB  
>> 401 GiB  10.26  0.21   65  up
>> 255   nvme  0.43660   1.0  447 GiB   47 GiB  161 MiB   46 GiB   1.2 GiB  
>> 400 GiB  10.58  0.21   64  up
>> 288   nvme  0.43660   1.0  447 GiB   46 GiB  161 MiB   44 GiB   1.2 GiB  
>> 401 GiB  10.25  0.21   64  up
>> 289   nvme  0.43660   1.0  447 GiB   46 GiB  161 MiB   45 GiB   641 MiB  
>> 401 GiB  10.33  0.21   64  up
>> 290   nvme  0.43660   1.0  447 GiB   45 GiB  229 MiB   44 GiB   668 MiB  
>> 402 GiB  10.14  0.21   65  up
>>
>> However in nvme list it says full:
>> Node SN   ModelNamespace Usage   
>>Format   FW Rev
>>   
>> --- -
>> --  

>> /dev/nvme0n1 90D0A00XTXTR KCD6XLUL1T92 1   1.92  TB 
>> /   1.92  TB512   B +  0 B   GPK6
>> /dev/nvme1n1 60P0A003TXTR KCD6XLUL1T92 1   1.92  TB 
>> /   1.92  TB512   B +  0 B   GPK6

That command isn't telling you what you think it is.  It has no awareness of 
actual data, it's looking at NVMe namespaces.

>>
>> With some other node the test was like:
>>
>>  *   if none of the disk full, no slow ops.
>>  *   If 1x disk full and the other not, has slow ops but not too much
>>  *   if none of the disk full, no slow ops.
>>
>> The full disks are very highly utilized during recovery and they are
>> holding back the operations from the other nvmes.
>>
>> What's the reason that even if the pgs are the same in the cluster +/-1
>> regarding space they are not equally utilized.
>>
>> Thank you
>>
>>
>>
>> 
>> This message is confidential and is for the sole use of the intended
>> recipient(s). It may also be privileged or otherwise protected by copyright
>> or other legal rules. If you have received it by mistake please 

[ceph-users] Re: How to identify the index pool real usage?

2023-12-03 Thread Szabo, Istvan (Agoda)
With the nodes that has some free space on that namespace, we don't have issue, 
only with this which is weird.

From: Anthony D'Atri 
Sent: Friday, December 1, 2023 10:53 PM
To: David C. 
Cc: Szabo, Istvan (Agoda) ; Ceph Users 

Subject: Re: [ceph-users] How to identify the index pool real usage?

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


>>
>> Today we had a big issue with slow ops on the nvme drives which holding
>> the index pool.
>>
>> Why the nvme shows full if on ceph is barely utilized? Which one I should
>> belive?
>>
>> When I check the ceph osd df it shows 10% usage of the osds (1x 2TB nvme
>> drive has 4x osds on it):

Why split each device into 4 very small OSDs?  You're losing a lot of capacity 
to overhead.

>>
>> ID   CLASS  WEIGHT   REWEIGHT  SIZE RAW USE  DATA OMAP META  
>> AVAIL%USE   VAR   PGS  STATUS
>> 195   nvme  0.43660   1.0  447 GiB   47 GiB  161 MiB   46 GiB   656 MiB  
>> 400 GiB  10.47  0.21   64  up
>> 252   nvme  0.43660   1.0  447 GiB   46 GiB  161 MiB   45 GiB   845 MiB  
>> 401 GiB  10.35  0.21   64  up
>> 253   nvme  0.43660   1.0  447 GiB   46 GiB  229 MiB   45 GiB   662 MiB  
>> 401 GiB  10.26  0.21   66  up
>> 254   nvme  0.43660   1.0  447 GiB   46 GiB  161 MiB   44 GiB   1.3 GiB  
>> 401 GiB  10.26  0.21   65  up
>> 255   nvme  0.43660   1.0  447 GiB   47 GiB  161 MiB   46 GiB   1.2 GiB  
>> 400 GiB  10.58  0.21   64  up
>> 288   nvme  0.43660   1.0  447 GiB   46 GiB  161 MiB   44 GiB   1.2 GiB  
>> 401 GiB  10.25  0.21   64  up
>> 289   nvme  0.43660   1.0  447 GiB   46 GiB  161 MiB   45 GiB   641 MiB  
>> 401 GiB  10.33  0.21   64  up
>> 290   nvme  0.43660   1.0  447 GiB   45 GiB  229 MiB   44 GiB   668 MiB  
>> 402 GiB  10.14  0.21   65  up
>>
>> However in nvme list it says full:
>> Node SN   ModelNamespace Usage   
>>Format   FW Rev
>>   
>> --- -
>> --  

>> /dev/nvme0n1 90D0A00XTXTR KCD6XLUL1T92 1   1.92  TB 
>> /   1.92  TB512   B +  0 B   GPK6
>> /dev/nvme1n1 60P0A003TXTR KCD6XLUL1T92 1   1.92  TB 
>> /   1.92  TB512   B +  0 B   GPK6

That command isn't telling you what you think it is.  It has no awareness of 
actual data, it's looking at NVMe namespaces.

>>
>> With some other node the test was like:
>>
>>  *   if none of the disk full, no slow ops.
>>  *   If 1x disk full and the other not, has slow ops but not too much
>>  *   if none of the disk full, no slow ops.
>>
>> The full disks are very highly utilized during recovery and they are
>> holding back the operations from the other nvmes.
>>
>> What's the reason that even if the pgs are the same in the cluster +/-1
>> regarding space they are not equally utilized.
>>
>> Thank you
>>
>>
>>
>> 
>> This message is confidential and is for the sole use of the intended
>> recipient(s). It may also be privileged or otherwise protected by copyright
>> or other legal rules. If you have received it by mistake please let us know
>> by reply email and delete it from your system. It is prohibited to copy
>> this message or disclose its content to anyone. Any confidentiality or
>> privilege is not waived or lost by any mistaken delivery or unauthorized
>> disclosure of the message. All messages sent to and from Agoda may be
>> monitored to ensure compliance with company policies, to protect the
>> company's interests and to remove potential malware. Electronic messages
>> may be intercepted, amended, lost or deleted, or contain viruses.
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyo

[ceph-users] How to identify the index pool real usage?

2023-12-01 Thread Szabo, Istvan (Agoda)
Hi,

Today we had a big issue with slow ops on the nvme drives which holding the 
index pool.

Why the nvme shows full if on ceph is barely utilized? Which one I should 
belive?

When I check the ceph osd df it shows 10% usage of the osds (1x 2TB nvme drive 
has 4x osds on it):

ID   CLASS  WEIGHT   REWEIGHT  SIZE RAW USE  DATA OMAP META  
AVAIL%USE   VAR   PGS  STATUS
195   nvme  0.43660   1.0  447 GiB   47 GiB  161 MiB   46 GiB   656 MiB  
400 GiB  10.47  0.21   64  up
252   nvme  0.43660   1.0  447 GiB   46 GiB  161 MiB   45 GiB   845 MiB  
401 GiB  10.35  0.21   64  up
253   nvme  0.43660   1.0  447 GiB   46 GiB  229 MiB   45 GiB   662 MiB  
401 GiB  10.26  0.21   66  up
254   nvme  0.43660   1.0  447 GiB   46 GiB  161 MiB   44 GiB   1.3 GiB  
401 GiB  10.26  0.21   65  up
255   nvme  0.43660   1.0  447 GiB   47 GiB  161 MiB   46 GiB   1.2 GiB  
400 GiB  10.58  0.21   64  up
288   nvme  0.43660   1.0  447 GiB   46 GiB  161 MiB   44 GiB   1.2 GiB  
401 GiB  10.25  0.21   64  up
289   nvme  0.43660   1.0  447 GiB   46 GiB  161 MiB   45 GiB   641 MiB  
401 GiB  10.33  0.21   64  up
290   nvme  0.43660   1.0  447 GiB   45 GiB  229 MiB   44 GiB   668 MiB  
402 GiB  10.14  0.21   65  up

However in nvme list it says full:
Node SN   Model
Namespace Usage  Format   FW Rev
   
- --  
/dev/nvme0n1 90D0A00XTXTR KCD6XLUL1T92 
1   1.92  TB /   1.92  TB512   B +  0 B   GPK6
/dev/nvme1n1 60P0A003TXTR KCD6XLUL1T92 
1   1.92  TB /   1.92  TB512   B +  0 B   GPK6

With some other node the test was like:

  *   if none of the disk full, no slow ops.
  *   If 1x disk full and the other not, has slow ops but not too much
  *   if none of the disk full, no slow ops.

The full disks are very highly utilized during recovery and they are holding 
back the operations from the other nvmes.

What's the reason that even if the pgs are the same in the cluster +/-1 
regarding space they are not equally utilized.

Thank you




This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Space reclaim doesn't happening in nautilus RBD pool

2023-11-30 Thread Szabo, Istvan (Agoda)
Thrash empty.


Istvan Szabo
Staff Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com<mailto:istvan.sz...@agoda.com>
---




From: Ilya Dryomov 
Sent: Thursday, November 30, 2023 6:27 PM
To: Szabo, Istvan (Agoda) 
Cc: Ceph Users 
Subject: Re: [ceph-users] Space reclaim doesn't happening in nautilus RBD pool

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


On Thu, Nov 30, 2023 at 8:25 AM Szabo, Istvan (Agoda)
 wrote:
>
> Hi,
>
> Is there any config on Ceph that block/not perform space reclaim?
> I test on one pool which has only one image 1.8 TiB in used.
>
>
> rbd $p du im/root
> warning: fast-diff map is not enabled for root. operation may be slow.
> NAMEPROVISIONED USED
> root 2.2 TiB 1.8 TiB
>
>
>
> I already removed all snaphots and now pool has only one image alone.
> I run both fstrim  over the filesystem (XFS) and try rbd sparsify im/root  
> (don't know what it is exactly but it mentions to reclaim something)
> It still shows the pool used 6.9 TiB which totally not make sense right? It 
> should be up to 3.6 (1.8 * 2) according to its replica?

Hi Istvan,

Have you checked RBD trash?

$ rbd trash ls -p im

Thanks,

Ilya


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Space reclaim doesn't happening in nautilus RBD pool

2023-11-29 Thread Szabo, Istvan (Agoda)
Hi,

Is there any config on Ceph that block/not perform space reclaim?
I test on one pool which has only one image 1.8 TiB in used.


rbd $p du im/root
warning: fast-diff map is not enabled for root. operation may be slow.
NAMEPROVISIONED USED
root 2.2 TiB 1.8 TiB



I already removed all snaphots and now pool has only one image alone.
I run both fstrim  over the filesystem (XFS) and try rbd sparsify im/root  
(don't know what it is exactly but it mentions to reclaim something)
It still shows the pool used 6.9 TiB which totally not make sense right? It 
should be up to 3.6 (1.8 * 2) according to its replica?



POOLS:
POOL ID PGS STORED  OBJECTS 
USED %USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY
   USED COMPR UNDER COMPR
im 19  32 3.5 TiB 918.34k  6.9 TiB  4.80
69 TiB N/A   10 TiB  918.34k0 B 
0 B



I think now some of others pool have this issue too, we do clean up a lot but 
seems space not reclaimed.
I estimate more than 50 TiB  should be able to reclaim, actual usage of this 
cluster much less than current reported number.

Thank you for your help.


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Experience with deduplication

2023-11-27 Thread Szabo, Istvan (Agoda)
Hi Developers,

What is the status of the deduplication for objectsore? I see it under the dev 
area only since octopus even with the latest release.
https://docs.ceph.com/en/octopus/dev/deduplication/

Is it something that can be used in production?

Thank you


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Previously synced bucket resharded after sync removed

2023-11-20 Thread Szabo, Istvan (Agoda)
Hi,

I had a multisite bucket which I've removed from sync completely and resharded 
on the master zone the bucket which was successful.

On the 2nd site (which was expected) can't list anything inside that bucket 
anymore which is okay, the issue is how I can delete the data somehow?
It was 50TB data there which I'd like to cleanup but at the moment everything 
shows 0, if I check the user space usage or the bucket space usage, all 0, 
however the data I'm sure still there because the used space is still the same 
as before in my cluster.

Thank you


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: radosgw bucket usage metrics gone after created in a loop 64K buckets

2023-09-18 Thread Szabo, Istvan (Agoda)
I think this is related to my radosgw-exporter, not related to ceph, I'll 
report it in git, sorry for the noise.



From: Szabo, Istvan (Agoda) 
Sent: Monday, September 18, 2023 1:58 PM
To: Ceph Users 
Subject: [ceph-users] radosgw bucket usage metrics gone after created in a loop 
64K buckets

Hi,

Last week we've created for a user 64K buckets to be able to properly shard 
their huge amount of objects and I can see that the "radosgw_usage_bucket" 
metrics disappeared from 10pm that day when the mass creation happened in our 
octopus 15.2.17 cluster.

In the logs I don't really see anything useful.

Is there any limitation that I might have hit?

Thank you


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] radosgw bucket usage metrics gone after created in a loop 64K buckets

2023-09-18 Thread Szabo, Istvan (Agoda)
Hi,

Last week we've created for a user 64K buckets to be able to properly shard 
their huge amount of objects and I can see that the "radosgw_usage_bucket" 
metrics disappeared from 10pm that day when the mass creation happened in our 
octopus 15.2.17 cluster.

In the logs I don't really see anything useful.

Is there any limitation that I might have hit?

Thank you


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Is it safe to add different OS but same ceph version to the existing cluster?

2023-09-04 Thread Szabo, Istvan (Agoda)
Hi,

I've added ubuntu 20.04 nodes to my ceph octopus 15.2.17 baremetal deployed 
cluster next to the centos 8 nodes and I see something interesting regarding 
the disk usage, it is higher on ubuntu than on centos, however the cpu usage is 
lower (on this picture you can see 4 nodes, each column is 1 node, the last 
column is the ubuntu): 
https://i.ibb.co/Tk5Srk6/image-2023-09-04-09-55-52-311.png

Could this be because of the missing HPC tuned profile on ubuntu 20.04?
On ubuntu 20.04 there isn't any HPC tuned profile, I've used the 
latency-performance one which is the base of the HPC:

This is the latency performance:

[main]
summary=Optimize for deterministic performance at the cost of increased power 
consumption
[cpu]
force_latency=1
governor=performance
energy_perf_bias=performance
min_perf_pct=100
[sysctl]
kernel.sched_min_granularity_ns=1000
vm.dirty_ratio=10
vm.dirty_background_ratio=3
vm.swappiness=10
kernel.sched_migration_cost_ns=500

The hpc tuned profile has these additional values on centos and on ubuntu 22.04:
[main]
summary=Optimize for HPC compute workloads
description=Configures virtual memory, CPU governors, and network settings for 
HPC compute workloads.
include=latency-performance
[vm]
transparent_hugepages=always
[disk]
readahead=>4096
[sysctl]
vm.hugepages_treat_as_movable=0
vm.min_free_kbytes=135168
vm.zone_reclaim_mode=1
kernel.numa_balancing=0
net.core.busy_read=50
net.core.busy_poll=50
net.ipv4.tcp_fastopen=3

If someone is very god with these kernel parameter values, do you see something 
that might be related to the high disk utilization?

Thank you
[https://i.ibb.co/Tk5Srk6/image-2023-09-04-09-55-52-311.png]




From: Milind Changire 
Sent: Monday, August 7, 2023 11:38 PM
To: Szabo, Istvan (Agoda) 
Cc: Ceph Users 
Subject: Re: [ceph-users] Is it safe to add different OS but same ceph version 
to the existing cluster?

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


On Mon, Aug 7, 2023 at 8:23 AM Szabo, Istvan (Agoda)
 wrote:
>
> Hi,
>
> I have an octopus cluster on the latest octopus version with mgr/mon/rgw/osds 
> on centos 8.
> Is it safe to add an ubuntu osd host with the same octopus version?
>
> Thank you

Well, the ceph source bits surely remain the same. The binary bits
could be different due to better compiler support on the newer OS
version.
So assuming the new ceph is deployed on the same hardware platform
things should be stable.
Also, assuming that relevant OS tunables and ceph features and config
options have been configured to match the older deployment, the new
ceph deployment should just work fine and as expected.
Saying all this, I'd still recommend to test out the move one node at
a time rather than executing a bulk move.
Making a list of types of devices and checking driver support on the
new OS would also be a prudent thing to do.



This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Is there any way to fine tune peering/pg relocation/rebalance?

2023-08-30 Thread Szabo, Istvan (Agoda)
Seems like tested on nautilus but I still see commits last month so I guess it 
is good with octopus.


From: Matt Vandermeulen 
Sent: Wednesday, August 30, 2023 12:44 AM
To: Szabo, Istvan (Agoda) 
Cc: Ceph Users 
Subject: Re: [ceph-users] Is there any way to fine tune peering/pg 
relocation/rebalance?

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


We have had success using pgremapper[1] for this sort of thing, in both
index and data augments.

1. Set nobackfill, norebalance
2. Add OSDs
3. pgremapper cancel-backfill
4. Unset flags
5. Slowly loop `pgremapper undo-upmaps` at our desired rate, or allow
the balancer to do this work

There's still going to be the hit from peering, especially when you
change the CRUSH map (before even bringing the OSDs in), but in general
peering has been very quick for us for the last few releases, and s3 has
enough overhead where it's not overly noticed.  We haven't done multiple
OSDs per device, however, and have plenty CPU power.

[1] https://github.com/digitalocean/pgremapper



On 2023-08-29 17:51, Szabo, Istvan (Agoda) wrote:
> Hello,
>
> Is there a way to somehow fine tune the rebalance even further than
> basic tuning steps when adding new osds?
> Today I've added some osd to the index pool and it generated many slow
> ops due to OSD op latency increase + read operation latency increase =
> high put get latency.
>
> https://ibb.co/album/9mN6GQ
>
> osd max backfill, max recovery, recovery ops priority are 1.
> 1 nvme drive has 4 osd, each osd has around 80pg.
>
> The steps how I add the osds:
>
>   1.  Set norebalance
>   2.  add the osds
>   3.  wait for peering
>   4.  unset rebalance
>
> It takes like 15-20 mins to became normal without interrupting the
> rebalance the user traffic.
>
> Thank you,
> Istvan
>
> 
> This message is confidential and is for the sole use of the intended
> recipient(s). It may also be privileged or otherwise protected by
> copyright or other legal rules. If you have received it by mistake
> please let us know by reply email and delete it from your system. It is
> prohibited to copy this message or disclose its content to anyone. Any
> confidentiality or privilege is not waived or lost by any mistaken
> delivery or unauthorized disclosure of the message. All messages sent
> to and from Agoda may be monitored to ensure compliance with company
> policies, to protect the company's interests and to remove potential
> malware. Electronic messages may be intercepted, amended, lost or
> deleted, or contain viruses.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Is there any way to fine tune peering/pg relocation/rebalance?

2023-08-30 Thread Szabo, Istvan (Agoda)
I'm using upmap with max deviation 1, maybe is it too aggressive?


From: Louis Koo 
Sent: Wednesday, August 30, 2023 4:17 AM
To: ceph-users@ceph.io 
Subject: [ceph-users] Re: Is there any way to fine tune peering/pg 
relocation/rebalance?

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


The "osdmaptool"  can be used.

like this:
$ ceph osd getmap -o om
$ osdmaptool om --upmap out.txt --upmap-pool xxx --upmap-deviation 5 
--upmap-max 10
$ cat out.txt
$ source out.txt
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Is there any way to fine tune peering/pg relocation/rebalance?

2023-08-29 Thread Szabo, Istvan (Agoda)
Hello,

Is there a way to somehow fine tune the rebalance even further than basic 
tuning steps when adding new osds?
Today I've added some osd to the index pool and it generated many slow ops due 
to OSD op latency increase + read operation latency increase = high put get 
latency.

https://ibb.co/album/9mN6GQ

osd max backfill, max recovery, recovery ops priority are 1.
1 nvme drive has 4 osd, each osd has around 80pg.

The steps how I add the osds:

  1.  Set norebalance
  2.  add the osds
  3.  wait for peering
  4.  unset rebalance

It takes like 15-20 mins to became normal without interrupting the rebalance 
the user traffic.

Thank you,
Istvan


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] 64k buckets for 1 user

2023-08-07 Thread Szabo, Istvan (Agoda)
Hi,

We are in a transition where I'd like to ask my user who stores 2B objects in 1 
bucket to split it some way.
Thinking for the future we identified to make it future proof and don't store 
huge amount of objects in 1 bucket, we would need to create 65xxx buckets.

Is there anybody aware of any issue with this amount of buckets please?
I guess better to split to multiple buckets rather than have gigantic bucket.

Thank you the advises


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Is it safe to add different OS but same ceph version to the existing cluster?

2023-08-06 Thread Szabo, Istvan (Agoda)
Hi,

I have an octopus cluster on the latest octopus version with mgr/mon/rgw/osds 
on centos 8.
Is it safe to add an ubuntu osd host with the same octopus version?

Thank you


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Multisite sync - zone permission denied

2023-07-14 Thread Szabo, Istvan (Agoda)
Hi,

Have you had the issue with zones are permission denied?

failed to retrieve sync info: (13) Permission denied

It's a newly added zone, uses the same sync user and credentials but it shows 
permission denied and I don't see any reason behind.

Thank you


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW dynamic resharding blocks write ops

2023-07-07 Thread Szabo, Istvan (Agoda)
I do manual reshard if needed but try to do pre-shard in advance.

I try to deal with the user and ask them before onboard them, do they need 
bucket with more than a million objects (default 11 shard) or it’s enough.
If they need I preshard (to a prime numbered shard number), if not then stay 
with default 11.

Istvan Szabo
Staff Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com<mailto:istvan.sz...@agoda.com>
---

On 2023. Jul 7., at 17:49, Eugen Block  wrote:

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Okay, thanks for the comment. But does that mean that you never
reshard or do you manually reshard? Do you experience performance
degradation? Maybe I should also add that they have their index pool
on HDDs (with rocksdb on SSD), not sure how big the impact is during
resharding though.

Zitat von "Szabo, Istvan (Agoda)" :

I turned off :)

Istvan Szabo
Staff Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com<mailto:istvan.sz...@agoda.com>
---

On 2023. Jul 7., at 17:35, Eugen Block  wrote:

Email received from the internet. If in doubt, don't click any link
nor open any attachment !


Hi *,
last week I successfully upgraded a customer cluster from Nautilus to
Pacific, no real issues, their main use is RGW. A couple of hours
after most of the OSDs were upgraded (the RGWs were not yet) their
application software reported an error, it couldn't write to a bucket.
This error occured again two days ago, in the RGW logs I found the
relevant messages that resharding was happening at that time. I'm
aware that this is nothing unusual, but I can't find anything helpful
how to prevent this except for deactivating dynamic resharding and
then manually do it during maintenance windows. We don't know yet if
there's really data missing after the bucket access has recovered or
not, that still needs to be investigated. Since Nautilus already had
dynamic resharding enabled, I wonder if they were just lucky until
now, for example resharding happened while no data was being written
to the buckets. Or if resharding just didn't happen until then, I have
no access to the cluster so I don't have any bucket stats available
right now. I found this thread [1] about an approach how to prevent
blocked IO but it's from 2019 and I don't know how far that got.

There are many users/operators on this list who use RGW more than me,
how do you deal with this? Are your clients better prepared for these
events? Any comments are appreciated!

Thanks,
Eugen

[1]
https://lists.ceph.io/hyperkitty/list/d...@ceph.io/thread/NG56XXAM5A4JONT4BGPQAZUTJAYMOSZ2/
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


This message is confidential and is for the sole use of the intended
recipient(s). It may also be privileged or otherwise protected by
copyright or other legal rules. If you have received it by mistake
please let us know by reply email and delete it from your system. It
is prohibited to copy this message or disclose its content to
anyone. Any confidentiality or privilege is not waived or lost by
any mistaken delivery or unauthorized disclosure of the message. All
messages sent to and from Agoda may be monitored to ensure
compliance with company policies, to protect the company's interests
and to remove potential malware. Electronic messages may be
intercepted, amended, lost or deleted, or contain viruses.





This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW dynamic resharding blocks write ops

2023-07-07 Thread Szabo, Istvan (Agoda)
I turned off :)

Istvan Szabo
Staff Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

On 2023. Jul 7., at 17:35, Eugen Block  wrote:

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Hi *,
last week I successfully upgraded a customer cluster from Nautilus to
Pacific, no real issues, their main use is RGW. A couple of hours
after most of the OSDs were upgraded (the RGWs were not yet) their
application software reported an error, it couldn't write to a bucket.
This error occured again two days ago, in the RGW logs I found the
relevant messages that resharding was happening at that time. I'm
aware that this is nothing unusual, but I can't find anything helpful
how to prevent this except for deactivating dynamic resharding and
then manually do it during maintenance windows. We don't know yet if
there's really data missing after the bucket access has recovered or
not, that still needs to be investigated. Since Nautilus already had
dynamic resharding enabled, I wonder if they were just lucky until
now, for example resharding happened while no data was being written
to the buckets. Or if resharding just didn't happen until then, I have
no access to the cluster so I don't have any bucket stats available
right now. I found this thread [1] about an approach how to prevent
blocked IO but it's from 2019 and I don't know how far that got.

There are many users/operators on this list who use RGW more than me,
how do you deal with this? Are your clients better prepared for these
events? Any comments are appreciated!

Thanks,
Eugen

[1]
https://lists.ceph.io/hyperkitty/list/d...@ceph.io/thread/NG56XXAM5A4JONT4BGPQAZUTJAYMOSZ2/
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: radosgw hang under pressure

2023-06-25 Thread Szabo, Istvan (Agoda)
Hi,

Can you check the read and write latency of your osds?
Maybe it hangs because it’s waiting for pg’s but maybe the pg are under scrub 
or something else.
Also with many small objects don’t rely on pg autoscaler, it might not tell to 
increase pg but maybe it should be.

Istvan Szabo
Staff Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

On 2023. Jun 23., at 19:12, Rok Jaklič  wrote:

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


We are experiencing something similar (slow GETs responses) when sending 1k
delete requests for example in ceph v16.2.13.

Rok

On Mon, Jun 12, 2023 at 7:16 PM grin  wrote:

Hello,

ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy
(stable)

There is a single (test) radosgw serving plenty of test traffic. When
under heavy req/s ("heavy" in a low sense, about 1k rq/s) it pretty
reliably hangs: low traffic threads seem to work (like handling occasional
PUTs) but GETs are completely nonresponsive, all attention seems to be
spent on futexes.

The effect is extremely similar to

https://ceph-users.ceph.narkive.com/I4uFVzH9/radosgw-civetweb-hangs-once-around-850-established-connections
(subject: Radosgw (civetweb) hangs once around)
except this is quincy so it's beast instead of civetweb. The effect is the
same as described there, except the cluster is way smaller (about 20-40
OSDs).

I observed that when I start radosgw -f with debug 20/20 it almost never
hangs, so my guess is some ugly race condition. However I am a bit clueless
how to actually debug it since debugging makes it go away. Debug 1
(default) with -d seems to hang after a while but it's not that simple to
induce, I'm still testing under 4/4.

Also I do not see much to configure about beast.

As to answer the question in the original (2016) thread:
- Debian stable
- no visible limits issue
- no obvious memory leak observed
- no other visible resource shortage
- strace says everyone's waiting on futexes, about 600-800 threads, apart
from the one serving occasional PUTs
- tcp port doesn't respond.

IRC didn't react. ;-)

Thanks,
Peter
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Transmit rate metric based per bucket

2023-06-19 Thread Szabo, Istvan (Agoda)
Hello,

I'd like to know is there a way to query some metrics/logs in octopus (or if 
has newer version I'm interested for the future too) about the bandwidth used 
in the bucket for put/get operations?

Thank you


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Bottleneck between loadbalancer and rgws

2023-06-14 Thread Szabo, Istvan (Agoda)
I'll try to increase in my small cluster, let's see is there any improvement 
there, thank you.

Any reason if has memory enough to not increase?

Istvan Szabo
Staff Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

-Original Message-
From: Kai Stian Olstad 
Sent: Wednesday, June 14, 2023 9:02 PM
To: Szabo, Istvan (Agoda) 
Cc: Ceph Users 
Subject: Re: [ceph-users] Bottleneck between loadbalancer and rgws

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


On Wed, Jun 14, 2023 at 01:44:40PM +, Szabo, Istvan (Agoda) wrote:
>I have a dedicated loadbalancer pairs separated on 2x baremetal servers and 
>behind the haproxy balancers I have 3 mon/mgr/rgw nodes.
>Each rgw node has 2rgw on it so in the cluster altogether 6, (now I just added 
>one more so currently 9).
>
>Today I see pretty high GET latency in the cluster (3-4s) and seems like the 
>limitations are the gateways:
>https://i.ibb.co/ypXFL34/1.png
>In this netstat seems like maxed out the established connections around 2-3k. 
>When I've added one more gateway it increased.
>
>Seems like the gateway node or the gateway instance has some limitation. What 
>is the value which is around 1000,I haven't really found it and affect GET and 
>limit the connections on linux?

It could be rgw_max_concurrent_requests[1] witch is default at 1024.
I read somewhere that it should not be increased, but could be increase it to 
2048.
But the recommended action was to add more gateways instead.


[1] 
https://docs.ceph.com/en/quincy/radosgw/config-ref/#confval-rgw_max_concurrent_requests

--
Kai Stian Olstad


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Bottleneck between loadbalancer and rgws

2023-06-14 Thread Szabo, Istvan (Agoda)
Hi,

I have a dedicated loadbalancer pairs separated on 2x baremetal servers and 
behind the haproxy balancers I have 3 mon/mgr/rgw nodes.
Each rgw node has 2rgw on it so in the cluster altogether 6, (now I just added 
one more so currently 9).

Today I see pretty high GET latency in the cluster (3-4s) and seems like the 
limitations are the gateways:
https://i.ibb.co/ypXFL34/1.png
In this netstat seems like maxed out the established connections around 2-3k. 
When I've added one more gateway it increased.

Seems like the gateway node or the gateway instance has some limitation. What 
is the value which is around 1000,I haven't really found it and affect GET and 
limit the connections on linux?

Thank you


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Non cephadm cluster upgrade from octopus to quincy

2023-06-07 Thread Szabo, Istvan (Agoda)
Hi,

I don't find any documentation for this upgrade process. Is there anybody who 
has already done it yet?
Is the normal apt-get update method works?

Thank you


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Deleting millions of objects

2023-05-17 Thread Szabo, Istvan (Agoda)
If it works I’d be amazed. We have this slow and limited delete issue also. 
What we’ve done to run on the same bucket multiple delete from multiple servers 
via s3cmd.

Istvan Szabo
Staff Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

On 2023. May 17., at 20:14, Joachim Kraftmayer - ceph ambassador 
 wrote:

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Hi Rok,

try this:


rgw_delete_multi_obj_max_num - Max number of objects in a single
multi-object delete request
 (int, advanced)
 Default: 1000
 Can update at runtime: true
 Services: [rgw]


config set   


WHO: client. or client.rgw

KEY: rgw_delete_multi_obj_max_num

VALUE: 1

Regards, Joachim

___
ceph ambassador DACH
ceph consultant since 2012

Clyso GmbH - Premier Ceph Foundation Member

https://www.clyso.com/

Am 17.05.23 um 14:24 schrieb Rok Jaklič:
thx.

I tried with:
ceph config set mon rgw_delete_multi_obj_max_num 1
ceph config set client rgw_delete_multi_obj_max_num 1
ceph config set global rgw_delete_multi_obj_max_num 1

but still only 1000 objects get deleted.

Is the target something different?

On Wed, May 17, 2023 at 11:58 AM Robert Hish 
wrote:

I think this is capped at 1000 by the config setting. Ive used the aws
and s3cmd clients to delete more than 1000 objects at a time and it
works even with the config setting capped at 1000. But it is a bit slow.

#> ceph config help rgw_delete_multi_obj_max_num

rgw_delete_multi_obj_max_num - Max number of objects in a single multi-
object delete request
  (int, advanced)
  Default: 1000
  Can update at runtime: true
  Services: [rgw]

On Wed, 2023-05-17 at 10:51 +0200, Rok Jaklič wrote:
Hi,

I would like to delete millions of objects in RGW instance with:
mc rm --recursive --force ceph/archive/veeam

but it seems it allows only 1000 (or 1002 exactly) removals per
command.

How can I delete/remove all objects with some prefix?

Kind regards,
Rok
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Octopus on Ubuntu 20.04.6 LTS with kernel 5

2023-05-15 Thread Szabo, Istvan (Agoda)
Hi,

Pacific and quincy still supports barematel deloyed setup?

Istvan Szabo
Staff Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

-Original Message-
From: Ilya Dryomov 
Sent: Thursday, May 11, 2023 3:39 PM
To: Szabo, Istvan (Agoda) 
Cc: Ceph Users 
Subject: Re: [ceph-users] Re: Octopus on Ubuntu 20.04.6 LTS with kernel 5

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


On Thu, May 11, 2023 at 7:13 AM Szabo, Istvan (Agoda)  
wrote:
>
> I can answer my question, even in the official ubuntu repo they are using by 
> default the octopus version so for sure it works with kernel 5.
>
> https://packages.ubuntu.com/focal/allpackages
>
>
> -Original Message-----
> From: Szabo, Istvan (Agoda) 
> Sent: Thursday, May 11, 2023 11:20 AM
> To: Ceph Users 
> Subject: [ceph-users] Octopus on Ubuntu 20.04.6 LTS with kernel 5
>
> Hi,
>
> In octopus documentation we can see kernel 4 as recommended, however we've 
> changed our test cluster yesterday from centos 7 / 8 to Ubuntu 20.04.6 LTS 
> with kernel 5.4.0-148 and seems working, I just want to make sure before I 
> move to prod there isn't any caveats.

Hi Istvan,

Note that on https://docs.ceph.com/en/octopus/start/os-recommendations/
it starts with:

> If you are using the kernel client to map RBD block devices or mount
> CephFS, the general advice is to use a “stable” or “longterm
> maintenance” kernel series provided by either http://kernel.org or
> your Linux distribution on any client hosts.

The recommendation for 4.x kernels follows that just as a precaution against 
folks opting to stick to something older.  If your distribution provides 5.x or 
6.x stable kernels, by all means use them!

A word of caution though: Octopus was EOLed last year.  Please consider 
upgrading your cluster to a supported release -- preferably Quincy since 
Pacific is scheduled to go EOL sometime this year too.

Thanks,

Ilya


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Octopus on Ubuntu 20.04.6 LTS with kernel 5

2023-05-10 Thread Szabo, Istvan (Agoda)
I can answer my question, even in the official ubuntu repo they are using by 
default the octopus version so for sure it works with kernel 5. 

https://packages.ubuntu.com/focal/allpackages


-Original Message-
From: Szabo, Istvan (Agoda)  
Sent: Thursday, May 11, 2023 11:20 AM
To: Ceph Users 
Subject: [ceph-users] Octopus on Ubuntu 20.04.6 LTS with kernel 5

Hi,

In octopus documentation we can see kernel 4 as recommended, however we've 
changed our test cluster yesterday from centos 7 / 8 to Ubuntu 20.04.6 LTS with 
kernel 5.4.0-148 and seems working, I just want to make sure before I move to 
prod there isn't any caveats.

Thank you


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to 
ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Octopus on Ubuntu 20.04.6 LTS with kernel 5

2023-05-10 Thread Szabo, Istvan (Agoda)
Hi,

In octopus documentation we can see kernel 4 as recommended, however we've 
changed our test cluster yesterday from centos 7 / 8 to Ubuntu 20.04.6 LTS with 
kernel 5.4.0-148 and seems working, I just want to make sure before I move to 
prod there isn't any caveats.

Thank you


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Os changed to Ubuntu, device class not shown

2023-05-08 Thread Szabo, Istvan (Agoda)
Hi,

We have an octopus cluster where we want to move from centos to Ubuntu, after 
activate all the osd, class is not shown in ceph osd tree.
However ceph-volume list shows the crush device class :/

Should I just add it or?



This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Bucket notification

2023-04-27 Thread Szabo, Istvan (Agoda)
Hi,

I think the sasl handshake is the issue:

On gateway:
2023-04-26T15:25:49.341+0700 7f7f04d21700  1 ERROR: failed to create push 
endpoint:  due to: pubsub endpoint configuration error: unknown schema in:
2023-04-26T15:25:49.341+0700 7f7f04d21700  5 req 245249540 0.00365s 
s3:delete_obj WARNING: publishing notification failed, with error: -22
2023-04-26T15:25:49.341+0700 7f7f04d21700  2 req 245249540 0.00365s 
s3:delete_obj completing
2023-04-26T15:25:49.341+0700 7f7f04d21700  2 req 245249540 0.00365s 
s3:delete_obj op status=0
2023-04-26T15:25:49.341+0700 7f7f04d21700  2 req 245249540 0.00365s 
s3:delete_obj http status=204

2023-04-26T15:44:31.978+0700 7f7fc3e9f700 20 notification: 'bulknotif' on 
topic: 'bulk-upload-tool-ceph-notifications' and bucket: 
'connectivity-bulk-upload-file-bucket' (unique topic: 
'bulknotif_bulk-upload-tool-ceph-notifications') apply to event of type: 
's3:ObjectCreated:Put'
2023-04-26T15:44:31.978+0700 7f7fc3e9f700  1 ERROR: failed to create push 
endpoint:  due to: pubsub endpoint configuration error: unknown schema in:
2023-04-26T15:44:31.978+0700 7f7fc3e9f700  5 req 245262095 0.00456s 
s3:put_obj WARNING: publishing notification failed, with error: -22
2023-04-26T15:44:31.978+0700 7f7fc3e9f700  2 req 245262095 0.00456s 
s3:put_obj completing

In kafka we see this:
[2023-04-26 15:45:20,921] INFO [SocketServer listenerType=ZK_BROKER, nodeId=1] 
Failed authentication with /xx.93.1 (Unexpected Kafka request of type METADATA 
during SASL handshake.) (org.apache.kafka.common.network.Selector)
[2023-04-26 15:45:21,921] INFO [SocketServer listenerType=ZK_BROKER, nodeId=1] 
Failed authentication with /xx.93.1 (Unexpected Kafka request of type METADATA 
during SASL handshake.) (org.apache.kafka.common.network.Selector)
[2023-04-26 15:45:22,921] INFO [SocketServer listenerType=ZK_BROKER, nodeId=1] 
Failed authentication with /xx.93.1 (Unexpected Kafka request of type METADATA 
during SASL handshake.) (org.apache.kafka.common.network.Selector)


Istvan Szabo
Staff Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com<mailto:istvan.sz...@agoda.com>
---

On 2023. Apr 27., at 15:08, Yuval Lifshitz  wrote:


Email received from the internet. If in doubt, don't click any link nor open 
any attachment !

Hi Istvan,
Looks like you are using user/password and SSL on the communication channels 
between RGW and the Kafka broker.
Maybe the issue is around the certificate? could you please increase RGW debug 
logs to 20 and see if there are any kafka related errors there?

Yuval

On Tue, Apr 25, 2023 at 5:48 PM Szabo, Istvan (Agoda) 
mailto:istvan.sz...@agoda.com>> wrote:
Hi,

I'm trying to set a kafka endpoint for bucket object create operation 
notifications but the notification is not created in kafka endpoint.
Settings seems to be fine because I can upload to the bucket objects when these 
settings are applied:

NotificationConfiguration>

bulknotif
arn:aws:sns:default::butcen
s3:ObjectCreated:*
s3:ObjectRemoved:*



but it simply not created any message in kafka.

This is my topic creation post request:

https://xxx.local/?
Action=CreateTopic&
Name=butcen&
kafka-ack-level=broker&
use-ssl=true&
push-endpoint=kafka://ceph:pw@xxx.local:9093

Am I missing something or it's definitely kafka issue?

Thank you



This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io>
To unsubscribe send an email to 
ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Bucket notification

2023-04-25 Thread Szabo, Istvan (Agoda)
Hi,

I'm trying to set a kafka endpoint for bucket object create operation 
notifications but the notification is not created in kafka endpoint.
Settings seems to be fine because I can upload to the bucket objects when these 
settings are applied:

NotificationConfiguration>

bulknotif
arn:aws:sns:default::butcen
s3:ObjectCreated:*
s3:ObjectRemoved:*



but it simply not created any message in kafka.

This is my topic creation post request:

https://xxx.local/?
Action=CreateTopic&
Name=butcen&
kafka-ack-level=broker&
use-ssl=true&
push-endpoint=kafka://ceph:pw@xxx.local:9093

Am I missing something or it's definitely kafka issue?

Thank you



This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW access logs with bucket name

2023-03-30 Thread Szabo, Istvan (Agoda)
It has the full url begins with the bucket name in the beast logs http 
requests, hasn’t it?

Istvan Szabo
Staff Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

On 2023. Mar 30., at 17:44, Boris Behrens  wrote:

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Bringing up that topic again:
is it possible to log the bucket name in the rgw client logs?

currently I am only to know the bucket name when someone access the bucket
via https://TLD/bucket/object instead of https://bucket.TLD/object.

Am Di., 3. Jan. 2023 um 10:25 Uhr schrieb Boris Behrens :

Hi,
I am looking forward to move our logs from
/var/log/ceph/ceph-client...log to our logaggregator.

Is there a way to have the bucket name in the log file?

Or can I write the rgw_enable_ops_log into a file? Maybe I could work with
this.

Cheers and happy new year
Boris



--
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Changing os to ubuntu from centos 8

2023-03-21 Thread Szabo, Istvan (Agoda)
Thank you, I’ll take a note and give a try.

Istvan Szabo
Staff Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com<mailto:istvan.sz...@agoda.com>
---

From: Boris Behrens 
Sent: Tuesday, March 21, 2023 4:29 PM
To: Szabo, Istvan (Agoda) ; Ceph Users 

Cc: dietr...@internet-sicherheit.de; ji...@spets.org
Subject: Re: [ceph-users] Changing os to ubuntu from centos 8

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !

Hi Istvan,

I currently make the move from centos7 to ubuntu18.04 (we want to jump directly 
from nautilus to pacific), When everything in the cluster got the same version, 
and the version is available on the new OS you can just reinstall the hosts 
with the new OS.

With the mons, I remove the current mon from the list while reinstalling and 
recreate the mon afterward, so I don't need to carry over any files. With the 
OSD hosts I just set the cluster to "noout" and have the system down for 20 
minutes, which is about the time I require to install the new OS and provision 
all the configs. Afterwards I just start all the OSDs (ceph-volume lvm activate 
--all) and wait for the cluster to become green again.

Cheers
 Boris

Am Di., 21. März 2023 um 08:54 Uhr schrieb Szabo, Istvan (Agoda) 
mailto:istvan.sz...@agoda.com>>:
Hi,

I'd like to change the os to ubuntu 20.04.5 from my bare metal deployed octopus 
15.2.14 on centos 8. On the first run I would go with octopus 15.2.17 just to 
not make big changes in the cluster.
I've found couple of threads on the mailing list but those were containerized 
(like: Re: Upgrade/migrate host operating system for ceph nodes (CentOS/Rocky) 
or  Re: Migrating CEPH OS looking for suggestions).

Wonder what is the proper steps for this kind of migration? Do we need to start 
with mgr or mon or rgw or osd?
Is it possible to reuse the osd with ceph-volume scan on the reinstalled 
machine?
I'd stay with baremetal deployment and even maybe with octopus but I'm curious 
your advice.

Thank you


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io>
To unsubscribe send an email to 
ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>


--
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im 
groüen Saal.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Changing os to ubuntu from centos 8

2023-03-21 Thread Szabo, Istvan (Agoda)
Hi,

I'd like to change the os to ubuntu 20.04.5 from my bare metal deployed octopus 
15.2.14 on centos 8. On the first run I would go with octopus 15.2.17 just to 
not make big changes in the cluster.
I've found couple of threads on the mailing list but those were containerized 
(like: Re: Upgrade/migrate host operating system for ceph nodes (CentOS/Rocky) 
or  Re: Migrating CEPH OS looking for suggestions).

Wonder what is the proper steps for this kind of migration? Do we need to start 
with mgr or mon or rgw or osd?
Is it possible to reuse the osd with ceph-volume scan on the reinstalled 
machine?
I'd stay with baremetal deployment and even maybe with octopus but I'm curious 
your advice.

Thank you


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Adding osds to each nodes

2023-02-08 Thread Szabo, Istvan (Agoda)
Ok, seems better to add all disks to host by hosts with waiting for rebalance 
between each of them, thx.

Istvan Szabo
Staff Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com<mailto:istvan.sz...@agoda.com>
---

On 2023. Feb 8., at 20:43, Eugen Block  wrote:

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Hi,

this is a quite common question and multiple threads exist on this
topic, e.g. [1].

Regards,
Eugen

[1] https://www.mail-archive.com/ceph-users@lists.ceph.com/msg36475.html


Zitat von "Szabo, Istvan (Agoda)" :

Hi,

What is the safest way to add disk(s) to each of the node in the cluster?
Should it be done 1 by 1 or can add all of them at once and let it rebalance?

My concern is that if add all in one due to host based EC code it
will block all the host.
The other side if I add 1 by 1,  one node will have more performance
and more osds than the others which is also not a good setup, so
wonder which is the safer way?

(have 9 nodes with host based EC 4:2, 1 disk is going to have 4osds)

Thank you


This message is confidential and is for the sole use of the intended
recipient(s). It may also be privileged or otherwise protected by
copyright or other legal rules. If you have received it by mistake
please let us know by reply email and delete it from your system. It
is prohibited to copy this message or disclose its content to
anyone. Any confidentiality or privilege is not waived or lost by
any mistaken delivery or unauthorized disclosure of the message. All
messages sent to and from Agoda may be monitored to ensure
compliance with company policies, to protect the company's interests
and to remove potential malware. Electronic messages may be
intercepted, amended, lost or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Adding osds to each nodes

2023-02-08 Thread Szabo, Istvan (Agoda)
Hi,

What is the safest way to add disk(s) to each of the node in the cluster?
Should it be done 1 by 1 or can add all of them at once and let it rebalance?

My concern is that if add all in one due to host based EC code it will block 
all the host.
The other side if I add 1 by 1,  one node will have more performance and more 
osds than the others which is also not a good setup, so wonder which is the 
safer way?

(have 9 nodes with host based EC 4:2, 1 disk is going to have 4osds)

Thank you


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] PG increase / data movement fine tuning

2023-02-06 Thread Szabo, Istvan (Agoda)
Hi,

I've increased the placement group in my octopus cluster firstly in the index 
pool and I gave almost 2.5 hours bad performance for the user. I'm planning to 
increase the data pool also, but first I'd like to know is there any way to 
make it smoother or not.

At the moment I have these values:

osd_max_backfills = 1
osd_recovery_max_active = 1
osd_recovery_op_priority = 1

But seems like this still generates slow ops.

Should I turn off scrubbing or any other way to make it even smoother?


Some information about the setup:

  *   I have 9 nodes, each of them has 2x nvme drives with 4osd on those and 
this is where the index pool lives.
  *   Currently has 2048 pg-s for the index pool

Thank you


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Real memory usage of the osd(s)

2023-01-29 Thread Szabo, Istvan (Agoda)
Hello,

If buffered_io is enabled, is there a way to know what is the exactly used 
physical memory from each osd?

What I've found is the dump_mempools which last entries are the following, but 
this bytes would be the real physical memory usage?

"total": {
"items": 60005205,
"bytes": 995781359

Also which metric is this value? I haven't found any.

Thank you


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Very slow snaptrim operations blocking client I/O

2023-01-27 Thread Szabo, Istvan (Agoda)
How is your pg distribution on your osd devices? Do you have enough assigned 
pgs?

Istvan Szabo
Staff Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

On 2023. Jan 27., at 23:30, Victor Rodriguez  wrote:

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Ah yes, checked that too. Monitors and OSD's report with ceph config
show-with-defaults that bluefs_buffered_io is set to true as default
setting (it isn't overriden somewere).


On 1/27/23 17:15, Wesley Dillingham wrote:
I hit this issue once on a nautilus cluster and changed the OSD
parameter bluefs_buffered_io = true (was set at false). I believe the
default of this parameter was switched from false to true in release
14.2.20, however, perhaps you could still check what your osds are
configured with in regard to this config item.

Respectfully,

*Wes Dillingham*
w...@wesdillingham.com
LinkedIn 


On Fri, Jan 27, 2023 at 8:52 AM Victor Rodriguez
 wrote:

   Hello,

   Asking for help with an issue. Maybe someone has a clue about what's
   going on.

   Using ceph 15.2.17 on Proxmox 7.3. A big VM had a snapshot and I
   removed
   it. A bit later, nearly half of the PGs of the pool entered
   snaptrim and
   snaptrim_wait state, as expected. The problem is that such operations
   ran extremely slow and client I/O was nearly nothing, so all VMs
   in the
   cluster got stuck as they could not I/O to the storage. Taking and
   removing big snapshots is a normal operation that we do often and
   this
   is the first time I see this issue in any of my clusters.

   Disks are all Samsung PM1733 and network is 25G. It gives us
   plenty of
   performance for the use case and never had an issue with the hardware.

   Both disk I/O and network I/O was very low. Still, client I/O
   seemed to
   get queued forever. Disabling snaptrim (ceph osd set nosnaptrim)
   stops
   any active snaptrim operation and client I/O resumes back to normal.
   Enabling snaptrim again makes client I/O to almost halt again.

   I've been playing with some settings:

   ceph tell 'osd.*' injectargs '--osd-max-trimming-pgs 1'
   ceph tell 'osd.*' injectargs '--osd-snap-trim-sleep 30'
   ceph tell 'osd.*' injectargs '--osd-snap-trim-sleep-ssd 30'
   ceph tell 'osd.*' injectargs '--osd-pg-max-concurrent-snap-trims 1'

   None really seemed to help. Also tried restarting OSD services.

   This cluster was upgraded from 14.2.x to 15.2.17 a couple of
   months. Is
   there any setting that must be changed which may cause this problem?

   I have scheduled a maintenance window, what should I look for to
   diagnose this problem?

   Any help is very appreciated. Thanks in advance.

   Victor


   ___
   ceph-users mailing list -- ceph-users@ceph.io
   To unsubscribe send an email to ceph-users-le...@ceph.io

--
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Snap trimming best practice

2023-01-11 Thread Szabo, Istvan (Agoda)
Hi,

Wonder have you ever faced issue with snaptrimming if you follow ceph pg 
allocation recommendation (100pg/osd)?

We have a nautilus cluster and we scare to increase the pg-s of the pools 
because seems like even if we have 4osd/nvme, if the pg number is higher = the 
snaptrimming is slower.

Eg.:

We have these pools:
Db1 pool size 64,504G with 512 PGs
Db2 pool size 92,242G with 256 PGs
Db2 snapshot remove faster than Db1.

Our osds are very underutilized regarding pg point of view due to this reason, 
each osd is holding maximum 25 gigantic pgs which makes all the maintenance 
very difficult due to backfilling full, osd full issues.

Any recommendation if you use this feature?

Thank you


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] User migration between clusters

2023-01-09 Thread Szabo, Istvan (Agoda)
Hi,

Normally I use rclone to migrate buckets across clusters.
However this time the user has close to 1000 buckets so I wonder what would be 
the best approach to do this rather buckets by buckets, any idea?

Thank you


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Ceph mgr rgw module missing in quincy

2022-12-08 Thread Szabo, Istvan (Agoda)
Hi,

When I want to enable this module it is missing:

https://docs.ceph.com/en/quincy/mgr/rgw.html

Looked in the mgr module list but nothing there.

What is the reason?

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Multi site alternative

2022-11-23 Thread Szabo, Istvan (Agoda)
Hi,

Due to the lack of documentation and issues with multisite bucket sync I’m 
looking for an alternative solution where I can put some sla around the sync 
like I can guarantee that the file will be available in x minutes.

Which solution you guys are using which works fine with huge amount of objects?

Cloud sync?
Rclone?
…

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Monitor server move across cages

2022-11-16 Thread Szabo, Istvan (Agoda)
2rgw out of the 6 located on this also, but that I don’t worry, haproxy can 
balance to the other 4. Mgr I’ll make sure it is in standby on this.

Thank you

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com<mailto:istvan.sz...@agoda.com>
---

On 2022. Nov 16., at 18:39, Janne Johansson  wrote:

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Den ons 16 nov. 2022 kl 08:00 skrev Szabo, Istvan (Agoda)
:

Hi,

I have 3 mons in my cluster and I need to move to another cage one of them.
I guess it is not an issue to have one mon down for an hour, is it?

If nothing else changes (ie, no OSDs,rgws,mds's fall off the net), no
new mounts from new clients, you would be able to go without ALL mons
for an hour.

--
May the most significant bit of your life be positive.


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Monitor server move across cages

2022-11-15 Thread Szabo, Istvan (Agoda)
Hi,

I have 3 mons in my cluster and I need to move to another cage one of them.
I guess it is not an issue to have one mon down for an hour, is it?

Thank you


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] What is the reason of the rgw_user_quota_bucket_sync_interval and rgw_bucket_quota_ttl values?

2022-11-04 Thread Szabo, Istvan (Agoda)
Hi,

One of my user told me that they can upload bigger files to the bucket than the 
limit. My question is to the developers mainly what’s the reason to set the 
rgw_bucket_quota_ttl=600 and rgw_user_quota_bucket_sync_interval=180? I don’t 
want to set to 0 before I know the reason 
With this settings if the user has pretty high bandwidth they can upload 
terabytes of files before the 10minutes limit reached.

I set the following values on a specific bucket:

"bucket_quota": {
"enabled": true,
"check_on_raw": false,
"max_size": 524288000,
"max_size_kb": 512000,
"max_objects": 1126400

But they can upload 600MB files also.

This article came into my face: 
https://bugzilla.redhat.com/show_bug.cgi?id=1417775

Seems like if these values set to 0:

"name": "rgw_bucket_quota_ttl",
"type": "int",
"level": "advanced",
"desc": "Bucket quota stats cache TTL",
"long_desc": "Length of time for bucket stats to be cached within RGW 
instance.",
"default": 600,

and

"name": "rgw_user_quota_bucket_sync_interval",
"type": "int",
"level": "advanced",
"desc": "User quota bucket sync interval",
"long_desc": "Time period for accumulating modified buckets before syncing 
these stats.",
"default": 180,

They will be terminated on bucket limit.

Thank you


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Strange 50K slow ops incident

2022-11-03 Thread Szabo, Istvan (Agoda)
Are those connected to the same switches?

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

On 2022. Nov 3., at 17:34, Frank Schilder  wrote:

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Hi all,

I just had a very weird incident on our production cluster. An OSD was 
reporting >50K slow ops. Upon further investigation I observed exceptionally 
high network traffic on 3 out of the 12 hosts in this OSD's pools, one of them 
was the host with the slow ops OSD (ceph-09); see the image here (bytes 
received): https://imgur.com/a/gPQDiq5. The incoming data bandwidth is about 
700MB/s (or a factor 4) higher than on all other hosts. Strange thing is, that 
on this OSD is not part of any 3xreplicated pool. The 2 pools of this OSD are 
8+2 and 8+3 EC pools. Hence, this is neither user- nor replication traffic.

It looks like 3 OSDs in that pool decided to have a private meeting and ignore 
everything around them.

My first attempt of recovery was:

ceph osd set norecover
ceph osd set norebalance
ceph osd out 669

And wait. Indeed, PGs peered and user IO bandwidth went up by a factor of 2. In 
addition, the slow ops count started falling. In the image, the execution of 
these commands is visible as the peak at 10:45. After about 3 minutes, the slow 
ops count was 0 and I set the OSD back to in and unset all flags. Nothing 
happened, the cluster just continued operating normally.

Does anyone have an explanation for what I observed? It looks a lot like a 
large amount fake traffic, 3 OSDs just sending packets in circles. During 
recovery, the OSD with 50K slow ops had nearly no disk IO, therefore I do not 
believe that this was actual IO. I rather suspect that it was internal 
communication going bonkers.

Since the impact is quite high it would be nice to have a pointer as to what 
might have happened.

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Does Ceph support presigned url (like s3) for uploading?

2022-10-28 Thread Szabo, Istvan (Agoda)
Hi,

I found this long time back tracker https://tracker.ceph.com/issues/23470 which 
I guess some way show that it is possible but haven't really found any 
documentation in ceph, how to do properly.
This is how it works with minio: 
https://min.io/docs/minio/linux/integrations/presigned-put-upload-via-browser.html
I'm looking for this in ceph.

Thank you


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Rgw compression any experience?

2022-10-17 Thread Szabo, Istvan (Agoda)
Hi,

I’m looking in ceph octopus in my existing cluster to have object compression.
Any feedback/experience appreciated.
Also I’m curious is it possible to set after cluster setup or need to setup at 
the beginning?

Thank you


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Low space hindering backfill and 2 backfillfull osd(s)

2022-10-14 Thread Szabo, Istvan (Agoda)
Thank you very much the detailed explanation. Will wait then, based on the 
speed 5 more hours, let's see

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

-Original Message-
From: Janne Johansson 
Sent: Friday, October 14, 2022 5:26 PM
To: Szabo, Istvan (Agoda) 
Cc: Ceph Users 
Subject: Re: [ceph-users] Low space hindering backfill and 2 backfillfull osd(s)

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Den fre 14 okt. 2022 kl 12:10 skrev Szabo, Istvan (Agoda)
:
> I've added 5 more nodes to my cluster and got this issue.
> HEALTH_WARN 2 backfillfull osd(s); 17 pool(s) backfillfull; Low space
> hindering backfill (add storage if this doesn't resolve itself): 4 pgs 
> backfill_toofull OSD_BACKFILLFULL 2 backfillfull osd(s)
> osd.150 is backfill full
> osd.178 is backfill full
>
> I read in the mail list that I might need to increase the pg on the some pool 
> to have smaller pgs.
> Also read I might need to reweigt the mentioned full osd with 1.2 until it's 
> ok, then set back.
> Which would be the best solution?


It is not unusual to see "backfill_toofull", especially if the reason for 
expanding was that space was getting tight.

When you add new drives, a lot of PGs need to move, not only from "old OSDs to 
new" but in all possible directions.
As an example, if you had 16 PGs and three hosts (A,B and C), the PGs would end 
up something like:

A 1,4,7,10,13,16
B 2,5,8,11,14
C 3,6,9,12,15
(5-6 PGs per host)

Then you add host D and E, now it should become something like:

A 1,6,11,16
B 2,7,12
C 3,8,13
D 4,9,14
E 5,10,15
(3-4 PGs per host)

>From here we can see that A will keep PG 1 and 16, B will keep PG 2, C keeps 
>PG 3, but more or less ALL the other PGs will be moving about.
D and E will of course get PGs because they are added, but A will send PG 7 to 
host B, B send PG 8 to host C and so on.

If A,B and C are almost full and you add new OSDs (D and E), the cluster will 
try to schedule *all* the moves.

Of course pgs 4,5,9,10,14 and 15 can just start copying at any time since D and 
E are empty when they arrive, but the cluster will also ask A to send PG 7 to 
B, and B will try to send PG 8 to C, and if PG 7 makes B go past backfill_full 
limit, or of PG 8 makes host C pass it, they will pause those moves with the 
state backfill_toofull and just have them being "misplaced"/"remapped".

In the meantime, the other moves are going to get handled, and sooner or later, 
the host B and C will have moved off so much data so that PG
7 and 8 can move to their correct places, but this might mean those will be 
among the last to move about.

The reality is not 100% as simple as this, the straw2 bucket placing algorithm 
tries to help prevent parts of this, and there might be cases where two of the 
old hosts would send PGs to each other, basically just swapping them around and 
the point that any PG is made up of ECk+m/#replica parts makes this explanation 
a bit too simple, but in broad terms, this is why you get "errors" when adding 
new empty drives and it is perfectly ok, and will fix itself as soon as the 
other moves have created space enough for the queued-toofull moves to be 
performed without driving an OSD over the limits.

--
May the most significant bit of your life be positive.


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Low space hindering backfill and 2 backfillfull osd(s)

2022-10-14 Thread Szabo, Istvan (Agoda)
Hi,

I've added 5 more nodes to my cluster and got this issue.
HEALTH_WARN 2 backfillfull osd(s); 17 pool(s) backfillfull; Low space hindering 
backfill (add storage if this doesn't resolve itself): 4 pgs backfill_toofull
OSD_BACKFILLFULL 2 backfillfull osd(s)
osd.150 is backfill full
osd.178 is backfill full

I read in the mail list that I might need to increase the pg on the some pool 
to have smaller pgs.
Also read I might need to reweigt the mentioned full osd with 1.2 until it's 
ok, then set back.
Which would be the best solution?

Still 17% percent left from rebalance, should I leave it and wait until 
finished or I should action something?

  data:
pools:   17 pools, 2816 pgs
objects: 87.76M objects, 158 TiB
usage:   442 TiB used, 424 TiB / 866 TiB avail
pgs: 31046292/175526278 objects misplaced (17.688%)
 2235 active+clean
 543  active+remapped+backfill_wait
 29   active+remapped+backfilling
 6active+remapped+backfill_wait+backfill_toofull
 1active+remapped+backfill_toofull
 1active+clean+scrubbing+deep
 1active+clean+scrubbing

  io:
client:   760 MiB/s rd, 573 MiB/s wr, 26.24k op/s rd, 18.18k op/s wr
recovery: 10 GiB/s, 2.82k objects/s

These are the most full osds, 1nvme has 4 osds on it:

184  nvme 3.49269  1.0 3.5 TiB  2.9 TiB  2.9 TiB 145 MiB 5.5 GiB  643 GiB 
82.01 1.61  26 up
208  nvme 3.49269  1.0 3.5 TiB  2.9 TiB  2.8 TiB 152 MiB 5.5 GiB  655 GiB 
81.70 1.60  20 up
178  nvme 3.49269  1.0 3.5 TiB  2.7 TiB  2.7 TiB 134 MiB 5.4 GiB  769 GiB 
78.48 1.54  20 up
164  nvme 3.49269  1.0 3.5 TiB  2.6 TiB  2.6 TiB 123 MiB 5.1 GiB  884 GiB 
75.28 1.47  31 up
188  nvme 3.49269  1.0 3.5 TiB  2.6 TiB  2.6 TiB 143 MiB 5.2 GiB  902 GiB 
74.79 1.46  20 up


Side note:

  *   Before the cluster have only 4 nodes, each node 8 drives.
  *   The added 5 nodes have 6 drives, the plan was to move out 1 nvme from the 
existing nodes and add it to the new ones so the final setup would be 7drives 
in each of the 9 nodes.

Thank you for your help and idea.


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-07 Thread Szabo, Istvan (Agoda)
Finally how is your pg distribution? How many pg/disk?

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

-Original Message-
From: Frank Schilder 
Sent: Friday, October 7, 2022 6:50 PM
To: Igor Fedotov ; ceph-users@ceph.io
Subject: [ceph-users] Re: OSD crashes during upgrade mimic->octopus

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Hi all,

trying to respond to 4 past emails :)

We started using manual conversion and, if  the conversion fails, it fails in 
the last step. So far, we have a fail on 1 out of 8 OSDs. The OSD can be 
repaired with running a compaction + another repair, which will complete the 
last step. Looks like we are just on the edge and can get away with 
double-compaction.

For the interested future reader, we have subdivided 400G high-performance SSDs 
into 4x100G OSDs for our FS meta data pool. The increased concurrency improves 
performance a lot. But yes, we are on the edge. OMAP+META is almost 50%.

In our case, just merging 2x100 into 1x200 will probably not improve things as 
we will end up with an even more insane number of objects per PG than what we 
have already today. I will plan for having more OSDs for the meta-data pool 
available and also plan for having the infamous 60G temp space available with a 
bit more margin than what we have now.

Thanks to everyone who helped!

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 07 October 2022 13:21:29
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: OSD crashes during upgrade mimic->octopus

Hi Frank,

there no tools to defragment OSD atm. The only way to defragment OSD is to 
redeploy it...


Thanks,

Igor


On 10/7/2022 3:04 AM, Frank Schilder wrote:
> Hi Igor,
>
> sorry for the extra e-mail. I forgot to ask: I'm interested in a tool to 
> de-fragment the OSD. It doesn't look like the fsck command does that. Is 
> there any such tool?
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Frank Schilder 
> Sent: 07 October 2022 01:53:20
> To: Igor Fedotov; ceph-users@ceph.io
> Subject: [ceph-users] Re: OSD crashes during upgrade mimic->octopus
>
> Hi Igor,
>
> I added a sample of OSDs on identical disks. The usage is quite well 
> balanced, so the numbers I included are representative. I don't believe that 
> we had one such extreme outlier. Maybe it ran full during conversion. Most of 
> the data is OMAP after all.
>
> I can't dump the free-dumps into paste bin, they are too large. Not sure if 
> you can access ceph-post-files. I will send you a tgz in a separate e-mail 
> directly to you.
>
>> And once again - do other non-starting OSDs show the same ENOSPC error?
>> Evidently I'm unable to make any generalization about the root cause
>> due to lack of the info...
> As I said before, I need more time to check this and give you the answer you 
> actually want. The stupid answer is they don't, because the other 3 are taken 
> down the moment 16 crashes and don't reach the same point. I need to take 
> them out of the grouped management and start them by hand, which I can do 
> tomorrow. I'm too tired now to play on our production system.
>
> The free-dumps are on their separate way. I included one for OSD 17 as well 
> (on the same disk).
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Igor Fedotov 
> Sent: 07 October 2022 01:19:44
> To: Frank Schilder; ceph-users@ceph.io
> Cc: Stefan Kooman
> Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus
>
> The log I inspected was for osd.16  so please share that OSD
> utilization... And honestly I trust allocator's stats more so it's
> rather CLI stats are incorrect if any. Anyway free dump should provide
> additional proofs..
>
> And once again - do other non-starting OSDs show the same ENOSPC error?
> Evidently I'm unable to make any generalization about the root cause
> due to lack of the info...
>
>
> W.r.t fsck - you can try to run it - since fsck opens DB in read-pnly
> there are some chances it will work.
>
>
> Thanks,
>
> Igor
>
>
> On 10/7/2022 1:59 AM, Frank Schilder wrote:
>> Hi Igor,
>>
>> I suspect there is something wrong with the data reported. These OSDs are 
>> only 50-60% used. For example:
>>
>> IDCLASS WEIGHT   REWEIGHT  SIZE RAW USE   DATA  OMAP 
>> META  AVAIL%USE   VAR   PGS  STATUS TYPE NAME
>> 29   ssd  0.09099   1.0   93 GiB49 GiB17 GiB   16 
>> GiB15 GiB   44 GiB  52.42  1.91  104 up  
>> osd.29
>>

[ceph-users] one pg periodically got inconsistent

2022-10-02 Thread Szabo, Istvan (Agoda)
Hi,

I have a pg which is periodically got inconsistent. Normally pg-repair helps 
but is there a way to avoid it?
In my opinion is comes from a bigger data delete on the weekend that cannot be 
handled.

Thank you


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Any disadvantage to go above the 100pg/osd or 4osd/disk?

2022-09-23 Thread Szabo, Istvan (Agoda)
Good to know thank you, so in that case during recovery it worth to increase 
those values right?

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

-Original Message-
From: Eugen Block 
Sent: Friday, September 23, 2022 1:19 PM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Any disadvantage to go above the 100pg/osd or 
4osd/disk?

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Hi,

I can't speak from the developers perspective, but we discussed this just 
recently intenally and with a customer. We doubled the number of PGs on one of 
our customer's data pools from around 100 to 200 PGs/OSD (HDDs with rocksDB on 
SSDs). We're still waiting for the final conclusion if the performance has 
increased or not, but it seems to work as expected. We probably would double it 
again if the PG size/objects per PG would affect the performance again. You 
just need to be aware of the mon_max_pg_per_osd and 
osd_max_pg_per_osd_hard_ratio configs in case of recovery. Otherwise we don't 
see any real issue with 200 or 400 PGs/OSD if the nodes can handle it.

Regards,
Eugen

Zitat von "Szabo, Istvan (Agoda)" :

> Hi,
>
> My question is, is there any technical limit to have 8osd/ssd and on
> each of them 100pg if the memory and cpu resource available (8gb
> memory/osd and 96vcore)?
> The iops and bandwidth on the disks are very low so I don’t see any
> issue to go with this.
>
> In my cluster I’m using 15.3TB ssds. We have more than 2 billions of
> objects in each of the 3 clusters.
> The bottleneck is the pg/osd so last time when my serious issue solved
> the solution was to bump the pg-s of the data pool the allowed maximum
> with 4:2 ec.
>
> I’m curious of the developers opinion also.
>
> Thank you,
> Istvan
>
> 
> This message is confidential and is for the sole use of the intended
> recipient(s). It may also be privileged or otherwise protected by
> copyright or other legal rules. If you have received it by mistake
> please let us know by reply email and delete it from your system. It
> is prohibited to copy this message or disclose its content to anyone.
> Any confidentiality or privilege is not waived or lost by any mistaken
> delivery or unauthorized disclosure of the message. All messages sent
> to and from Agoda may be monitored to ensure compliance with company
> policies, to protect the company's interests and to remove potential
> malware. Electronic messages may be intercepted, amended, lost or
> deleted, or contain viruses.
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to 
ceph-users-le...@ceph.io


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Any disadvantage to go above the 100pg/osd or 4osd/disk?

2022-09-19 Thread Szabo, Istvan (Agoda)
Sorry, 96vcore is a typo, 2vcore/osd but can be 4 also.

> 
> On 2022. Sep 19., at 19:50, Szabo, Istvan (Agoda)  
> wrote:
> 
> Hi,
> 
> My question is, is there any technical limit to have 8osd/ssd and on each of 
> them 100pg if the memory and cpu resource available (8gb memory/osd and 
> 96vcore)?
> The iops and bandwidth on the disks are very low so I don’t see any issue to 
> go with this.
> 
> In my cluster I’m using 15.3TB ssds. We have more than 2 billions of objects 
> in each of the 3 clusters.
> The bottleneck is the pg/osd so last time when my serious issue solved the 
> solution was to bump the pg-s of the data pool the allowed maximum with 4:2 
> ec.
> 
> I’m curious of the developers opinion also.
> 
> Thank you,
> Istvan
> 
> 
> This message is confidential and is for the sole use of the intended 
> recipient(s). It may also be privileged or otherwise protected by copyright 
> or other legal rules. If you have received it by mistake please let us know 
> by reply email and delete it from your system. It is prohibited to copy this 
> message or disclose its content to anyone. Any confidentiality or privilege 
> is not waived or lost by any mistaken delivery or unauthorized disclosure of 
> the message. All messages sent to and from Agoda may be monitored to ensure 
> compliance with company policies, to protect the company's interests and to 
> remove potential malware. Electronic messages may be intercepted, amended, 
> lost or deleted, or contain viruses.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Any disadvantage to go above the 100pg/osd or 4osd/disk?

2022-09-19 Thread Szabo, Istvan (Agoda)
Hi,

My question is, is there any technical limit to have 8osd/ssd and on each of 
them 100pg if the memory and cpu resource available (8gb memory/osd and 
96vcore)?
The iops and bandwidth on the disks are very low so I don’t see any issue to go 
with this.

In my cluster I’m using 15.3TB ssds. We have more than 2 billions of objects in 
each of the 3 clusters.
The bottleneck is the pg/osd so last time when my serious issue solved the 
solution was to bump the pg-s of the data pool the allowed maximum with 4:2 ec.

I’m curious of the developers opinion also.

Thank you,
Istvan


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-osd iodepth for high-performance SSD OSDs

2021-10-26 Thread Szabo, Istvan (Agoda)
Isn’t it too much for ssd 4 osd? Normally nvme is suitable for 4osd isn’t it?

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

On 2021. Oct 26., at 10:23, Frank Schilder  wrote:

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


It looks like the bottleneck is the bstore_kv_sync thread, there seems to be 
only one running per OSD daemon independent of shard number. This would imply a 
rather low effective queue depth per OSD daemon. Are there ways to improve this 
other than deploying even more OSD daemons per OSD?

Thanks!
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 26 October 2021 09:41:44
To: ceph-users
Subject: [ceph-users] ceph-osd iodepth for high-performance SSD OSDs

Hi all,

we deployed a pool with high-performance SSDs and I'm testing aggregated 
performance. We seem to hit a bottleneck that is not caused by drive 
performance. My best guess at the moment is, that the effective iodepth of the 
OSD daemons is too low for these drives. I have 4 OSDs per drive and I vaguely 
remember that there are parameters to modify the degree of concurrency an OSD 
daemon uses to write to disk. Are these parameters the ones I'm looking for:

   "osd_op_num_shards": "0",
   "osd_op_num_shards_hdd": "5",
   "osd_op_num_shards_ssd": "8",
   "osd_op_num_threads_per_shard": "0",
   "osd_op_num_threads_per_shard_hdd": "1",
   "osd_op_num_threads_per_shard_ssd": "2",

How do these apply if I have these drives in a custom device class rbd_perf? 
Could I set, for example

ceph config set osd/class:rbd_perf osd_op_num_threads_per_shard 4

to increase concurrency on this particular device class only? Is it possible to 
increase the number of shards at run-time?

Thanks for your help!

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph performance optimization with SSDs

2021-10-22 Thread Szabo, Istvan (Agoda)
Be careful when you are designing, if you are planning to have billions of 
objects, you need to have more than 2-4% for rocksdb+wal to avoid spillover.

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

On 2021. Oct 22., at 14:47, Peter Sabaini  wrote:

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


On 22.10.21 11:29, Mevludin Blazevic wrote:
Dear Ceph users,

I have a small Ceph cluster where each host consist of a small amount of SSDs 
and a larger number of HDDs. Is there a way to use the SSDs as performance 
optimization such as putting OSD Journals to the SSDs and/or using SSDs for 
caching?


Hi,

yes, SSDs can be put to good use as journal devices[0] for OSDs, or you could 
use them as caching devices for bcache[1]. This is actually a pretty widespread 
setup, theres docs for various scenarios[2][3]

But be aware that OSD journals are pretty write-intense, so be sure to use 
fast, reliable SSDs (or NVMes). I have actually seen OSD performance (esp. 
latency and jitter) actually worsen with (prosumer-grade) SSDs that could not 
support the sustained write load.

If at all in doubt run tests before putting prod on it


[0] https://docs.ceph.com/en/latest/start/hardware-recommendations/
[1] https://bcache.evilpiepirate.org/
[2] https://charmhub.io/ceph-osd
[3] 
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/4/html/object_gateway_for_production_guide/using-nvme-with-lvm-optimally

Best regards,
Mevludin

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW/multisite sync traffic rps

2021-10-22 Thread Szabo, Istvan (Agoda)
I see the same issue (45k GET requests constantly as admin), what my guess is, 
the primary site is putting to the datalog the changes and the secondary sites 
are pulling these logs as it changes.
Do you have user who constantly uploading, deleting?

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

On 2021. Oct 22., at 10:46, Stefan Schueffler  
wrote:

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Hi,

i have a question on RGW/multisite. The sync traffic is running a lot of 
requests per second (around 1500), which seems to be high, especially compared 
to the actual volume of user/client-requests.

We have a rather simple multisite-setup with
- two ceph clusters (16.2.6), 1 realm, 1 zonegroup, and one zone on each side, 
one of them ist the master zone.
- latency between those cluster around 0.3ms
- each cluster has 3 RGW/beast daemons running.
- a handful of buckets (around 20), and a check script which creates one bucket 
per second (and deletes it after validating the successful bucket creation).
- one of the buckets has a few million (smaller) objects, the others are (more 
or less) empty.
- from the client side, there are just a few requests per second (mostly PUT 
objects into the one larger bucket), writing a few kilobytes per second.
- roughly 5 GB in total disk size consumed currently, with the idea to increase 
the total consumption to a few TB over time.

Both clusters are in sync (after the initial full sync, they now do incremental 
sync). Although they do sync the new objects from cluster A (master, to which 
the clients connect to) to B, we see a lot of „internal“ sync requests in our 
monitoring: each rgw daemon does about 500 requests per second to a rgw daemon 
on cluster A, especially to "/admin/log?…", which leads to a total of 1500 
requests per second just for the sync, and this results in almost 60% cpu usage 
for the rgw/beast processes.

When stopping and restarting the rgw-instances on cluster-B, it first catches 
up with the delta, and as soon as it finishes, it starts to request in this 
endless loop "/admin/log…"

Is this amount of internal, sync-related requests normal and expected?

Thanks for any ideas how to debug / introspect this.

Best
Stefan

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: inconsistent pg after upgrade nautilus to octopus

2021-10-20 Thread Szabo, Istvan (Agoda)
Have you tried to repair pg?

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

On 2021. Oct 20., at 9:04, Glaza  wrote:

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Hi Everyone,   I am in the process of
upgrading nautilus (14.2.22) to octopus (15.2.14) on centos7 (Mon/Mgr
were additionally migrated to centos8 beforehand). Each day I upgraded
one host and after all osds were up, I manually compacted them one by
one.  Today (8 hosts upgraded, 7 still to go) I started
getting errors like Possible data damage: 1 pg inconsistent. For the
first time it was acting [56,58,62] but I thought OK in 
osd.62 logs
there are many lines like osd.62 39892 class rgw_gc open got (1)
Operation not permitted Maybe rgw did not cleaned some omaps properly,
and ceph did not noticed it until scrub happened. But now I have got
acting [56,57,58] and none of this osds has those errors with 
rgw_gc
in logs. All affected osds are octopus 15.2.14 on NVMe hosting
default.rgw.buckets.index pool.  Has anyone experience with this problem?  Any 
help appreciated.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph IO are interrupted when OSD goes down

2021-10-18 Thread Szabo, Istvan (Agoda)
Octopus 15.2.14?
I have totally the same issue and it makes me prod issue.

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

On 2021. Oct 18., at 12:01, Denis Polom  wrote:

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Hi,

I have a EC pool with these settings:

crush-device-class= crush-failure-domain=host crush-root=default
jerasure-per-chunk-alignment=false k=10 m=2 plugin=jerasure
technique=reed_sol_van w=8

and my understanding is if some of the OSDs goes down because of read
error or just flapping due to some reason (mostly read errors , bad
sectors in my case) clients IO shouldn't be disturbed because we have
other object replicas and Ceph sould manage it. But clients IOs are
disturbed, cephfs mount point gets inaccessible on clients even if they
are mounting cephfs against all 3 monitors.

It's not happening always just sometimes. Is it right understanding that
it can happen if read error or flapping occures  on active OSD?


Thx!

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Limit scrub impact

2021-10-16 Thread Szabo, Istvan (Agoda)
Hi,

During scrub I see slow ops like this:

osd.31 [WRN] slow request osd_op(client.115442393.0:263257613728.76s0 
28:6ed54dc8:::9213182a-14ba-48ad-bde9-289a1c0c0de8.6034919.1_%2fWHITELABEL-1%2fPAGETPYE-7%2fDEVICE-4%2fLANGUAGE-46%2fSUBTYPE-0%2f492210:head
 [create,setxattr user.rgw.idtag (57) in=71b,setxattr user.rgw.tail_tag (57) 
in=74b,writefull 0~36883 in=36883b,setxattr user.rgw.manifest (375) 
in=392b,setxattr user.rgw.acl (123) in=135b,setxattr user.rgw.content_type (10) 
in=31b,setxattr user.rgw.etag (32) in=45b,setxattr 
user.rgw.x-amz-meta-storagetimestamp (40) in=76b,call rgw.obj_store_pg_ver 
in=44b,setxattr user.rgw.source_zone (4) in=24b] snapc 0=[] 
ondisk+write+known_if_redirected e34043) initiated 
2021-10-16T20:01:19.846240+0700 currently started

which not sure why make rgw down, guess because wants to write to that specific 
osd which is busy.
I saw a suse article (https://www.suse.com/support/kb/doc/?id=19684) where 
they say if the system load average is above 0.5, worth to set  something like 
this:

osd_max_scrubs=2
osd_scrub_load_threshold=3

My load average is around 1.5-1.7, not really clear if the load average is 
actually higher, if I set more scrub it will even make it worth, isn’t it?

Or how to limit scrub?

Thx
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Metrics for object sizes

2021-10-14 Thread Szabo, Istvan (Agoda)
Actually it's a good idea, metrics are already there so easy to make the 
Grafana dash, thank you 

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

-Original Message-
From: Christian Rohmann  
Sent: Thursday, October 14, 2021 3:08 PM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Metrics for object sizes

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


On 23/04/2021 03:53, Szabo, Istvan (Agoda) wrote:
> Objects inside RGW buckets like in couch base software they have their own 
> metrics and has this information.

Not as detailed as you would like, but how about using the bucket stats on 
bucket size and number of objects?
  $ radosgw-admin bucket stats --bucket mybucket


Doing a bucket_size / number_of_objects gives you an average object size per 
bucket and that certainly is an indication on buckets with rather small objects.



Regards


Christian

___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to 
ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: is it possible to remove the db+wal from an external device (nvme)

2021-10-13 Thread Szabo, Istvan (Agoda)
Is it possible to extend the block.db lv of that specific osd with lvextend 
command or it needs some special bluestore extend?
I want to extend that lv with the size of the spillover, compact it and migrate 
after.

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com<mailto:istvan.sz...@agoda.com>
---

From: Igor Fedotov 
Sent: Tuesday, October 12, 2021 7:15 PM
To: Szabo, Istvan (Agoda) 
Cc: ceph-users@ceph.io; 胡 玮文 
Subject: Re: [ceph-users] Re: is it possible to remove the db+wal from an 
external device (nvme)

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Istvan,

So things with migrations are clear at the moment, right? As I mentioned the 
migrate command in 15.2.14 has a bug which causes corrupted OSD if db->slow 
migration occurs on spilled over OSD. To work around that you might want to 
migrate slow to db first or try manual compaction. Please make sure there is no 
spilled over data left after any of them via bluestore-tool's 
bluestore-bdev-sizes command before proceeding with db->slow migrate...

just a side note - IMO it sounds a bit controversial that you're 
expecting/experiencing better performance without standalone DB and at the same 
time spillovers cause performance issues... Spillover means some data goes to 
main device (which you're trying to achieve by migrating as well) hence it 
would rather improve things... Or the root cause of your performace issues is 
different... Just want to share my thoughts - I don't have any better ideas 
about that so far...



Thanks,

Igor
On 10/12/2021 2:54 PM, Szabo, Istvan (Agoda) wrote:
I’m having 1 billions of objects in the cluster and we are still increasing and 
faced spillovers allover the clusters.
After 15-18 spilledover osds (out of the 42-50) the osds started to die, 
flapping.
Tried to compact manually the spilleovered ones, but didn’t help, however the 
not spilled osds less frequently crashed.
In our design 3 ssd was used 1 nvme for db+wal, but this nvme has 30k iops on 
random write, however the ssds behind this nvme have individually 67k so 
actually the SSDs are faster in write than the nvme which means our config 
suboptimal.

I’ve decided to update the cluster to 15.2.14 to be able to run this 
ceph-volume lvm migrate command and started to use it.

10-20% is the failed migration at the moment, 80-90% is successful.
I want to avoid this spillover in the future so I’ll use bare SSDs as osds 
without wal+db. At the moment my iowait decreased  a lot without nvme drives, I 
just hope didn’t do anything wrong with this migration right?

The failed ones I’m removing from the cluster and add it back after cleaned up.

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com<mailto:istvan.sz...@agoda.com>
---

From: Igor Fedotov <mailto:igor.fedo...@croit.io>
Sent: Tuesday, October 12, 2021 6:45 PM
To: Szabo, Istvan (Agoda) 
<mailto:istvan.sz...@agoda.com>
Cc: ceph-users@ceph.io<mailto:ceph-users@ceph.io>; 胡 玮文 
<mailto:huw...@outlook.com>
Subject: Re: [ceph-users] Re: is it possible to remove the db+wal from an 
external device (nvme)

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


You mean you run migrate for these 72 OSDs and all of them aren't starting any 
more? Or you just upgraded them to Octopus and experiencing performance issues.

In the latter case and if you have enough space at DB device you might want to 
try to migrate data from slow to db first. Run fsck (just in case) and then 
migrate from DB/WAl back to slow.
Theoretically this should help in avoiding the before-mentioned bug. But  I 
haven't try that personally...

And this wouldn't fix the corrupted OSDs if any though...



Thanks,

Igor
On 10/12/2021 2:36 PM, Szabo, Istvan (Agoda) wrote:
Omg, I’ve already migrated 24x osds in each dc-s (altogether 72).
What should I do then? 12 left (altogether 36). In my case slow device is 
faster in random write iops than the one which is serving it.

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com<mailto:istvan.sz...@agoda.com>
---



On 2021. Oct 12., at 13:21, Igor Fedotov 
<mailto:igor.fedo...@croit.io> wrote:
Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Istvan,

you're bitten by

It's not fixed in 15.2.14. This has got a backport to upcoming Octopus
minor release. Please do not use 'migrate' command from

[ceph-users] Re: is it possible to remove the db+wal from an external device (nvme)

2021-10-12 Thread Szabo, Istvan (Agoda)
One more thing, what I’m doing at the moment:

Noout norebalance on 1 host
Stop all osd
Compact all the osds
Migrate the db 1 by 1
Start the osds 1 by 1

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com<mailto:istvan.sz...@agoda.com>
---

From: Szabo, Istvan (Agoda)
Sent: Tuesday, October 12, 2021 6:54 PM
To: Igor Fedotov 
Cc: ceph-users@ceph.io; 胡 玮文 
Subject: RE: [ceph-users] Re: is it possible to remove the db+wal from an 
external device (nvme)

I’m having 1 billions of objects in the cluster and we are still increasing and 
faced spillovers allover the clusters.
After 15-18 spilledover osds (out of the 42-50) the osds started to die, 
flapping.
Tried to compact manually the spilleovered ones, but didn’t help, however the 
not spilled osds less frequently crashed.
In our design 3 ssd was used 1 nvme for db+wal, but this nvme has 30k iops on 
random write, however the ssds behind this nvme have individually 67k so 
actually the SSDs are faster in write than the nvme which means our config 
suboptimal.

I’ve decided to update the cluster to 15.2.14 to be able to run this 
ceph-volume lvm migrate command and started to use it.

10-20% is the failed migration at the moment, 80-90% is successful.
I want to avoid this spillover in the future so I’ll use bare SSDs as osds 
without wal+db. At the moment my iowait decreased  a lot without nvme drives, I 
just hope didn’t do anything wrong with this migration right?

The failed ones I’m removing from the cluster and add it back after cleaned up.

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com<mailto:istvan.sz...@agoda.com>
---

From: Igor Fedotov mailto:igor.fedo...@croit.io>>
Sent: Tuesday, October 12, 2021 6:45 PM
To: Szabo, Istvan (Agoda) 
mailto:istvan.sz...@agoda.com>>
Cc: ceph-users@ceph.io<mailto:ceph-users@ceph.io>; 胡 玮文 
mailto:huw...@outlook.com>>
Subject: Re: [ceph-users] Re: is it possible to remove the db+wal from an 
external device (nvme)

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


You mean you run migrate for these 72 OSDs and all of them aren't starting any 
more? Or you just upgraded them to Octopus and experiencing performance issues.

In the latter case and if you have enough space at DB device you might want to 
try to migrate data from slow to db first. Run fsck (just in case) and then 
migrate from DB/WAl back to slow.
Theoretically this should help in avoiding the before-mentioned bug. But  I 
haven't try that personally...

And this wouldn't fix the corrupted OSDs if any though...



Thanks,

Igor
On 10/12/2021 2:36 PM, Szabo, Istvan (Agoda) wrote:
Omg, I’ve already migrated 24x osds in each dc-s (altogether 72).
What should I do then? 12 left (altogether 36). In my case slow device is 
faster in random write iops than the one which is serving it.

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com<mailto:istvan.sz...@agoda.com>
---

On 2021. Oct 12., at 13:21, Igor Fedotov 
<mailto:igor.fedo...@croit.io> wrote:
Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Istvan,

you're bitten by

It's not fixed in 15.2.14. This has got a backport to upcoming Octopus
minor release. Please do not use 'migrate' command from WAL/DB to slow
volume if some data is already present there...

Thanks,

Igor


On 10/12/2021 12:13 PM, Szabo, Istvan (Agoda) wrote:
Hi Igor,

I’ve attached here, thank you in advance.

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: 
istvan.sz...@agoda.com<mailto:istvan.sz...@agoda.com><mailto:istvan.sz...@agoda.com><mailto:istvan.sz...@agoda.com>
---

From: Igor Fedotov <mailto:ifedo...@suse.de>
Sent: Monday, October 11, 2021 10:40 PM
To: Szabo, Istvan (Agoda) 
<mailto:istvan.sz...@agoda.com>
Cc: ceph-users@ceph.io<mailto:ceph-users@ceph.io>; Eugen Block 
<mailto:ebl...@nde.ag>; 胡 玮文 
<mailto:huw...@outlook.com>
Subject: Re: [ceph-users] Re: is it possible to remove the db+wal from an 
external device (nvme)

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


No,

that's just backtrace of the crash - I'd like to see the full OSD log from the 
process startup till the crash instead...
On 10/8/2021 4:02 PM, Szabo, Istvan (Agoda) wrote:
Hi Igor,

[ceph-users] Re: is it possible to remove the db+wal from an external device (nvme)

2021-10-12 Thread Szabo, Istvan (Agoda)
I’m having 1 billions of objects in the cluster and we are still increasing and 
faced spillovers allover the clusters.
After 15-18 spilledover osds (out of the 42-50) the osds started to die, 
flapping.
Tried to compact manually the spilleovered ones, but didn’t help, however the 
not spilled osds less frequently crashed.
In our design 3 ssd was used 1 nvme for db+wal, but this nvme has 30k iops on 
random write, however the ssds behind this nvme have individually 67k so 
actually the SSDs are faster in write than the nvme which means our config 
suboptimal.

I’ve decided to update the cluster to 15.2.14 to be able to run this 
ceph-volume lvm migrate command and started to use it.

10-20% is the failed migration at the moment, 80-90% is successful.
I want to avoid this spillover in the future so I’ll use bare SSDs as osds 
without wal+db. At the moment my iowait decreased  a lot without nvme drives, I 
just hope didn’t do anything wrong with this migration right?

The failed ones I’m removing from the cluster and add it back after cleaned up.

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com<mailto:istvan.sz...@agoda.com>
---

From: Igor Fedotov 
Sent: Tuesday, October 12, 2021 6:45 PM
To: Szabo, Istvan (Agoda) 
Cc: ceph-users@ceph.io; 胡 玮文 
Subject: Re: [ceph-users] Re: is it possible to remove the db+wal from an 
external device (nvme)

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


You mean you run migrate for these 72 OSDs and all of them aren't starting any 
more? Or you just upgraded them to Octopus and experiencing performance issues.

In the latter case and if you have enough space at DB device you might want to 
try to migrate data from slow to db first. Run fsck (just in case) and then 
migrate from DB/WAl back to slow.
Theoretically this should help in avoiding the before-mentioned bug. But  I 
haven't try that personally...

And this wouldn't fix the corrupted OSDs if any though...



Thanks,

Igor
On 10/12/2021 2:36 PM, Szabo, Istvan (Agoda) wrote:
Omg, I’ve already migrated 24x osds in each dc-s (altogether 72).
What should I do then? 12 left (altogether 36). In my case slow device is 
faster in random write iops than the one which is serving it.

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com<mailto:istvan.sz...@agoda.com>
---


On 2021. Oct 12., at 13:21, Igor Fedotov 
<mailto:igor.fedo...@croit.io> wrote:
Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Istvan,

you're bitten by

It's not fixed in 15.2.14. This has got a backport to upcoming Octopus
minor release. Please do not use 'migrate' command from WAL/DB to slow
volume if some data is already present there...

Thanks,

Igor


On 10/12/2021 12:13 PM, Szabo, Istvan (Agoda) wrote:

Hi Igor,

I’ve attached here, thank you in advance.

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: 
istvan.sz...@agoda.com<mailto:istvan.sz...@agoda.com><mailto:istvan.sz...@agoda.com><mailto:istvan.sz...@agoda.com>
---

From: Igor Fedotov <mailto:ifedo...@suse.de>
Sent: Monday, October 11, 2021 10:40 PM
To: Szabo, Istvan (Agoda) 
<mailto:istvan.sz...@agoda.com>
Cc: ceph-users@ceph.io<mailto:ceph-users@ceph.io>; Eugen Block 
<mailto:ebl...@nde.ag>; 胡 玮文 
<mailto:huw...@outlook.com>
Subject: Re: [ceph-users] Re: is it possible to remove the db+wal from an 
external device (nvme)

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


No,

that's just backtrace of the crash - I'd like to see the full OSD log from the 
process startup till the crash instead...
On 10/8/2021 4:02 PM, Szabo, Istvan (Agoda) wrote:
Hi Igor,

Here is a bluestore tool fsck output:
https://justpaste.it/7igrb

Is this that you are looking for?

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: 
istvan.sz...@agoda.com<mailto:istvan.sz...@agoda.com><mailto:istvan.sz...@agoda.com><mailto:istvan.sz...@agoda.com>
---

From: Igor Fedotov 
<mailto:ifedo...@suse.de><mailto:ifedo...@suse.de><mailto:ifedo...@suse.de>
Sent: Tuesday, October 5, 2021 10:02 PM
To: Szabo, Istvan (Agoda) 
<mailto:istvan.sz...@agoda.com><mailto:istvan.sz...@agoda.com><mailto:istvan.sz...@agoda.com>;
 胡 玮文 
<mailto:huw...@outloo

[ceph-users] Re: is it possible to remove the db+wal from an external device (nvme)

2021-10-12 Thread Szabo, Istvan (Agoda)
Omg, I’ve already migrated 24x osds in each dc-s (altogether 72).
What should I do then? 12 left (altogether 36). In my case slow device is 
faster in random write iops than the one which is serving it.

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com<mailto:istvan.sz...@agoda.com>
---

On 2021. Oct 12., at 13:21, Igor Fedotov  wrote:

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Istvan,

you're bitten by https://github.com/ceph/ceph/pull/43140

It's not fixed in 15.2.14. This has got a backport to upcoming Octopus
minor release. Please do not use 'migrate' command from WAL/DB to slow
volume if some data is already present there...

Thanks,

Igor


On 10/12/2021 12:13 PM, Szabo, Istvan (Agoda) wrote:
Hi Igor,

I’ve attached here, thank you in advance.

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com<mailto:istvan.sz...@agoda.com>
---

From: Igor Fedotov 
Sent: Monday, October 11, 2021 10:40 PM
To: Szabo, Istvan (Agoda) 
Cc: ceph-users@ceph.io; Eugen Block ; 胡 玮文 
Subject: Re: [ceph-users] Re: is it possible to remove the db+wal from an 
external device (nvme)

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


No,

that's just backtrace of the crash - I'd like to see the full OSD log from the 
process startup till the crash instead...
On 10/8/2021 4:02 PM, Szabo, Istvan (Agoda) wrote:
Hi Igor,

Here is a bluestore tool fsck output:
https://justpaste.it/7igrb

Is this that you are looking for?

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com<mailto:istvan.sz...@agoda.com>
---

From: Igor Fedotov <mailto:ifedo...@suse.de>
Sent: Tuesday, October 5, 2021 10:02 PM
To: Szabo, Istvan (Agoda) 
<mailto:istvan.sz...@agoda.com>; 胡 玮文 
<mailto:huw...@outlook.com>
Cc: ceph-users@ceph.io<mailto:ceph-users@ceph.io>; Eugen Block 
<mailto:ebl...@nde.ag>
Subject: Re: [ceph-users] Re: is it possible to remove the db+wal from an 
external device (nvme)

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Not sure dmcrypt is a culprit here.

Could you please set debug-bluefs to 20 and collect an OSD startup log.


On 10/5/2021 4:43 PM, Szabo, Istvan (Agoda) wrote:
Hmm, tried another one which hasn’t been spilledover disk, still coredumped ☹
Is there any special thing that we need to do before we migrate db next to the 
block? Our osds are using dmcrypt, is it an issue?

{
"backtrace": [
"(()+0x12b20) [0x7f310aa49b20]",
"(gsignal()+0x10f) [0x7f31096aa37f]",
"(abort()+0x127) [0x7f3109694db5]",
"(()+0x9009b) [0x7f310a06209b]",
"(()+0x9653c) [0x7f310a06853c]",
"(()+0x95559) [0x7f310a067559]",
"(__gxx_personality_v0()+0x2a8) [0x7f310a067ed8]",
"(()+0x10b03) [0x7f3109a48b03]",
"(_Unwind_RaiseException()+0x2b1) [0x7f3109a49071]",
"(__cxa_throw()+0x3b) [0x7f310a0687eb]",
"(()+0x19fa4) [0x7f310b7b6fa4]",
"(tcmalloc::allocate_full_cpp_throw_oom(unsigned long)+0x146) 
[0x7f310b7d8c96]",
"(()+0x10d0f8e) [0x55ffa520df8e]",
"(rocksdb::Version::~Version()+0x104) [0x55ffa521d174]",
"(rocksdb::Version::Unref()+0x21) [0x55ffa521d221]",
"(rocksdb::ColumnFamilyData::~ColumnFamilyData()+0x5a) 
[0x55ffa52efcca]",
"(rocksdb::ColumnFamilySet::~ColumnFamilySet()+0x88) [0x55ffa52f0568]",
"(rocksdb::VersionSet::~VersionSet()+0x5e) [0x55ffa520e01e]",
"(rocksdb::VersionSet::~VersionSet()+0x11) [0x55ffa520e261]",
"(rocksdb::DBImpl::CloseHelper()+0x616) [0x55ffa5155ed6]",
"(rocksdb::DBImpl::~DBImpl()+0x83b) [0x55ffa515c35b]",
"(rocksdb::DBImplReadOnly::~DBImplReadOnly()+0x11) [0x55ffa51a3bc1]",
"(rocksdb::DB::OpenForReadOnly(rocksdb::DBOptions const&, 
std::__cxx11::basic_string, std::allocator > 
const&, std::vector > const&, 
std::vector >*, rocksdb::DB**, bool)+0x1089) 
[0x55ffa51a57e9]",
"(RocksDBStore::do_open(std::ostream&, bool, bool, 
std::vector 
> const*)+0x14ca) [0x55ffa51285ca]",
"(BlueStore::_open_db(bool, bool, bool)+0x1314) [0x55ffa4bc27

[ceph-users] Re: is it possible to remove the db+wal from an external device (nvme)

2021-10-12 Thread Szabo, Istvan (Agoda)
Hi Igor,

I’ve attached here, thank you in advance.

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com<mailto:istvan.sz...@agoda.com>
---

From: Igor Fedotov 
Sent: Monday, October 11, 2021 10:40 PM
To: Szabo, Istvan (Agoda) 
Cc: ceph-users@ceph.io; Eugen Block ; 胡 玮文 
Subject: Re: [ceph-users] Re: is it possible to remove the db+wal from an 
external device (nvme)

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


No,

that's just backtrace of the crash - I'd like to see the full OSD log from the 
process startup till the crash instead...
On 10/8/2021 4:02 PM, Szabo, Istvan (Agoda) wrote:
Hi Igor,

Here is a bluestore tool fsck output:
https://justpaste.it/7igrb

Is this that you are looking for?

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com<mailto:istvan.sz...@agoda.com>
---

From: Igor Fedotov <mailto:ifedo...@suse.de>
Sent: Tuesday, October 5, 2021 10:02 PM
To: Szabo, Istvan (Agoda) 
<mailto:istvan.sz...@agoda.com>; 胡 玮文 
<mailto:huw...@outlook.com>
Cc: ceph-users@ceph.io<mailto:ceph-users@ceph.io>; Eugen Block 
<mailto:ebl...@nde.ag>
Subject: Re: [ceph-users] Re: is it possible to remove the db+wal from an 
external device (nvme)

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Not sure dmcrypt is a culprit here.

Could you please set debug-bluefs to 20 and collect an OSD startup log.


On 10/5/2021 4:43 PM, Szabo, Istvan (Agoda) wrote:
Hmm, tried another one which hasn’t been spilledover disk, still coredumped ☹
Is there any special thing that we need to do before we migrate db next to the 
block? Our osds are using dmcrypt, is it an issue?

{
"backtrace": [
"(()+0x12b20) [0x7f310aa49b20]",
"(gsignal()+0x10f) [0x7f31096aa37f]",
"(abort()+0x127) [0x7f3109694db5]",
"(()+0x9009b) [0x7f310a06209b]",
"(()+0x9653c) [0x7f310a06853c]",
"(()+0x95559) [0x7f310a067559]",
"(__gxx_personality_v0()+0x2a8) [0x7f310a067ed8]",
"(()+0x10b03) [0x7f3109a48b03]",
"(_Unwind_RaiseException()+0x2b1) [0x7f3109a49071]",
"(__cxa_throw()+0x3b) [0x7f310a0687eb]",
"(()+0x19fa4) [0x7f310b7b6fa4]",
"(tcmalloc::allocate_full_cpp_throw_oom(unsigned long)+0x146) 
[0x7f310b7d8c96]",
"(()+0x10d0f8e) [0x55ffa520df8e]",
"(rocksdb::Version::~Version()+0x104) [0x55ffa521d174]",
"(rocksdb::Version::Unref()+0x21) [0x55ffa521d221]",
"(rocksdb::ColumnFamilyData::~ColumnFamilyData()+0x5a) 
[0x55ffa52efcca]",
"(rocksdb::ColumnFamilySet::~ColumnFamilySet()+0x88) [0x55ffa52f0568]",
"(rocksdb::VersionSet::~VersionSet()+0x5e) [0x55ffa520e01e]",
"(rocksdb::VersionSet::~VersionSet()+0x11) [0x55ffa520e261]",
"(rocksdb::DBImpl::CloseHelper()+0x616) [0x55ffa5155ed6]",
"(rocksdb::DBImpl::~DBImpl()+0x83b) [0x55ffa515c35b]",
"(rocksdb::DBImplReadOnly::~DBImplReadOnly()+0x11) [0x55ffa51a3bc1]",
"(rocksdb::DB::OpenForReadOnly(rocksdb::DBOptions const&, 
std::__cxx11::basic_string, std::allocator > 
const&, std::vector > const&, 
std::vector >*, rocksdb::DB**, bool)+0x1089) 
[0x55ffa51a57e9]",
"(RocksDBStore::do_open(std::ostream&, bool, bool, 
std::vector 
> const*)+0x14ca) [0x55ffa51285ca]",
"(BlueStore::_open_db(bool, bool, bool)+0x1314) [0x55ffa4bc27e4]",
"(BlueStore::_open_db_and_around(bool)+0x4c) [0x55ffa4bd4c5c]",
"(BlueStore::_mount(bool, bool)+0x847) [0x55ffa4c2e047]",
"(OSD::init()+0x380) [0x55ffa4753a70]",
"(main()+0x47f1) [0x55ffa46a6901]",
"(__libc_start_main()+0xf3) [0x7f3109696493]",
"(_start()+0x2e) [0x55ffa46d4e3e]"
],
"ceph_version": "15.2.14",
"crash_id": 
"2021-10-05T13:31:28.513463Z_b6818598-4960-4ed6-942a-d4a7ff37a758",
"entity_name": "osd.48",
"os_id": "centos",
"os_name": "CentOS Linux",
"os_version": "8",
"os_version_id": "8",
"process_name": "ceph-osd",
"stack_sig": 
"6a43b6c219adac393b239fbea4a53ff87c4185bcd213724f0d721b452b81ddbf",
&quo

[ceph-users] Re: Metrics for object sizes

2021-10-12 Thread Szabo, Istvan (Agoda)
Hi,

Just got the chance to have a look, but I see lua scripting is new in version 
pacific ☹
I have octopus 15.2.14, will it be backported or no chance?

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com<mailto:istvan.sz...@agoda.com>
---

From: Yuval Lifshitz 
Sent: Tuesday, September 14, 2021 7:38 PM
To: Szabo, Istvan (Agoda) 
Cc: Wido den Hollander ; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: Metrics for object sizes

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !

Hi Istvan,
Hope this is still relevant... but you may want to have a look at this example:

https://github.com/ceph/ceph/blob/master/examples/lua/prometheus_adapter.lua
https://github.com/ceph/ceph/blob/master/examples/lua/prometheus_adapter.md

where we log RGW object sizes to Prometheus.
would be easy to change it so it is per bucket and not per operation type.

Yuval

On Fri, Apr 23, 2021 at 4:53 AM Szabo, Istvan (Agoda) 
mailto:istvan.sz...@agoda.com>> wrote:
Objects inside RGW buckets like in couch base software they have their own 
metrics and has this information.

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: 
istvan.sz...@agoda.com<mailto:istvan.sz...@agoda.com><mailto:istvan.sz...@agoda.com<mailto:istvan.sz...@agoda.com>>
---

On 2021. Apr 22., at 14:00, Wido den Hollander 
mailto:w...@42on.com>> wrote:



On 21/04/2021 11:46, Szabo, Istvan (Agoda) wrote:
Hi,
Is there any clusterwise metric regarding object sizes?
I'd like to collect some information about the users what is the object sizes 
in their buckets.

Are you talking about RADOS objects or objects inside RGW buckets?

I think you are talking about RGW, but I just wanted to check.

Afaik this information is not available for both RADOS and RGW.

Do keep in mind that small objects are much more expensive then large objects. 
The metadata overhead becomes costly and can even become problematic if you 
have millions of tiny (few kb) objects.

Wido


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io>
To unsubscribe send an email to 
ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>
___
ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io>
To unsubscribe send an email to 
ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Where is my free space?

2021-10-12 Thread Szabo, Istvan (Agoda)
I see, I'm using ssds so it shouldn't be a problem I guess, because the : 
"bluestore_min_alloc_size": "0" is overwritten with the:
"bluestore_min_alloc_size_ssd": "4096"  ?

-Original Message-
From: Stefan Kooman  
Sent: Tuesday, October 12, 2021 2:19 PM
To: Szabo, Istvan (Agoda) ; ceph-users@ceph.io
Subject: Re: [ceph-users] Where is my free space?

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !
________

On 10/12/21 07:21, Szabo, Istvan (Agoda) wrote:
> Hi,
>
> 377TiB is the total cluster size, data pool 4:2 ec, stored 66TiB, how can be 
> the data pool on 60% used??!!

Space amplification? It depends on, among others (like object size), the 
min_alloc size you use for the OSDs. See this thread [1], and this spreadsheet 
[2].

Gr. Stefan

[1]:
https://lists.ceph.io/hyperkitty/list/d...@ceph.io/thread/NIVVTSR2YW22VELM4BW4S6NQUCS3T4XW/
[2]:
https://docs.google.com/spreadsheets/d/1rpGfScgG-GLoIGMJWDixEkqs-On9w8nAUToPQjN8bDI/edit#gid=358760253
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] get_health_metrics reporting slow ops and gw outage

2021-10-12 Thread Szabo, Istvan (Agoda)
Hi,

Many of my osds having this issue which causes 10-15ms osd write operation 
latency and more than 60ms read operation latency.
This causes rgw wait for operations and after a while rgw just restarted (all 
of them in my cluster) and only available after slow ops disappeared.

I see similar issue but haven't really seen solution anywhere: 
https://tracker.ceph.com/issues/44184

I'm facing this issue in 2 of my cluster's from my 3 clusters multisite 
environment (octopus 15.2.14). Some background information, where I'm facing 
this issues, before I had many flapping osds even some unfound objects, not 
sure would that be related to this.

2021-10-12T09:59:45.542+0700 7fa0445a7700 -1 osd.46 32739 get_health_metrics 
reporting 205 slow ops, oldest is osd_op(client.115442393.0:1420913395 28.23s0 
28:c4b40264:::9213182a-14ba-48ad-bde9-289a1c0c0de8.29868038.12_geo%2fpoi%2f1718955%2f7fc1308d421939a23614908dda8ff659.jpg:head
 [getxattrs,stat] snapc 0=[] ondisk+read+known_if_redirected e32739)
2021-10-12T09:59:46.583+0700 7fa0445a7700 -1 osd.46 32739 get_health_metrics 
reporting 205 slow ops, oldest is osd_op(client.115442393.0:1420913395 28.23s0 
28:c4b40264:::9213182a-14ba-48ad-bde9-289a1c0c0de8.29868038.12_geo%2fpoi%2f1718955%2f7fc1308d421939a23614908dda8ff659.jpg:head
 [getxattrs,stat] snapc 0=[] ondisk+read+known_if_redirected e32739)
2021-10-12T09:59:47.581+0700 7fa0445a7700 -1 osd.46 32739 get_health_metrics 
reporting 205 slow ops, oldest is osd_op(client.115442393.0:1420913395 28.23s0 
28:c4b40264:::9213182a-14ba-48ad-bde9-289a1c0c0de8.29868038.12_geo%2fpoi%2f1718955%2f7fc1308d421939a23614908dda8ff659.jpg:head
 [getxattrs,stat] snapc 0=[] ondisk+read+known_if_redirected e32739)
2021-10-12T09:59:48.551+0700 7fa0445a7700 -1 osd.46 32739 get_health_metrics 
reporting 205 slow ops, oldest is osd_op(client.115442393.0:1420913395 28.23s0 
28:c4b40264:::9213182a-14ba-48ad-bde9-289a1c0c0de8.29868038.12_geo%2fpoi%2f1718955%2f7fc1308d421939a23614908dda8ff659.jpg:head
 [getxattrs,stat] snapc 0=[] ondisk+read+known_if_redirected e32739)
2021-10-12T09:59:49.592+0700 7fa0445a7700 -1 osd.46 32739 get_health_metrics 
reporting 205 slow ops, oldest is osd_op(client.115442393.0:1420913395 28.23s0 
28:c4b40264:::9213182a-14ba-48ad-bde9-289a1c0c0de8.29868038.12_geo%2fpoi%2f1718955%2f7fc1308d421939a23614908dda8ff659.jpg:head
 [getxattrs,stat] snapc 0=[] ondisk+read+known_if_redirected e32739)

Haven't really fund anybody in the maillist also about this :/

Thank you
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Where is my free space?

2021-10-12 Thread Szabo, Istvan (Agoda)
Hi,

377TiB is the total cluster size, data pool 4:2 ec, stored 66TiB, how can be 
the data pool on 60% used??!!


Some output:
ceph df detail
--- RAW STORAGE ---
CLASS  SIZE AVAILUSED RAW USED  %RAW USED
nvme12 TiB   11 TiB  128 MiB   1.2 TiB   9.81
ssd377 TiB  269 TiB  100 TiB   108 TiB  28.65
TOTAL  389 TiB  280 TiB  100 TiB   109 TiB  28.06

--- POOLS ---
POOLID  PGS  STORED   (DATA)   (OMAP)   OBJECTS  USED 
(DATA)   (OMAP)   %USED  MAX AVAIL  QUOTA OBJECTS  QUOTA BYTES  DIRTY   USED 
COMPR  UNDER COMPR
device_health_metrics11   49 MiB  0 B   49 MiB   50   98 MiB
  0 B   98 MiB  0 73 TiB  N/AN/A  50 0 
B  0 B
.rgw.root2   32  1.1 MiB  1.1 MiB  4.5 KiB  159  3.9 MiB  
3.9 MiB   12 KiB  0 56 TiB  N/AN/A 159 
0 B  0 B
ash.rgw.log  6   32  1.8 GiB   46 KiB  1.8 GiB   73.83k  4.3 GiB  
4.4 MiB  4.3 GiB  0 59 TiB  N/AN/A  73.83k 
0 B  0 B
ash.rgw.control  7   32  2.9 KiB  0 B  2.9 KiB8  7.7 KiB
  0 B  7.7 KiB  0 56 TiB  N/AN/A   8 0 
B  0 B
ash.rgw.meta 88  554 KiB  531 KiB   23 KiB1.93k   22 MiB   
22 MiB   70 KiB  03.4 TiB  N/AN/A   1.93k 0 
B  0 B
ash.rgw.buckets.index   10  128  406 GiB  0 B  406 GiB   58.69k  1.2 TiB
  0 B  1.2 TiB  10.333.4 TiB  N/AN/A  58.69k 0 
B  0 B
ash.rgw.buckets.data11   32   66 TiB   66 TiB  0 B1.21G   86 TiB   
86 TiB  0 B  37.16111 TiB  N/AN/A   1.21G 0 
B  0 B
ash.rgw.buckets.non-ec  15   32  8.4 MiB653 B  8.4 MiB   22   23 MiB  
264 KiB   23 MiB  0 54 TiB  N/AN/A  22 
0 B  0 B




rados df
POOL_NAME  USED OBJECTS  CLONES  COPIES  
MISSING_ON_PRIMARY  UNFOUND   DEGRADED   RD_OPS   RD   WR_OPS   
WR  USED COMPR  UNDER COMPR
.rgw.root   3.9 MiB 159   0 477 
  00 60  8905420   20 GiB 8171   19 MiB 0 B 
 0 B
ash.rgw.buckets.data 86 TiB  1205539864   0  7233239184 
  00  904168110  36125678580  153 TiB  55724221429  174 TiB 0 B 
 0 B
ash.rgw.buckets.index   1.2 TiB   58688   0  176064 
  00  0  65848675184   62 TiB  10672532772  6.8 TiB 0 B 
 0 B
ash.rgw.buckets.non-ec   23 MiB  22   0  66 
  00  6  3999256  2.3 GiB  1369730  944 MiB 0 B 
 0 B
ash.rgw.control 7.7 KiB   8   0  24 
  00  30  0 B8  0 B 0 B 
 0 B
ash.rgw.log 4.3 GiB   73830   0  221490 
  00  39282  36922450608   34 TiB   5420884130  1.8 TiB 0 B 
 0 B
ash.rgw.meta 22 MiB1931   05793 
  00  0692302142  528 GiB  4274154  2.0 GiB 0 B 
 0 B
device_health_metrics98 MiB  50   0 150 
  00 5013588   40 MiB17758   46 MiB 0 B 
 0 B

total_objects1205674552
total_used   109 TiB
total_avail  280 TiB
total_space  389 TiB



4 osd down because migrating the db to block.

ceph osd tree
ID   CLASS  WEIGHT TYPE NAME STATUS  REWEIGHT  PRI-AFF
-1 398.17001  root default
-11  61.12257  host server01
24   nvme1.74660  osd.24up   1.0  1.0
  0ssd   14.84399  osd.0   down   1.0  1.0
10ssd   14.84399  osd.10  down   1.0  1.0
14ssd   14.84399  osd.14  down   1.0  1.0
20ssd   14.84399  osd.20  down   1.0  1.0
-5  61.12257  host server02
25   nvme1.74660  osd.25up   1.0  1.0
  1ssd   14.84399  osd.1 up   1.0  1.0
  7ssd   14.84399  osd.7 up   1.0  1.0
13ssd   14.84399  osd.13up   1.0  1.0
19ssd   14.84399  osd.19up   1.0  1.0
-9  61.12257  host server03
26   nvme1.74660  osd.26up   1.0  1.0
  3ssd   14.84399  osd.3 up   1.0  1.0
  9ssd   14.84399  osd.9 up   1.0  1.0
16ssd   14.84399  osd.16up   1.0  

[ceph-users] Re: is it possible to remove the db+wal from an external device (nvme)

2021-10-08 Thread Szabo, Istvan (Agoda)
Hi Igor,

Here is a bluestore tool fsck output:
https://justpaste.it/7igrb

Is this that you are looking for?

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com<mailto:istvan.sz...@agoda.com>
---

From: Igor Fedotov 
Sent: Tuesday, October 5, 2021 10:02 PM
To: Szabo, Istvan (Agoda) ; 胡 玮文 
Cc: ceph-users@ceph.io; Eugen Block 
Subject: Re: [ceph-users] Re: is it possible to remove the db+wal from an 
external device (nvme)

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Not sure dmcrypt is a culprit here.

Could you please set debug-bluefs to 20 and collect an OSD startup log.


On 10/5/2021 4:43 PM, Szabo, Istvan (Agoda) wrote:
Hmm, tried another one which hasn’t been spilledover disk, still coredumped ☹
Is there any special thing that we need to do before we migrate db next to the 
block? Our osds are using dmcrypt, is it an issue?

{
"backtrace": [
"(()+0x12b20) [0x7f310aa49b20]",
"(gsignal()+0x10f) [0x7f31096aa37f]",
"(abort()+0x127) [0x7f3109694db5]",
"(()+0x9009b) [0x7f310a06209b]",
"(()+0x9653c) [0x7f310a06853c]",
"(()+0x95559) [0x7f310a067559]",
"(__gxx_personality_v0()+0x2a8) [0x7f310a067ed8]",
"(()+0x10b03) [0x7f3109a48b03]",
"(_Unwind_RaiseException()+0x2b1) [0x7f3109a49071]",
"(__cxa_throw()+0x3b) [0x7f310a0687eb]",
"(()+0x19fa4) [0x7f310b7b6fa4]",
"(tcmalloc::allocate_full_cpp_throw_oom(unsigned long)+0x146) 
[0x7f310b7d8c96]",
"(()+0x10d0f8e) [0x55ffa520df8e]",
"(rocksdb::Version::~Version()+0x104) [0x55ffa521d174]",
"(rocksdb::Version::Unref()+0x21) [0x55ffa521d221]",
"(rocksdb::ColumnFamilyData::~ColumnFamilyData()+0x5a) 
[0x55ffa52efcca]",
"(rocksdb::ColumnFamilySet::~ColumnFamilySet()+0x88) [0x55ffa52f0568]",
"(rocksdb::VersionSet::~VersionSet()+0x5e) [0x55ffa520e01e]",
"(rocksdb::VersionSet::~VersionSet()+0x11) [0x55ffa520e261]",
"(rocksdb::DBImpl::CloseHelper()+0x616) [0x55ffa5155ed6]",
"(rocksdb::DBImpl::~DBImpl()+0x83b) [0x55ffa515c35b]",
"(rocksdb::DBImplReadOnly::~DBImplReadOnly()+0x11) [0x55ffa51a3bc1]",
"(rocksdb::DB::OpenForReadOnly(rocksdb::DBOptions const&, 
std::__cxx11::basic_string, std::allocator > 
const&, std::vector > const&, 
std::vector >*, rocksdb::DB**, bool)+0x1089) 
[0x55ffa51a57e9]",
"(RocksDBStore::do_open(std::ostream&, bool, bool, 
std::vector 
> const*)+0x14ca) [0x55ffa51285ca]",
"(BlueStore::_open_db(bool, bool, bool)+0x1314) [0x55ffa4bc27e4]",
"(BlueStore::_open_db_and_around(bool)+0x4c) [0x55ffa4bd4c5c]",
"(BlueStore::_mount(bool, bool)+0x847) [0x55ffa4c2e047]",
"(OSD::init()+0x380) [0x55ffa4753a70]",
"(main()+0x47f1) [0x55ffa46a6901]",
"(__libc_start_main()+0xf3) [0x7f3109696493]",
"(_start()+0x2e) [0x55ffa46d4e3e]"
],
"ceph_version": "15.2.14",
"crash_id": 
"2021-10-05T13:31:28.513463Z_b6818598-4960-4ed6-942a-d4a7ff37a758",
"entity_name": "osd.48",
"os_id": "centos",
"os_name": "CentOS Linux",
"os_version": "8",
"os_version_id": "8",
"process_name": "ceph-osd",
"stack_sig": 
"6a43b6c219adac393b239fbea4a53ff87c4185bcd213724f0d721b452b81ddbf",
"timestamp": "2021-10-05T13:31:28.513463Z",
    "utsname_hostname": "server-2s07",
"utsname_machine": "x86_64",
"utsname_release": "4.18.0-305.19.1.el8_4.x86_64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP Wed Sep 15 15:39:39 UTC 2021"
}
Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com<mailto:istvan.sz...@agoda.com>
---

From: 胡 玮文 <mailto:huw...@outlook.com>
Sent: Monday, October 4, 2021 12:13 AM
To: Szabo, Istvan (Agoda) 
<mailto:istvan.sz...@agoda.com>; Igor Fedotov 
<mailto:ifedo...@suse.de>
Cc: ceph-users@ceph.io<mailto:ceph-users@ceph.io>
Subject: 回复: [ceph-users] Re: is it possible to remove the db+wal from an 
external device (nvme)

Email received

[ceph-users] Re: is it possible to remove the db+wal from an external device (nvme)

2021-10-05 Thread Szabo, Istvan (Agoda)
This unable to load table properties also interesting before caught signal:

  -16> 2021-10-05T20:31:28.484+0700 7f310cce5f00 2 rocksdb: 
[db/version_set.cc:1362] Unable to load table properties for file 247222 --- 
NotFound:


   -15> 2021-10-05T20:31:28.484+0700 7f310cce5f00  2 rocksdb: 
[db/version_set.cc:1362] Unable to load table properties for file 251966 --- 
NotFound:

   -14> 2021-10-05T20:31:28.484+0700 7f310cce5f00  2 rocksdb: 
[db/version_set.cc:1362] Unable to load table properties for file 247508 --- 
NotFound:

   -13> 2021-10-05T20:31:28.484+0700 7f310cce5f00  2 rocksdb: 
[db/version_set.cc:1362] Unable to load table properties for file 252237 --- 
NotFound:

   -12> 2021-10-05T20:31:28.486+0700 7f310cce5f00  2 rocksdb: 
[db/version_set.cc:1362] Unable to load table properties for file 249610 --- 
NotFound:

   -11> 2021-10-05T20:31:28.486+0700 7f310cce5f00  2 rocksdb: 
[db/version_set.cc:1362] Unable to load table properties for file 251798 --- 
NotFound:

   -10> 2021-10-05T20:31:28.486+0700 7f310cce5f00  2 rocksdb: 
[db/version_set.cc:1362] Unable to load table properties for file 251799 --- 
NotFound:

-9> 2021-10-05T20:31:28.486+0700 7f310cce5f00  2 rocksdb: 
[db/version_set.cc:1362] Unable to load table properties for file 252235 --- 
NotFound:

-8> 2021-10-05T20:31:28.486+0700 7f310cce5f00  2 rocksdb: 
[db/version_set.cc:1362] Unable to load table properties for file 252236 --- 
NotFound:

-7> 2021-10-05T20:31:28.486+0700 7f310cce5f00  2 rocksdb: 
[db/version_set.cc:1362] Unable to load table properties for file 244769 --- 
NotFound:

-6> 2021-10-05T20:31:28.486+0700 7f310cce5f00  2 rocksdb: 
[db/version_set.cc:1362] Unable to load table properties for file 242684 --- 
NotFound:

-5> 2021-10-05T20:31:28.486+0700 7f310cce5f00  2 rocksdb: 
[db/version_set.cc:1362] Unable to load table properties for file 241854 --- 
NotFound:

-4> 2021-10-05T20:31:28.486+0700 7f310cce5f00  2 rocksdb: 
[db/version_set.cc:1362] Unable to load table properties for file 241191 --- 
NotFound:

-3> 2021-10-05T20:31:28.492+0700 7f310cce5f00  4 rocksdb: 
[db/version_set.cc:3757] Recovered from manifest file:db/MANIFEST-241072 
succeeded,manifest_file_number is 241072, next_file_number is 252389, 
last_sequence is 5847989279, log_number is 252336,prev_log_number is 
0,max_column_family is 0,min_log_number_to_keep is 0

-2> 2021-10-05T20:31:28.492+0700 7f310cce5f00  4 rocksdb: 
[db/version_set.cc:3766] Column family [default] (ID 0), log number is 252336

-1> 2021-10-05T20:31:28.501+0700 7f310cce5f00  4 rocksdb: 
[db/db_impl.cc:390] Shutdown: canceling all background work
 0> 2021-10-05T20:31:28.512+0700 7f310cce5f00 -1 *** Caught signal 
(Aborted) **
 in thread 7f310cce5f00 thread_name:ceph-osd




Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com<mailto:istvan.sz...@agoda.com>
-----------

On 2021. Oct 5., at 17:19, Szabo, Istvan (Agoda)  wrote:


Hmm, I’ve removed from the cluster, now data rebalance, I’ll do with the next 
one ☹

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com<mailto:istvan.sz...@agoda.com>
---

From: Igor Fedotov 
Sent: Tuesday, October 5, 2021 10:02 PM
To: Szabo, Istvan (Agoda) ; 胡 玮文 
Cc: ceph-users@ceph.io; Eugen Block 
Subject: Re: [ceph-users] Re: is it possible to remove the db+wal from an 
external device (nvme)

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Not sure dmcrypt is a culprit here.

Could you please set debug-bluefs to 20 and collect an OSD startup log.


On 10/5/2021 4:43 PM, Szabo, Istvan (Agoda) wrote:
Hmm, tried another one which hasn’t been spilledover disk, still coredumped ☹
Is there any special thing that we need to do before we migrate db next to the 
block? Our osds are using dmcrypt, is it an issue?

{
"backtrace": [
"(()+0x12b20) [0x7f310aa49b20]",
"(gsignal()+0x10f) [0x7f31096aa37f]",
"(abort()+0x127) [0x7f3109694db5]",
"(()+0x9009b) [0x7f310a06209b]",
"(()+0x9653c) [0x7f310a06853c]",
"(()+0x95559) [0x7f310a067559]",
"(__gxx_personality_v0()+0x2a8) [0x7f310a067ed8]",
"(()+0x10b03) [0x7f3109a48b03]",
"(_Unwind_RaiseException()+0x2b1) [0x7f3109a49071]",
"(__cxa_throw()+0x3b) [0x7f310a0687eb]",
"(()+0x19fa4) [0x7f310b7b6fa4]",
"(tcmalloc::allocate_full_cpp_throw_oom(unsigned long)+0x146) 
[0x7f310b7d8c96]",
"(()+0x10d0f8e) [0x55ffa520df8e]",
  

[ceph-users] Re: is it possible to remove the db+wal from an external device (nvme)

2021-10-05 Thread Szabo, Istvan (Agoda)
Hmm, I’ve removed from the cluster, now data rebalance, I’ll do with the next 
one ☹

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com<mailto:istvan.sz...@agoda.com>
---

From: Igor Fedotov 
Sent: Tuesday, October 5, 2021 10:02 PM
To: Szabo, Istvan (Agoda) ; 胡 玮文 
Cc: ceph-users@ceph.io; Eugen Block 
Subject: Re: [ceph-users] Re: is it possible to remove the db+wal from an 
external device (nvme)

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Not sure dmcrypt is a culprit here.

Could you please set debug-bluefs to 20 and collect an OSD startup log.


On 10/5/2021 4:43 PM, Szabo, Istvan (Agoda) wrote:
Hmm, tried another one which hasn’t been spilledover disk, still coredumped ☹
Is there any special thing that we need to do before we migrate db next to the 
block? Our osds are using dmcrypt, is it an issue?

{
"backtrace": [
"(()+0x12b20) [0x7f310aa49b20]",
"(gsignal()+0x10f) [0x7f31096aa37f]",
"(abort()+0x127) [0x7f3109694db5]",
"(()+0x9009b) [0x7f310a06209b]",
"(()+0x9653c) [0x7f310a06853c]",
"(()+0x95559) [0x7f310a067559]",
"(__gxx_personality_v0()+0x2a8) [0x7f310a067ed8]",
"(()+0x10b03) [0x7f3109a48b03]",
"(_Unwind_RaiseException()+0x2b1) [0x7f3109a49071]",
"(__cxa_throw()+0x3b) [0x7f310a0687eb]",
"(()+0x19fa4) [0x7f310b7b6fa4]",
"(tcmalloc::allocate_full_cpp_throw_oom(unsigned long)+0x146) 
[0x7f310b7d8c96]",
"(()+0x10d0f8e) [0x55ffa520df8e]",
"(rocksdb::Version::~Version()+0x104) [0x55ffa521d174]",
"(rocksdb::Version::Unref()+0x21) [0x55ffa521d221]",
"(rocksdb::ColumnFamilyData::~ColumnFamilyData()+0x5a) 
[0x55ffa52efcca]",
"(rocksdb::ColumnFamilySet::~ColumnFamilySet()+0x88) [0x55ffa52f0568]",
"(rocksdb::VersionSet::~VersionSet()+0x5e) [0x55ffa520e01e]",
"(rocksdb::VersionSet::~VersionSet()+0x11) [0x55ffa520e261]",
"(rocksdb::DBImpl::CloseHelper()+0x616) [0x55ffa5155ed6]",
"(rocksdb::DBImpl::~DBImpl()+0x83b) [0x55ffa515c35b]",
"(rocksdb::DBImplReadOnly::~DBImplReadOnly()+0x11) [0x55ffa51a3bc1]",
"(rocksdb::DB::OpenForReadOnly(rocksdb::DBOptions const&, 
std::__cxx11::basic_string, std::allocator > 
const&, std::vector > const&, 
std::vector >*, rocksdb::DB**, bool)+0x1089) 
[0x55ffa51a57e9]",
"(RocksDBStore::do_open(std::ostream&, bool, bool, 
std::vector 
> const*)+0x14ca) [0x55ffa51285ca]",
"(BlueStore::_open_db(bool, bool, bool)+0x1314) [0x55ffa4bc27e4]",
"(BlueStore::_open_db_and_around(bool)+0x4c) [0x55ffa4bd4c5c]",
"(BlueStore::_mount(bool, bool)+0x847) [0x55ffa4c2e047]",
"(OSD::init()+0x380) [0x55ffa4753a70]",
"(main()+0x47f1) [0x55ffa46a6901]",
"(__libc_start_main()+0xf3) [0x7f3109696493]",
"(_start()+0x2e) [0x55ffa46d4e3e]"
],
"ceph_version": "15.2.14",
"crash_id": 
"2021-10-05T13:31:28.513463Z_b6818598-4960-4ed6-942a-d4a7ff37a758",
"entity_name": "osd.48",
"os_id": "centos",
"os_name": "CentOS Linux",
"os_version": "8",
"os_version_id": "8",
"process_name": "ceph-osd",
"stack_sig": 
"6a43b6c219adac393b239fbea4a53ff87c4185bcd213724f0d721b452b81ddbf",
"timestamp": "2021-10-05T13:31:28.513463Z",
    "utsname_hostname": "server-2s07",
"utsname_machine": "x86_64",
"utsname_release": "4.18.0-305.19.1.el8_4.x86_64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP Wed Sep 15 15:39:39 UTC 2021"
}
Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com<mailto:istvan.sz...@agoda.com>
---

From: 胡 玮文 <mailto:huw...@outlook.com>
Sent: Monday, October 4, 2021 12:13 AM
To: Szabo, Istvan (Agoda) 
<mailto:istvan.sz...@agoda.com>; Igor Fedotov 
<mailto:ifedo...@suse.de>
Cc: ceph-users@ceph.io<mailto:ceph-users@ceph.io>
Subject: 回复: [ceph-users] Re: is it possible to remove the db+wal from an 
external device (nvme)

Email received from the internet. If in

[ceph-users] Re: is it possible to remove the db+wal from an external device (nvme)

2021-10-05 Thread Szabo, Istvan (Agoda)
This one is in messages: https://justpaste.it/3x08z

Buffered_io is turned on by default in 15.2.14 octopus FYI.


Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

-Original Message-
From: Eugen Block  
Sent: Tuesday, October 5, 2021 9:52 PM
To: Szabo, Istvan (Agoda) 
Cc: 胡 玮文 ; Igor Fedotov ; 
ceph-users@ceph.io
Subject: Re: [ceph-users] Re: is it possible to remove the db+wal from an 
external device (nvme)

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Do you see oom killers in dmesg on this host? This line indicates it:

  "(tcmalloc::allocate_full_cpp_throw_oom(unsigned
long)+0x146) [0x7f310b7d8c96]",


Zitat von "Szabo, Istvan (Agoda)" :

> Hmm, tried another one which hasn’t been spilledover disk, still 
> coredumped ☹ Is there any special thing that we need to do before we 
> migrate db next to the block? Our osds are using dmcrypt, is it an issue?
>
> {
> "backtrace": [
> "(()+0x12b20) [0x7f310aa49b20]",
> "(gsignal()+0x10f) [0x7f31096aa37f]",
> "(abort()+0x127) [0x7f3109694db5]",
> "(()+0x9009b) [0x7f310a06209b]",
> "(()+0x9653c) [0x7f310a06853c]",
> "(()+0x95559) [0x7f310a067559]",
> "(__gxx_personality_v0()+0x2a8) [0x7f310a067ed8]",
> "(()+0x10b03) [0x7f3109a48b03]",
> "(_Unwind_RaiseException()+0x2b1) [0x7f3109a49071]",
> "(__cxa_throw()+0x3b) [0x7f310a0687eb]",
> "(()+0x19fa4) [0x7f310b7b6fa4]",
> "(tcmalloc::allocate_full_cpp_throw_oom(unsigned
> long)+0x146) [0x7f310b7d8c96]",
> "(()+0x10d0f8e) [0x55ffa520df8e]",
> "(rocksdb::Version::~Version()+0x104) [0x55ffa521d174]",
> "(rocksdb::Version::Unref()+0x21) [0x55ffa521d221]",
> "(rocksdb::ColumnFamilyData::~ColumnFamilyData()+0x5a)
> [0x55ffa52efcca]",
> "(rocksdb::ColumnFamilySet::~ColumnFamilySet()+0x88)
> [0x55ffa52f0568]",
> "(rocksdb::VersionSet::~VersionSet()+0x5e) [0x55ffa520e01e]",
> "(rocksdb::VersionSet::~VersionSet()+0x11) [0x55ffa520e261]",
> "(rocksdb::DBImpl::CloseHelper()+0x616) [0x55ffa5155ed6]",
> "(rocksdb::DBImpl::~DBImpl()+0x83b) [0x55ffa515c35b]",
> "(rocksdb::DBImplReadOnly::~DBImplReadOnly()+0x11) [0x55ffa51a3bc1]",
> "(rocksdb::DB::OpenForReadOnly(rocksdb::DBOptions const&, 
> std::__cxx11::basic_string, 
> std::allocator > const&, 
> std::vector std::allocator > const&, 
> std::vector std::allocator >*, rocksdb::DB**,
> bool)+0x1089) [0x55ffa51a57e9]",
> "(RocksDBStore::do_open(std::ostream&, bool, bool, 
> std::vector std::allocator > const*)+0x14ca) 
> [0x55ffa51285ca]",
> "(BlueStore::_open_db(bool, bool, bool)+0x1314) [0x55ffa4bc27e4]",
> "(BlueStore::_open_db_and_around(bool)+0x4c) [0x55ffa4bd4c5c]",
> "(BlueStore::_mount(bool, bool)+0x847) [0x55ffa4c2e047]",
> "(OSD::init()+0x380) [0x55ffa4753a70]",
> "(main()+0x47f1) [0x55ffa46a6901]",
> "(__libc_start_main()+0xf3) [0x7f3109696493]",
> "(_start()+0x2e) [0x55ffa46d4e3e]"
> ],
> "ceph_version": "15.2.14",
> "crash_id":
> "2021-10-05T13:31:28.513463Z_b6818598-4960-4ed6-942a-d4a7ff37a758",
> "entity_name": "osd.48",
> "os_id": "centos",
> "os_name": "CentOS Linux",
> "os_version": "8",
> "os_version_id": "8",
> "process_name": "ceph-osd",
> "stack_sig":
> "6a43b6c219adac393b239fbea4a53ff87c4185bcd213724f0d721b452b81ddbf",
> "timestamp": "2021-10-05T13:31:28.513463Z",
> "utsname_hostname": "server-2s07",
> "utsname_machine": "x86_64",
> "utsname_release": "4.18.0-305.19.1.el8_4.x86_64",
> "utsname_sysname": "Linux",
> "utsname_version": "#1 SMP Wed Sep 15 15:39:39 UTC 2021"
> }
> Istvan Szabo
> Senior Infrastructure Engineer
> ---
> Agoda Services Co., Ltd.
> e: istvan.sz...@agoda.com<mail

  1   2   3   4   >