[ceph-users] Re: Multisite: metadata behind on shards

2024-05-13 Thread Christian Rohmann

On 13.05.24 5:26 AM, Szabo, Istvan (Agoda) wrote:

Wonder what is the mechanism behind the sync mechanism because I need to 
restart all the gateways every 2 days on the remote sites to keep those it in 
sync. (Octopus 15.2.7)
We've also seen lots of those issues with stuck RGWs with earlier 
versions. But there have been lots of fixes in this area ... e.g. 
https://tracker.ceph.com/issues/39657



Is upgrading Ceph to a more recent version an option for you?



Regards


Christian



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: reef 18.2.3 QE validation status

2024-04-19 Thread Christian Rohmann

On 18.04.24 8:13 PM, Laura Flores wrote:
Thanks for bringing this to our attention. The leads have decided that 
since this PR hasn't been merged to main yet and isn't approved, it 
will not go in v18.2.3, but it will be prioritized for v18.2.4.
I've already added the PR to the v18.2.4 milestone so it's sure to be 
picked up.


Thanks a bunch. If you miss the train, you miss the train - fair enough.
Nice to know there is another one going soon and that bug is going to be 
on it !



Regards

Christian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: reef 18.2.3 QE validation status

2024-04-18 Thread Christian Rohmann

Hey Laura,


On 17.04.24 4:58 PM, Laura Flores wrote:

There are two PRs that were added later to the 18.2.3 milestone concerning
debian packaging:
https://github.com/ceph/ceph/pulls?q=is%3Apr+is%3Aopen+milestone%3Av18.2.3
The user is asking if these can be included.


I know everybody always wants their most anticipated PR in the next 
point release,
but please let me kindly point you to the issue of ceph-crash not 
working due to some small glitch it's directory permissions:


 * ML post to the ML 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/VACLBNVXTYNSXJSNXJSRAQNZHCHABDF4/

 * Bug Report: https://github.com/ceph/ceph/pull/55917
 * Non-backport PR fixing this: https://tracker.ceph.com/issues/64548


Since this is really potentially a one liner fix allowing for ceph-crash 
reports to be sent again.
When I noticed this, I had 47 non-reported crashes queues up in one my 
clusters.




Regards


Christian




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rgw s3 bucket policies limitations (on users)

2024-04-03 Thread Christian Rohmann

Hey Garcetto,

On 29.03.24 4:13 PM, garcetto wrote:

   i am trying to set bucket policies to allow to different users to access
same bucket with different permissions, BUT it seems that is not yet
supported, am i wrong?

https://docs.ceph.com/en/reef/radosgw/bucketpolicy/#limitations

"We do not yet support setting policies on users, groups, or roles."


Maybe check out my previous, somewhat similar question: 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/S2TV7GVFJTWPYA6NVRXDL2JXYUIQGMIN/

And PR https://github.com/ceph/ceph/pull/44434 could also be of interest.

I would love for RGW to support more detailed bucket policies, 
especially with external / Keystone authentication.




Regards


Christian

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Hanging request in S3

2024-03-12 Thread Christian Kugler
Hi Casey,

Interesting. Especially since the request it hangs on is a GET request.
I set the option and restarted the RGW I test with.

The POSTs for deleting take a while but there are not longer blocking GET
or POST requests.
Thank you!

Best,
Christian

PS: Sorry for pressing the wrong reply button, Casey
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Journal size recommendations

2024-03-08 Thread Christian Rohmann

On 01.03.22 19:57, Eugen Block wrote:
can you be more specific what exactly you are looking for? Are you 
talking about the rocksDB size? And what is the unit for 5012? It’s 
really not clear to me what you’re asking. And since the 
recommendations vary between different use cases you might want to 
share more details about your use case.



FWIW, I suppose OP was asking about this setting: 
https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_journal_size
And reading 
https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#journal-settings 
states


"This section applies only to the older Filestore OSD back end. Since 
Luminous BlueStore has been default and preferred."



It's totally obsolete with bluestore.



Regards


Christian


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rgw dynamic bucket sharding will hang io

2024-03-08 Thread Christian Rohmann

On 08.03.24 14:25, Christian Rohmann wrote:
What do you mean by blocking IO? No bucket actions (read / write) or 
high IO utilization?


According to https://docs.ceph.com/en/latest/radosgw/dynamicresharding/

"Writes to the target bucket are blocked (but reads are not) briefly 
during resharding process."


Are you observing this not being that "briefly" then?



Regards


Christian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rgw dynamic bucket sharding will hang io

2024-03-08 Thread Christian Rohmann

On 08.03.24 07:22, nuabo tan wrote:

When reshard occurs, io will be blocked, why has this serious problem not been 
solved?


Do you care to elaborate on this a bit more?

Which Ceph release are you using?
Are you using multisite replication or are you talking about a single 
RGW site?


What do you mean by blocking IO? No bucket actions (read / write) or 
high IO utilization?




Regards


Christian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Hanging request in S3

2024-03-06 Thread Christian Kugler
6
Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.336010247s
s3:list_bucket get_obj_state: setting s->obj_tag to
107ace7a-a829-4d1c-9cb8-9db30644b786.395658.12884446303569321109
Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.336010247s
s3:list_bucket  bucket index object:
rechenzentrum.rgw.buckets.index:.dir.3caabb9a-4e3b-4b8a-8222-34c33dd63210.10724501.3.1.34
Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.336010247s
s3:list_bucket cache get:
name=rechenzentrum.rgw.log++bucket.sync-source-hints.sql20 : hit (negative
entry)
Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.336010247s
s3:list_bucket cache get:
name=rechenzentrum.rgw.log++bucket.sync-target-hints.sql20 : hit (negative
entry)
Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.336010247s
s3:list_bucket reflect(): flow manager
(bucket=sql20:3caabb9a-4e3b-4b8a-8222-34c33dd63210.10724501.3): adding
source pipe:
{s={b=sql20:3caabb9a-4e3b-4b8a-8222-34c33dd63210.10724501.3,z=3caabb9a-4e3b-4b8a-8222-34c33dd63210,az=0},d={b=sql20:3caabb9a-4e3b-4b8a-8222-34c33dd63210.10724501.3,z=107ace7a-a829-4d1c-9cb8-9db30644b786,az=0}}
Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.336010247s
s3:list_bucket reflect(): flow manager
(bucket=sql20:3caabb9a-4e3b-4b8a-8222-34c33dd63210.10724501.3): adding dest
pipe:
{s={b=sql20:3caabb9a-4e3b-4b8a-8222-34c33dd63210.10724501.3,z=107ace7a-a829-4d1c-9cb8-9db30644b786,az=0},d={b=sql20:3caabb9a-4e3b-4b8a-8222-34c33dd63210.10724501.3,z=3caabb9a-4e3b-4b8a-8222-34c33dd63210,az=0}}
Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.336010247s
s3:list_bucket reflect(): flow manager (bucket=): adding source pipe:
{s={b=*,z=3caabb9a-4e3b-4b8a-8222-34c33dd63210,az=0},d={b=*,z=107ace7a-a829-4d1c-9cb8-9db30644b786,az=0}}
Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.336010247s
s3:list_bucket reflect(): flow manager (bucket=): adding dest pipe:
{s={b=*,z=107ace7a-a829-4d1c-9cb8-9db30644b786,az=0},d={b=*,z=3caabb9a-4e3b-4b8a-8222-34c33dd63210,az=0}}
Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.336010247s
s3:list_bucket chain_cache_entry: cache_locator=
Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.336010247s
s3:list_bucket chain_cache_entry: couldn't find cache locator
Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.336010247s
s3:list_bucket couldn't put bucket_sync_policy cache entry, might have
raced with data changes
Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.336010247s
s3:list_bucket RGWDataChangesLog::add_entry() bucket.name=sql20 shard_id=34
now=2024-03-06T18:36:17.978389+
cur_expiration=1970-01-01T00:00:00.00+

I don't see any clear error but somehow the last view lines are odd to me:
- When before it said: flow manager
(bucket=sql20:3caabb9a-4e3b-4b8a-8222-34c33dd63210.10724501.3)
  it has no more bucket: flow manager (bucket=)
- no cache locator found. No idea if this is okay or not
- The cur_expiration a few lines later is set to unix time 0
  (1970-01-01T00:00:00.00+)
- I did this multiple times and it seems to always be shard 34 that has the
issue

Did someone see something like this before?
Any ideas how to remedy the situation or at least where to or what to look
for?

Best,
Christian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: debian-reef_OLD?

2024-03-05 Thread Christian Rohmann

On 04.03.24 22:24, Daniel Brown wrote:

debian-reef/

Now appears to be:

debian-reef_OLD/


Could this have been  some sort of "release script" just messing up the 
renaming / symlinking to the most recent stable?




Regards


Christian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-crash NOT reporting crashes due to wrong permissions on /var/lib/ceph/crash/posted (Debian / Ubuntu packages)

2024-02-29 Thread Christian Rohmann




On 23.02.24 16:18, Christian Rohmann wrote:
I just noticed issues with ceph-crash using the Debian /Ubuntu 
packages (package: ceph-base):


While the /var/lib/ceph/crash/posted folder is created by the package 
install,

it's not properly chowned to ceph:ceph by the postinst script.

[...]

You might want to check if you might be affected as well.
Failing to post crashes to the local cluster results in them not being 
reported back via telemetry.


Sorry to bluntly bump this again, but did nobody else notice this on 
your clusters?
Call me egoistic, but the more clusters return crash reports the more 
stable my Ceph likely becomes ;-)



Regards


Christian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] ceph-crash NOT reporting crashes due to wrong permissions on /var/lib/ceph/crash/posted (Debian / Ubuntu packages)

2024-02-23 Thread Christian Rohmann

Hey ceph-users,

I just noticed issues with ceph-crash using the Debian /Ubuntu packages 
(package: ceph-base):


While the /var/lib/ceph/crash/posted folder is created by the package 
install,

it's not properly chowned to ceph:ceph by the postinst script.
This might also affect RPM based installs somehow, but I did not look 
into that.


I opened a bug report with all the details and two ideas to fix this: 
https://tracker.ceph.com/issues/64548



The wrong ownership causes ceph-crash to NOT work at all. I myself 
missed quite a few crash reports. All of them were just sitting around 
on the machines, but were reported right after I did


 chown ceph:ceph /var/lib/ceph/crash/posted
 systemctl restart ceph-crash.service

You might want to check if you might be affected as well.
Failing to post crashes to the local cluster results in them not being 
reported back via telemetry.



Regards

Christian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Throughput metrics missing iwhen updating Ceph Quincy to Reef

2024-02-05 Thread Christian Rohmann

On 01.02.24 10:10, Christian Rohmann wrote:

[...]
I am wondering if ceph-exporter ([2] is also built and packaged via 
the ceph packages [3] for installations that use them?




[2] https://github.com/ceph/ceph/tree/main/src/exporter
[3] https://docs.ceph.com/en/latest/install/get-packages/


I could not find ceph-exporter in any of the packages or as single 
binary, so I opened an issue:


https://tracker.ceph.com/issues/64317



Regards


Christian

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: how can install latest dev release?

2024-02-01 Thread Christian Rohmann

On 31.01.24 11:33, garcetto wrote:
thank you, but seems related to quincy, there is nothing on latest 
vesions in the doc...maybe the doc is not updated?



I don't understand what you are missing. I just used a documentation 
link pointing to the Quincy version of this page, yes.
The "latest" documentation is at 
https://docs.ceph.com/en/latest/install/get-packages/#ceph-development-packages.
But it seems nothing has changed. There are dev packages available at 
the URLs mentioned there.



Regards


Christian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Throughput metrics missing iwhen updating Ceph Quincy to Reef

2024-02-01 Thread Christian Rohmann
This change is documented at 
https://docs.ceph.com/en/latest/mgr/prometheus/#ceph-daemon-performance-counters-metrics,
also mentioning the deployment of ceph-exporter which is now used to 
collect per-host metrics from the local daemons.


While this deployment is done by cephadm if used, I am wondering if 
ceph-exporter ([2] is also built and packaged via the ceph packages [3] 
for installations that use them?




Regards


Christian





[1] 
https://docs.ceph.com/en/latest/mgr/prometheus/#ceph-daemon-performance-counters-metrics

[2] https://github.com/ceph/ceph/tree/main/src/exporter
[3] https://docs.ceph.com/en/latest/install/get-packages/




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: how can install latest dev release?

2024-01-31 Thread Christian Rohmann

On 31.01.24 09:38, garcetto wrote:

  how can i install latest dev release using cephadm?
I suppose you found 
https://docs.ceph.com/en/quincy/install/get-packages/#ceph-development-packages, 
but yes, that only seems to target a package installation.
Would be nice if there were also dev containers being built somewhere to 
use with cephadm.




Regards

Christian



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 3 DC with 4+5 EC not quite working

2024-01-14 Thread Christian Wuerdig
I could be wrong however as far as I can see you have 9 chunks which
requires 9 failure domains.
Your failure domain is set to datacenter which you only have 3 of. So that
won't work.

You need to set your failure domain to host and then create a crush rule to
choose a DC and choose 3 hosts within each DC
Something like this should work:
step choose indep 3 type datacenter
step chooseleaf indep 3 type host

On Fri, 12 Jan 2024 at 20:58, Torkil Svensgaard  wrote:

> We are looking to create a 3 datacenter 4+5 erasure coded pool but can't
> quite get it to work. Ceph version 17.2.7. These are the hosts (there
> will eventually be 6 hdd hosts in each datacenter):
>
> -33  886.00842  datacenter 714
>   -7  209.93135  host ceph-hdd1
>
> -69   69.86389  host ceph-flash1
>   -6  188.09579  host ceph-hdd2
>
>   -3  233.57649  host ceph-hdd3
>
> -12  184.54091  host ceph-hdd4
> -34  824.47168  datacenter DCN
> -73   69.86389  host ceph-flash2
>   -2  201.78067  host ceph-hdd5
>
> -81  288.26501  host ceph-hdd6
>
> -31  264.56207  host ceph-hdd7
>
> -36 1284.48621  datacenter TBA
> -77   69.86389  host ceph-flash3
> -21  190.83224  host ceph-hdd8
>
> -29  199.08838  host ceph-hdd9
>
> -11  193.85382  host ceph-hdd10
>
>   -9  237.28154  host ceph-hdd11
>
> -26  187.19536  host ceph-hdd12
>
>   -4  206.37102  host ceph-hdd13
>
> We did this:
>
> ceph osd erasure-code-profile set DRCMR_k4m5_datacenter_hdd
> plugin=jerasure k=4 m=5 technique=reed_sol_van crush-root=default
> crush-failure-domain=datacenter crush-device-class=hdd
>
> ceph osd pool create cephfs.hdd.data erasure DRCMR_k4m5_datacenter_hdd
> ceph osd pool set cephfs.hdd.data allow_ec_overwrites true
> ceph osd pool set cephfs.hdd.data pg_autoscale_mode warn
>
> Didn't quite work:
>
> "
> [WARN] PG_AVAILABILITY: Reduced data availability: 1 pg inactive, 1 pg
> incomplete
>  pg 33.0 is creating+incomplete, acting
> [104,219,NONE,NONE,NONE,41,NONE,NONE,NONE] (reducing pool
> cephfs.hdd.data min_size from 5 may help; search ceph.com/docs for
> 'incomplete')
> "
>
> I then manually changed the crush rule from this:
>
> "
> rule cephfs.hdd.data {
>  id 7
>  type erasure
>  step set_chooseleaf_tries 5
>  step set_choose_tries 100
>  step take default class hdd
>  step chooseleaf indep 0 type datacenter
>  step emit
> }
> "
>
> To this:
>
> "
> rule cephfs.hdd.data {
>  id 7
>  type erasure
>  step set_chooseleaf_tries 5
>  step set_choose_tries 100
>  step take default class hdd
>  step choose indep 0 type datacenter
>  step chooseleaf indep 3 type host
>  step emit
> }
> "
>
> Based on some testing and dialogue I had with Red Hat support last year
> when we were on RHCS, and it seemed to work. Then:
>
> ceph fs add_data_pool cephfs cephfs.hdd.data
> ceph fs subvolumegroup create hdd --pool_layout cephfs.hdd.data
>
> I started copying data to the subvolume and increased pg_num a couple of
> times:
>
> ceph osd pool set cephfs.hdd.data pg_num 256
> ceph osd pool set cephfs.hdd.data pg_num 2048
>
> But at some point it failed to activate new PGs eventually leading to this:
>
> "
> [WARN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs
>  mds.cephfs.ceph-flash1.agdajf(mds.0): 64 slow metadata IOs are
> blocked > 30 secs, oldest blocked for 25455 secs
> [WARN] MDS_TRIM: 1 MDSs behind on trimming
>  mds.cephfs.ceph-flash1.agdajf(mds.0): Behind on trimming
> (997/128) max_segments: 128, num_segments: 997
> [WARN] PG_AVAILABILITY: Reduced data availability: 5 pgs inactive
>  pg 33.6f6 is stuck inactive for 8h, current state
> activating+remapped, last acting [50,79,116,299,98,219,164,124,421]
>  pg 33.6fa is stuck inactive for 11h, current state
> activating+undersized+degraded+remapped, last acting
> [17,408,NONE,196,223,290,73,39,11]
>  pg 33.705 is stuck inactive for 11h, current state
> activating+undersized+degraded+remapped, last acting
> [33,273,71,NONE,411,96,28,7,161]
>  pg 33.721 is stuck inactive for 7h, current state
> activating+remapped, last acting [283,150,209,423,103,325,118,142,87]
>  pg 33.726 is stuck inactive for 11h, current state
> activating+undersized+degraded+remapped, last acting
> [234,NONE,416,121,54,141,277,265,19]
> [WARN] PG_DEGRADED: Degraded data redundancy: 1818/1282640036 objects
> degraded (0.000%), 3 pgs degraded, 3 pgs undersized
>  pg 33.6fa is stuck undersized for 7h, current state
> activating+undersized+degraded+remapped, last acting
> [17,408,NONE,196,223,290,73,39,11]
>  pg 33.705 is stuck undersized for 7h, current state
> 

[ceph-users] Re: RGW rate-limiting or anti-hammering for (external) auth requests // Anti-DoS measures

2024-01-12 Thread Christian Rohmann

Hey Istvan,

On 10.01.24 03:27, Szabo, Istvan (Agoda) wrote:
I'm using in the frontend https config on haproxy like this, it works 
so far good:


stick-table type ip size 1m expire 10s store http_req_rate(10s)

tcp-request inspect-delay 10s
tcp-request content track-sc0 src
http-request deny deny_status 429 if { sc_http_req_rate(0) gt 1 }



But this serves as a basic rate limit for all request coming from a 
single IP address, right?



My question was rather about limiting clients in regards to 
authentication requests / unauthorized requests,

which end up hammering the auth system (Keystone in my case) at full rate.



Regards


Christian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW rate-limiting or anti-hammering for (external) auth requests // Anti-DoS measures

2024-01-09 Thread Christian Rohmann

Happy New Year Ceph-Users!

With the holidays and people likely being away, I take the liberty to 
bluntly BUMP this question about protecting RGW from DoS below:



On 22.12.23 10:24, Christian Rohmann wrote:

Hey Ceph-Users,


RGW does have options [1] to rate limit ops or bandwidth per bucket or 
user.

But those only come into play when the request is authenticated.

I'd like to also protect the authentication subsystem from malicious 
or invalid requests.
So in case e.g. some EC2 credentials are not valid (anymore) and 
clients start hammering the RGW with those requests, I'd like to make 
it cheap to deal with those requests. Especially in case some external 
authentication like OpenStack Keystone [2] is used, valid access 
tokens are cached within the RGW. But requests with invalid 
credentials end up being sent at full rate to the external API [3] as 
there is no negative caching. And even if there was, that would only 
limit the external auth requests for the same set of invalid 
credentials, but it would surely reduce the load in that case:


Since the HTTP request is blocking  



[...]
2023-12-18T15:25:55.861+ 7fec91dbb640 20 sending request to 
https://keystone.example.com/v3/s3tokens
2023-12-18T15:25:55.861+ 7fec91dbb640 20 register_request 
mgr=0x561a407ae0c0 req_data->id=778, curl_handle=0x7fedaccb36e0
2023-12-18T15:25:55.861+ 7fec91dbb640 20 WARNING: blocking http 
request
2023-12-18T15:25:55.861+ 7fede37fe640 20 link_request 
req_data=0x561a40a418b0 req_data->id=778, curl_handle=0x7fedaccb36e0

[...]



this does not only stress the external authentication API (keystone in 
this case), but also blocks RGW threads for the duration of the 
external call.


I am currently looking into using the capabilities of HAProxy to rate 
limit requests based on their resulting http-response [4]. So in 
essence to rate-limit or tarpit clients that "produce" a high number 
of 403 "InvalidAccessKeyId" responses. To have less collateral it 
might make sense to limit based on the presented credentials 
themselves. But this would require to extract and track HTTP headers 
or URL parameters (presigned URLs) [5] and to put them into tables.



* What are your thoughts on the matter?
* What kind of measures did you put in place?
* Does it make sense to extend RGWs capabilities to deal with those 
cases itself?

** adding negative caching
** rate limits on concurrent external authentication requests (or is 
there a pool of connections for those requests?)




Regards


Christian



[1] https://docs.ceph.com/en/latest/radosgw/admin/#rate-limit-management
[2] 
https://docs.ceph.com/en/latest/radosgw/keystone/#integrating-with-openstack-keystone
[3] 
https://github.com/ceph/ceph/blob/86bb77eb9633bfd002e73b5e58b863bc2d0df594/src/rgw/rgw_auth_keystone.cc#L475
[4] 
https://www.haproxy.com/documentation/haproxy-configuration-manual/latest/#4.2-http-response%20track-sc0
[5] 
https://docs.aws.amazon.com/AmazonS3/latest/API/sig-v4-authenticating-requests.html#auth-methods-intro



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephadm - podman vs docker

2023-12-31 Thread Christian Wuerdig
General complaint about docker is usually that it by default stops all
running containers when the docker daemon gets shutdown. There is the
"live-restore" option (which has been around for a while) but that's turned
off by default (and requires a daemon restart to enable). It only supports
patch updates (no major version upgrades) though that might be sufficient
for you.

On Thu, 28 Dec 2023 at 03:30, Murilo Morais  wrote:

> Good morning everybody!
>
> Guys, are there any differences or limitations when using Docker instead of
> Podman?
>
> Context: I have a cluster with Debian 11 running Podman (3.0.1), but the
> iSCSI service, when restarted, the "tcmu-runner" binary is in "Z State" and
> the "rbd-target-api" script enters "D State" and never dies, causing the
> service not to start until I perform a reboot. On machines that use
> distributions based on Red Hat with podman 4+ this behavior does not
> happen.
>
> I don't want to use a repository that I don't know about just to update
> podman.
>
> I haven't tested it with Debian 12 yet, as we experienced some problems
> with bootstrap, so we decided to use Debian 11.
>
> I'm thinking about testing with Docker but I don't know what the difference
> is between both solutions in the CEPH context.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] RGW rate-limiting or anti-hammering for (external) auth requests // Anti-DoS measures

2023-12-22 Thread Christian Rohmann

Hey Ceph-Users,


RGW does have options [1] to rate limit ops or bandwidth per bucket or user.
But those only come into play when the request is authenticated.

I'd like to also protect the authentication subsystem from malicious or 
invalid requests.
So in case e.g. some EC2 credentials are not valid (anymore) and clients 
start hammering the RGW with those requests, I'd like to make it cheap 
to deal with those requests. Especially in case some external 
authentication like OpenStack Keystone [2] is used, valid access tokens 
are cached within the RGW. But requests with invalid credentials end up 
being sent at full rate to the external API [3] as there is no negative 
caching. And even if there was, that would only limit the external auth 
requests for the same set of invalid credentials, but it would surely 
reduce the load in that case:


Since the HTTP request is blocking  



[...]
2023-12-18T15:25:55.861+ 7fec91dbb640 20 sending request to 
https://keystone.example.com/v3/s3tokens
2023-12-18T15:25:55.861+ 7fec91dbb640 20 register_request 
mgr=0x561a407ae0c0 req_data->id=778, curl_handle=0x7fedaccb36e0
2023-12-18T15:25:55.861+ 7fec91dbb640 20 WARNING: blocking http 
request
2023-12-18T15:25:55.861+ 7fede37fe640 20 link_request 
req_data=0x561a40a418b0 req_data->id=778, curl_handle=0x7fedaccb36e0

[...]



this does not only stress the external authentication API (keystone in 
this case), but also blocks RGW threads for the duration of the external 
call.


I am currently looking into using the capabilities of HAProxy to rate 
limit requests based on their resulting http-response [4]. So in essence 
to rate-limit or tarpit clients that "produce" a high number of 403 
"InvalidAccessKeyId" responses. To have less collateral it might make 
sense to limit based on the presented credentials themselves. But this 
would require to extract and track HTTP headers or URL parameters 
(presigned URLs) [5] and to put them into tables.



* What are your thoughts on the matter?
* What kind of measures did you put in place?
* Does it make sense to extend RGWs capabilities to deal with those 
cases itself?

** adding negative caching
** rate limits on concurrent external authentication requests (or is 
there a pool of connections for those requests?)




Regards


Christian



[1] https://docs.ceph.com/en/latest/radosgw/admin/#rate-limit-management
[2] 
https://docs.ceph.com/en/latest/radosgw/keystone/#integrating-with-openstack-keystone
[3] 
https://github.com/ceph/ceph/blob/86bb77eb9633bfd002e73b5e58b863bc2d0df594/src/rgw/rgw_auth_keystone.cc#L475
[4] 
https://www.haproxy.com/documentation/haproxy-configuration-manual/latest/#4.2-http-response%20track-sc0
[5] 
https://docs.aws.amazon.com/AmazonS3/latest/API/sig-v4-authenticating-requests.html#auth-methods-intro

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: EC Profiles & DR

2023-12-05 Thread Christian Wuerdig
You can structure your crush map so that you get multiple EC chunks per
host in a way that you can still survive a host outage outage even though
you have fewer hosts than k+1
For example if you run an EC=4+2 profile on 3 hosts you can structure your
crushmap so that you have 2 chunks per host. This means even if one host is
down you are still guaranteed to have 4 chunks available.
If you then set min_size = 4 you can still operate your cluster in that
situation - albeit risky since any additional failure in that time will
lead to data loss. However in a highly constrained setup it might be a
trade-off that's worth it for you.
There have been examples of this on this mailing list in the past.

On Wed, 6 Dec 2023 at 12:11, Rich Freeman  wrote:

> On Tue, Dec 5, 2023 at 6:35 AM Patrick Begou
>  wrote:
> >
> > Ok, so I've misunderstood the meaning of failure domain. If there is no
> > way to request using 2 osd/node and node as failure domain, with 5 nodes
> > k=3+m=1 is not secure enough and I will have to use k=2+m=2, so like a
> > raid1  setup. A little bit better than replication in the point of view
> > of global storage capacity.
> >
>
> I'm not sure what you mean by requesting 2osd/node.  If the failure
> domain is set to the host, then by default k/m refer to hosts, and the
> PGs will be spread across all OSDs on all hosts, but with any
> particular PG only being present on one OSD on each host.  You can get
> fancy with device classes and crush rules and such and be more
> specific with how they're allocated, but that would be the typical
> behavior.
>
> Since k/m refer to hosts, then k+m must be less than or equal to the
> number of hosts or you'll have a degraded pool because there won't be
> enough hosts to allocate them all.  It won't ever stack them across
> multiple OSDs on the same host with that configuration.
>
> k=2,m=2 with min=3 would require at least 4 hosts (k+m), and would
> allow you to operate degraded with a single host down, and the PGs
> would become inactive but would still be recoverable with two hosts
> down.  While strictly speaking only 4 hosts are required, you'd do
> better to have more than that since then the cluster can immediately
> recover from a loss, assuming you have sufficient space.  As you say
> it is no more space-efficient than RAID1 or size=2, and it suffers
> write amplification for modifications, but it does allow recovery
> after the loss of up to two hosts, and you can operate degraded with
> one host down which allows for somewhat high availability.
>
> --
> Rich
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Automatic triggering of the Ubuntu SRU process, e.g. for the recent 17.2.7 Quincy point release?

2023-11-12 Thread Christian Rohmann

Hey Yuri, hey ceph-users,

first of all, thanks for all your work on developing and maintaining Ceph.

I was just wondering if there was any sort of process or trigger to the 
Ubuntu release team following a point release, for them to also create 
updated packages.
If you look at https://packages.ubuntu.com/jammy-updates/ceph, there 
still only is 17.2.6 as the current update available.
There was an [SRU] bug raised for 17.2.6 
(https://bugs.launchpad.net/cloud-archive/+bug/2018929), I now opened a 
similar one (https://bugs.launchpad.net/cloud-archive/+bug/2043336) 
hoping I went the right way of triggering the packaging this point release.


Even though the Ceph team does not build Quincy packages for Ubuntu 
22.04 LTS (Jammy) themselves, it would be nice to still treat it 
somewhat of as a release channel and to automatically trigger these kind 
of processes.




Regards


Christian



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Packages for 17.2.7 released without release notes / announcement (Re: Re: Status of Quincy 17.2.5 ?)

2023-10-30 Thread Christian Rohmann

Sorry to dig up this old thread ...

On 25.01.23 10:26, Christian Rohmann wrote:

On 20/10/2022 10:12, Christian Rohmann wrote:

1) May I bring up again my remarks about the timing:

On 19/10/2022 11:46, Christian Rohmann wrote:

I believe the upload of a new release to the repo prior to the 
announcement happens quite regularly - it might just be due to the 
technical process of releasing.
But I agree it would be nice to have a more "bit flip" approach to 
new releases in the repo and not have the packages appear as updates 
prior to the announcement and final release and update notes.
By my observations sometimes there are packages available on the 
download servers via the "last stable" folders such as 
https://download.ceph.com/debian-quincy/ quite some time before the 
announcement of a release is out.
I know it's hard to time this right with mirrors requiring some time 
to sync files, but would be nice to not see the packages or have 
people install them before there are the release notes and potential 
pointers to changes out. 


Todays 16.2.11 release shows the exact issue I described above 

1) 16.2.11 packages are already available via e.g. 
https://download.ceph.com/debian-pacific
2) release notes not yet merged: 
(https://github.com/ceph/ceph/pull/49839), thus 
https://ceph.io/en/news/blog/2022/v16-2-11-pacific-released/ show a 
404 :-)
3) No announcement like 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/QOCU563UD3D3ZTB5C5BJT5WRSJL5CVSD/ 
to the ML yet.




I really appreciate the work (implementation and also testing) that goes 
into each release.
But the release of 17.2.7 showed the issue of "packages available before 
the news is out":


1) packages are available on e.g. download.ceph.com
2) There are NO release notes on at 
https://docs.ceph.com/en/latest/releases/ yet

3) And there is no announcement on the ML yet


It would be awesome if you could consider bit-flip releases with 
packages only available right with the communication / release notes.




Regards


Christian






___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Hardware recommendations for a Ceph cluster

2023-10-10 Thread Christian Wuerdig
On Mon, 9 Oct 2023 at 14:24, Anthony D'Atri  wrote:

>
>
> > AFAIK the standing recommendation for all flash setups is to prefer fewer
> > but faster cores
>
> Hrm, I think this might depend on what you’re solving for.  This is the
> conventional wisdom for MDS for sure.  My sense is that OSDs can use
> multiple cores fairly well, so I might look at the cores * GHz product.
> Especially since this use-case sounds like long-tail performance probably
> isn’t worth thousands.  Only four OSD servers, Neutron, Kingston.  I don’t
> think the OP has stated any performance goals other than being more
> suitable to OpenStack instances than LFF spinners.
>

Well, the 75F3 seems to retail for less than the 7713P, so it should
technically be cheaper but then availability and supplier quotes are always
an important factor.


>
> > so something like a 75F3 might be yielding better latency.
> > Plus you probably want to experiment with partitioning the NVMEs and
> > running multiple OSDs per drive - either 2 or 4.
>
> Mark Nelson has authored a series of blog posts that explore this in great
> detail over a number of releases.  TL;DR: with Quincy or Reef, especially,
> my sense is that multiple OSDs per NVMe device is not the clear win that it
> once was, and just eats more RAM.  Mark has also authored detailed posts
> about OSD performance vs cores per OSD, though IIRC those are for one OSD
> in isolation.  In a real-world cluster, especially one this small, I
> suspect that replication and the network will be bottlenecks before either
> of the factors discussed above.
>
>
Thanks for reminding me of those. One thing I'm missing from
https://ceph.io/en/news/blog/2023/reef-osds-per-nvme/ is the NVMe
utilization - no point in buying NVMe that are blazingly fast (in terms
sustained of random 4k IOPS performance) if you have no chance to actually
utilize it.
In summary it seems - if you have many cores then multiple OSD/NVME would
provide a benefit, with fewer cores not so much. Still, it would also be
good to see the same benchmark with a faster CPU (but less cores) and see
what the actual difference is but I guess duplicating the test setup with a
different CPU is a bit tricky budget-wsie.


> ymmv.
>
>
>
> >
> > On Sat, 7 Oct 2023 at 08:23, Gustavo Fahnle  wrote:
> >
> >> Hi,
> >>
> >> Currently, I have an OpenStack installation with a Ceph cluster
> consisting
> >> of 4 servers for OSD, each with 16TB SATA HDDs. My intention is to add a
> >> second, independent Ceph cluster to provide faster disks for OpenStack
> VMs.
> >> The idea for this second cluster is to exclusively provide RBD services
> to
> >> OpenStack. I plan to start with a cluster composed of 3 mon/mgr nodes
> >> similar to what we currently have (3 virtualized servers with VMware)
> with
> >> 4 cores, 8GB of memory, 80GB disk and 10GB network
> >> each server.
> >> In the current cluster, these nodes have low resource consumption, less
> >> than 10% CPU usage, 40% memory usage, and less than 100Mb/s of network
> >> usage.
> >>
> >> For the OSDs, I'm thinking of starting with 3 or 4 servers, specifically
> >> Supermicro AS-1114S-WN10RT, each with:
> >>
> >> 1 AMD EPYC 7713P Gen 3 processor (64 Core, 128 Threads, 2.0GHz)
> >> 256GB of RAM
> >> 2 x NVME 1TB for the operating system
> >> 10 x NVME Kingston DC1500M U.2 7.68TB for the OSDs
> >> Two Intel NIC E810-XXVDA2 25GbE Dual Port (2 x SFP28) PCIe 4.0 x8 cards
> >> Connected to 2 MikroTik CRS518-16XS-2XQ-RM switches at 100GbE per server
> >> Connection to OpenStack would be via 4 x 10GB to our core switch.
> >>
> >> I would like to hear opinions about this configuration, recommendations,
> >> criticisms, etc.
> >>
> >> If any of you have references or experience with any of the components
> in
> >> this initial configuration, they would be very welcome.
> >>
> >> Thank you very much in advance.
> >>
> >> Gustavo Fahnle
> >>
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Hardware recommendations for a Ceph cluster

2023-10-08 Thread Christian Wuerdig
AFAIK the standing recommendation for all flash setups is to prefer fewer
but faster cores, so something like a 75F3 might be yielding better latency.
Plus you probably want to experiment with partitioning the NVMEs and
running multiple OSDs per drive - either 2 or 4.

On Sat, 7 Oct 2023 at 08:23, Gustavo Fahnle  wrote:

> Hi,
>
> Currently, I have an OpenStack installation with a Ceph cluster consisting
> of 4 servers for OSD, each with 16TB SATA HDDs. My intention is to add a
> second, independent Ceph cluster to provide faster disks for OpenStack VMs.
> The idea for this second cluster is to exclusively provide RBD services to
> OpenStack. I plan to start with a cluster composed of 3 mon/mgr nodes
> similar to what we currently have (3 virtualized servers with VMware) with
> 4 cores, 8GB of memory, 80GB disk and 10GB network
> each server.
> In the current cluster, these nodes have low resource consumption, less
> than 10% CPU usage, 40% memory usage, and less than 100Mb/s of network
> usage.
>
> For the OSDs, I'm thinking of starting with 3 or 4 servers, specifically
> Supermicro AS-1114S-WN10RT, each with:
>
> 1 AMD EPYC 7713P Gen 3 processor (64 Core, 128 Threads, 2.0GHz)
> 256GB of RAM
> 2 x NVME 1TB for the operating system
> 10 x NVME Kingston DC1500M U.2 7.68TB for the OSDs
> Two Intel NIC E810-XXVDA2 25GbE Dual Port (2 x SFP28) PCIe 4.0 x8 cards
> Connected to 2 MikroTik CRS518-16XS-2XQ-RM switches at 100GbE per server
> Connection to OpenStack would be via 4 x 10GB to our core switch.
>
> I would like to hear opinions about this configuration, recommendations,
> criticisms, etc.
>
> If any of you have references or experience with any of the components in
> this initial configuration, they would be very welcome.
>
> Thank you very much in advance.
>
> Gustavo Fahnle
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] CVE-2023-43040 - Improperly verified POST keys in Ceph RGW?

2023-09-27 Thread Christian Rohmann

Hey Ceph-users,

i just noticed there is a post to oss-security 
(https://www.openwall.com/lists/oss-security/2023/09/26/10) about a 
security issue with Ceph RGW.

Signed by IBM / Redhat and including a patch by DO.


I also raised an issue on the tracker 
(https://tracker.ceph.com/issues/63004) about this, as I could not find 
one yet.
It seems a weird way of disclosing such a thing and am wondering if 
anybody knew any more about this?




Regards


Christian



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] What is causing *.rgw.log pool to fill up / not be expired (Re: RGW multisite logs (data, md, bilog) not being trimmed automatically?)

2023-09-14 Thread Christian Rohmann
I am unfortunately still observing this issue of the RADOS pool 
"*.rgw.log" filling up with more and more objects:


On 26.06.23 18:18, Christian Rohmann wrote:

On the primary cluster I am observing an ever growing (objects and 
bytes) "sitea.rgw.log" pool, not so on the remote "siteb.rgw.log" 
which is only 300MiB and around 15k objects with no growth.
Metrics show that the growth of pool on primary is linear for at least 
6 months, so not sudden spikes or anything. Also sync status appears 
to be totally happy.

There are also no warnings in regards to large OMAPs or anything similar.


Could anybody kindly point me into the right direction to search for the 
cause of this?

What kinds of logs and data are stored in this pool?



Thanks and with kind regards,


Christian



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Contionuous spurious repairs without cause?

2023-09-06 Thread Christian Theune
Hi,

interesting, that’s something we can definitely try!

Thanks!

Christian

> On 5. Sep 2023, at 16:37, Manuel Lausch  wrote:
> 
> Hi,
> 
> in older versions of ceph with the auto-repair feature the PG state of
> scrubbing PGs had always the repair state as well.
> With later versions (I don't know exactly at which version) ceph
> differentiated scrubbing and repair again in the PG state.
> 
> I think as long as there are no errors loged all should be fine. If
> you disable auto repair, the issue should disapear as well. In case of
> scrub errors you will then see appropriate states. 
> 
> Regards
> Manuel
> 
> On Tue, 05 Sep 2023 14:14:56 +
> Eugen Block  wrote:
> 
>> Hi,
>> 
>> it sounds like you have auto-repair enabled (osd_scrub_auto_repair). I  
>> guess you could disable that to see what's going on with the PGs and  
>> their replicas. And/or you could enable debug logs. Are all daemons  
>> running the same ceph (minor) version? I remember a customer case  
>> where different ceph minor versions (but overall Octopus) caused  
>> damaged PGs, a repair fixed them everytime. After they updated all  
>> daemons to the same minor version those errors were gone.
>> 
>> Regards,
>> Eugen
>> 
>> Zitat von Christian Theune :
>> 
>>> Hi,
>>> 
>>> this is a bit older cluster (Nautilus, bluestore only).
>>> 
>>> We’ve noticed that the cluster is almost continuously repairing PGs.  
>>> However, they all finish successfully with “0 fixed”. We do not see  
>>> the trigger why Ceph decides to repair the PGs and it’s happening  
>>> for a lot of PGs, not any specific individual one.
>>> 
>>> Deep-scrubs are generally running, but currently a bit late as we  
>>> had some recoveries in the last week.
>>> 
>>> Logs look regular aside from the number of repairs. Here’s the last  
>>> weeks from the perspective of a single PG. There’s one repair, but  
>>> the same thing seems to happen for all PGs.
>>> 
>>> 2023-08-06 16:08:17.870 7fc49f1e6640  0 log_channel(cluster) log  
>>> [DBG] : 278.2f3 scrub starts
>>> 2023-08-06 16:08:18.270 7fc49b1de640  0 log_channel(cluster) log  
>>> [DBG] : 278.2f3 scrub ok
>>> 2023-08-07 21:52:22.299 7fc49f1e6640  0 log_channel(cluster) log  
>>> [DBG] : 278.2f3 scrub starts
>>> 2023-08-07 21:52:22.711 7fc49b1de640  0 log_channel(cluster) log  
>>> [DBG] : 278.2f3 scrub ok
>>> 2023-08-09 00:33:42.587 7fc49b1de640  0 log_channel(cluster) log  
>>> [DBG] : 278.2f3 scrub starts
>>> 2023-08-09 00:33:43.049 7fc49f1e6640  0 log_channel(cluster) log  
>>> [DBG] : 278.2f3 scrub ok
>>> 2023-08-10 09:36:00.590 7fc49b1de640  0 log_channel(cluster) log  
>>> [DBG] : 278.2f3 deep-scrub starts
>>> 2023-08-10 09:36:28.811 7fc49b1de640  0 log_channel(cluster) log  
>>> [DBG] : 278.2f3 deep-scrub ok
>>> 2023-08-11 12:59:14.219 7fc49f1e6640  0 log_channel(cluster) log  
>>> [DBG] : 278.2f3 scrub starts
>>> 2023-08-11 12:59:14.567 7fc49b1de640  0 log_channel(cluster) log  
>>> [DBG] : 278.2f3 scrub ok
>>> 2023-08-12 13:52:44.073 7fc49b1de640  0 log_channel(cluster) log  
>>> [DBG] : 278.2f3 scrub starts
>>> 2023-08-12 13:52:44.483 7fc49f1e6640  0 log_channel(cluster) log  
>>> [DBG] : 278.2f3 scrub ok
>>> 2023-08-14 01:51:04.774 7fc49f1e6640  0 log_channel(cluster) log  
>>> [DBG] : 278.2f3 deep-scrub starts
>>> 2023-08-14 01:51:33.113 7fc49b1de640  0 log_channel(cluster) log  
>>> [DBG] : 278.2f3 deep-scrub ok
>>> 2023-08-15 05:18:16.093 7fc49b1de640  0 log_channel(cluster) log  
>>> [DBG] : 278.2f3 scrub starts
>>> 2023-08-15 05:18:16.520 7fc49f1e6640  0 log_channel(cluster) log  
>>> [DBG] : 278.2f3 scrub ok
>>> 2023-08-16 09:47:38.520 7fc49b1de640  0 log_channel(cluster) log  
>>> [DBG] : 278.2f3 scrub starts
>>> 2023-08-16 09:47:38.930 7fc49b1de640  0 log_channel(cluster) log  
>>> [DBG] : 278.2f3 scrub ok
>>> 2023-08-17 19:25:45.352 7fc49b1de640  0 log_channel(cluster) log  
>>> [DBG] : 278.2f3 scrub starts
>>> 2023-08-17 19:25:45.775 7fc49b1de640  0 log_channel(cluster) log  
>>> [DBG] : 278.2f3 scrub ok
>>> 2023-08-19 05:40:43.663 7fc49b1de640  0 log_channel(cluster) log  
>>> [DBG] : 278.2f3 scrub starts
>>> 2023-08-19 05:40:44.073 7fc49f1e6640  0 log_channel(cluster) log  
>>> [DBG] : 278.2f3 scrub ok
>>> 2023-08-20 12:06:54.343 7fc49f1e6640  0 log_channel(cluster) log  
>>> [DBG] : 278.2f3 scr

[ceph-users] Re: Contionuous spurious repairs without cause?

2023-09-06 Thread Christian Theune
Hi,

thanks for the hint. We’re definitely running exact same binaries for all. :)

> On 5. Sep 2023, at 16:14, Eugen Block  wrote:
> 
> Hi,
> 
> it sounds like you have auto-repair enabled (osd_scrub_auto_repair). I guess 
> you could disable that to see what's going on with the PGs and their 
> replicas. And/or you could enable debug logs. Are all daemons running the 
> same ceph (minor) version? I remember a customer case where different ceph 
> minor versions (but overall Octopus) caused damaged PGs, a repair fixed them 
> everytime. After they updated all daemons to the same minor version those 
> errors were gone.
> 
> Regards,
> Eugen
> 
> Zitat von Christian Theune :
> 
>> Hi,
>> 
>> this is a bit older cluster (Nautilus, bluestore only).
>> 
>> We’ve noticed that the cluster is almost continuously repairing PGs. 
>> However, they all finish successfully with “0 fixed”. We do not see the 
>> trigger why Ceph decides to repair the PGs and it’s happening for a lot of 
>> PGs, not any specific individual one.
>> 
>> Deep-scrubs are generally running, but currently a bit late as we had some 
>> recoveries in the last week.
>> 
>> Logs look regular aside from the number of repairs. Here’s the last weeks 
>> from the perspective of a single PG. There’s one repair, but the same thing 
>> seems to happen for all PGs.
>> 
>> 2023-08-06 16:08:17.870 7fc49f1e6640  0 log_channel(cluster) log [DBG] : 
>> 278.2f3 scrub starts
>> 2023-08-06 16:08:18.270 7fc49b1de640  0 log_channel(cluster) log [DBG] : 
>> 278.2f3 scrub ok
>> 2023-08-07 21:52:22.299 7fc49f1e6640  0 log_channel(cluster) log [DBG] : 
>> 278.2f3 scrub starts
>> 2023-08-07 21:52:22.711 7fc49b1de640  0 log_channel(cluster) log [DBG] : 
>> 278.2f3 scrub ok
>> 2023-08-09 00:33:42.587 7fc49b1de640  0 log_channel(cluster) log [DBG] : 
>> 278.2f3 scrub starts
>> 2023-08-09 00:33:43.049 7fc49f1e6640  0 log_channel(cluster) log [DBG] : 
>> 278.2f3 scrub ok
>> 2023-08-10 09:36:00.590 7fc49b1de640  0 log_channel(cluster) log [DBG] : 
>> 278.2f3 deep-scrub starts
>> 2023-08-10 09:36:28.811 7fc49b1de640  0 log_channel(cluster) log [DBG] : 
>> 278.2f3 deep-scrub ok
>> 2023-08-11 12:59:14.219 7fc49f1e6640  0 log_channel(cluster) log [DBG] : 
>> 278.2f3 scrub starts
>> 2023-08-11 12:59:14.567 7fc49b1de640  0 log_channel(cluster) log [DBG] : 
>> 278.2f3 scrub ok
>> 2023-08-12 13:52:44.073 7fc49b1de640  0 log_channel(cluster) log [DBG] : 
>> 278.2f3 scrub starts
>> 2023-08-12 13:52:44.483 7fc49f1e6640  0 log_channel(cluster) log [DBG] : 
>> 278.2f3 scrub ok
>> 2023-08-14 01:51:04.774 7fc49f1e6640  0 log_channel(cluster) log [DBG] : 
>> 278.2f3 deep-scrub starts
>> 2023-08-14 01:51:33.113 7fc49b1de640  0 log_channel(cluster) log [DBG] : 
>> 278.2f3 deep-scrub ok
>> 2023-08-15 05:18:16.093 7fc49b1de640  0 log_channel(cluster) log [DBG] : 
>> 278.2f3 scrub starts
>> 2023-08-15 05:18:16.520 7fc49f1e6640  0 log_channel(cluster) log [DBG] : 
>> 278.2f3 scrub ok
>> 2023-08-16 09:47:38.520 7fc49b1de640  0 log_channel(cluster) log [DBG] : 
>> 278.2f3 scrub starts
>> 2023-08-16 09:47:38.930 7fc49b1de640  0 log_channel(cluster) log [DBG] : 
>> 278.2f3 scrub ok
>> 2023-08-17 19:25:45.352 7fc49b1de640  0 log_channel(cluster) log [DBG] : 
>> 278.2f3 scrub starts
>> 2023-08-17 19:25:45.775 7fc49b1de640  0 log_channel(cluster) log [DBG] : 
>> 278.2f3 scrub ok
>> 2023-08-19 05:40:43.663 7fc49b1de640  0 log_channel(cluster) log [DBG] : 
>> 278.2f3 scrub starts
>> 2023-08-19 05:40:44.073 7fc49f1e6640  0 log_channel(cluster) log [DBG] : 
>> 278.2f3 scrub ok
>> 2023-08-20 12:06:54.343 7fc49f1e6640  0 log_channel(cluster) log [DBG] : 
>> 278.2f3 scrub starts
>> 2023-08-20 12:06:54.809 7fc49b1de640  0 log_channel(cluster) log [DBG] : 
>> 278.2f3 scrub ok
>> 2023-08-21 19:23:10.801 7fc49f1e6640  0 log_channel(cluster) log [DBG] : 
>> 278.2f3 deep-scrub starts
>> 2023-08-21 19:23:39.936 7fc49b1de640  0 log_channel(cluster) log [DBG] : 
>> 278.2f3 deep-scrub ok
>> 2023-08-23 03:43:21.391 7fc49f1e6640  0 log_channel(cluster) log [DBG] : 
>> 278.2f3 scrub starts
>> 2023-08-23 03:43:21.844 7fc49b1de640  0 log_channel(cluster) log [DBG] : 
>> 278.2f3 scrub ok
>> 2023-08-24 04:21:17.004 7fc49b1de640  0 log_channel(cluster) log [DBG] : 
>> 278.2f3 deep-scrub starts
>> 2023-08-24 04:21:47.972 7fc49f1e6640  0 log_channel(cluster) log [DBG] : 
>> 278.2f3 deep-scrub ok
>> 2023-08-25 06:55:13.588 7fc49b1de640  0 log_channel(cluster) log [DBG] : 
>> 278.2f3 scrub starts
>> 2023-08

[ceph-users] Contionuous spurious repairs without cause?

2023-09-05 Thread Christian Theune
) log [DBG] : 
278.2f3 scrub starts
2023-09-04 03:16:15.295 7f37ca268640  0 log_channel(cluster) log [DBG] : 
278.2f3 scrub ok
2023-09-05 14:50:36.064 7f37ca268640  0 log_channel(cluster) log [DBG] : 
278.2f3 repair starts
2023-09-05 14:51:04.407 7f37c6260640  0 log_channel(cluster) log [DBG] : 
278.2f3 repair ok, 0 fixed

Googling didn’t help, unfortunately and the bug tracker doesn’t appear to have 
any relevant issue either.

Any ideas?

Liebe Grüße,
Christian Theune

-- 
Christian Theune · c...@flyingcircus.io · +49 345 219401 0
Flying Circus Internet Operations GmbH · https://flyingcircus.io
Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Can ceph-volume manage the LVs optionally used for DB / WAL at all?

2023-08-26 Thread Christian Rohmann

On 25.08.23 09:09, Eugen Block wrote:

I'm still not sure if we're on the same page.


Maybe not, I'll respond inline to clarify.




By looking at 
https://docs.ceph.com/en/latest/man/8/ceph-volume/#cmdoption-ceph-volume-lvm-prepare-block.db 
it seems that ceph-volume wants an LV or partition. So it's 
apparently not just taking a VG itself? Also if there were multiple 
VGs / devices , I likely would need to at least pick those.


ceph-volume creates all required VGs/LVs automatically, and the OSD 
creation happens in batch mode, for example when run by cephadm:

ceph-volume lvm batch --yes /dev/sdb /dev/sdc /dev/sdd

In a non-cephadm deployment you can fiddle with ceph-volume manually, 
where you also can deploy single OSDs, with or without providing your 
own pre-built VGs/LVs. In a cephadm deployment manually creating OSDs 
will result in "stray daemons not managed by cephadm" warnings.


1) I am mostly asking about an non-cephadm environment and would just 
like to know if ceph-volume can also manage the VG of a DB/WAL device 
that is used for multiple OSD and create the individual LVs which are 
used for DB or WAL devices when creating a single OSD. Below you give an 
example "before we upgraded to Pacific" in which you run lvcreate 
manually. Is that not required anymore with >= Quincy?
2) Even with cephadm there is the "db_devices" as part of the 
drivegroups. But the question remains if cephadm can use a single 
db_device for multiple OSDs.



Before we upgraded to Pacific we did manage our block.db devices 
manually with pre-built LVs, e.g.:


$ lvcreate -L 30G -n bluefsdb-30 ceph-journals
$ ceph-volume lvm create --data /dev/sdh --block.db 
ceph-journals/bluefsdb-30


As asked and explained in the paragraph above, this is what I am 
currently doing (lvcreate + ceph-volume lvm create). My question 
therefore is, if ceph-volume (!) could somehow create this LV for the DB 
automagically if I'd just give it a device (or existing VG)?



Thank you very much for your patience in clarifying and responding to my 
questions.

Regards


Christian


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Can ceph-volume manage the LVs optionally used for DB / WAL at all?

2023-08-25 Thread Christian Rohmann

On 11.08.23 16:06, Eugen Block wrote:
if you deploy OSDs from scratch you don't have to create LVs manually, 
that is handled entirely by ceph-volume (for example on cephadm based 
clusters you only provide a drivegroup definition). 


By looking at 
https://docs.ceph.com/en/latest/man/8/ceph-volume/#cmdoption-ceph-volume-lvm-prepare-block.db 
it seems that ceph-volume wants an LV or partition. So it's apparently 
not just taking a VG itself? Also if there were multiple VGs / devices , 
I likely would need to at least pick those.


But I suppose this orchestration would then require cephadm 
(https://docs.ceph.com/en/latest/cephadm/services/osd/#drivegroups) and 
cannot be done via ceph-volume which merely takes care of ONE OSD at a time.



I'm not sure if automating db/wal migration has been considered, it 
might be (too) difficult. But moving the db/wal devices to 
new/different devices doesn't seem to be a reoccuring issue (corner 
case?), so maybe having control over that process for each OSD 
individually is the safe(r) option in case something goes wrong. 


Sorry for the confusion. I was not talking about any migrations, just 
the initial creation of spinning rust OSDs with DB or WAL on fast storage.



Regards


Christian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] When to use the auth profiles simple-rados-client and profile simple-rados-client-with-blocklist?

2023-08-22 Thread Christian Rohmann

Hey ceph-users,

1) When configuring Gnocchi to use Ceph storage (see 
https://gnocchi.osci.io/install.html#ceph-requirements)

I was wondering if one could use any of the auth profiles like
 * simple-rados-client
 * simple-rados-client-with-blocklist ?

Or are those for different use cases?

2) I was also wondering why the documentation mentions "(Monitor only)" 
but then it says

"Gives a user read-only permissions for monitor, OSD, and PG data."?

3) And are those profiles really for "read-only" users? Why don't they 
have "read-only" in their name like the rbd and the corresponding 
"rbd-read-only" profile?



Regards


Christian


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Can ceph-volume manage the LVs optionally used for DB / WAL at all?

2023-08-11 Thread Christian Rohmann

Hey ceph-users,

I was wondering if ceph-volume did anything in regards to the management 
(creation, setting metadata, ) of LVs which are used for

DB / WAL of an OSD?

Reading the documentation at 
https://docs.ceph.com/en/latest/man/8/ceph-volume/#new-db is seems to 
indicate that the LV to be used as e.g. DB needs to be created manually 
(without ceph-volume) and exist prior to using ceph-volume to move the 
DB to that LV? I suppose the same is true for "ceph-volume lvm create" 
or "ceph-volume lvm prepare" and "--block.db"


It's not that creating a few LVs is hard... it's just that ceph volume 
does apply some structure to the naming of LVM VGs and LVs on the OSD 
device and also adds metadata. That would then be up to the user, right?




Regards


Christian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-volume lvm new-db fails

2023-08-11 Thread Christian Rohmann

On 10/08/2023 13:30, Christian Rohmann wrote:

It's already fixed master, but the backports are all still pending ...


There are PRs for the backports now:

* https://tracker.ceph.com/issues/62060
* https://tracker.ceph.com/issues/62061
* https://tracker.ceph.com/issues/62062



Regards

Christian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-volume lvm new-db fails

2023-08-10 Thread Christian Rohmann



On 11/05/2022 23:21, Joost Nieuwenhuijse wrote:
After a reboot the OSD turned out to be corrupt. Not sure if 
ceph-volume lvm new-db caused the problem, or failed because of 
another problem.



I just ran into the same issue trying to add a db to an existing OSD.
Apparently this is a known bug: https://tracker.ceph.com/issues/55260

It's already fixed master, but the backports are all still pending ...



Regards

Christian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Not all Bucket Shards being used

2023-08-02 Thread Christian Kugler
> Thank you for the information, Christian. When you reshard the bucket id is 
> updated (with most recent versions of ceph, a generation number is 
> incremented). The first bucket id matches the bucket marker, but after the 
> first reshard they diverge.

This makes a lot of sense and explains why the large omap objects do
not go away. It is the old shards that are too big.

> The bucket id is in the names of the currently used bucket index shards. 
> You’re searching for the marker, which means you’re finding older bucket 
> index shards.
>
> Change your commands to these:
>
> # rados -p raum.rgw.buckets.index ls \
>|grep 3caabb9a-4e3b-4b8a-8222-34c33dd63210.10648356.1 \
>|sort -V
>
> # rados -p raum.rgw.buckets.index ls \
>|grep 3caabb9a-4e3b-4b8a-8222-34c33dd63210.10648356.1 \
>|sort -V \
>|xargs -IOMAP sh -c \
>'rados -p raum.rgw.buckets.index listomapkeys OMAP | wc -l'

I don't think the outputs are very interesting here. They are as expected:
- 131 lines of rados objects (omap)
- each omap contains about 70k keys (below the 100k limit).

> When you refer to the “second zone”, what do you mean? Is this cluster using 
> multisite? If and only if your answer is “no”, then it’s safe to remove old 
> bucket index shards. Depending on the version of ceph running when reshard 
> was run, they were either intentionally left behind (earlier behavior) or 
> removed automatically (later behavior).

Yes, this cluster uses multisite. It is one realm, one zonegroup with
two zones (bidirectional sync).
Ceph warns about resharding on the non-metadata zone. So I did not do
that and only resharded on the metadata zone.
The resharding was done using a radosgw-admin v16.2.6 on a ceph
cluster running v17.2.5.
Is there a way to get rid of the old (big) shards without breaking something?

Christian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Not all Bucket Shards being used

2023-07-25 Thread Christian Kugler
Hi Eric,

> 1. I recommend that you *not* issue another bucket reshard until you figure 
> out what’s going on.

Thanks, noted!

> 2. Which version of Ceph are you using?
17.2.5
I wanted to get the Cluster to Health OK before upgrading. I didn't
see anything that led me to believe that an upgrade could fix the
reshard issue.

> 3. Can you issue a `radosgw-admin metadata get bucket:` so we 
> can verify what the current marker is?

# radosgw-admin metadata get bucket:sql20
{
"key": "bucket:sql20",
"ver": {
"tag": "_hGhtgzjcWY9rO9JP7YlWzt8",
"ver": 3
},
"mtime": "2023-07-12T15:56:55.226784Z",
"data": {
"bucket": {
"name": "sql20",
"marker": "3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9",
"bucket_id": "3caabb9a-4e3b-4b8a-8222-34c33dd63210.10648356.1",
"tenant": "",
"explicit_placement": {
"data_pool": "",
"data_extra_pool": "",
"index_pool": ""
}
},
"owner": "S3user",
"creation_time": "2023-04-26T09:22:01.681646Z",
"linked": "true",
"has_bucket_info": "false"
}
}

> 4. After you resharded previously, did you get command-line output along the 
> lines of:
> 2023-07-24T13:33:50.867-0400 7f10359f2a80 1 execute INFO: reshard of bucket 
> “" completed successfully

I think so, at least for the second reshard. But I wouldn't bet my
life on it. I fear I might have missed an error on the first one since
I have done a radosgw-admin bucket reshard so often and never seen it
fail.

Christian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Not all Bucket Shards being used

2023-07-18 Thread Christian Kugler
Hi,

I have trouble with large OMAP files in a cluster in the RGW index pool. Some
background information about the cluster: There is CephFS and RBD usage on the
main cluster but for this issue I think only S3 is interesting.
There is one realm, one zonegroup with two zones which have a bidirectional sync
set up. Since this does not allow for autoresharding we have to do it by hand in
this cluster – looking forward to Reef!

From the logs:
cluster 2023-07-17T22:59:03.018722+ osd.75 (osd.75) 623978 :
cluster [WRN] Large omap object found. Object:
34:bcec3016:::.dir.3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9.5:head
PG: 34.680c373d (34.5) Key count: 962091 Size (bytes): 277963182

The offending bucket looks like this:
# radosgw-admin bucket stats \
| jq '.[] | select(.marker
=="3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9")
  |"\(.num_shards) \(.usage["rgw.main"].num_objects)"' -r
131 9463833

Last week the number of objects was about 12 million. Which is why I reshareded
the offending bucket twice, I think. Once to 129 and the second time to 131
because I wanted some leeway (or lieway? scnr, Sage).

Unfortunately, even after a week the objects were still to big (the log line
above is quite recent), so I looked into it again.

# rados -p raum.rgw.buckets.index ls \
|grep .dir.3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9 \
|sort -V
.dir.3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9.0
.dir.3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9.1
.dir.3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9.2
.dir.3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9.3
.dir.3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9.4
.dir.3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9.5
.dir.3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9.6
.dir.3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9.7
.dir.3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9.8
.dir.3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9.9
.dir.3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9.10
# rados -p raum.rgw.buckets.index ls \
|grep .dir.3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9 \
|sort -V \
|xargs -IOMAP sh -c \
'rados -p raum.rgw.buckets.index listomapkeys OMAP | wc -l'
1013854
1011007
1012287
1011232
1013565
998262
1012777
1012713
1012230
1010690
997111

Apparently, only 11 shards are in use. This would explain why the "Key usage"
(from the log line) is about ten times higher than I would expect.

How can I deal with this issue?
One thing I could try to fix this would be to reshard to a lower number, but I
am not sure if there are any risks associated with "downsharding". After that I
could reshard to something like 97. Or I could directly "downshard" to 97.

Also, the second zone has a similar problem, but as the error messsage lets me
know, this would be a bad idea. Will it just take more time until the sharding
is transferred to the seconds zone?

Best,
Christian Kugler
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Adding datacenter level to CRUSH tree causes rebalancing

2023-07-16 Thread Christian Wuerdig
Based on my understanding of CRUSH it basically works down the hierarchy
and then randomly (but deterministically for a given CRUSH map) picks
buckets (based on the specific selection rule) on that level for the object
and then it does this recursively until it ends up at the leaf nodes.
Given that you introduced a whole hierarchy level just below the top,
objects will now be distributed differently since the pseudo-random
hash-based selection strategy may now for example put an object that used
to be in node-4 under FSN-DC16 instead
So basically when you fiddle with the hierarchy you can generally expect
lots of data movement everywhere downstream of your change.

On Sun, 16 Jul 2023 at 06:03, Niklas Hambüchen  wrote:

> Hi Ceph users,
>
> I have a Ceph 16.2.7 cluster that so far has been replicated over the
> `host` failure domain.
> All `hosts` have been chosen to be in different `datacenter`s, so that was
> sufficient.
>
> Now I wish to add more hosts, including some in already-used data centers,
> so I'm planning to use CRUSH's `datacenter` failure domain instead.
>
> My problem is that when I add the `datacenter`s into the CRUSH tree, Ceph
> decides that it should now rebalance the entire cluster.
> This seems unnecessary, and wrong.
>
> Before, `ceph osd tree` (some OSDs omitted for legibility):
>
>
>  ID   CLASS  WEIGHT TYPE NAMESTATUS  REWEIGHT
> PRI-AFF
>   -1 440.73514  root default
>   -3 146.43625  host node-4
>2hdd   14.61089  osd.2up   1.0
> 1.0
>3hdd   14.61089  osd.3up   1.0
> 1.0
>   -7 146.43625  host node-5
>   14hdd   14.61089  osd.14   up   1.0
> 1.0
>   15hdd   14.61089  osd.15   up   1.0
> 1.0
>  -10 146.43625  host node-6
>   26hdd   14.61089  osd.26   up   1.0
> 1.0
>   27hdd   14.61089  osd.27   up   1.0
> 1.0
>
>
> After assigning of `datacenter` crush buckets:
>
>
>  ID   CLASS  WEIGHT TYPE NAMESTATUS  REWEIGHT
> PRI-AFF
>   -1 440.73514  root default
>  -18 146.43625  datacenter FSN-DC16
>   -7 146.43625  host node-5
>   14hdd   14.61089  osd.14   up   1.0
> 1.0
>   15hdd   14.61089  osd.15   up   1.0
> 1.0
>  -17 146.43625  datacenter FSN-DC18
>  -10 146.43625  host node-6
>   26hdd   14.61089  osd.26   up   1.0
> 1.0
>   27hdd   14.61089  osd.27   up   1.0
> 1.0
>  -16 146.43625  datacenter FSN-DC4
>   -3 146.43625  host node-4
>2hdd   14.61089  osd.2up   1.0
> 1.0
>3hdd   14.61089  osd.3up   1.0
> 1.0
>
>
> This shows that the tree is essentially unchanged, it just "gained a
> level".
>
> In `ceph status` I now get:
>
>  pgs: 1167541260/1595506041 objects misplaced (73.177%)
>
> If I remove the `datacenter` level again, then the misplacement disappears.
>
> On a minimal testing cluster, this misplacement issue did not appear.
>
> Why does Ceph think that these objects are misplaced when I add the
> datacenter level?
> Is there a more correct way to do this?
>
>
> Thanks!
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW accessing real source IP address of a client (e.g. in S3 bucket policies)

2023-07-06 Thread Christian Rohmann

Hey Casey, all,

On 16/06/2023 17:00, Casey Bodley wrote:



But when applying a bucket policy with aws:SourceIp it seems to only work if I 
set the internal IP of the HAProxy instance, not the public IP of the client.
So the actual remote address is NOT used in my case.


Did I miss any config setting anywhere?


your 'rgw remote addr param' config looks right. with that same
config, i was able to set a bucket policy that denied access based on


I found the issue. Embarrassingly it was simply a NAT-Hairpin which was 
applied to the traffic from the server I was testing with.
In short: Even though I targeted the public IP from the HAProxy instance 
the internal IP address of my test server was maintained as source since 
both machines are on the same network segment.
That is why I first thought the LB IP was applied to the policy, but not 
the actual public source IP of the client. In reality it was simply the 
private, RFC1918, IP of the test machine that came in as source.




Sorry for the noise and thanks for your help.

Christian


P.S. With IPv6, this would not have happened.



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW multisite logs (data, md, bilog) not being trimmed automatically?

2023-06-29 Thread Christian Rohmann
There was a similar issue reported at 
https://tracker.ceph.com/issues/48103 and yet another ML post at

https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/5LGXQINAJBIGFUZP5WEINVHNPBJEV5X7

May I second the question if it's safe to run radosgw-admin autotrim on 
those logs?
If so, why is that required and why seems to be no periodic trimming 
happening?




Regards


Christian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Bluestore compression - Which algo to choose? Zstd really still that bad?

2023-06-27 Thread Christian Rohmann

Hey Igor,

On 27/06/2023 12:06, Igor Fedotov wrote:
I can't say anything about your primary question on zstd 
benefits/drawbacks but I'd like to emphasize that compression ratio at 
BlueStore is (to a major degree) determined by the input data flow 
characteristics (primarily write block size), object store allocation 
unit size (bluestore_min_alloc_size) and some parameters (e.g. maximum 
blob size) that determine how input data chunks are logically split 
when landing on disk.
E.g. if one has min_alloc_size set to 4K and write block size is in 
(4K-8K] then resulting compressed block would never be less than 4K. 
Hence compression ratio is never more than 2.
Similarly if min_alloc_size is 64K there would be no benefit in 
compression at all for the above input since target allocation units 
are always larger than input blocks.
The rationale of the above behavior is that compression is applied 
exclusively on input blocks - there is no additional processing to 
merge input and existing data and compress them all together.



Thanks for the emphasis on input data and its block-size. Yes, that is 
certainly the most important factor for the compression efficiency and 
choice of an suitable algorithm for a certain use-case.
In my case the pool is RBD only, so (by default) the blocks are 4M if I 
am not mistaken. I also understand that even though larger blocks 
generally compress better, I know there is no relation between
different blocks in regard to compression dictionaries (going along the 
lines of de-duplication). In the end in my use-case it boils down to the 
type of data stored on the RBD images and how compressible that might be.
But since those blocks are only written once, and I am ready to invest 
more CPU cycles to reduce the size on disk.


I am simply looking for data other might have collected on their similar 
use-cases.
Also I am still wondering if there really is nobody that worked/played 
more with zstd since that has become so popular in recent months...



Regards


Christian


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] RGW multisite logs (data, md, bilog) not being trimmed automatically?

2023-06-26 Thread Christian Rohmann

Hey ceph-users,

I am running two (now) Quincy clusters doing RGW multi-site replication 
with only one actually being written to by clients.

The other site is intended simply as a remote copy.

On the primary cluster I am observing an ever growing (objects and 
bytes) "sitea.rgw.log" pool, not so on the remote "siteb.rgw.log" which 
is only 300MiB and around 15k objects with no growth.
Metrics show that the growth of pool on primary is linear for at least 6 
months, so not sudden spikes or anything. Also sync status appears to be 
totally happy.

There are also no warnings in regards to large OMAPs or anything similar.

I was under the impression that RGW will trim its three logs (md, bi, 
data) automatically and only keep data that has not yet been replicated 
by the other zonegroup members?
The config option "ceph config get mgr rgw_sync_log_trim_interval" is 
set to 1200, so 20 Minutes.


So I am wondering if there might be some inconsistency or how I can best 
analyze what the cause for the accumulation of log data is?
There are older questions on the ML, such as [1], but there was not 
really a solution or root cause identified.


I know there is manual trimming, but I'd rather want to analyze the 
current situation and figure out what the cause for the lack of 
auto-trimming is.



  * Do I need to go through all buckets and count logs and look at 
their timestamps? Which queries do make sense here?
  * Is there usually any logging of the log trimming activity that I 
should expect? Or that might indicate why trimming does not happen?



Regards

Christian


[1] 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/WZCFOAMLWV3XCGJ3TVLHGMJFVYNZNKLD/




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Radogw ignoring HTTP_X_FORWARDED_FOR header

2023-06-26 Thread Christian Rohmann

Hello Yosr,

On 26/06/2023 11:41, Yosr Kchaou wrote:

We are facing an issue with getting the right value for the header
HTTP_X_FORWARDED_FOR when getting client requests. We need this value to do
the source ip check validation.

[...]

Currently, RGW sees that all requests come from 127.0.0.1. So it is still
considering the nginx ip address and not the client who made the request.
May I point you to my recent post to this ML about this very question: 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/IKGLAROSVWHSRZQSYTLLHVRWFPOLBEGL/


I am still planning to reproduce this issue with simple examples and 
headers set manually via e.g. curl to rule out anything stupid I might 
have misconfigured in my case. I just did not find the time yet.


But did you sniff any traffic to the backend or verified how the headers 
look like in your case? Any debug logging "debug rgw = 20" where you can 
see what RGW things of the incoming request?
Did you test with S3 bucket policies or how did you come to the 
conclusion that RGW is not using the X_FORWARDED_FOR header? Or what is 
your indication that things are not working as expected?


From what I can see, the rgw client log does NOT print the external IP 
from the header, but the source IP of the incoming TCP connection:


    2023-06-26T11:14:37.070+ 7f0389e0b700  1 beast: 0x7f051c776660: 
192.168.1.1 - someid [26/Jun/2023:11:14:36.990 +] "PUT 
/bucket/object HTTP/1.1" 200 43248 - "aws-sdk-go/1.27.0 (go1.16.15; 
linux; amd64) S3Manager" - latency=0.07469s



while the rgw ops log does indeed print the remote_address in remote_addr:

{"bucket":"bucket","time":"2023-06-26T11:16:08.721465Z","time_local":"2023-06-26T11:16:08.721465+","remote_addr":"xxx.xxx.xxx.xxx","user":"someuser","operation":"put_obj","uri":"PUT 
/bucket/object 
HTTP/1.1","http_status":"200","error_code":"","bytes_sent":0,"bytes_received":64413,"object_size":64413,"total_time":155,"user_agent":"aws-sdk-go/1.27.0 
(go1.16.15; linux; amd64) 
S3Manager","referrer":"","trans_id":"REDACTED","authentication_type":"Keystone","access_key_id":"REDACTED","temp_url":false}



So in my case it's not that RGW does not receive and logs this info, but 
more about it not applying this in a bucket policy (as far as my 
analysis of the issue goes).




Regards


Christian


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Bluestore compression - Which algo to choose? Zstd really still that bad?

2023-06-26 Thread Christian Rohmann

Hey ceph-users,

we've been using the default "snappy" to have Ceph compress data on 
certain pools - namely backups / copies of volumes of a VM environment.

So it's write once, and no random access.
I am now wondering if switching to another algo (there is snappy, zlib, 
lz4, or zstd) would improve the compression ratio (significantly)?


* Does anybody have any real world data on snappy vs. $anyother?

Using zstd is tempting as it's used in various other applications 
(btrfs, MongoDB, ...) for inline-compression with great success.
For Ceph though there is a warning ([1]), about it being not recommended 
in the docs still. But I am wondering if this still stands with e.g. [2] 
merged.
And there was [3] trying to improve the performance, this this reads as 
it only lead to a dead-end and no code changes?



In any case does anybody have any numbers to help with the decision on 
the compression algo?




Regards


Christian


[1] 
https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#confval-bluestore_compression_algorithm

[2] https://github.com/ceph/ceph/pull/33790
[3] https://github.com/facebook/zstd/issues/910
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] ceph quincy repo update to debian bookworm...?

2023-06-22 Thread Christian Peters

Hi ceph users/maintainers,

I installed ceph quincy on debian bullseye as a ceph client and now want 
to update to bookworm.

I see that there is at the moment only bullseye supported.

https://download.ceph.com/debian-quincy/dists/bullseye/

Will there be an update of

deb https://download.coeh.com/debian-quincy/ bullseye main

to

deb https://download.coeh.com/debian-quincy/ boowkworm main

in the near future!?

Regards,

Christian



OpenPGP_0xC20C05037880471C.asc
Description: OpenPGP public key


OpenPGP_signature
Description: OpenPGP digital signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW: Migrating a long-lived cluster to multi-site, fixing an EC pool mistake

2023-06-21 Thread Christian Theune
Aaaand another dead end: there is too much meta-data involved (bucket and 
object ACLs, lifecycle, policy, …) that won’t be possible to perfectly migrate. 
Also, lifecycles _might_ be affected if mtimes change.

So, I’m going to try and go back to a single-cluster multi-zone setup. For that 
I’m going to change all buckets with explicit placements to remove the explicit 
placement markers (those were created from old versions of Ceph and weren’t 
intentional by us, they perfectly reflect the default placement configuration).

Here’s the patch I’m going to try on top of our Nautilus branch now:
https://github.com/flyingcircusio/ceph/commit/b3a317987e50f089efc4e9694cf6e3d5d9c23bd5

All our buckets with explicit placements conform perfectly to the default 
placement, so this seems safe.

Otherwise Zone migration was perfect until I noticed the objects with explicit 
placements in our staging and production clusters. (The dev cluster seems to 
have been purged intermediately, so this wasn’t noticed).

I’m actually wondering whether explicit placements are really a sensible thing 
to have, even in multi-cluster multi-zone setups. AFAICT due to realms you 
might end up with different zonegroups referring to the same pools and this 
should only run through proper abstractions … o_O

Cheers,
Christian

> On 14. Jun 2023, at 17:42, Christian Theune  wrote:
> 
> Hi,
> 
> further note to self and for posterity … ;)
> 
> This turned out to be a no-go as well, because you can’t silently switch the 
> pools to a different storage class: the objects will be found, but the index 
> still refers to the old storage class and lifecycle migrations won’t work.
> 
> I’ve brainstormed for further options and it appears that the last resort is 
> to use placement targets and copy the buckets explicitly - twice, because on 
> Nautilus I don’t have renames available, yet. :( 
> 
> This will require temporary downtimes prohibiting users to access their 
> bucket. Fortunately we only have a few very large buckets (200T+) that will 
> take a while to copy. We can pre-sync them of course, so the downtime will 
> only be during the second copy.
> 
> Christian
> 
>> On 13. Jun 2023, at 14:52, Christian Theune  wrote:
>> 
>> Following up to myself and for posterity:
>> 
>> I’m going to try to perform a switch here using (temporary) storage classes 
>> and renaming of the pools to ensure that I can quickly change the STANDARD 
>> class to a better EC pool and have new objects located there. After that 
>> we’ll add (temporary) lifecycle rules to all buckets to ensure their objects 
>> will be migrated to the STANDARD class.
>> 
>> Once that is finished we should be able to delete the old pool and the 
>> temporary storage class.
>> 
>> First tests appear successfull, but I’m a bit struggling to get the bucket 
>> rules working (apparently 0 days isn’t a real rule … and the debug interval 
>> setting causes high frequent LC runs but doesn’t seem move objects just yet. 
>> I’ll play around with that setting a bit more, though, I think I might have 
>> tripped something that only wants to process objects every so often and on 
>> an interval of 10 a day is still 2.4 hours … 
>> 
>> Cheers,
>> Christian
>> 
>>> On 9. Jun 2023, at 11:16, Christian Theune  wrote:
>>> 
>>> Hi,
>>> 
>>> we are running a cluster that has been alive for a long time and we tread 
>>> carefully regarding updates. We are still a bit lagging and our cluster 
>>> (that started around Firefly) is currently at Nautilus. We’re updating and 
>>> we know we’re still behind, but we do keep running into challenges along 
>>> the way that typically are still unfixed on main and - as I started with - 
>>> have to tread carefully.
>>> 
>>> Nevertheless, mistakes happen, and we found ourselves in this situation: we 
>>> converted our RGW data pool from replicated (n=3) to erasure coded (k=10, 
>>> m=3, with 17 hosts) but when doing the EC profile selection we missed that 
>>> our hosts are not evenly balanced (this is a growing cluster and some 
>>> machines have around 20TiB capacity for the RGW data pool, wheres newer 
>>> machines have around 160TiB and we rather should have gone with k=4, m=3.  
>>> In any case, having 13 chunks causes too many hosts to participate in each 
>>> object. Going for k+m=7 will allow distribution to be more effective as we 
>>> have 7 hosts that have the 160TiB sizing.
>>> 
>>> Our original migration used the “cache tiering” approach, but that only 
>>> works once when moving from replicated to EC and can not be used for 
>>> further

[ceph-users] Re: RGW: Migrating a long-lived cluster to multi-site, fixing an EC pool mistake

2023-06-16 Thread Christian Theune
What got lost is that I need to change the pool’s m/k parameters, which is only 
possible by creating a new pool and moving all data from the old pool. Changing 
the crush rule doesn’t allow you to do that. 

> On 16. Jun 2023, at 23:32, Nino Kotur  wrote:
> 
> If you create new crush rule for ssd/nvme/hdd and attach it to existing pool 
> you should be able to do the migration seamlessly while everything is 
> online... However impact to user will depend on storage devices load and 
> network utilization as it will create chaos on cluster network.
> 
> Or did i get something wrong?
> 
> 
> 
>  
> Kind regards,
> Nino
> 
> 
> On Wed, Jun 14, 2023 at 5:44 PM Christian Theune  wrote:
> Hi,
> 
> further note to self and for posterity … ;)
> 
> This turned out to be a no-go as well, because you can’t silently switch the 
> pools to a different storage class: the objects will be found, but the index 
> still refers to the old storage class and lifecycle migrations won’t work.
> 
> I’ve brainstormed for further options and it appears that the last resort is 
> to use placement targets and copy the buckets explicitly - twice, because on 
> Nautilus I don’t have renames available, yet. :( 
> 
> This will require temporary downtimes prohibiting users to access their 
> bucket. Fortunately we only have a few very large buckets (200T+) that will 
> take a while to copy. We can pre-sync them of course, so the downtime will 
> only be during the second copy.
> 
> Christian
> 
> > On 13. Jun 2023, at 14:52, Christian Theune  wrote:
> > 
> > Following up to myself and for posterity:
> > 
> > I’m going to try to perform a switch here using (temporary) storage classes 
> > and renaming of the pools to ensure that I can quickly change the STANDARD 
> > class to a better EC pool and have new objects located there. After that 
> > we’ll add (temporary) lifecycle rules to all buckets to ensure their 
> > objects will be migrated to the STANDARD class.
> > 
> > Once that is finished we should be able to delete the old pool and the 
> > temporary storage class.
> > 
> > First tests appear successfull, but I’m a bit struggling to get the bucket 
> > rules working (apparently 0 days isn’t a real rule … and the debug interval 
> > setting causes high frequent LC runs but doesn’t seem move objects just 
> > yet. I’ll play around with that setting a bit more, though, I think I might 
> > have tripped something that only wants to process objects every so often 
> > and on an interval of 10 a day is still 2.4 hours … 
> > 
> > Cheers,
> > Christian
> > 
> >> On 9. Jun 2023, at 11:16, Christian Theune  wrote:
> >> 
> >> Hi,
> >> 
> >> we are running a cluster that has been alive for a long time and we tread 
> >> carefully regarding updates. We are still a bit lagging and our cluster 
> >> (that started around Firefly) is currently at Nautilus. We’re updating and 
> >> we know we’re still behind, but we do keep running into challenges along 
> >> the way that typically are still unfixed on main and - as I started with - 
> >> have to tread carefully.
> >> 
> >> Nevertheless, mistakes happen, and we found ourselves in this situation: 
> >> we converted our RGW data pool from replicated (n=3) to erasure coded 
> >> (k=10, m=3, with 17 hosts) but when doing the EC profile selection we 
> >> missed that our hosts are not evenly balanced (this is a growing cluster 
> >> and some machines have around 20TiB capacity for the RGW data pool, wheres 
> >> newer machines have around 160TiB and we rather should have gone with k=4, 
> >> m=3.  In any case, having 13 chunks causes too many hosts to participate 
> >> in each object. Going for k+m=7 will allow distribution to be more 
> >> effective as we have 7 hosts that have the 160TiB sizing.
> >> 
> >> Our original migration used the “cache tiering” approach, but that only 
> >> works once when moving from replicated to EC and can not be used for 
> >> further migrations.
> >> 
> >> The amount of data is at 215TiB somewhat significant, so using an approach 
> >> that scales when copying data[1] to avoid ending up with months of 
> >> migration.
> >> 
> >> I’ve run out of ideas doing this on a low-level (i.e. trying to fix it on 
> >> a rados/pool level) and I guess we can only fix this on an application 
> >> level using multi-zone replication.
> >> 
> >> I have the setup nailed in general, but I’m running into issues with 
> >> buckets in ou

[ceph-users] Re: RGW accessing real source IP address of a client (e.g. in S3 bucket policies)

2023-06-16 Thread Christian Rohmann

On 15/06/2023 15:46, Casey Bodley wrote:

   * In case of HTTP via headers like "X-Forwarded-For". This is
apparently supported only for logging the source in the "rgw ops log" ([1])?
Or is this info used also when evaluating the source IP condition within
a bucket policy?

yes, the aws:SourceIp condition key does use the value from
X-Forwarded-For when present


I have an HAProxy in front of the RGWs which has

"option forwardfor" set  to add the "X-Forwarded-For" header.

Then the RGWs have  "rgw remote addr param = http_x_forwarded_for" set,
according to 
https://docs.ceph.com/en/quincy/radosgw/config-ref/#confval-rgw_remote_addr_param


and I also see remote_addr properly logged within the rgw ops log.



But when applying a bucket policy with aws:SourceIp it seems to only 
work if I set the internal IP of the HAProxy instance, not the public IP 
of the client.

So the actual remote address is NOT used in my case.


Did I miss any config setting anywhere?




Regards and thanks for your help


Christian

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] RGW accessing real source IP address of a client (e.g. in S3 bucket policies)

2023-06-15 Thread Christian Rohmann

Hello Ceph-Users,

context or motivation of my question is S3 bucket policies and other 
cases using the source IP address as condition.


I was wondering if and how RadosGW is able to access the source IP 
address of clients if receiving their connections via a loadbalancer / 
reverse proxy like HAProxy.
So naturally that is where the connection originates from in that case, 
rendering a policy based on IP addresses useless.


Depending on whether the connection balanced as HTTP or TCP there are 
two ways to carry information about the actual source:


 * In case of HTTP via headers like "X-Forwarded-For". This is 
apparently supported only for logging the source in the "rgw ops log" ([1])?
Or is this info used also when evaluating the source IP condition within 
a bucket policy?


 * In case of TCP loadbalancing, there is the proxy protocol v2. This 
unfortunately seems not even supposed by the BEAST library which RGW uses.

    I opened feature requests ...

     ** https://tracker.ceph.com/issues/59422
     ** https://github.com/chriskohlhoff/asio/issues/1091
     ** https://github.com/boostorg/beast/issues/2484

   but there is no outcome yet.


Regards


Christian


[1] 
https://docs.ceph.com/en/quincy/radosgw/config-ref/#confval-rgw_remote_addr_param

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW: Migrating a long-lived cluster to multi-site, fixing an EC pool mistake

2023-06-14 Thread Christian Theune
Hi,

further note to self and for posterity … ;)

This turned out to be a no-go as well, because you can’t silently switch the 
pools to a different storage class: the objects will be found, but the index 
still refers to the old storage class and lifecycle migrations won’t work.

I’ve brainstormed for further options and it appears that the last resort is to 
use placement targets and copy the buckets explicitly - twice, because on 
Nautilus I don’t have renames available, yet. :( 

This will require temporary downtimes prohibiting users to access their bucket. 
Fortunately we only have a few very large buckets (200T+) that will take a 
while to copy. We can pre-sync them of course, so the downtime will only be 
during the second copy.

Christian

> On 13. Jun 2023, at 14:52, Christian Theune  wrote:
> 
> Following up to myself and for posterity:
> 
> I’m going to try to perform a switch here using (temporary) storage classes 
> and renaming of the pools to ensure that I can quickly change the STANDARD 
> class to a better EC pool and have new objects located there. After that 
> we’ll add (temporary) lifecycle rules to all buckets to ensure their objects 
> will be migrated to the STANDARD class.
> 
> Once that is finished we should be able to delete the old pool and the 
> temporary storage class.
> 
> First tests appear successfull, but I’m a bit struggling to get the bucket 
> rules working (apparently 0 days isn’t a real rule … and the debug interval 
> setting causes high frequent LC runs but doesn’t seem move objects just yet. 
> I’ll play around with that setting a bit more, though, I think I might have 
> tripped something that only wants to process objects every so often and on an 
> interval of 10 a day is still 2.4 hours … 
> 
> Cheers,
> Christian
> 
>> On 9. Jun 2023, at 11:16, Christian Theune  wrote:
>> 
>> Hi,
>> 
>> we are running a cluster that has been alive for a long time and we tread 
>> carefully regarding updates. We are still a bit lagging and our cluster 
>> (that started around Firefly) is currently at Nautilus. We’re updating and 
>> we know we’re still behind, but we do keep running into challenges along the 
>> way that typically are still unfixed on main and - as I started with - have 
>> to tread carefully.
>> 
>> Nevertheless, mistakes happen, and we found ourselves in this situation: we 
>> converted our RGW data pool from replicated (n=3) to erasure coded (k=10, 
>> m=3, with 17 hosts) but when doing the EC profile selection we missed that 
>> our hosts are not evenly balanced (this is a growing cluster and some 
>> machines have around 20TiB capacity for the RGW data pool, wheres newer 
>> machines have around 160TiB and we rather should have gone with k=4, m=3.  
>> In any case, having 13 chunks causes too many hosts to participate in each 
>> object. Going for k+m=7 will allow distribution to be more effective as we 
>> have 7 hosts that have the 160TiB sizing.
>> 
>> Our original migration used the “cache tiering” approach, but that only 
>> works once when moving from replicated to EC and can not be used for further 
>> migrations.
>> 
>> The amount of data is at 215TiB somewhat significant, so using an approach 
>> that scales when copying data[1] to avoid ending up with months of migration.
>> 
>> I’ve run out of ideas doing this on a low-level (i.e. trying to fix it on a 
>> rados/pool level) and I guess we can only fix this on an application level 
>> using multi-zone replication.
>> 
>> I have the setup nailed in general, but I’m running into issues with buckets 
>> in our staging and production environment that have `explicit_placement` 
>> pools attached, AFAICT is this an outdated mechanisms but there are no 
>> migration tools around. I’ve seen some people talk about patched versions of 
>> the `radosgw-admin metadata put` variant that (still) prohibits removing 
>> explicit placements.
>> 
>> AFAICT those explicit placements will be synced to the secondary zone and 
>> the effect that I’m seeing underpins that theory: the sync runs for a while 
>> and only a few hundred objects show up in the new zone, as the 
>> buckets/objects are already found in the old pool that the new zone uses due 
>> to the explicit placement rule.
>> 
>> I’m currently running out of ideas, but open for any other options.
>> 
>> Looking at 
>> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/ULKK5RU2VXLFXNUJMZBMUG7CQ5UCWJCB/#R6CPZ2TEWRFL2JJWP7TT5GX7DPSV5S7Z
>>  I’m wondering whether the relevant patch is available somewhere, or whether 
>> I’ll have to try building that patch again on my own.
&

[ceph-users] Re: RGW: Migrating a long-lived cluster to multi-site, fixing an EC pool mistake

2023-06-13 Thread Christian Theune
Following up to myself and for posterity:

I’m going to try to perform a switch here using (temporary) storage classes and 
renaming of the pools to ensure that I can quickly change the STANDARD class to 
a better EC pool and have new objects located there. After that we’ll add 
(temporary) lifecycle rules to all buckets to ensure their objects will be 
migrated to the STANDARD class.

Once that is finished we should be able to delete the old pool and the 
temporary storage class.

First tests appear successfull, but I’m a bit struggling to get the bucket 
rules working (apparently 0 days isn’t a real rule … and the debug interval 
setting causes high frequent LC runs but doesn’t seem move objects just yet. 
I’ll play around with that setting a bit more, though, I think I might have 
tripped something that only wants to process objects every so often and on an 
interval of 10 a day is still 2.4 hours … 

Cheers,
Christian

> On 9. Jun 2023, at 11:16, Christian Theune  wrote:
> 
> Hi,
> 
> we are running a cluster that has been alive for a long time and we tread 
> carefully regarding updates. We are still a bit lagging and our cluster (that 
> started around Firefly) is currently at Nautilus. We’re updating and we know 
> we’re still behind, but we do keep running into challenges along the way that 
> typically are still unfixed on main and - as I started with - have to tread 
> carefully.
> 
> Nevertheless, mistakes happen, and we found ourselves in this situation: we 
> converted our RGW data pool from replicated (n=3) to erasure coded (k=10, 
> m=3, with 17 hosts) but when doing the EC profile selection we missed that 
> our hosts are not evenly balanced (this is a growing cluster and some 
> machines have around 20TiB capacity for the RGW data pool, wheres newer 
> machines have around 160TiB and we rather should have gone with k=4, m=3.  In 
> any case, having 13 chunks causes too many hosts to participate in each 
> object. Going for k+m=7 will allow distribution to be more effective as we 
> have 7 hosts that have the 160TiB sizing.
> 
> Our original migration used the “cache tiering” approach, but that only works 
> once when moving from replicated to EC and can not be used for further 
> migrations.
> 
> The amount of data is at 215TiB somewhat significant, so using an approach 
> that scales when copying data[1] to avoid ending up with months of migration.
> 
> I’ve run out of ideas doing this on a low-level (i.e. trying to fix it on a 
> rados/pool level) and I guess we can only fix this on an application level 
> using multi-zone replication.
> 
> I have the setup nailed in general, but I’m running into issues with buckets 
> in our staging and production environment that have `explicit_placement` 
> pools attached, AFAICT is this an outdated mechanisms but there are no 
> migration tools around. I’ve seen some people talk about patched versions of 
> the `radosgw-admin metadata put` variant that (still) prohibits removing 
> explicit placements.
> 
> AFAICT those explicit placements will be synced to the secondary zone and the 
> effect that I’m seeing underpins that theory: the sync runs for a while and 
> only a few hundred objects show up in the new zone, as the buckets/objects 
> are already found in the old pool that the new zone uses due to the explicit 
> placement rule.
> 
> I’m currently running out of ideas, but open for any other options.
> 
> Looking at 
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/ULKK5RU2VXLFXNUJMZBMUG7CQ5UCWJCB/#R6CPZ2TEWRFL2JJWP7TT5GX7DPSV5S7Z
>  I’m wondering whether the relevant patch is available somewhere, or whether 
> I’ll have to try building that patch again on my own.
> 
> Going through the docs and the code I’m actually wondering whether 
> `explicit_placement` is actually a really crufty residual piece that won’t 
> get used in newer clusters but older clusters don’t really have an option to 
> get away from?
> 
> In my specific case, the placement rules are identical to the explicit 
> placements that are stored on (apparently older) buckets and the only thing I 
> need to do is to remove them. I can accept a bit of downtime to avoid any 
> race conditions if needed, so maybe having a small tool to just remove those 
> entries while all RGWs are down would be fine. A call to `radosgw-admin 
> bucket stat` takes about 18s for all buckets in production and I guess that 
> would be a good comparison for what timing to expect when running an update 
> on the metadata.
> 
> I’ll also be in touch with colleagues from Heinlein and 42on but I’m open to 
> other suggestions.
> 
> Hugs,
> Christian
> 
> [1] We currently have 215TiB data in 230M objects. Using the “official” 
> “cache-flush-evict-all” appr

[ceph-users] RGW: Migrating a long-lived cluster to multi-site, fixing an EC pool mistake

2023-06-09 Thread Christian Theune
Hi,

we are running a cluster that has been alive for a long time and we tread 
carefully regarding updates. We are still a bit lagging and our cluster (that 
started around Firefly) is currently at Nautilus. We’re updating and we know 
we’re still behind, but we do keep running into challenges along the way that 
typically are still unfixed on main and - as I started with - have to tread 
carefully.

Nevertheless, mistakes happen, and we found ourselves in this situation: we 
converted our RGW data pool from replicated (n=3) to erasure coded (k=10, m=3, 
with 17 hosts) but when doing the EC profile selection we missed that our hosts 
are not evenly balanced (this is a growing cluster and some machines have 
around 20TiB capacity for the RGW data pool, wheres newer machines have around 
160TiB and we rather should have gone with k=4, m=3.  In any case, having 13 
chunks causes too many hosts to participate in each object. Going for k+m=7 
will allow distribution to be more effective as we have 7 hosts that have the 
160TiB sizing.

Our original migration used the “cache tiering” approach, but that only works 
once when moving from replicated to EC and can not be used for further 
migrations.

The amount of data is at 215TiB somewhat significant, so using an approach that 
scales when copying data[1] to avoid ending up with months of migration.

I’ve run out of ideas doing this on a low-level (i.e. trying to fix it on a 
rados/pool level) and I guess we can only fix this on an application level 
using multi-zone replication.

I have the setup nailed in general, but I’m running into issues with buckets in 
our staging and production environment that have `explicit_placement` pools 
attached, AFAICT is this an outdated mechanisms but there are no migration 
tools around. I’ve seen some people talk about patched versions of the 
`radosgw-admin metadata put` variant that (still) prohibits removing explicit 
placements.

AFAICT those explicit placements will be synced to the secondary zone and the 
effect that I’m seeing underpins that theory: the sync runs for a while and 
only a few hundred objects show up in the new zone, as the buckets/objects are 
already found in the old pool that the new zone uses due to the explicit 
placement rule.

I’m currently running out of ideas, but open for any other options.

Looking at 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/ULKK5RU2VXLFXNUJMZBMUG7CQ5UCWJCB/#R6CPZ2TEWRFL2JJWP7TT5GX7DPSV5S7Z
 I’m wondering whether the relevant patch is available somewhere, or whether 
I’ll have to try building that patch again on my own.

Going through the docs and the code I’m actually wondering whether 
`explicit_placement` is actually a really crufty residual piece that won’t get 
used in newer clusters but older clusters don’t really have an option to get 
away from?

In my specific case, the placement rules are identical to the explicit 
placements that are stored on (apparently older) buckets and the only thing I 
need to do is to remove them. I can accept a bit of downtime to avoid any race 
conditions if needed, so maybe having a small tool to just remove those entries 
while all RGWs are down would be fine. A call to `radosgw-admin bucket stat` 
takes about 18s for all buckets in production and I guess that would be a good 
comparison for what timing to expect when running an update on the metadata.

I’ll also be in touch with colleagues from Heinlein and 42on but I’m open to 
other suggestions.

Hugs,
Christian

[1] We currently have 215TiB data in 230M objects. Using the “official” 
“cache-flush-evict-all” approach was unfeasible here as it only yielded around 
50MiB/s. Using cache limits and targetting the cache sizes to 0 caused proper 
parallelization and was able to flush/evict at almost constant 1GiB/s in the 
cluster. 


-- 
Christian Theune · c...@flyingcircus.io · +49 345 219401 0
Flying Circus Internet Operations GmbH · https://flyingcircus.io
Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Encryption per user Howto

2023-05-22 Thread Christian Wuerdig
Hm, this thread is confusing
in the context of S3 client-side encryption means - the user is responsible
to encrypt the data with their own keys before submitting it. As far as I'm
aware, client-side encryption doesn't require any specific server support -
it's a function of the client SDK used which provides the convenience of
encrypting your data before upload and decryptiing it after download -
https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingClientSideEncryption.html
But you can always encrypt your data and then upload it via RGW, there is
nothing anywhere that prevents that since uploaded objects are just a
sequence of bytes, meta data won't be encrypted then

You can also do server-side encryption by bringing your own keys -
https://docs.ceph.com/en/quincy/radosgw/encryption/#customer-provided-keys

I suspect you're asking for server-side encryption with keys managed by
ceph on a per-user basis?


On Tue, 23 May 2023 at 03:28, huxia...@horebdata.cn 
wrote:

> Hi, Stefan,
>
> Thanks a lot for the message. It seems that client-side encryption (or per
> use) is still on the way and not ready yet for today.
>
> Are there  practical methods to implement encryption for CephFS with
> today' technique? e.g using LUKS or other tools?
>
> Kind regards,
>
>
> Samuel
>
>
>
>
> huxia...@horebdata.cn
>
> From: Stefan Kooman
> Date: 2023-05-22 17:19
> To: Alexander E. Patrakov; huxia...@horebdata.cn
> CC: ceph-users
> Subject: Re: [ceph-users] Re: Encryption per user Howto
> On 5/21/23 15:44, Alexander E. Patrakov wrote:
> > Hello Samuel,
> >
> > On Sun, May 21, 2023 at 3:48 PM huxia...@horebdata.cn
> >  wrote:
> >>
> >> Dear Ceph folks,
> >>
> >> Recently one of our clients approached us with a request on encrpytion
> per user, i.e. using individual encrytion key for each user and encryption
> files and object store.
> >>
> >> Does anyone know (or have experience) how to do with CephFS and Ceph
> RGW?
> >
> > For CephFS, this is unachievable.
>
> For a couple of years already, work is being done to have fscrypt
> support for CephFS [1]. When that work ends up in mainline kernel (and
> distro kernels at some point) this will be possible.
>
> Gr. Stefan
>
> [1]: https://lwn.net/Articles/829448/
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pg_autoscaler using uncompressed bytes as pool current total_bytes triggering false POOL_TARGET_SIZE_BYTES_OVERCOMMITTED warnings?

2023-04-21 Thread Christian Rohmann

Hey ceph-users,

may I ask (nag) again about this issue?  I am wondering if anybody can 
confirm my observations?
I raised a bug https://tracker.ceph.com/issues/54136, but apart from the 
assignment to a

dev a while ago here was not response yet.

Maybe I am just holding it wrong, please someone enlighten me.


Thank you and with kind regards

Christian




On 02/02/2022 20:10, Christian Rohmann wrote:


Hey ceph-users,


I am debugging a mgr pg_autoscaler WARN which states a 
target_size_bytes on a pool would overcommit the available storage.
There is only one pool with value for  target_size_bytes (=5T) defined 
and that apparently would consume more than the available storage:


--- cut ---
# ceph health detail
HEALTH_WARN 1 subtrees have overcommitted pool target_size_bytes
[WRN] POOL_TARGET_SIZE_BYTES_OVERCOMMITTED: 1 subtrees have 
overcommitted pool target_size_bytes
    Pools ['backups', 'images', 'device_health_metrics', '.rgw.root', 
'redacted.rgw.control', 'redacted.rgw.meta', 'redacted.rgw.log', 
'redacted.rgw.otp', 'redacted.rgw.buckets.index', 
'redacted.rgw.buckets.data', 'redacted.rgw.buckets.non-ec'] overcommit 
available storage by 1.011x due to target_size_bytes 15.0T on pools 
['redacted.rgw.buckets.data'].

--- cut ---


But then looking at the actual usage it seems strange that 15T (5T * 3 
replicas) should not fit onto the remaining 122 TiB AVAIL:



--- cut ---
# ceph df detail
--- RAW STORAGE ---
CLASS  SIZE AVAIL    USED RAW USED  %RAW USED
hdd    293 TiB  122 TiB  171 TiB   171 TiB  58.44
TOTAL  293 TiB  122 TiB  171 TiB   171 TiB  58.44

--- POOLS ---
POOL ID  PGS   STORED   (DATA) (OMAP)   
OBJECTS  USED (DATA)   (OMAP)   %USED  MAX AVAIL QUOTA OBJECTS  
QUOTA BYTES  DIRTY  USED COMPR  UNDER COMPR
backups   1  1024   92 TiB   92 TiB  3.8 MiB   
28.11M  156 TiB  156 TiB   11 MiB  64.77 28 TiB N/A    
N/A    N/A  39 TiB  123 TiB
images    2    64  1.7 TiB  1.7 TiB  249 KiB  
471.72k  5.2 TiB  5.2 TiB  748 KiB   5.81 28 TiB N/A    
N/A    N/A 0 B  0 B
device_health_metrics    19 1   82 MiB  0 B   82 
MiB   43  245 MiB  0 B  245 MiB  0 28 TiB 
N/A    N/A    N/A 0 B  0 B
.rgw.root    21    32   23 KiB   23 KiB 0 B   
25  4.1 MiB  4.1 MiB  0 B  0 28 TiB N/A    
N/A    N/A 0 B  0 B
redacted.rgw.control 22    32  0 B  0 B 0 B    
8  0 B  0 B  0 B  0 28 TiB N/A    
N/A    N/A 0 B  0 B
redacted.rgw.meta    23    32  1.7 MiB  394 KiB  1.3 
MiB    1.38k  237 MiB  233 MiB  3.9 MiB  0 28 TiB 
N/A    N/A    N/A 0 B  0 B
redacted.rgw.log 24    32   53 MiB  500 KiB   53 
MiB    7.60k  204 MiB   47 MiB  158 MiB  0 28 TiB 
N/A    N/A    N/A 0 B  0 B
redacted.rgw.otp 25    32  5.2 KiB  0 B  5.2 
KiB    0   16 KiB  0 B   16 KiB  0 28 TiB 
N/A    N/A    N/A 0 B  0 B
redacted.rgw.buckets.index   26    32  1.2 GiB  0 B  1.2 
GiB    7.46k  3.5 GiB  0 B  3.5 GiB  0 28 TiB 
N/A    N/A    N/A 0 B  0 B
redacted.rgw.buckets.data    27   128  3.1 TiB  3.1 TiB 0 B    
3.53M  9.5 TiB  9.5 TiB  0 B  10.11 28 TiB N/A    
N/A    N/A 0 B  0 B
redacted.rgw.buckets.non-ec  28    32  0 B  0 B 0 B    
0  0 B  0 B  0 B  0 28 TiB N/A    
N/A    N/A 0 B  0 B

--- cut ---


I then looked at how those values are determined at 
https://github.com/ceph/ceph/blob/9f723519257eca039126a20aa6a2a7d2dbfb5dba/src/pybind/mgr/pg_autoscaler/module.py#L509.
Apparently "total_bytes" are compared with the capacity of the 
root_map. I added a debug line and found that the total in my cluster 
was already at:


  total=325511007759696

so in excess of 300 TiB - Looking at "ceph df" again this usage seems 
strange.




Looking at how this total is calculated at 
https://github.com/ceph/ceph/blob/9f723519257eca039126a20aa6a2a7d2dbfb5dba/src/pybind/mgr/pg_autoscaler/module.py#L441,
you see that the larger value (max) of "actual_raw_used" vs. 
"target_bytes*raw_used_rate" is determined and then summed up.



I dumped the values for all pools my cluster with yet another line of 
debug code:


---cut ---
pool_id 1 - actual_raw_used=303160109187420.0, target_bytes=0 
raw_used_rate=3.0
pool_id 2 - actual_raw_used=5714098884702.0, target_bytes=0 
raw_used_rate=3.0

pool_id 19 - actual_raw_used=256550760.0, target_bytes=0 raw_used_rate=3.0
pool_id 21 - actual_raw_used=71433.0, target_bytes=0 raw_used_r

[ceph-users] Re: Eccessive occupation of small OSDs

2023-04-02 Thread Christian Wuerdig
With failure domain host your max usable cluster capacity is essentially
constrained by the total capacity of the smallest host which is 8TB if I
read the output correctly. You need to balance your hosts better by
swapping drives.

On Fri, 31 Mar 2023 at 03:34, Nicola Mori  wrote:

> Dear Ceph users,
>
> my cluster is made up of 10 old machines, with uneven number of disks and
> disk size. Essentially I have just one big data pool (6+2 erasure code,
> with host failure domain) for which I am currently experiencing a very poor
> available space (88 TB of which 40 TB occupied, as reported by df -h on
> hosts mounting the cephfs) compared to the raw one (196.5 TB). I have a
> total of 104 OSDs and 512 PGs for the pool; I cannot increment the PG
> number since the machines are old and with very low amount of RAM, and some
> of them are already overloaded.
>
> In this situation I'm seeing a high occupation of small OSDs (500 MB) with
> respect to bigger ones (2 and 4 TB) even if the weight is set equal to disk
> capacity (see below for ceph osd tree). For example OSD 9 is at 62%
> occupancy even with weight 0.5 and reweight 0.75, while the highest
> occupancy for 2 TB OSDs is 41% (OSD 18) and 4 TB OSDs is 23% (OSD 79). I
> guess this high occupancy for 500 MB OSDs combined with erasure code size
> and host failure domain might be the cause of the poor available space,
> could this be true? The upmap balancer is currently running but I don't
> know if and how much it could improve the situation.
> Any hint is greatly appreciated, thanks.
>
> Nicola
>
> # ceph osd tree
> ID   CLASS  WEIGHT TYPE NAME STATUS  REWEIGHT  PRI-AFF
>  -1 196.47754  root default
>  -7  14.55518  host aka
>   4hdd1.81940  osd.4 up   1.0  1.0
>  11hdd1.81940  osd.11up   1.0  1.0
>  18hdd1.81940  osd.18up   1.0  1.0
>  26hdd1.81940  osd.26up   1.0  1.0
>  32hdd1.81940  osd.32up   1.0  1.0
>  41hdd1.81940  osd.41up   1.0  1.0
>  48hdd1.81940  osd.48up   1.0  1.0
>  55hdd1.81940  osd.55up   1.0  1.0
>  -3  14.55518  host balin
>   0hdd1.81940  osd.0 up   1.0  1.0
>   8hdd1.81940  osd.8 up   1.0  1.0
>  15hdd1.81940  osd.15up   1.0  1.0
>  22hdd1.81940  osd.22up   1.0  1.0
>  29hdd1.81940  osd.29up   1.0  1.0
>  34hdd1.81940  osd.34up   1.0  1.0
>  43hdd1.81940  osd.43up   1.0  1.0
>  49hdd1.81940  osd.49up   1.0  1.0
> -13  29.10950  host bifur
>   3hdd3.63869  osd.3 up   1.0  1.0
>  14hdd3.63869  osd.14up   1.0  1.0
>  27hdd3.63869  osd.27up   1.0  1.0
>  37hdd3.63869  osd.37up   1.0  1.0
>  50hdd3.63869  osd.50up   1.0  1.0
>  59hdd3.63869  osd.59up   1.0  1.0
>  64hdd3.63869  osd.64up   1.0  1.0
>  69hdd3.63869  osd.69up   1.0  1.0
> -17  29.10950  host bofur
>   2hdd3.63869  osd.2 up   1.0  1.0
>  21hdd3.63869  osd.21up   1.0  1.0
>  39hdd3.63869  osd.39up   1.0  1.0
>  57hdd3.63869  osd.57up   1.0  1.0
>  66hdd3.63869  osd.66up   1.0  1.0
>  72hdd3.63869  osd.72up   1.0  1.0
>  76hdd3.63869  osd.76up   1.0  1.0
>  79hdd3.63869  osd.79up   1.0  1.0
> -21  29.10376  host dwalin
>  88hdd1.81898  osd.88up   1.0  1.0
>  89hdd1.81898  osd.89up   1.0  1.0
>  90hdd1.81898  osd.90up   1.0  1.0
>  91hdd1.81898  osd.91up   1.0  1.0
>  92hdd1.81898  osd.92up   1.0  1.0
>  93hdd1.81898  osd.93up   1.0  1.0
>  94hdd1.81898  osd.94up   1.0  1.0
>  95hdd1.81898  osd.95up   1.0  1.0
>  96hdd1.81898  osd.96up   1.0  1.0
>  97hdd1.81898  osd.97up   1.0  1.0
>  98hdd1.81898  osd.98up   1.0  1.0
>  99hdd1.81898  osd.99up   1.0  1.0
> 100hdd1.81898  osd.100   up   1.0  

[ceph-users] External Auth (AssumeRoleWithWebIdentity) , STS by default, generic policies and isolation by ownership

2023-03-15 Thread Christian Rohmann

Hello ceph-users,

unhappy with the capabilities in regards to bucket access policies when 
using the Keystone authentication module
I posted to this ML a while back - 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/S2TV7GVFJTWPYA6NVRXDL2JXYUIQGMIN/


In general I'd still like to hear how others are making use of external 
authentication and STS and what your

experiences are in replacing e.g. Keystone authentication



In the meantime we looked into OIDC authentication (via Keycloak) and 
the potentials there.
While this works in general, AssumeRoleWithWebIdentity comes back with 
an STS token and that can be used to access S3 buckets,

I am wondering about a few things:


1) How to enable STS for everyone (without user-individual policy to 
AssumeRole)


In the documentation on STS 
(https://docs.ceph.com/en/quincy/radosgw/STS/#sts-in-ceph) and also 
STS-Lite (https://docs.ceph.com/en/quincy/radosgw/STSLite/#sts-lite)
it's implied at one has to attach an dedicated policy to allow for STS 
to each user individually. This does not scale well with thousands of 
users. Also when using a federated / external authentication, there is no
explicit user creation "A shadow user is created corresponding to every 
federated user. The user id is derived from the ‘sub’ field of the 
incoming web token."


Is there a way to automatically have a role corresponding to each user 
that can be assumed via a OIDC token?
So an implicit role that would allow for an externally authenticated 
user to have full access to S3 and all buckets owned?
Looking at STS Lite documentation, it seems all the more natural to be 
able to allow keystone users to make use of STS.


Is there any way to apply such an AssumeRole policy "globally" or for a 
whole set of users at the same time?
I just found PR https://github.com/ceph/ceph/pull/44434 aiming to add 
policy variables such as ${aws:username}  to allow for generic policies.
But this is more about restricting bucket names or granting access to 
certain pattern of names.




2) Isolation in S3 Multi-Tenancy with external IdP 
(AssumeRoleWithWebIdentity), how does bucket ownership come into play?


Following the question about generic policies for STS I am wondering 
about the role (no pun intended) that the bucket ownership or tenant 
play here?

If one creates a role policy of e.g.

{"Version":"2012-10-17","Statement":{"Effect":"Allow","Action":"s3:*","Resource":"arn:aws:s3:::*"}}

Would this allow someone assuming this role access to all, "*", buckets, 
or just those owned by the user that created this role policy?



In case of Keystone auth the owner of a bucket is the project, not the 
individual (human) user. So this creates somewhat of a tenant which I'd 
want to isolate.




3) Allowing users to create their own roles and policies by default

Is there a way to allow users to create their own roles and policies to 
use them by default?
All the examples talk about the requirement for admin caps and 
individual setting of '--caps="user-policy=*'.


If there was a default role + policy (question #1) that could be applied 
to externally authenticated users, I'd like for them to be able to
create new roles and policies to grant access to their buckets to other 
users.






Regards


Christian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Trying to throttle global backfill

2023-03-09 Thread Rice, Christian
I received a few suggestions, and resolved my issue.

Anthony D'Atri suggested mclock (newer than my nautilus version), adding 
"--osd_recovery_max_single_start 1” (didn’t seem to take), 
“osd_op_queue_cut_off=high” (which I didn’t get to checking), and pgremapper 
(from github).

Pgremapper did the trick to cancel the backfill which had been initiated by an 
unfortunate OSD name-changing sequence.  Big winner, achieved EXACTLY what I 
needed, which was to undo an unfortunate recalculation of placement groups.

Before: 310842802/17308319325 objects misplaced (1.796%)
Ran: pgremapper cancel-backfill --yes
After: 421709/17308356309 objects misplaced (0.002%)

The “before” scenario was causing over 10GiB/s of backfill traffic.  The 
“after” scenario was a very cool 300-400MiB/s, entirely within the realm of 
sanity.  The cluster is temporarily split between two datacenters, being 
physically lifted and shifted over a period of a month.

Alex Gorbachev also suggested setting osd-recovery-sleep.  That was probably 
the solution I was looking for to throttle backfill operations at the 
beginning, and I’ll be keeping that in my toolbox, as well.

As always, I’m HUGELY appreciative of the community response.  I learned a lot 
in the process, had an outage-inducing scenario rectified very quickly, and got 
back to work.  Thanks so much!  Happy to answer any followup questions and 
return the favor when I can.

From: Rice, Christian 
Date: Wednesday, March 8, 2023 at 3:57 PM
To: ceph-users 
Subject: [EXTERNAL] [ceph-users] Trying to throttle global backfill
I have a large number of misplaced objects, and I have all osd settings to “1” 
already:

sudo ceph tell osd.\* injectargs '--osd_max_backfills=1 
--osd_recovery_max_active=1 --osd_recovery_op_priority=1'


How can I slow it down even more?  The cluster is too large, it’s impacting 
other network traffic 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Trying to throttle global backfill

2023-03-08 Thread Rice, Christian
I have a large number of misplaced objects, and I have all osd settings to “1” 
already:

sudo ceph tell osd.\* injectargs '--osd_max_backfills=1 
--osd_recovery_max_active=1 --osd_recovery_op_priority=1'


How can I slow it down even more?  The cluster is too large, it’s impacting 
other network traffic 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [EXTERNAL] Re: Renaming a ceph node

2023-02-15 Thread Rice, Christian
Hi all, so I used the rename-bucket option this morning for OSD node renames, 
and it was a success.  Works great even on Luminous.

I looked at the swap-bucket command and I felt it was leaning toward real data 
migration from old OSDs to new OSDs and I was a bit timid because there wasn’t 
a second host, just a name change.  So when I looked at rename-bucket, it just 
seemed too simple not to try first.  And I did, and it was.  I renamed two host 
buckets (they housed discrete storage classes, so no dangerous loss of data 
redundancy), and even some rack buckets.

sudo ceph osd crush rename-bucket  

and no data moved.  I first thought I’d wait til the hosts were shutdown, but 
after I stopped the OSDs on the nodes, it seemed safe enough, and it was.

In my particular case, I was moving migrating nodes to a new datacenter, just  
new names and IPs.  I also moved a mon/mgr/rgw; and I merely had to delete the 
mon first, then reprovision it later.

The rgw and mgr worked fine.  I pre-edited ceph.conf to add the new networks, 
remove the old mon name, add the new mon name, so on startup it worked.

I’m not a ceph admin but I play one on the tele.

From: Eugen Block 
Date: Wednesday, February 15, 2023 at 12:44 AM
To: ceph-users@ceph.io 
Subject: [EXTERNAL] [ceph-users] Re: Renaming a ceph node
Hi,

I haven't done this in a production cluster yet, only in small test
clusters without data. But there's a rename-bucket command:

ceph osd crush rename-bucket  
  rename bucket  to 

It should do exactly that, just rename the bucket within the crushmap
without changing the ID. That command also exists in Luminous, I
believe. To have an impression of the impact I'd recommend to test in
a test cluster first.

Regards,
Eugen


Zitat von Manuel Lausch :

> Hi,
>
> yes you can rename a node without massive rebalancing.
>
> The following I tested with pacific. But I think this should work with
> older versions as well.
> You need to rename the node in the crushmap between shutting down the
> node with the old name and starting it with the new name.
> You only must keep the ID from the node in the crushmap!
>
> Regards
> Manuel
>
>
> On Mon, 13 Feb 2023 22:22:35 +
> "Rice, Christian"  wrote:
>
>> Can anyone please point me at a doc that explains the most
>> efficient procedure to rename a ceph node WITHOUT causing a massive
>> misplaced objects churn?
>>
>> When my node came up with a new name, it properly joined the
>> cluster and owned the OSDs, but the original node with no devices
>> remained.  I expect this affected the crush map such that a large
>> qty of objects got reshuffled.  I want no object movement, if
>> possible.
>>
>> BTW this old cluster is on luminous. ☹
>>
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Renaming a ceph node

2023-02-13 Thread Rice, Christian
Can anyone please point me at a doc that explains the most efficient procedure 
to rename a ceph node WITHOUT causing a massive misplaced objects churn?

When my node came up with a new name, it properly joined the cluster and owned 
the OSDs, but the original node with no devices remained.  I expect this 
affected the crush map such that a large qty of objects got reshuffled.  I want 
no object movement, if possible.

BTW this old cluster is on luminous. ☹

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Status of Quincy 17.2.5 ?

2023-01-25 Thread Christian Rohmann

Hey everyone,


On 20/10/2022 10:12, Christian Rohmann wrote:

1) May I bring up again my remarks about the timing:

On 19/10/2022 11:46, Christian Rohmann wrote:

I believe the upload of a new release to the repo prior to the 
announcement happens quite regularly - it might just be due to the 
technical process of releasing.
But I agree it would be nice to have a more "bit flip" approach to 
new releases in the repo and not have the packages appear as updates 
prior to the announcement and final release and update notes.
By my observations sometimes there are packages available on the 
download servers via the "last stable" folders such as 
https://download.ceph.com/debian-quincy/ quite some time before the 
announcement of a release is out.
I know it's hard to time this right with mirrors requiring some time 
to sync files, but would be nice to not see the packages or have 
people install them before there are the release notes and potential 
pointers to changes out. 


Todays 16.2.11 release shows the exact issue I described above 

1) 16.2.11 packages are already available via e.g. 
https://download.ceph.com/debian-pacific
2) release notes not yet merged: 
(https://github.com/ceph/ceph/pull/49839), thus 
https://ceph.io/en/news/blog/2022/v16-2-11-pacific-released/ show a 404 :-)
3) No announcement like 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/QOCU563UD3D3ZTB5C5BJT5WRSJL5CVSD/ 
to the ML yet.



Regards


Christian


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD slow ops warning not clearing after OSD down

2023-01-16 Thread Christian Rohmann

Hello,

On 04/05/2021 09:49, Frank Schilder wrote:

I created a ticket: https://tracker.ceph.com/issues/50637


We just observed this very issue on Pacific (16.2.10) , which I also 
commented on the ticket.
I wonder if this case is so seldom, first having some issues causing 
slow ops and then a total failure of an OSD ?



Would be nice to fix this though to not "block" the warning status with 
something that's not actually a warning.




Regards


Christian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 16.2.11 branch

2022-12-15 Thread Christian Rohmann

On 15/12/2022 10:31, Christian Rohmann wrote:


May I kindly ask for an update on how things are progressing? Mostly I 
am interested on the (persisting) implications for testing new point 
releases (e.g. 16.2.11) with more and more bugfixes in them.


I guess I just have not looked on the right ML, it's being worke on 
already ... 
https://lists.ceph.io/hyperkitty/list/d...@ceph.io/thread/CQPQJXD6OVTZUH43I4U3GGOP2PKYOREJ/




Sorry for the nagging,


Christian

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 16.2.11 branch

2022-12-15 Thread Christian Rohmann

Hey Laura, Greg, all,

On 31/10/2022 17:15, Gregory Farnum wrote:

If you don't mind me asking Laura, have those issues regarding the

testing lab been resolved yet?


There are currently a lot of folks working to fix the testing lab issues.
Essentially, disk corruption affected our ability to reach quay.ceph.io.
We've made progress this morning, but we are still working to understand
the root cause of the corruption. We expect to re-deploy affected services
soon so we can resume testing for v16.2.11.

We got a note about this today, so I wanted to clarify:

For Reasons, the sepia lab we run teuthology in currently uses a Red
Hat Enterprise Virtualization stack — meaning, mostly KVM with a lot
of fancy orchestration all packaged up, backed by Gluster. (Yes,
really — a full Ceph integration was never built and at one point this
was deemed the most straightforward solution compared to running
all-up OpenStack backed by Ceph, which would have been the available
alternative.) The disk images stored in Gluster started reporting
corruption last week (though Gluster was claiming to be healthy), and
with David's departure and his backup on vacation it took a while for
the remaining team members to figure out what was going on and
identify strategies to resolve or work around it.

The relevant people have figured out a lot more of what was going on,
and Adam (David's backup) is back now so we're expecting things to
resolve more quickly at this point. And indeed the team's looking at
other options for providing this infrastructure going forward. 
-Greg



May I kindly ask for an update on how things are progressing? Mostly I 
am interested on the (persisting) implications for testing new point 
releases (e.g. 16.2.11) with more and more bugfixes in them.



Thanks a bunch!


Christian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] RGW Forcing buckets to be encrypted (SSE-S3) by default (via a global bucket encryption policy)?

2022-11-23 Thread Christian Rohmann

Hey ceph-users,

loosely related to my question about client-side encryption in the Cloud 
Sync module 
(https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/I366AIAGWGXG3YQZXP6GDQT4ZX2Y6BXM/)


I am wondering if there are other options to ensure data is encrypted at 
rest and also only replicated as encrypted data ...



My thoughts / findings so far:

AWS S3 supports setting a bucket encryption policy 
(https://docs.aws.amazon.com/AmazonS3/latest/userguide/default-bucket-encryption.html) 
to "ApplyServerSideEncryptionByDefault" - so automatically apply SSE to 
all objects without the clients to explicitly request this per object.


Ceph RGW has received support for such policy via the bucket encryption 
API with 
https://github.com/ceph/ceph/commit/95acefb2f5e5b1a930b263bbc7d18857d476653c.


I am now just wondering if there is any way to not only allow bucket 
creators to apply such a policy themselves, but to apply this as a 
global default in RGW, forcing all buckets to have SSE enabled - 
transparently.


If there is no way to achieve this just yet, what are your thoughts 
about adding such an option to RGW?



Regards


Christian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cloud sync to minio fails after creating the bucket

2022-11-21 Thread Christian Rohmann

On 21/11/2022 12:50, ma...@roterruler.de wrote:

Could this "just" be the bug https://tracker.ceph.com/issues/55310 (duplicate
https://tracker.ceph.com/issues/57807) about Cloud Sync being broken since 
Pacific?

Wow - yes, the issue seems to be exactly the same that I'm facing -.-


But there is a fix commited, pending backports to Quincy / Pacific: 
https://tracker.ceph.com/issues/57306




Regards


Christian

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cloud sync to minio fails after creating the bucket

2022-11-21 Thread Christian Rohmann

On 21/11/2022 11:04, ma...@roterruler.de wrote:

Hi list,

I'm currently implementing a sync between ceph and a minio cluster to 
continously sync the buckets and objects to an offsite location. I followed the 
guide on https://croit.io/blog/setting-up-ceph-cloud-sync-module

After the sync starts it successfully creates the first bucket, but somehow 
tries over and over again to create the bucket instead of adding the objects 
itself. This is from the minio logs:



Could this "just" be the bug https://tracker.ceph.com/issues/55310 
(duplicate https://tracker.ceph.com/issues/57807) about Cloud Sync being 
broken since Pacific?




Regards


Christian

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW replication and multiple endpoints

2022-11-14 Thread Christian Rohmann

Hey Kamil

On 14/11/2022 13:54, Kamil Madac wrote:

Hello,

I'm trying to create a RGW Zonegroup with two zones, and to have data
replicated between the zones. Each zone is separate Ceph cluster. There is
a possibility to use list of endpoints in zone definitions (not just single
endpoint) which will be then used for the replication between zones. so I
tried to use it instead of using LB in front of clusters for the
replication .

[...]

When node is back again, replication continue to work.

What is the reason to have possibility to have multiple endpoints in the
zone configuration when outage of one of them makes replication not
working?


We are running a similar setup and ran into similar issues before when 
doing rolling restarts of the RGWs.


1) Mostly it's a single metadata shard never syncing up and requireing a 
complete "metadata init". But this issue will likely be address via 
https://tracker.ceph.com/issues/39657


2) But we also observed issues with one RGW being unavailable or just 
slow and as a result influencing the whole sync process. I suppose the 
HTTP client used within rgw syncer does not do a good job of tracking 
which remote RGW is healthy or a slow reading RGW could just be locking 
all the shards ...


3) But as far as "cooperating" goes there are improvements being worked 
on, see https://tracker.ceph.com/issues/41230 or 
https://github.com/ceph/ceph/pull/45958 which then makes better use of 
having multiple distinct RGW in both zones.



Regards


Christian

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Upgrade/migrate host operating system for ceph nodes (CentOS/Rocky)

2022-11-03 Thread Prof. Dr. Christian Dietrich



Hi all,

we're running a ceph cluster with v15.2.17 and cephadm on various CentOS 
hosts. Since CentOS 8.x is EOL, we'd like to upgrade/migrate/reinstall 
the OS, possibly migrating to Rocky or CentOS stream:


host | CentOS   | Podman
-|--|---
osd* | 7.9.2009 | 1.6.4   x5
osd* | 8.4.2105 | 3.0.1   x2
mon0 | 8.4.2105 | 3.2.3
mon1 | 8.4.2105 | 3.0.1
mon2 | 8.4.2105 | 3.0.1
mds* | 7.9.2009 | 1.6.4   x2

We have a few specific questions:
1) Does anyone have experience using Rocky Linux 8 or 9 or CentOS stream 
with ceph? Rocky is not mentioned specifically in the cephadm docs [2].


2) Is the Podman compatibility list [1] still up to date? CentOS Stream 
8 as of 2022-10-19 appears to have Podman version 4.x, IIRC. 4.x does 
not appear in the compatibility table. Anyone using Podman 4.x 
successfully (with which ceph version)?


Thanks in advance,

Chris


[1]: 
https://docs.ceph.com/en/quincy/cephadm/compatibility/#compatibility-with-podman-versions


[2]: 
https://docs.ceph.com/en/quincy/cephadm/install/#cephadm-install-distros

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 16.2.11 branch

2022-10-28 Thread Christian Rohmann

On 28/10/2022 00:25, Laura Flores wrote:

Hi Oleksiy,

The Pacific RC has not been declared yet since there have been problems in
our upstream testing lab. There is no ETA yet for v16.2.11 for that reason,
but the full diff of all the patches that were included will be published
to ceph.io when v16.2.11 is released. There will also be a diff published
in the documentation on this page:
https://docs.ceph.com/en/latest/releases/pacific/

In the meantime, here is a link to the diff in commits between v16.2.10 and
the Pacific branch: https://github.com/ceph/ceph/compare/v16.2.10...pacific


There also is https://tracker.ceph.com/versions/656 which seems to be 
tracking

the open issues tagged for this particular point release.


If you don't mind me asking Laura, have those issues regarding the 
testing lab been resolved yet?


There are quite a few bugfixes in the pending release 16.2.11 which we 
are waiting for. TBH I was about
to ask if it would not be sensible to do an intermediate release and not 
let it grow bigger and

bigger (with even more changes / fixes)  going out at once.



Regards


Christian

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Using multiple SSDs as DB

2022-10-25 Thread Christian
Thank you!

Robert Sander  schrieb am Fr. 21. Okt. 2022

> This is a bug in certain versions of ceph-volume:
>
> https://tracker.ceph.com/issues/56031
>
> It should be fixed in the latest releases.


For completeness's sake: The cluster is on 16.2.10.
Issue is resolved and marked as backported. 16.2.10 was released shortly
before the backport.
Fixed version for Pacific should be 16.2.11.

A partial workaround I found, was limiting data_devices to 8 and db_devices
to 1. This resulted in correct db usage for one db device.
I then tried 16 data 2 db: This did not work: it (would have) resulted in
extra 8 Ceph OSDs with no db device.

Best,
Christian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Using multiple SSDs as DB

2022-10-21 Thread Christian
Hi,

I have a problem fully utilizing some disks with cephadm service spec. The
host I have has the following disks:
4 SSD 900GB
32 HDD 10TB

I would like to use the SSDs as DB devices and the HDD devices as block. 8
HDDs per SSD and the available size for the DB would be about 111GB
(900GB/8).
The spec I used does not fully utilize the SSDs though. Instead of 1/8th of
the SSD, it uses about 28GB, so 1/32th of the SSD.

The spec I use:
spec:
  objectstore: bluestore
  filter_logic: AND
  data_devices:
rotational: 1
  db_devices:
rotational: 0

I saw "limit" in the docs but it sounds like it would limit the amount of
SSDs used for DB devices.

How can I use all of the SSDs‘ capacity?

Best,
Christian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Status of Quincy 17.2.5 ?

2022-10-20 Thread Christian Rohmann



On 19/10/2022 16:30, Laura Flores wrote:
Dan is correct that 17.2.5 is a hotfix release. There was a flaw in 
the release process for 17.2.4 in which five commits were not included 
in the release. The users mailing list will hear an official 
announcement about this hotfix release later this week.


Thanks for the info.


1) May I bring up again my remarks about the timing:

On 19/10/2022 11:46, Christian Rohmann wrote:

I believe the upload of a new release to the repo prior to the 
announcement happens quite regularly - it might just be due to the 
technical process of releasing.
But I agree it would be nice to have a more "bit flip" approach to new 
releases in the repo and not have the packages appear as updates prior 
to the announcement and final release and update notes.
By my observations sometimes there are packages available on the 
download servers via the "last stable" folders such as 
https://download.ceph.com/debian-quincy/ quite some time before the 
announcement of a release is out.
I know it's hard to time this right with mirrors requiring some time to 
sync files, but would be nice to not see the packages or have people 
install them before there are the release notes and potential pointers 
to changes out.



2) Also in cases as with the 17.2.4 release containing a regression it 
would be great to have the N release and N-1 there to allow users to 
downgrade to a previous point-release quickly in case they run into issues.
Otherwise one needs to configure the N-1 repo manually to still have 
access to the N-1 release.


And with this just being links in the filesystem this should not even 
take make space on the download servers or their mirrors.




Regards


Christian

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Mirror de.ceph.com broken?

2022-10-20 Thread Christian Rohmann

Hey ceph-users,

it seems that the German ceph mirror http://de.ceph.com/ 
<http://de.ceph.com/> listed

at https://docs.ceph.com/en/latest/install/mirrors/#locations

does not hold any data.

The index page shows some plesk default page and also deeper links like 
http://de.ceph.com/debian-17.2.4/ return 404.



Regards

Christian


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Status of Quincy 17.2.5 ?

2022-10-19 Thread Christian Rohmann

On 19/10/2022 11:26, Chris Palmer wrote:
I've noticed that packages for Quincy 17.2.5 appeared in the debian 11 
repo a few days ago. However I haven't seen any mention of it 
anywhere, can't find any release notes, and the documentation still 
shows 17.2.4 as the latest version.


Is 17.2.5 documented and ready for use yet? It's a bit risky having it 
sitting undocumented in the repo for any length of time when it might 
inadvertently be applied when doing routine patching... (I spotted it, 
but one day someone might not).


I believe the upload of a new release to the repo prior to the 
announcement happens quite regularly - it might just be due to the 
technical process of releasing.
But I agree it would be nice to have a more "bit flip" approach to new 
releases in the repo and not have the packages appear as updates prior 
to the announcement and final release and update notes.



Regards

Christian

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rgw multisite octopus - bucket can not be resharded after cancelling prior reshard process

2022-10-13 Thread Christian Rohmann

Hey Boris,

On 07/10/2022 11:30, Boris Behrens wrote:

I just wanted to reshard a bucket but mistyped the amount of shards. In a
reflex I hit ctrl-c and waited. It looked like the resharding did not
finish so I canceled it, and now the bucket is in this state.
How can I fix it. It does not show up in the stale-instace list. It's also
a multisite environment (we only sync metadata).
I believe resharding is not supported with rgw multisite 
(https://docs.ceph.com/en/latest/radosgw/dynamicresharding/#multisite)
but is being worked on / implemented fpr the Quincy release, see 
https://tracker.ceph.com/projects/rgw/issues?query_id=247


But you are not syncing the data in your deployment? Maybe that's a 
different case then?




Regards

Christian


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] RGW multisite Cloud Sync module with support for client side encryption?

2022-09-12 Thread Christian Rohmann

Hello Ceph-Users,

I have a question regarding support for any client side encryption in 
the Cloud Sync Module for RGW 
(https://docs.ceph.com/en/latest/radosgw/cloud-sync-module/).


While a "regular" multi-site setup 
(https://docs.ceph.com/en/latest/radosgw/multisite/) is usually syncing 
data between Ceph clusters, RGWs and other supporting
infrastructure in the same administrative domain this might be different 
when looking at cloud sync.
One could setup a sync to e.g. AWS S3 or any other compatible S3 
implementation that is provided as a service and by another provider.


1) I was wondering if there is any transparent way to apply client side 
encryption to those objects that are sent to the remote service?
Even something the likes of a single static key (see 
https://github.com/ceph/ceph/blob/1c9e84a447bb628f2235134f8d54928f7d6b7796/doc/radosgw/encryption.rst#automatic-encryption-for-testing-only) 
would protect against the remote provider being able to look at the data.



2) What happens to objects that are encrypted on the source RGW and via 
SSE-S3? (https://docs.ceph.com/en/quincy/radosgw/encryption/#sse-s3)
I suppose those remain encrypted? But this does require users to 
actively make use of SSE-S3, right?




Thanks again with kind regards,


Christian

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Suggestion to build ceph storage

2022-06-19 Thread Christian Wuerdig
On Sun, 19 Jun 2022 at 02:29, Satish Patel  wrote:

> Greeting folks,
>
> We are planning to build Ceph storage for mostly cephFS for HPC workload
> and in future we are planning to expand to S3 style but that is yet to be
> decided. Because we need mass storage, we bought the following HW.
>
> 15 Total servers and each server has a 12x18TB HDD (spinning disk) . We
> understand SSD/NvME would be best fit but it's way out of budget.
>
> I hope you have extra HW on hand for Monitor and MDS  servers


> Ceph recommends using a faster disk for wal/db if the data disk is slow and
> in my case I do have a slower disk for data.
>
> Question:
> 1. Let's say if i want to put a NvME disk for wal/db then what size i
> should buy.
>

The official recommendation is to budget 4% of OSD size for WAL/DB - so in
your case that would be 720GB per OSD. Especially if you want to go to S3
later you should stick closer to that limit since RGW is a heavy meta data
user.
Also with 12 OSD per node you should have at least 2 NVME - so 2x4TB might
do or maybe 3x3TB
The WAL/DB device is a Single Point of Failure for all OSDs attached (in
other words - if the WAL/DB device fails then all OSDs that have their
WAL/DB located there need to be rebuilt)
Make sure you budget for good number of DWPD (I assume in HPC scenario
you'll have a lot of scratch data) and test it with O_DIRECT and F_SYNC and
QD=1 and BS=4K to find one that can reliably handle high IOPS under that
condition


> 2. Do I need to partition wal/db for each OSD or just a single
> partition can share for all OSDs?
>

You need one partition per OSD


> 3. Can I put the OS on the same disk where the wal/db is going to sit ?
> (This way i don't need to spend extra money for extra disk)
>

Yes you can but in your case that would mean putting the WAL/DB on the HDD
- I would predict your HPC users not being very impressed with the
resulting performance but YMMV


> Any suggestions you have for this kind of storage would be much
> appreciated.
>

Budget plenty of RAM to deal with  recovery scenarios - I'd say in your
case 256GB minimum.
Normally you build a POC and test the heck out of it to cover your usage
scenarios but you already bought the HW so not a lot you can change now -
but you should test and tune your setup before you put production data on
it to ensure that you have a good idea how the system is going to behave
when it get s under load. Make sure you test failure scenarios (failing
OSDs,  failing nodes, network cuts, failing MDS etc.) so you know what to
expect and how to handle them

Another bottleneck in CephFS setups tends to be the MDS - again in your
setup you probably want at least 2 MDS in active-active (i.e. shared load)
plus 1 or 2 on standby as failover but others on this list have more
experience with that.


___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [EXTERNAL] Laggy OSDs

2022-03-30 Thread Rice, Christian
we had issues with slow ops on ssd AND nvme; mostly fixed by raising aio-max-nr 
from 64K to 1M, eg "fs.aio-max-nr=1048576" if I remember correctly.

On 3/29/22, 2:13 PM, "Alex Closs"  wrote:

Hey folks,

We have a 16.2.7 cephadm cluster that's had slow ops and several 
(constantly changing) laggy PGs. The set of OSDs with slow ops seems to change 
at random, among all 6 OSD hosts in the cluster. All drives are enterprise SATA 
SSDs, by either Intel or Micron. We're still not ruling out a network issue, 
but wanted to troubleshoot from the Ceph side in case something broke there.

ceph -s:

 health: HEALTH_WARN
 3 slow ops, oldest one blocked for 246 sec, daemons 
[osd.124,osd.130,osd.141,osd.152,osd.27] have slow ops.

 services:
 mon: 5 daemons, quorum ceph-osd10,ceph-mon0,ceph-mon1,ceph-osd9,ceph-osd11 
(age 28h)
 mgr: ceph-mon0.sckxhj(active, since 25m), standbys: ceph-osd10.xmdwfh, 
ceph-mon1.iogajr
 osd: 143 osds: 143 up (since 92m), 143 in (since 2w)
 rgw: 3 daemons active (3 hosts, 1 zones)

 data:
 pools: 26 pools, 3936 pgs
 objects: 33.14M objects, 144 TiB
 usage: 338 TiB used, 162 TiB / 500 TiB avail
 pgs: 3916 active+clean
 19 active+clean+laggy
 1 active+clean+scrubbing+deep

 io:
 client: 59 MiB/s rd, 98 MiB/s wr, 1.66k op/s rd, 1.68k op/s wr

This is actually much faster than it's been for much of the past hour, it's 
been as low as 50 kb/s and dozens of iops in both directions (where the cluster 
typically does 300MB to a few gigs, and ~4k iops)

The cluster has been on 16.2.7 since a few days after release without 
issue. The only recent change was an apt upgrade and reboot on the hosts (which 
was last Friday and didn't show signs of problems).

Happy to provide logs, let me know what would be useful. Thanks for reading 
this wall :)

-Alex

MIT CSAIL
he/they
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: octopus (15.2.16) OSDs crash or don't answer heathbeats (and get marked as down)

2022-03-23 Thread Christian Wuerdig
I would not host multiple OSD on a spinning drive (unless it's one of those
Seagate MACH.2 drives that have two independent heads) - head seek time
will most likely kill performance. The main reason to host multiple OSD on
a single SSD or NVME is typically to make use of the large IOPS capacity
which cepth can struggle to fully utilize on a single drive. With spinners
you usually don't have that "problem" (quite the opposite usually)

On Wed, 23 Mar 2022 at 19:29, Boris Behrens  wrote:

> Good morning Istvan,
> those are rotating disks and we don't use EC. Splitting up the 16TB disks
> into two 8TB partitions and have two OSDs on one disk also sounds
> interesting, but would it solve the problem?
>
> I also thought to adjust the PGs for the data pool from 4096 to 8192. But I
> am not sure if this will solve the problem or make it worse.
>
> Until now, everything I've tried didn't work.
>
> Am Mi., 23. März 2022 um 05:10 Uhr schrieb Szabo, Istvan (Agoda) <
> istvan.sz...@agoda.com>:
>
> > Hi,
> >
> > I think you are having similar issue as me in the past.
> >
> > I have 1.6B objects on a cluster average 40k and all my osd had spilled
> > over.
> >
> > Also slow ops, wrongly marked down…
> >
> > My osds are 15.3TB ssds, so my solution was to store block+db together on
> > the ssds, put 4 osd/ssd and go up to 100pg/osd so 1 disk holds 400pg
> approx.
> > Also turned on balancer with upmap and max deviation 1.
> >
> > I’m using ec 4:2, let’s see how long it lasts. My bottleneck is always
> the
> > pg number, too small pg number for too many objects.
> >
> > Istvan Szabo
> > Senior Infrastructure Engineer
> > ---
> > Agoda Services Co., Ltd.
> > e: istvan.sz...@agoda.com
> > ---
> >
> > On 2022. Mar 22., at 23:34, Boris Behrens  wrote:
> >
> > Email received from the internet. If in doubt, don't click any link nor
> > open any attachment !
> > 
> >
> > The number 180 PGs is because of the 16TB disks. 3/4 of our OSDs had
> cache
> > SSDs (not nvme though and most of them are 10OSDs one SSD) but this
> problem
> > only came in with octopus.
> >
> > We also thought this might be the db compactation, but it doesn't match
> up.
> > It might happen when the compactation run, but it looks also that it
> > happens, when there are other operations like table_file_deletion
> > and it happens on OSDs that have SSD backed block.db devices (like 5 OSDs
> > share one SAMSUNG MZ7KM1T9HAJM-5 and the IOPS/throughput on the SSD
> is
> > not huge (100IOPS r/s 300IOPS w/s when compacting an OSD on it, and
> around
> > 50mb/s r/w throughput)
> >
> > I also can not reproduce it via "ceph tell osd.NN compact", so I am not
> > 100% sure it is the compactation.
> >
> > What do you mean with "grep for latency string"?
> >
> > Cheers
> > Boris
> >
> > Am Di., 22. März 2022 um 15:53 Uhr schrieb Konstantin Shalygin <
> > k0...@k0ste.ru>:
> >
> > 180PG per OSD is usually overhead, also 40k obj per PG is not much, but I
> >
> > don't think this will works without block.db NVMe. I think your "wrong
> out
> >
> > marks" evulate in time of rocksdb compaction. With default log settings
> you
> >
> > can try to grep 'latency' strings
> >
> >
> > Also, https://tracker.ceph.com/issues/50297
> >
> >
> >
> > k
> >
> > Sent from my iPhone
> >
> >
> > On 22 Mar 2022, at 14:29, Boris Behrens  wrote:
> >
> >
> > * the 8TB disks hold around 80-90 PGs (16TB around 160-180)
> >
> > * per PG we've around 40k objects 170m objects in 1.2PiB of storage
> >
> >
> >
> >
> > --
> > Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
> > groüen Saal.
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> >
> > --
> > This message is confidential and is for the sole use of the intended
> > recipient(s). It may also be privileged or otherwise protected by
> copyright
> > or other legal rules. If you have received it by mistake please let us
> know
> > by reply email and delete it from your system. It is prohibited to copy
> > this message or disclose its content to anyone. Any confidentiality or
> > privilege is not waived or lost by any mistaken delivery or unauthorized
> > disclosure of the message. All messages sent to and from Agoda may be
> > monitored to ensure compliance with company policies, to protect the
> > company's interests and to remove potential malware. Electronic messages
> > may be intercepted, amended, lost or deleted, or contain viruses.
> >
>
>
> --
> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
> groüen Saal.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___

[ceph-users] Re: How to clear "Too many repaired reads on 1 OSDs" on pacific

2022-03-01 Thread Christian Rohmann

On 28/02/2022 20:54, Sascha Vogt wrote:
Is there a way to clear the error counter on pacific? If so, how? 


No, no anymore. See https://tracker.ceph.com/issues/54182


Regards


Christian

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Random scrub errors (omap_digest_mismatch) on pgs of RADOSGW metadata pools (bug 53663)

2022-02-10 Thread Christian Rohmann

Hey Stefan,

thanks for getting back to me!


On 10/02/2022 10:05, Stefan Schueffler wrote:

since my last mail in Dezember, we changed our ceph-setuo like this:

we added one SSD osd on each ceph host (which were pure HDD before). Then, we moved 
the problematic pool "de-dus5.rgw.buckets.index“ to those dedicated SSDs (by 
adding a corresponding crush map).

Since then, no further PG corruptions occurred.

This now has a two sided result:

on the one side, we now do not observe the problematic behavior anymore,

on the other side, this means, by using just spinning HDDs something is buggy 
with ceph. If the HDD can not fulfill the data IO requirements, then it 
probably should not lead to data/PG corruption…
And, just a blind guess, we only have a few IO requests in our RGW gateway per 
second - even with spinning HDDs there should not be a problem to store / 
update the index pool.

I would guess that it correlates with our setup having 7001 shards in the 
problematic bucket, and the implementation of „multisite“ feature, which will 
do 7001 „status“ requests per second to check and synchronize between the 
different rgw sites. And _this_ amount of random IO can not be satisfied by 
utilizing HDDs…
Anyway it should not lead to corrupted PGs.



We also have a multi-site setup and and and have one HDD-only and one 
cluster (primary) with NVME SSD for the OSD journaling.
There are more inconsistencies on the HDD-only cluster, but we do 
observe those on the other cluster as well.


If you follow the issue at https://tracker.ceph.com/issues/53663 there 
is even another user (Dieter Roels) observing this issue now.
He is talking about RADOSGW crashes potentially causing the 
inconsistencies. We already guessed it could be rolling restarts. But we 
cannot put our finger on it yet.


And yes, no amount of IO contention should ever cause data corruption.
In this case I believe there might be a correlation to the multisite 
feature hitting OMAP and stored metadata much harder than with regular 
RADOSGW usage.
And if there is a race condition or missing lock /semaphore or something 
along this line, this certainly is affected by the latency on the 
underlying storage.




Could you maybe trigger manual a deep-scrub on all your OSDs, just to 
see if that does anything?





Thanks again for keeping in touch!
Regards


Christian






___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Random scrub errors (omap_digest_mismatch) on pgs of RADOSGW metadata pools (bug 53663)

2022-02-08 Thread Christian Rohmann

Hey there again,

there now was a question from Neha Ojha in 
https://tracker.ceph.com/issues/53663
about providing OSD debug logs for a manual deep-scrub on (inconsistent) 
PGs.


I did provide the logs of two of those deep-scrubs via ceph-post-file 
already.


But since data inconsistencies are the worse of bugs and adding some 
unpredictability to their occurrence we likely need
more evidence to have a chance to narrow this down. And since you seem 
to observe something similar,  could you gather

and post debug info about them to the ticket as well maybe?


Regards

Christian

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Random scrub errors (omap_digest_mismatch) on pgs of RADOSGW metadata pools (bug 53663)

2022-02-07 Thread Christian Rohmann

Hello Ceph-Users!

On 22/12/2021 00:38, Stefan Schueffler wrote:

The other Problem, regarding the OSD scrub errors, we have this:

ceph health detail shows „PG_DAMAGED: Possible data damage: x pgs 
inconsistent.“
Every now and then new pgs get inconsistent. All inconsistent pgs 
belong to the buckets-index-pool de-dus5.rgw.buckets.index


ceph health detail
pg 136.1 is active+clean+inconsistent, acting [8,3,0]

rados -p de-dus5.rgw.buckets.index list-inconsistent-obj 136.1
No scrub information available for pg 136.1
error 2: (2) No such file or directory

rados list-inconsistent-obj 136.1
No scrub information available for pg 136.1
error 2: (2) No such file or directory

ceph pg deep-scrub 136.1
instructing pg 136.1 on osd.8 to deep-scrub

… until now nothing changed, the list-inconsistent-obj does not show 
any information (did i miss some cli arguments?)


Ususally, we simply do a
ceph pg repair 136.1
which most of the time silently does whatever it is supposed to do, 
and the error disappears. Shortly after, it reappears at random, with 
some other (or the same) pg out of the rgw.buckets.index - pool…


Strange you don't see any actual inconsistent objects ...



1)  For me it's usually looking at which pool actually has 
inconsistencies via e.g. :


$  for pool in $(rados lspools); do echo "${pool} $(rados 
list-inconsistent-pg ${pool})"; done


 device_health_metrics []
 .rgw.root []
 zone.rgw.control []
 zone.rgw.meta []
 zone.rgw.log 
["5.3","5.5","5.a","5.b","5.10","5.11","5.19","5.1a","5.1d","5.1e"]

 zone.rgw.otp []
 zone.rgw.buckets.index 
["7.4","7.5","7.6","7.9","7.b","7.11","7.13","7.14","7.18","7.1e"]

 zone.rgw.buckets.data []
 zone.rgw.buckets.non-ec []

(This is from now) and you can see how only metadata pools are actually 
affected.



2)  I then simply looped over the pgs with "rados list-inconsistent-obj 
$pg" and this is the object.name, errors and last_reqid:



 "data_log.14","omap_digest_mismatch","client.4349063.0:12045734"
 "data_log.59","omap_digest_mismatch","client.4364800.0:11773451"
 "data_log.30","omap_digest_mismatch","client.4349063.0:10935030"
 "data_log.42","omap_digest_mismatch","client.4348139.0:112695680"
 "data_log.63","omap_digest_mismatch","client.4348139.0:116876563"
 "data_log.44","omap_digest_mismatch","client.4349063.0:11358410"
 "data_log.11","omap_digest_mismatch","client.4349063.0:10259566"
 "data_log.61","omap_digest_mismatch","client.4349063.0:10259594"
 "data_log.28","omap_digest_mismatch","client.4349063.0:11358396"
 "data_log.39","omap_digest_mismatch","client.4349063.0:11364174"
 "data_log.55","omap_digest_mismatch","client.4349063.0:11358415"
 "data_log.15","omap_digest_mismatch","client.4364800.0:9518143"
 "data_log.27","omap_digest_mismatch","client.4349063.0:11473205"
 
".dir.06f9b7c7-6326-4a41-9115-d4d092cf74ce.1163207.114.6","omap_digest_mismatch","client.4349063.0:11274164"
 
".dir.06f9b7c7-6326-4a41-9115-d4d092cf74ce.2217176.214.1","omap_digest_mismatch","client.4349063.0:12168097"
 
".dir.06f9b7c7-6326-4a41-9115-d4d092cf74ce.2217176.214.10","omap_digest_mismatch","client.4348139.0:112993744"
 
".dir.06f9b7c7-6326-4a41-9115-d4d092cf74ce.2202949.678.0","omap_digest_mismatch","client.4349063.0:10289913"
 
".dir.9cba42a3-dd1c-46d4-bdd2-ef634d12c0a5.56337947.1562","omap_digest_mismatch","client.4364800.0:10934595"
 
".dir.06f9b7c7-6326-4a41-9115-d4d092cf74ce.1163207.114.9","omap_digest_mismatch","client.4349063.0:10431941"
 
".dir.06f9b7c7-6326-4a41-9115-d4d092cf74ce.1163207.114.0","omap_digest_mismatch","client.4349063.0:10431932"
 
".dir.06f9b7c7-6326-4a41-9115-d4d092cf74ce.2202949.678.10","omap_digest_mismatch","client.4349063.0:10460106"
 
".dir.06f9b7c7-6326-4a41-9115-d4d092cf74ce.1163207.114.8","omap_digest_mismatch","client.4349063.0:11696943"
 
".dir.06f9b7c7-6326-4a41-9115-d4d092cf74ce.2217176.214.0","omap_digest_mismatch","client.4349063.0:9845513"
 
".dir.9cba42a3-dd1c-46d4-bdd2-ef634d12c0a5.61963196.333.1","omap_digest_mismatch","client.4364800.0:9593089"


As you can see, it's always some omap data that suffers from 
inconsistencies.





Regards


Christian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] pg_autoscaler using uncompressed bytes as pool current total_bytes triggering false POOL_TARGET_SIZE_BYTES_OVERCOMMITTED warnings?

2022-02-02 Thread Christian Rohmann
raw_used=10035209699328.0, 
target_bytes=5497558138880 raw_used_rate=3.0

pool_id 28 - actual_raw_used=0.0, target_bytes=0 raw_used_rate=3.0
--- cut ---


All values but those of pool_id 1 (backups) make sense. For backups it's 
just reporting a MUCH larger actual_raw_used value than what is shown 
via ceph df.
The only difference of that pool compared to the others is the enabled 
compression:



--- cut ---
# ceph osd pool get backups compression_mode
compression_mode: aggressive
--- cut ---


Apparently there already was a similar issue 
(https://tracker.ceph.com/issues/41567) with a resulting commit 
(https://github.com/ceph/ceph/commit/dd6e752826bc762095be4d276e3c1b8d31293eb0) 

changing which from "bytes_used" to the "stored" field for 
"pool_logical_used".


But how does that take compressed (away) data into account? Does 
"bytes_used" count all the "stored" bytes, summing up all uncompressed 
bytes for pools with compression?
This surely must be a bug then, as those bytes are not really 
"actual_raw_used".




I was about to raise a bug, but I wanted to ask here on the ML first if 
I misunderstood the mechanisms at play here.

Thanks and with kind regards,


Christian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Random scrub errors (omap_digest_mismatch) on pgs of RADOSGW metadata pools (bug 53663)

2021-12-21 Thread Christian Rohmann

Thanks for your response Stefan,

On 21/12/2021 10:07, Stefan Schueffler wrote:

Even without adding a lot of rgw objects (only a few PUTs per minute), we have 
thousands and thousands of rgw bucket.sync log entries in the rgw log pool 
(this seems to be a separate problem), and as such we accumulate „large omap 
objects“ over time.


Since you are doing RADOSGW as well, those OMAP objects are usually 
bucket index files 
(https://docs.ceph.com/en/latest/rados/operations/health-checks/#large-omap-objects 
<https://docs.ceph.com/en/latest/rados/operations/health-checks/#large-omap-objects>). 
Since there is no dynamic resharing 
(https://docs.ceph.com/en/latest/radosgw/dynamicresharding/#rgw-dynamic-bucket-index-resharding) 
until Quincy 
(https://tracker.ceph.com/projects/rgw/issues?utf8=%E2%9C%93_filter=1%5B%5D=cf_3%5Bcf_3%5D=%3D%5Bcf_3%5D%5B%5D=multisite-reshard%5B%5D=%5B%5D=project%5B%5D=tracker%5B%5D=status%5B%5D=priority%5B%5D=subject%5B%5D=assigned_to%5B%5D=updated_on%5B%5D=category%5B%5D=fixed_version%5B%5D=cf_3_by=%5B%5D=) 
you need to have enough shards created for each bucket by default.


At about 200k objects (~ keys) per shards you should reveive this 
warning otherwise (used to be 2mio, see 
https://github.com/ceph/ceph/pull/29175/files).




we also face the same or at least a very similar  problem. We are running 
pacific (16.2.6 and 16.2.7, upgraded from 16.2.x to y to z) on both sides of 
the rgw multisite. In our case, the scrub errors occur on the secondary side 
only

Regarding your scrub errors. You do have those still coming up at random?
Could you check with "list-inconsistent-obj" if yours are within the 
OMAP data and in the metadata pools only?





Regards


Christian


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Random scrub errors (omap_digest_mismatch) on pgs of RADOSGW metadata pools (bug 53663)

2021-12-21 Thread Christian Rohmann

Hello Eugen,

On 20/12/2021 22:02, Eugen Block wrote:
you wrote that this cluster was initially installed with Octopus, so 
no upgrade ceph wise? Are all RGW daemons on the exact same ceph 
(minor) versions?
I remember one of our customers reporting inconsistent objects on a 
regular basis although no hardware issues were detectable. They 
replicate between two sites, too. A couple of months ago both sites 
were updated to the same exact ceph minor version (also Octopus), they 
haven't faced inconsistencies since then. I don't have details about 
the ceph version(s) though, only that both sites were initially 
installed with Octopus. Maybe it's worth checking your versions? 



Yes, everything has the same version:


{
[...]
   "overall": {
   "ceph version 15.2.15 
(2dfb18841cfecc2f7eb7eb2afd65986ca4d95985) octopus (stable)": 34

   }
}

I just observed another 3 scrub errors. Strangely they never see to have 
occurred on the same pgs again.
I shall be running another deep scrub on those OSD again to narrow this 
down.




But I am somewhat suspecting this to be a potential issue with the OMAP 
validation part of the scrubbing.
For RADOSGW there are large OMAP structures with lots of movement. And 
the issues only are with the metadata pools.





Regards


Christian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Random scrub errors (omap_digest_mismatch) on pgs of RADOSGW metadata pools (bug 53663)

2021-12-20 Thread Christian Rohmann

Hello Ceph-Users,

for about 3 weeks now I see batches of scrub errors on a 4 node Octopus 
cluster:


# ceph health detail HEALTH_ERR 7 scrub errors; Possible data damage: 
6 pgs inconsistent [ERR] OSD_SCRUB_ERRORS: 7 scrub errors [ERR] 
PG_DAMAGED: Possible data damage: 6 pgs inconsistent     pg 5.3 is 
active+clean+inconsistent, acting [9,12,6]     pg 5.4 is 
active+clean+inconsistent, acting [15,17,18]     pg 7.2 is 
active+clean+inconsistent, acting [13,15,10]     pg 7.9 is 
active+clean+inconsistent, acting [5,19,4]     pg 7.e is 
active+clean+inconsistent, acting [1,15,20]     pg 7.18 is 
active+clean+inconsistent, acting [5,10,0] 


this cluster only serves RADOSGW and it's a multisite master.

I already found another thread 
(https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/LXMQSRNSCPS5YJMFXIS3K5NMROHZKDJU/), 
but with no recent comments about such an issue.


In my case I am still seeing more scrub errors every few days. All those 
inconsistencies are "omap_digest_mismatch" in the "zone.rgw.log" or 
"zone.rgw.buckets.index" pool and are spread all across nodes and OSDs.


I already raised I bug ticket (https://tracker.ceph.com/issues/53663), 
but am wondering if anybody of you ever observed something similar?
Traffic to and from the object storage seems totally fine and I can even 
run a manual deep-scrub with no errors and then receive 3-4 errors the 
next day.



Is there anything I could look into / collect when the next 
inconsistency occurs?

Could there be any misconfiguration causing this?


Thanks and with kind regards


Christian

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: inconsistent pg after upgrade nautilus to octopus

2021-12-19 Thread Christian Rohmann

Hello Tomasz,


I observe a strange accumulation of inconsistencies for an RGW-only 
(+multisite) setup, with errors just like those you reported.
I collected some info and raised a bug ticket:  
https://tracker.ceph.com/issues/53663
Two more inconsistencies have just shown up hours after repairing the 
other, adding to the theory of something really odd going on.




Did you upgrade to Octopus in the end then? Any more issues with such 
inconsistencies on your side Tomasz?




Regards

Christian



On 20/10/2021 10:33, Tomasz Płaza wrote:
As the upgrade process states, rgw are the last one to be upgraded, so 
they are still on nautilus (centos7). Those logs showed up after 
upgrade of the first osd host. It is a multisite setup so I am a 
little afraid of upgrading rgw now.


Etienne:

Sorry for answering in this thread, but somehow I do not get messages 
directed only to ceph-users list. I did "rados list-inconsistent-pg" 
and got many entries like:


{
  "object": {
    "name": ".dir.99a07ed8-2112-429b-9f94-81383220a95b.7104621.23.7",
    "nspace": "",
    "locator": "",
    "snap": "head",
    "version": 82561410
  },
  "errors": [
    "omap_digest_mismatch"
  ],
  "union_shard_errors": [],
  "selected_object_info": {
    "oid": {
  "oid": ".dir.99a07ed8-2112-429b-9f94-81383220a95b.7104621.23.7",
  "key": "",
  "snapid": -2,
  "hash": 3316145293,
  "max": 0,
  "pool": 230,
  "namespace": ""
    },
    "version": "107760'82561410",
    "prior_version": "106468'82554595",
    "last_reqid": "client.392341383.0:2027385771",
    "user_version": 82561410,
    "size": 0,
    "mtime": "2021-10-19T16:32:25.699134+0200",
    "local_mtime": "2021-10-19T16:32:25.699073+0200",
    "lost": 0,
    "flags": [
  "dirty",
  "omap",
  "data_digest"
    ],
    "truncate_seq": 0,
    "truncate_size": 0,
    "data_digest": "0x",
    "omap_digest": "0x",
    "expected_object_size": 0,
    "expected_write_size": 0,
    "alloc_hint_flags": 0,
    "manifest": {
  "type": 0
    },
    "watchers": {}
  },
  "shards": [
    {
  "osd": 56,
  "primary": true,
  "errors": [],
  "size": 0,
  "omap_digest": "0xf4cf0e1c",
  "data_digest": "0x"
    },
    {
  "osd": 58,
  "primary": false,
  "errors": [],
  "size": 0,
  "omap_digest": "0xf4cf0e1c",
  "data_digest": "0x"
    },
    {
  "osd": 62,
  "primary": false,
  "errors": [],
  "size": 0,
  "omap_digest": "0x4bd5703a",
  "data_digest": "0x"
    }
  ]
}


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [EXTERNAL] Re: Why you might want packages not containers for Ceph deployments

2021-11-18 Thread Christian Wuerdig
I think Marc uses containers - but they've chosen Apache Mesos as
orchestrator and ceph-adm doesn't work with that.
Currently essentially two ceph container orchestrators exist - rook which
is a ceph orch or kubernetes and ceph-adm which is an orchestrator
expecting docker or podman
Admittedly I don't fully understand the nuanced differences between rook
(which can be added as a module to the ceph orchestrator cli) and cephadm
(no idea how this is related to the ceph orch cli) - they kinda seem to do
the same thing but slightly differently (or not?).

On Fri, 19 Nov 2021 at 16:51, Tony Liu  wrote:

> Instead of complaining, take some time to learn more about container would
> help.
>
> Tony
> 
> From: Marc 
> Sent: November 18, 2021 10:50 AM
> To: Pickett, Neale T; Hans van den Bogert; ceph-users@ceph.io
> Subject: [ceph-users] Re: [EXTERNAL] Re: Why you might want packages not
> containers for Ceph deployments
>
> > We also use containers for ceph and love it. If for some reason we
> > couldn't run ceph this way any longer, we would probably migrate
> > everything to a different solution. We are absolutely committed to
> > containerization.
>
> I wonder if you are really using containers. Are you not just using
> ceph-adm? If you would be using containers you would have selected your OC
> already, and would be pissed about how the current containers are being
> developed and have to use a 2nd system.
>
>
>
>
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Question if WAL/block.db partition will benefit us

2021-11-08 Thread Christian Wuerdig
In addition to what the others said - generally there is little point
in splitting block and wal partitions - just stick to one for both.
What model are you SSDs and how well do they handle small direct
writes? Because that's what you'll be getting on them and the wrong
type of SSD can make things worse rather than better.

On Tue, 9 Nov 2021 at 00:08, Boris Behrens  wrote:
>
> Hi,
> we run a larger octopus s3 cluster with only rotating disks.
> 1.3 PiB with 177 OSDs, some with a SSD block.db and some without.
>
> We have a ton of spare 2TB disks and we just wondered if we can bring the
> to good use.
> For every 10 spinning disks we could add one 2TB SSD and we would create
> two partitions per OSD (130GB for block.db and 20GB for block.wal). This
> would leave some empty space on the SSD for waer leveling.
>
> The question now is: would we benefit from this? Most of the data that is
> written to the cluster is very large (50GB and above). This would take a
> lot of work into restructuring the cluster and also two other clusters.
>
> And does it make a different to have only a block.db partition or a
> block.db and a block.wal partition?
>
> Cheers
>  Boris
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [Ceph] Recovery is very Slow

2021-10-28 Thread Christian Wuerdig
Yes, just expose each disk as an individual OSD and you'll already be
better off. Depending what type of SSD they are - if they can sustain
high random write IOPS you may even want to consider partitioning each
disk and create 2 OSDs per SSD to make better use of the available IO
capacity.
For all-flash storage CPU utilization is also a factor - generally
fewer cores with a higher clock speed would be preferred over a cpu
with more cores but lower clock speeds in such a setup.


On Thu, 28 Oct 2021 at 21:25, Lokendra Rathour
 wrote:
>
> Hey Janne,
> Thanks for the feedback, we only wanted to have huge space to test more with 
> more data. do you advise some other way to plan this out?
> So I have 15 disks with 1 TB each.  Creating multiple OSD would help or 
> please advise.
>
> thanks,
> Lokendra
>
>
> On Thu, Oct 28, 2021 at 1:52 PM Janne Johansson  wrote:
>>
>> Den tors 28 okt. 2021 kl 10:18 skrev Lokendra Rathour 
>> :
>> >
>> > Hi Christian,
>> > Thanks for the update.
>> > I have 5 SSD on each node i.e. a total of 15 SSD using which I have 
>> > created this RAID 0 Disk, which in Ceph becomes three OSD. Each OSD with 
>> > around 4.4 TB of disk. and in total it is coming around 13.3 TB.
>> > Do you feel local RAID is an issue here? Keeping independent disks can 
>> > help recovery fast or increase the performance? please advice.
>>
>>
>> That is a very poor way to set up ceph storage.
>>
>>
>> --
>> May the most significant bit of your life be positive.
>
>
>
> --
> ~ Lokendra
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Open discussing: Designing 50GB/s CephFS or S3 ceph cluster

2021-10-21 Thread Christian Wuerdig
   - What is the expected file/object size distribution and count?
   - Is it write-once or modify-often data?
   - What's your overall required storage capacity?
   - 18 OSDs per WAL/DB drive seems a lot - recommended is ~6-8
   - With 12TB OSD the recommended WAL/DB size is 120-480GB (1-4%) per OSD
   to avoid spillover - if you go RGW then you may want to aim more towards 4%
   since RGW can use quite a bit of OMAP data (especially when you store many
   small objects). Not sure about CephFS
   - So you may want to look at 4x NVME and probably 3.2TB instead of 1.6
  - Rule-of-thumb is 1 Thread per HDD OSD - so if you want to give
   yourself some extra wiggle room a 7402 might be better - especially since
   EC is a bit heavier on CPU
   - Running EC 8+3 with failure domain host means you should have at least
   12 nodes which means you'd need to push 4GB/sec/node which seems
   theoretically possible but is quite close to the network interface
   capacity. And whether you could actually push 4GB/sec into a node in this
   config I don't know. But overall 12 nodes seems like the minimum
   - With 12 nodes you have a raw storage capacity of around 5PB - assuming
   you don't run you cluster more than 80% full and EC 8+3 means max of 3PB
   usable data capacity (again assuming your objects are large enough to not
   cause significant space amplification wrt. bluestore min block size)
  - You will probably run more nodes than that so if you don't need the
  actual capacity then consider going replicated instead which generally
  performs better than EC


On Fri, 22 Oct 2021 at 05:24, huxia...@horebdata.cn 
wrote:

> Dear Cephers,
>
> I am thinking of designing a cephfs or S3 cluster, with a target to
> achieve a minimum of 50GB/s (write) bandwidth. For each node, I prefer 4U
> 36x 3.5" Supermicro server with 36x 12TB 7200K RPM HDDs, 2x Intel P4610
> 1.6TB NVMe SSD as DB/WAL, a single CPU socket with AMD 7302, and 256GB DDR4
> memory. Each node comes with 2x 25Gb networking in mode 4 bonded. 8+3 EC
> will be used.
>
> My questions are the following:
>
> 1   How many nodes should be deployed in order to achieve a minimum of
> 50GB/s, if possible, with the above hardware setting?
>
> 2   How many Cephfs MDS are required? (suppose 1MB request size), and how
> many clients are needed for reach a total of 50GB/s?
>
> 3   From the perspective of getting the maximum bandwidth, which one
> should i choose, CephFS or Ceph S3?
>
> Any comments, suggestions, or improvement tips are warmly welcome
>
> best regards,
>
> Samuel
>
>
>
> huxia...@horebdata.cn
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Metrics for object sizes

2021-10-14 Thread Christian Rohmann

On 23/04/2021 03:53, Szabo, Istvan (Agoda) wrote:

Objects inside RGW buckets like in couch base software they have their own 
metrics and has this information.


Not as detailed as you would like, but how about using the bucket stats 
on bucket size and number of objects?

 $ radosgw-admin bucket stats --bucket mybucket


Doing a bucket_size / number_of_objects gives you an average object size 
per bucket and that certainly is an indication on

buckets with rather small objects.



Regards


Christian

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CEPH 16.2.x: disappointing I/O performance

2021-10-06 Thread Christian Wuerdig
Hm, generally ceph is mostly latency sensitive which would more translate
into IOPs limits rather than bandwidth. In a single threaded write scenario
your throughput is limited by the latency of the write path which is
generally network + OSD write path + disk. People have managed to get write
latencies under 1ms on all-flash setups but around 0.8ms seems the best you
can achieve which generally puts an upper limit of ~1200 IOPS on a single
threaded client if you do direct synchronized IO. But there shouldn't
really be much in the path that artificially limits bandwidth.

Bluestore does deferred writes only for small writes - which are the writes
that will hit the WAL, writes larger than that will hit the backing store
(i.e HDD) directly. I think the default is 32KB but I could be wrong.
Obviously even for small writes the WAL will eventually have to be flushed
so your longer term performance is still impacted by your HDD speed.
So that might be why for larger block sizes the throughput suffers since
they will hit the drives directly

It's been pointed out in the past that disabling the HDD write cache can
actually improve latency quite substantially (e.g.
https://ceph-users.ceph.narkive.com/UU9QMu9W/disabling-write-cache-on-sata-hdds-reduces-write-latency-7-times)
- might be worth a try


On Wed, 6 Oct 2021 at 10:07, Zakhar Kirpichenko  wrote:

> I'm not sure, fio might be showing some bogus values in the summary, I'll
> check the readings again tomorrow.
>
> Another thing I noticed is that writes seem bandwidth-limited and don't
> scale well with block size and/or number of threads. I.e. one clients
> writes at about the same speed regardless of the benchmark settings. A
> person on reddit, where I asked this question as well, suggested that in a
> replicated pool writes and reads are handled by the primary PG, which would
> explain this write bandwidth limit.
>
> /Z
>
> On Tue, 5 Oct 2021, 22:31 Christian Wuerdig, 
> wrote:
>
>> Maybe some info is missing but 7k write IOPs at 4k block size seem fairly
>> decent (as you also state) - the bandwidth automatically follows from that
>> so not sure what you're expecting?
>> I am a bit puzzled though - by my math 7k IOPS at 4k should only be
>> 27MiB/sec - not sure how the 120MiB/sec was achieved
>> The read benchmark seems in line with 13k IOPS at 4k making around
>> 52MiB/sec bandwidth which again is expected.
>>
>>
>> On Wed, 6 Oct 2021 at 04:08, Zakhar Kirpichenko  wrote:
>>
>>> Hi,
>>>
>>> I built a CEPH 16.2.x cluster with relatively fast and modern hardware,
>>> and
>>> its performance is kind of disappointing. I would very much appreciate an
>>> advice and/or pointers :-)
>>>
>>> The hardware is 3 x Supermicro SSG-6029P nodes, each equipped with:
>>>
>>> 2 x Intel(R) Xeon(R) Gold 5220R CPUs
>>> 384 GB RAM
>>> 2 x boot drives
>>> 2 x 1.6 TB Micron 7300 MTFDHBE1T6TDG drives (DB/WAL)
>>> 2 x 6.4 TB Micron 7300 MTFDHBE6T4TDG drives (storage tier)
>>> 9 x Toshiba MG06SCA10TE 9TB HDDs, write cache off (storage tier)
>>> 2 x Intel XL710 NICs connected to a pair of 40/100GE switches
>>>
>>> All 3 nodes are running Ubuntu 20.04 LTS with the latest 5.4 kernel,
>>> apparmor is disabled, energy-saving features are disabled. The network
>>> between the CEPH nodes is 40G, CEPH access network is 40G, the average
>>> latencies are < 0.15 ms. I've personally tested the network for
>>> throughput,
>>> latency and loss, and can tell that it's operating as expected and
>>> doesn't
>>> exhibit any issues at idle or under load.
>>>
>>> The CEPH cluster is set up with 2 storage classes, NVME and HDD, with 2
>>> smaller NVME drives in each node used as DB/WAL and each HDD allocated .
>>> ceph osd tree output:
>>>
>>> ID   CLASS  WEIGHT TYPE NAMESTATUS  REWEIGHT  PRI-AFF
>>>  -1 288.37488  root default
>>> -13 288.37488  datacenter ste
>>> -14 288.37488  rack rack01
>>>  -7  96.12495  host ceph01
>>>   0hdd9.38680  osd.0up   1.0  1.0
>>>   1hdd9.38680  osd.1up   1.0  1.0
>>>   2hdd9.38680  osd.2up   1.0  1.0
>>>   3hdd9.38680  osd.3up   1.0  1.0
>>>   4hdd9.38680  osd.4up   1.0  1.0
>>>   5hdd9.38680  osd.5up   1.0  1.0
>>>   6hdd9.38680  osd.6

[ceph-users] Re: CEPH 16.2.x: disappointing I/O performance

2021-10-05 Thread Christian Wuerdig
Maybe some info is missing but 7k write IOPs at 4k block size seem fairly
decent (as you also state) - the bandwidth automatically follows from that
so not sure what you're expecting?
I am a bit puzzled though - by my math 7k IOPS at 4k should only be
27MiB/sec - not sure how the 120MiB/sec was achieved
The read benchmark seems in line with 13k IOPS at 4k making around
52MiB/sec bandwidth which again is expected.


On Wed, 6 Oct 2021 at 04:08, Zakhar Kirpichenko  wrote:

> Hi,
>
> I built a CEPH 16.2.x cluster with relatively fast and modern hardware, and
> its performance is kind of disappointing. I would very much appreciate an
> advice and/or pointers :-)
>
> The hardware is 3 x Supermicro SSG-6029P nodes, each equipped with:
>
> 2 x Intel(R) Xeon(R) Gold 5220R CPUs
> 384 GB RAM
> 2 x boot drives
> 2 x 1.6 TB Micron 7300 MTFDHBE1T6TDG drives (DB/WAL)
> 2 x 6.4 TB Micron 7300 MTFDHBE6T4TDG drives (storage tier)
> 9 x Toshiba MG06SCA10TE 9TB HDDs, write cache off (storage tier)
> 2 x Intel XL710 NICs connected to a pair of 40/100GE switches
>
> All 3 nodes are running Ubuntu 20.04 LTS with the latest 5.4 kernel,
> apparmor is disabled, energy-saving features are disabled. The network
> between the CEPH nodes is 40G, CEPH access network is 40G, the average
> latencies are < 0.15 ms. I've personally tested the network for throughput,
> latency and loss, and can tell that it's operating as expected and doesn't
> exhibit any issues at idle or under load.
>
> The CEPH cluster is set up with 2 storage classes, NVME and HDD, with 2
> smaller NVME drives in each node used as DB/WAL and each HDD allocated .
> ceph osd tree output:
>
> ID   CLASS  WEIGHT TYPE NAMESTATUS  REWEIGHT  PRI-AFF
>  -1 288.37488  root default
> -13 288.37488  datacenter ste
> -14 288.37488  rack rack01
>  -7  96.12495  host ceph01
>   0hdd9.38680  osd.0up   1.0  1.0
>   1hdd9.38680  osd.1up   1.0  1.0
>   2hdd9.38680  osd.2up   1.0  1.0
>   3hdd9.38680  osd.3up   1.0  1.0
>   4hdd9.38680  osd.4up   1.0  1.0
>   5hdd9.38680  osd.5up   1.0  1.0
>   6hdd9.38680  osd.6up   1.0  1.0
>   7hdd9.38680  osd.7up   1.0  1.0
>   8hdd9.38680  osd.8up   1.0  1.0
>   9   nvme5.82190  osd.9up   1.0  1.0
>  10   nvme5.82190  osd.10   up   1.0  1.0
> -10  96.12495  host ceph02
>  11hdd9.38680  osd.11   up   1.0  1.0
>  12hdd9.38680  osd.12   up   1.0  1.0
>  13hdd9.38680  osd.13   up   1.0  1.0
>  14hdd9.38680  osd.14   up   1.0  1.0
>  15hdd9.38680  osd.15   up   1.0  1.0
>  16hdd9.38680  osd.16   up   1.0  1.0
>  17hdd9.38680  osd.17   up   1.0  1.0
>  18hdd9.38680  osd.18   up   1.0  1.0
>  19hdd9.38680  osd.19   up   1.0  1.0
>  20   nvme5.82190  osd.20   up   1.0  1.0
>  21   nvme5.82190  osd.21   up   1.0  1.0
>  -3  96.12495  host ceph03
>  22hdd9.38680  osd.22   up   1.0  1.0
>  23hdd9.38680  osd.23   up   1.0  1.0
>  24hdd9.38680  osd.24   up   1.0  1.0
>  25hdd9.38680  osd.25   up   1.0  1.0
>  26hdd9.38680  osd.26   up   1.0  1.0
>  27hdd9.38680  osd.27   up   1.0  1.0
>  28hdd9.38680  osd.28   up   1.0  1.0
>  29hdd9.38680  osd.29   up   1.0  1.0
>  30hdd9.38680  osd.30   up   1.0  1.0
>  31   nvme5.82190  osd.31   up   1.0  1.0
>  32   nvme5.82190  osd.32   up   1.0  1.0
>
> ceph df:
>
> --- RAW STORAGE ---
> CLASS SIZEAVAILUSED  RAW USED  %RAW USED
> hdd253 TiB  241 TiB  13 TiB13 TiB   5.00
> nvme35 TiB   35 TiB  82 GiB82 GiB   0.23
> TOTAL  288 TiB  276 TiB  13 TiB13 TiB   4.42
>
> --- POOLS ---
> POOL   ID  PGS   STORED  OBJECTS USED  %USED  MAX AVAIL
> images 12  256   24 GiB3.15k   73 GiB   0.03 76 TiB
> volumes13  256  839 GiB  232.16k  2.5 

[ceph-users] Re: Erasure coded pool chunk count k

2021-10-05 Thread Christian Wuerdig
A couple of notes to this:

Ideally you should have at least 2 more failure domains than your base
resilience (K+M for EC or size=N for replicated) - reasoning: Maintenance
needs to be performed so chances are every now and then you take a host
down for a few hours or possibly days to do some upgrade, fix some broken
things, etc. This means you're running in degraded state since only K+M-1
shards are available. While in that state a drive in another host dies on
you. Now recovery for that is blocked because you have insufficient failure
domains available and things start getting a bit uncomfortable depending on
how large M is. Or a whole host dies on you in that state ...
Generally planning your cluster resources right along the fault lines is
going to bite you and cause high levels of stress and anxiety. I know -
budgets have a limit but still, there is plenty of history on this list for
desperate calls for help simply because clusters were only planned for the
happy day case.

Unlike replicated pools you cannot change your profile on an EC-pool after
it has been created - so if you decide to change EC profile this means
creating a new pool and migrating. Just something to keep in mind.

On Tue, 5 Oct 2021 at 14:58, Anthony D'Atri  wrote:

>
> The larger the value of K relative to M, the more efficient the raw ::
> usable ratio ends up.
>
> There are tradeoffs and caveats.  Here are some of my thoughts; if I’m
> off-base here, I welcome enlightenment.
>
>
>
> When possible, it’s ideal to have at least K+M failure domains — often
> racks, sometimes hosts, chassis, etc.  Thus smaller clusters, say with 5-6
> nodes, aren’t good fits for larger sums of K+M if your data is valuable.
>
> Larger sums of K+M also mean that more drives will be touched by each read
> or write, especially during recovery.  This could be a factor if one is
> IOPS-limited.  Same with scrubs.
>
> When using a pool for, eg. RGW buckets, larger sums of K+M may result in
> greater overhead when storing small objects, since Ceph / RGW only AIUI
> writes full stripes.  So say you have an EC pool of 17,3 on drives with the
> default 4kB bluestone_min_alloc size.  A 1kB S3 object would thus allocate
> 17+3=20 x 4kB == 80kB of storage, which is 7900% overhead.  This is an
> extreme example to illustrate the point.
>
> Larger sums of K+M may present more IOPs to each storage drive, dependent
> on workload and the EC plugin selected.
>
> With larger objects (including RBD) the modulo factor is dramatically
> smaller.  One’s use-case and dataset per-pool may thus inform the EC
> profiles that make sense; workloads that are predominately smaller objects
> might opt for replication instead.
>
> There was a post ….. a year ago? suggesting that values with small prime
> factors are advantageous, but I never saw a discussion of why that might be.
>
> In some cases where one might be pressured to use replication with only 2
> copies of data, a 2,2 EC profile might achieve the same efficiency with
> greater safety.
>
> Geo / stretch clusters or ones in challenging environments are a special
> case; they might choose values of M equal to or even larger than K.
>
> That said, I think 4,2 is a reasonable place to *start*, adjusted by one’s
> specific needs.  You get a raw :: usable ratio of 1.5 without getting too
> complicated.
>
> ymmv
>
>
>
>
>
>
> >
> > Hi,
> >
> > It depends of hardware, failure domain, use case, overhead.
> >
> > I don’t see an easy way to chose k and m values.
> >
> > -
> > Etienne Menguy
> > etienne.men...@croit.io
> >
> >
> >> On 4 Oct 2021, at 16:57, Golasowski Martin 
> wrote:
> >>
> >> Hello guys,
> >> how does one estimate number of chunks for erasure coded pool ( k = ? )
> ? I see that number of m chunks determines the pool’s resiliency, however I
> did not find clear guideline how to determine k.
> >>
> >> Red Hat states that they support only the following combinations:
> >>
> >> k=8, m=3
> >> k=8, m=4
> >> k=4, m=2
> >>
> >> without any rationale behind them.
> >> The table is taken from
> https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/storage_strategies_guide/erasure_code_pools
> .
> >>
> >> Thanks!
> >>
> >> Regards,
> >> Martin
> >>
> >>
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


  1   2   >