[ceph-users] Re: How does mclock work?

2024-01-09 Thread Sridhar Seshasayee
Hello Frédéric,

Please see answers below.


> Could someone please explain how mclock works regarding reads and writes?
> Does mclock intervene on both read and write iops? Or only on reads or only
> on writes?
>

mClock schedules both read and write ops.


>
> And what type of underlying hardware performance is calculated and
> considered by mclock? Seems to be only write performance.
>

Random write performance is considered for setting the maximum IOPS
capacity of an OSD. This along with the sequential bandwidth
capability of the OSD is used to calculate the cost per IO that is
internally used by mClock for scheduling Ops. In addition, the mClock
profiles use the capacity information to allocate reservation and limit for
different classes of service (for e.g., client, background-recovery,
scrub, snaptrim etc.).

The write performance is used to set a lower bound on the amount of
bandwidth to be allocated for different classes of services. For e.g.,
the 'balanced' profile allocates 50% of the OSD's IOPS capacity to cllent
ops. In other words, a minimum guarantee of 50% of the OSD's
bandwidth is allocated to client ops (read or write). If you look at the
'balanced' profile, there is no upper limit set for client ops (i.e. set to
MAX) which means that reads can potentially use the maximum possible
bandwidth (i.e., not contrained by max IOPS capacity) if there
are no other competing ops.

Please see
https://docs.ceph.com/en/reef/rados/configuration/mclock-config-ref/#built-in-profiles
for more information about mClock profiles.


> The mclock documentation shows HDDs and SSDs specific configuration
> options (capacity and sequential bandwidth) but nothing regarding hybrid
> setups and these configuration options do not distinguish reads and writes.
> But read and write performance are often not in par for a single drive and
> even less when using hybrid setups.
>
> With hybrid setups (RocksDB+WAL on SSDs or NVMes and Data on HDD), if
> mclock only considers write performance, it may fail to properly schedule
> read iops (does mclock schedule read iops?) as the calculated iops capacity
> would be way too high for reads.
>
> With HDD only setups (RocksDB+WAL+Data on HDD), if mclock only considers
> write performance, the OSD may not take advantage of higher read
> performance.
>
> Can someone please shed some light on this?
>

As mentioned above, as long as there are no competing ops, the mClock
profiles ensure that there is nothing constraining client
ops from using the full available bandwidth of an OSD for both reads and
writes regardless of the type of setup (hybrid, HDD,
SSD) being employed. The important aspect is to ensure that the set IOPS
capacity for the OSD reflects a fairly accurate
representation of the underlying device capability. This is because the
reservation criteria based the IOPS capacity helps
maintain an acceptable level of performance with other active competing ops.

You could run some synthetic benchmarks to ensure that read and write
performance are along expected lines with the
default mClock profile to confirm the above.

I hope this helps.

-Sridhar
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: About ceph disk slowops effect to cluster

2024-01-09 Thread David Yang
The 2*10Gbps shared network seems to be full (1.9GB/s).
Is it possible to reduce part of the workload and wait for the cluster
to return to a healthy state?
Tip: Erasure coding needs to collect all data blocks when recovering
data, so it takes up a lot of network card bandwidth and processor
resources.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW rate-limiting or anti-hammering for (external) auth requests // Anti-DoS measures

2024-01-09 Thread Szabo, Istvan (Agoda)
Hi,

I'm using in the frontend https config on haproxy like this, it works so far 
good:

stick-table type ip size 1m expire 10s store http_req_rate(10s)

tcp-request inspect-delay 10s
tcp-request content track-sc0 src
http-request deny deny_status 429 if { sc_http_req_rate(0) gt 1 }


Istvan Szabo
Staff Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---




From: Christian Rohmann 
Sent: Tuesday, January 9, 2024 3:33 PM
To: ceph-users@ceph.io 
Subject: [ceph-users] Re: RGW rate-limiting or anti-hammering for (external) 
auth requests // Anti-DoS measures

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Happy New Year Ceph-Users!

With the holidays and people likely being away, I take the liberty to
bluntly BUMP this question about protecting RGW from DoS below:


On 22.12.23 10:24, Christian Rohmann wrote:
> Hey Ceph-Users,
>
>
> RGW does have options [1] to rate limit ops or bandwidth per bucket or
> user.
> But those only come into play when the request is authenticated.
>
> I'd like to also protect the authentication subsystem from malicious
> or invalid requests.
> So in case e.g. some EC2 credentials are not valid (anymore) and
> clients start hammering the RGW with those requests, I'd like to make
> it cheap to deal with those requests. Especially in case some external
> authentication like OpenStack Keystone [2] is used, valid access
> tokens are cached within the RGW. But requests with invalid
> credentials end up being sent at full rate to the external API [3] as
> there is no negative caching. And even if there was, that would only
> limit the external auth requests for the same set of invalid
> credentials, but it would surely reduce the load in that case:
>
> Since the HTTP request is blocking  
>
>
>> [...]
>> 2023-12-18T15:25:55.861+ 7fec91dbb640 20 sending request to
>> https://keystone.example.com/v3/s3tokens
>> 2023-12-18T15:25:55.861+ 7fec91dbb640 20 register_request
>> mgr=0x561a407ae0c0 req_data->id=778, curl_handle=0x7fedaccb36e0
>> 2023-12-18T15:25:55.861+ 7fec91dbb640 20 WARNING: blocking http
>> request
>> 2023-12-18T15:25:55.861+ 7fede37fe640 20 link_request
>> req_data=0x561a40a418b0 req_data->id=778, curl_handle=0x7fedaccb36e0
>> [...]
>
>
> this does not only stress the external authentication API (keystone in
> this case), but also blocks RGW threads for the duration of the
> external call.
>
> I am currently looking into using the capabilities of HAProxy to rate
> limit requests based on their resulting http-response [4]. So in
> essence to rate-limit or tarpit clients that "produce" a high number
> of 403 "InvalidAccessKeyId" responses. To have less collateral it
> might make sense to limit based on the presented credentials
> themselves. But this would require to extract and track HTTP headers
> or URL parameters (presigned URLs) [5] and to put them into tables.
>
>
> * What are your thoughts on the matter?
> * What kind of measures did you put in place?
> * Does it make sense to extend RGWs capabilities to deal with those
> cases itself?
> ** adding negative caching
> ** rate limits on concurrent external authentication requests (or is
> there a pool of connections for those requests?)
>
>
>
> Regards
>
>
> Christian
>
>
>
> [1] https://docs.ceph.com/en/latest/radosgw/admin/#rate-limit-management
> [2]
> https://docs.ceph.com/en/latest/radosgw/keystone/#integrating-with-openstack-keystone
> [3]
> https://github.com/ceph/ceph/blob/86bb77eb9633bfd002e73b5e58b863bc2d0df594/src/rgw/rgw_auth_keystone.cc#L475
> [4]
> https://www.haproxy.com/documentation/haproxy-configuration-manual/latest/#4.2-http-response%20track-sc0
> [5]
> https://docs.aws.amazon.com/AmazonS3/latest/API/sig-v4-authenticating-requests.html#auth-methods-intro
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing 

[ceph-users] Join us for the User + Dev Monthly Meetup - January 18th!

2024-01-09 Thread Laura Flores
Hi Ceph users and developers,

You are invited to join us at the User + Dev meeting this week Thursday,
January 18th at 10:00 AM Eastern Time! See below for more meeting details.

The focus topic, "Ceph Feature Request from the DKIST Data Center: Add a
service backed by tape that is analogous to AWS Glacier", will be presented
by Joel Davidow, a Ceph operator from the National Solar Observatory. In
his talk, he will propose a feature request to add support for tape as a
storage class with lifecycle management for object storage.

Feel free to add questions or additional topics under the "Open Discussion"
section on the agenda: https://pad.ceph.com/p/ceph-user-dev-monthly-minutes

If you have an idea for a focus topic you'd like to present at a future
meeting, you are welcome to submit it to this Google Form:
https://docs.google.com/forms/d/e/1FAIpQLSdboBhxVoBZoaHm8xSmeBoemuXoV_rmh4vJDGBrp6d-D3-BlQ/viewform?usp=sf_link
Any Ceph user or developer is eligible to submit!

Thanks,
Laura Flores

Meeting link: https://meet.jit.si/ceph-user-dev-monthly

Time Conversions:
UTC:   Thursday, January 18, 15:00 UTC
Mountain View, CA, US: Thursday, January 18,  7:00 PST
Phoenix, AZ, US:   Thursday, January 18,  8:00 MST
Denver, CO, US:Thursday, January 18,  8:00 MST
Huntsville, AL, US:Thursday, January 18,  9:00 CST
Raleigh, NC, US:   Thursday, January 18, 10:00 EST
London, England:   Thursday, January 18, 15:00 GMT
Paris, France: Thursday, January 18, 16:00 CET
Helsinki, Finland: Thursday, January 18, 17:00 EET
Tel Aviv, Israel:  Thursday, January 18, 17:00 IST
Pune, India:   Thursday, January 18, 20:30 IST
Brisbane, Australia:   Friday, January 19,  1:00 AEST
Singapore, Asia:   Thursday, January 18, 23:00 +08
Auckland, New Zealand: Friday, January 19,  4:00 NZDT

-- 

Laura Flores

She/Her/Hers

Software Engineer, Ceph Storage 

Chicago, IL

lflo...@ibm.com | lflo...@redhat.com 
M: +17087388804
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: mds crashes after up:replay state

2024-01-09 Thread Paul Mezzanini

There isn't one specific thing I can point my finger at that would be "_this_ 
is where all the pain comes from".  Some of these issues are also our own 
doing.   We have been getting too comfortable seeing the cluster in HEALTH_WARN 
with "1 clients failing to respond to capability release" and the like.  Some 
of these are client bugs, some are mds issues, others are issues with HPC 
cluster job workflow.  Getting comfortable driving with the check engine light 
on is how you end up with a new ventilated engine block.

What I can talk about is when we get into a state with a huge journal / log 
segments and experience a failure, the recovery is painful.  Similar to what 
Lars ran into, there is a missing heartbeat in the recovery path.  I can't 
remember off the top of my head which status it is in 
(replay,reconnect,rejoin,etc).  This alone has stopped our cluster from coming 
back automatically after an off hours failure.

In recovery, the memory usage is easily 3x what was being used at the time of 
crashing.  Lots of swap is slow, but it at least gets you out of jail on this 
one.  What can't be maneuvered around is how much of the recovery process is 
single threaded.  One thread goes into overdrive while memory slowly goes over 
the edge.  Being recovery however I can understand being careful.

In operation, we've been bit by the finisher thread numerous times. 
Specifically when removing a large amount of empty directories when snapshots 
exist.  This is what was going on when we hit our last outage that had a huge 
journal and thusly painful recovery.  


--

Paul Mezzanini
Platform Engineer III
Research Computing

Rochester Institute of Technology

 “End users is a description, not a goal.”






From: Milind Changire 
Sent: Sunday, January 7, 2024 10:54 PM
To: Paul Mezzanini
Cc: Lars Köppel; Patrick Donnelly; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: mds crashes after up:replay state

Hi Paul,
Could you create a ceph tracker (tracker.ceph.com) and list out things
that are suboptimal according to your investigation?
We'd like to hear more on this.

Alternatively, you could list the issues with mds here.

Thanks,
Milind

On Sun, Jan 7, 2024 at 4:37 PM Paul Mezzanini  wrote:
>
> We've seen it use as much as 1.6t of ram/swap.Swap makes it slow, but a 
> slow recovery is better than no recovery.   My coworker looked into it at the 
>  source code level and while it is doing some things suboptimal that's how 
> it's currently written.
>
> The MDS code needs some real love if ceph is going to offer file services 
> that can match what the back end storage can actually provide.
>
> --
>
> Paul Mezzanini
> Platform Engineer III
> Research Computing
>
> Rochester Institute of Technology
>
>  Sent from my phone, please excuse typos and brevity
> 
> From: Lars Köppel 
> Sent: Sunday, January 7, 2024 4:20:05 AM
> To: Paul Mezzanini 
> Cc: Patrick Donnelly ; ceph-users@ceph.io 
> 
> Subject: Re: [ceph-users] Re: mds crashes after up:replay state
>
> Hi Paul,
>
> your suggestion was correct. The mds went through the replay state and was a 
> few minutes in the active state. But then it gets killed because of too high 
> memory consumption.
> @mds.cephfs.storage01.pgperp.service: Main process exited, code=exited, 
> status=137/n/a
> How could I raise the memory limit for the mds?
>
> From the looks in htop. It looked like there is a memory leak, because it 
> consumed over 200 GB of memory while reporting that it actually used 20 - 30 
> GB.
> Is this possible?
>
> Best regardes
> Lars
>
>
> [ariadne.ai Logo]   Lars Köppel
> Developer
> Email:  lars.koep...@ariadne.ai
> Phone:  +49 6221 5993580
> ariadne.ai (Germany) GmbH
> Häusserstraße 3, 69115 Heidelberg
> Amtsgericht Mannheim, HRB 744040
> Geschäftsführer: Dr. Fabian Svara
> https://ariadne.ai
>
>
> On Sat, Jan 6, 2024 at 3:33 PM Paul Mezzanini 
> mailto:pfm...@rit.edu>> wrote:
> I'm replying from my phone so hopefully this works well.  This sounds 
> suspiciously similar to an issue we have run into where there is an internal 
> loop in the MDS that doesn't have heartbeat in it. If that loop goes for too 
> long, it is marked as failed and the process jumps to another server and 
> starts again.
>
> We get around it by "wedging it in a corner" and removing the ability to 
> migrate. This is as simple as stopping all standby MDS services and just 
> waiting for the MDS to complete.
>
>
>
> --
>
> Paul Mezzanini
> Platform Engineer III
> Research Computing
>
> Rochester Institute of Technology
>
>  Sent from my phone, please excuse typos and brevity
> 
> From: Lars Köppel mailto:lars.koep...@ariadne.ai>>
> Sent: Saturday, January 6, 2024 7:22:14 AM
> To: Patrick Donnelly mailto:pdonn...@redhat.com>>
> Cc: ceph-users@ceph.io 
> mailto:ceph-users@ceph.io>>
> 

[ceph-users] Re: How does mclock work?

2024-01-09 Thread Anthony D'Atri
There was a client SSD sorta like that, a bit of Optane with TLC or QLC, but it 
didn't seem to sell well.  Optane was groovy tech, but with certain challenges 
as well.

> On Jan 9, 2024, at 14:30, Mark Nelson  wrote:
> 
> With HDDs and a lot of metadata, it's tough to get away from it imho.  In an 
> alternate universe it would have been really neat if Intel could have worked 
> with the HDD vendors to put like 16GB of user accessible optane on every HDD. 
>  Enough for the WAL and L0 (and maybe L1).
> 
> 
> Mark
> 
> 
> On 1/9/24 08:53, Anthony D'Atri wrote:
>> Not strictly an answer to your worthy question, but IMHO this supports my 
>> stance that hybrid OSDs aren't worth the hassle.
>> 
>>> On Jan 9, 2024, at 06:13, Frédéric Nass  
>>> wrote:
>>> 
>>> With hybrid setups (RocksDB+WAL on SSDs or NVMes and Data on HDD), if 
>>> mclock only considers write performance, it may fail to properly schedule 
>>> read iops (does mclock schedule read iops?) as the calculated iops capacity 
>>> would be way too high for reads.
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How does mclock work?

2024-01-09 Thread Mark Nelson
With HDDs and a lot of metadata, it's tough to get away from it imho.  
In an alternate universe it would have been really neat if Intel could 
have worked with the HDD vendors to put like 16GB of user accessible 
optane on every HDD.  Enough for the WAL and L0 (and maybe L1).



Mark


On 1/9/24 08:53, Anthony D'Atri wrote:

Not strictly an answer to your worthy question, but IMHO this supports my 
stance that hybrid OSDs aren't worth the hassle.


On Jan 9, 2024, at 06:13, Frédéric Nass  wrote:

With hybrid setups (RocksDB+WAL on SSDs or NVMes and Data on HDD), if mclock 
only considers write performance, it may fail to properly schedule read iops 
(does mclock schedule read iops?) as the calculated iops capacity would be way 
too high for reads.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to configure something like osd_deep_scrub_min_interval?

2024-01-09 Thread Frank Schilder
Quick answers:


  *   ... osd_deep_scrub_randomize_ratio ... but not on Octopus: is it still a 
valid parameter?

Yes, this parameter exists and can be used to prevent premature deep-scrubs. 
The effect is dramatic.


  *   ... essentially by playing with osd_scrub_min_interval,...

The main parameter is actually osd_deep_scrub_randomize_ratio, all other 
parameters have less effect in terms of scrub load. osd_scrub_min_interval is 
the second most important parameter and needs increasing for large 
SATA-/NL-SAS-HDDs. For sufficiently fast drives the default of 24h is good 
(although might be a bit aggressive/paranoid).


  *   Another small question: you opt for osd_max_scrubs=1 just to make sure
your I/O is not adversely affected by scrubbing, or is there a more
profound reason for that?

Well, not affecting user-IO too much is a quite profound reason and many admins 
try to avoid scrubbing at all when users are on the system. It makes IO 
somewhat unpredictable and can trigger user complaints.

However, there is another profound reason: for HDDs it increases deep-scrub 
load (that is, interference with user IO) a lot while it actually slows down 
the deep-scrubbing. HDDs can't handle the implied random IO of concurrent 
deep-scrubs well. On my system I saw that with osd_max_scrubs=2 the scrub time 
for a PG increased a bit more than double. In other words: more scrub load, 
less scrub progress = useless, do not do this.

I plan to document the script a bit more and am waiting for some deep-scrub 
histograms to converge to equilibrium. This takes months for our large pools, 
but I would like to have the numbers for an example of how it should look like.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Fulvio Galeazzi
Sent: Monday, January 8, 2024 4:21 PM
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: How to configure something like 
osd_deep_scrub_min_interval?

Hallo Frank,
just found this post, thank you! I have also been puzzled/struggling
with scrub/deep-scrub and found your post very useful: will give this a
try, soon.

One thing, first: I am using Octopus, too, but I cannot find any
documentation about osd_deep_scrub_randomize_ratio. I do see that in
past releases, but not on Octopus: is it still a valid parameter?

Let me check whether I understood your procedure: you optimize scrub
time distribution essentially by playing with osd_scrub_min_interval,
thus "forcing" the automated algorithm to preferentially select
older-scrubbed PGs, am I correct?

Another small question: you opt for osd_max_scrubs=1 just to make sure
your I/O is not adversely affected by scrubbing, or is there a more
profound reason for that?

   Thanks!

Fulvio

On 12/13/23 13:36, Frank Schilder wrote:
> Hi all,
>
> since there seems to be some interest, here some additional notes.
>
> 1) The script is tested on octopus. It seems that there was a change in the 
> output of ceph commands used and it might need some tweaking to get it to 
> work on other versions.
>
> 2) If you want to give my findings a shot, you can do so in a gradual way. 
> The most important change is setting osd_deep_scrub_randomize_ratio=0 (with 
> osd_max_scrubs=1), this will make osd_deep_scrub_interval work exactly as the 
> requested osd_deep_scrub_min_interval setting, PGs with a deep-scrub stamp 
> younger than osd_deep_scrub_interval will *not* be deep-scrubbed. This is the 
> one change to test, all other settings have less impact. The script will not 
> report some numbers at the end, but the histogram will be correct. Let it run 
> a few deep-scrub-interval rounds until the histogram is evened out.
>
> If you start your test after using osd_max_scrubs>1 for a while -as I did - 
> you will need a lot of patience and might need to mute some scrub warnings 
> for a while.
>
> 3) The changes are mostly relevant for large HDDs that take a long time to 
> deep-scrub (many small objects). The overall load reduction, however, is 
> useful in general.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

--
Fulvio Galeazzi
GARR-Net Department
tel.: +39-334-6533-250
skype: fgaleazzi70
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How does mclock work?

2024-01-09 Thread Anthony D'Atri
Not strictly an answer to your worthy question, but IMHO this supports my 
stance that hybrid OSDs aren't worth the hassle.  

> On Jan 9, 2024, at 06:13, Frédéric Nass  
> wrote:
> 
> With hybrid setups (RocksDB+WAL on SSDs or NVMes and Data on HDD), if mclock 
> only considers write performance, it may fail to properly schedule read iops 
> (does mclock schedule read iops?) as the calculated iops capacity would be 
> way too high for reads.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stuck in upgrade process to reef

2024-01-09 Thread Igor Fedotov

Hi Marek,

I haven't looked through those upgrade logs yet but here are some 
comments regarding last OSD startup attempt.


First of answering your question


_init_alloc::NCB::restore_allocator() failed! Run Full Recovery from ONodes 
(might take a while)



Is it a mandatory part of fsck?


This is caused by previous non-graceful OSD process shutdown. BlueStore is 
unable to find up-to-date allocation map and recovers it from RocksDB. And 
since fsck is a read-only procedure the recovered allocmap is not saved - hence 
all the following BlueStore startups (within fsck or OSD init) cause another 
rebuild attempt. To avoid that you might want to run repair instead of fsck - 
this will persist up-to-date allocation map and avoid its rebuilding on the 
next startup. This will work till the next non-graceful shutdown only - hence 
unsuccessful OSD attempt might break the allocmap state again.

Secondly - looking at OSD startup log one can see that actual OSD log ends with 
that allocmap recovery as well:


2024-01-09T11:25:30.718449+01:00 osd1 ceph-osd[1734062]: 
bluestore(/var/lib/ceph/osd/ceph-1) _init_alloc::NCB::restore_allocator() 
failed! Run Full Recovery from ONodes (might take a while) ...


Subsequent log line indicating OSD daemon termination is from systemd:

2024-01-09T11:25:33.516258+01:00 osd1 systemd[1]: Stopping 
ceph-2c565e24-7850-47dc-a751-a6357cbbaf2a@osd.1.service - Ceph osd.1 for 
2c565e24-7850-47dc-a751-a6357cbbaf2a...


And honestly these lines provide almost no clue why termination happened. No 
obvious OSD failures or something are shown. Perhaps containerized environment 
hides the details e.g. by cutting off OSD log's tail.
So you might want to proceed the investigation by running repair prior to 
starting the OSD as per above. This will result in no alloc map recovery and 
hopefully workaround the problem during startup - if the issue is caused by 
allocmap recovery.
Additionally you might want to increase debug_bluestore log level for osd.1 
before starting it up to get more insight on what's happening.

Alternatively you might want to play with OSD log target settings to write 
OSD.1 log to some file rather than using system wide logging infra - hopefully 
this will be more helpful.

Thanks,
Igor

On 09/01/2024 13:31, Jan Marek wrote:

Hi Igor,

I've sent you logs via filesender.cesnet.cz, if someone would
be interested, they are here:

https://filesender.cesnet.cz/?s=download=047b1ec4-4df0-4e8a-90fc-31706eb168a4

Some points:

1) I've found, that on the osd1 server was bad time (3 minutes in
future). I've corrected that. Yes, I know, that it's bad, but we
moved servers to any other net segment, where they have no access
to the timeservers in Internet, then I must reconfigure it to use
our own NTP servers.

2) I've tried to start osd.1 service by this sequence:

a)

ceph-bluestore-tool --path 
/var/lib/ceph/2c565e24-7850-47dc-a751-a6357cbbaf2a/osd.1 --command fsck

(without setting log properly :-( )

b)

export CEPH_ARGS="--log-file osd.1.log --debug-bluestore 5/20"
ceph-bluestore-tool --path 
/var/lib/ceph/2c565e24-7850-47dc-a751-a6357cbbaf2a/osd.1 --command fsck

- here I have one question: Why is it in this log stil this line:

_init_alloc::NCB::restore_allocator() failed! Run Full Recovery from ONodes 
(might take a while)

Is it a mandatory part of fsck?

Log is attached.

c)

systemctl start ceph-2c565e24-7850-47dc-a751-a6357cbbaf2a@osd.1.service

still crashing, gzip-ed log attached too.

Many thanks for exploring problem.

Sincerely
Jan Marek

Dne Po, led 08, 2024 at 12:00:05 CET napsal(a) Igor Fedotov:

Hi Jan,

indeed fsck logs for the OSDs other than osd.0 look good so it would be
interesting to see OSD startup logs for them. Preferably to have that for
multiple (e.g. 3-4) OSDs to get the pattern.

Original upgrade log(s) would be nice to see as well.

You might want to use Google Drive or any other publicly available file
sharing site for that.


Thanks,

Igor

On 05/01/2024 10:25, Jan Marek wrote:

Hi Igor,

I've tried to start only osd.1, which seems to be fsck'd OK, but
it crashed :-(

I search logs and I've found, that I have logs from 22.12.2023,
when I've did a upgrade (I have set logging to journald).

Would you be interested in those logs? This file have 30MB in
bzip2 format, how I can share it with you?

It contains crash log from start osd.1 too, but I can cut out
from it and send it to list...

Sincerely
Jan Marek

Dne Čt, led 04, 2024 at 02:43:48 CET napsal(a) Jan Marek:

Hi Igor,

I've ran this oneliner:

for i in {0..12}; do export CEPH_ARGS="--log-file osd."${i}".log --debug-bluestore 
5/20" ; ceph-bluestore-tool --path /var/lib/ceph/2c565e24-7850-47dc-a751-a6357cbbaf2a/osd.${i} 
--command fsck ; done;

On osd.0 it crashed very quickly, on osd.1 it is still working.

I've send those logs in one e-mail.

But!

I've tried to list disk devices in monitor view, and I've got
very interesting screenshot - some part I've emphasized by red
rectangulars.


[ceph-users] How does mclock work?

2024-01-09 Thread Frédéric Nass

  
 
Hello, 
  
Could someone please explain how mclock works regarding reads and writes? Does 
mclock intervene on both read and write iops? Or only on reads or only on 
writes? 
  
And what type of underlying hardware performance is calculated and considered 
by mclock? Seems to be only write performance. 
  
The mclock documentation shows HDDs and SSDs specific configuration options 
(capacity and sequential bandwidth) but nothing regarding hybrid setups and 
these configuration options do not distinguish reads and writes. But read and 
write performance are often not in par for a single drive and even less when 
using hybrid setups. 
  
With hybrid setups (RocksDB+WAL on SSDs or NVMes and Data on HDD), if mclock 
only considers write performance, it may fail to properly schedule read iops 
(does mclock schedule read iops?) as the calculated iops capacity would be way 
too high for reads. 
  
With HDD only setups (RocksDB+WAL+Data on HDD), if mclock only considers write 
performance, the OSD may not take advantage of higher read performance. 
  
Can someone please shed some light on this? 
  
Best regards,   
 
Frédéric Nass 

Sous-direction Infrastructures et Services
Direction du Numérique 
Université de Lorraine
Tél : +33 3 72 74 11 35  
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW rate-limiting or anti-hammering for (external) auth requests // Anti-DoS measures

2024-01-09 Thread Christian Rohmann

Happy New Year Ceph-Users!

With the holidays and people likely being away, I take the liberty to 
bluntly BUMP this question about protecting RGW from DoS below:



On 22.12.23 10:24, Christian Rohmann wrote:

Hey Ceph-Users,


RGW does have options [1] to rate limit ops or bandwidth per bucket or 
user.

But those only come into play when the request is authenticated.

I'd like to also protect the authentication subsystem from malicious 
or invalid requests.
So in case e.g. some EC2 credentials are not valid (anymore) and 
clients start hammering the RGW with those requests, I'd like to make 
it cheap to deal with those requests. Especially in case some external 
authentication like OpenStack Keystone [2] is used, valid access 
tokens are cached within the RGW. But requests with invalid 
credentials end up being sent at full rate to the external API [3] as 
there is no negative caching. And even if there was, that would only 
limit the external auth requests for the same set of invalid 
credentials, but it would surely reduce the load in that case:


Since the HTTP request is blocking  



[...]
2023-12-18T15:25:55.861+ 7fec91dbb640 20 sending request to 
https://keystone.example.com/v3/s3tokens
2023-12-18T15:25:55.861+ 7fec91dbb640 20 register_request 
mgr=0x561a407ae0c0 req_data->id=778, curl_handle=0x7fedaccb36e0
2023-12-18T15:25:55.861+ 7fec91dbb640 20 WARNING: blocking http 
request
2023-12-18T15:25:55.861+ 7fede37fe640 20 link_request 
req_data=0x561a40a418b0 req_data->id=778, curl_handle=0x7fedaccb36e0

[...]



this does not only stress the external authentication API (keystone in 
this case), but also blocks RGW threads for the duration of the 
external call.


I am currently looking into using the capabilities of HAProxy to rate 
limit requests based on their resulting http-response [4]. So in 
essence to rate-limit or tarpit clients that "produce" a high number 
of 403 "InvalidAccessKeyId" responses. To have less collateral it 
might make sense to limit based on the presented credentials 
themselves. But this would require to extract and track HTTP headers 
or URL parameters (presigned URLs) [5] and to put them into tables.



* What are your thoughts on the matter?
* What kind of measures did you put in place?
* Does it make sense to extend RGWs capabilities to deal with those 
cases itself?

** adding negative caching
** rate limits on concurrent external authentication requests (or is 
there a pool of connections for those requests?)




Regards


Christian



[1] https://docs.ceph.com/en/latest/radosgw/admin/#rate-limit-management
[2] 
https://docs.ceph.com/en/latest/radosgw/keystone/#integrating-with-openstack-keystone
[3] 
https://github.com/ceph/ceph/blob/86bb77eb9633bfd002e73b5e58b863bc2d0df594/src/rgw/rgw_auth_keystone.cc#L475
[4] 
https://www.haproxy.com/documentation/haproxy-configuration-manual/latest/#4.2-http-response%20track-sc0
[5] 
https://docs.aws.amazon.com/AmazonS3/latest/API/sig-v4-authenticating-requests.html#auth-methods-intro



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io