[ceph-users] Re: What is client request_load_avg? Troubleshooting MDS issues on Luminous

2022-08-15 Thread Chris Smart
On Mon, 2022-08-15 at 09:00 +, Frank Schilder wrote:
> Hi Chris,

> 

Hi Frank, thanks for the reply.

> I also have serious problems identifying problematic ceph-fs clients
> (using mimic). I don't think that even in the newest ceph version
> there are useful counters for that. Just last week I had the case
> that a client caused an all-time peak in cluster load and I was not
> able to locate the client due to the lack of useful rate counters.
> There are two problems with ceph fs' load monitoring. The first is
> the complete lack of rate-based IO load counters down to client+PID
> level and that warnings generated actually flag the wrong clients.
> 

Yikes, sounds familiar...

> The hallmark of the last problem is basically explained in this
> thread, specifically, this message:
> 
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/TWNF2PWM7SONLCT4OLAJLMLXHK3ABPUB/
> 
> It states that warnings are generated for *inactive* clients, not for
> clients that are actually causing the trouble. Worse yet, the
> proposed solution counteracts the problem that MDS client caps recall
> is usually way too slow. I had to increase it to 64K just to get the
> MDS cache balanced, because MDSes don't have a concept of rate-
> limiting clients that go bonkers. The effect is, that the MDSes
> punish all others because of a single rogue client instead of rate-
> limiting the bad one.
> 

Thanks for linking to that thread, it's very interesting.

> The first problem is essentially that useful IO rate counters are
> missing, for example, for each client the rates with which it
> acquires and releases caps. What I really would love to see are
> warnings for "clients acquiring caps much faster than releasing"
> (with client ID and PID) and MDS-side rate-balancing essentially
> throttling such aggressive clients. Every client holding more than,
> say, 2*max-caps caps should be throttled so that caps-acquire rate =
> caps-release rate. I also don't understand why the MDS is not going
> after the rich clients first. I get all the time warnings that a
> client with 4000 caps is not releasing fast enough while some fat
> cats sit on millions and are not flagged as problematic. Why is the
> recall rate not proportional to the amount of caps a client holds?
> 

I don't know the answer, but is it the case that the number of caps in
itself doesn't necessarily indicate a bad client? If I had a long-
running job that slowly trawled through millions of files but didn't
release caps, then I might end up with millions but I'm not really
putting any pressure on MDS?

Versus someone who's got 12 parallel threads running linking and
unlinking thousands of the same files?

If that's true, then maybe some kind of counter that tracks the rate of
caps vs number of metadata updates required or something... I don't
know.


> Another counter that is missing is an actual IO rate counter. MDS
> requests are in no way indicative of a client's IO activity. Once it
> has the caps for a file it talks to OSDs directly. This communication
> is not reflected in any counter I'm aware of. To return to my case
> above, I had clients with more than 50K average load requests, but
> these were completely harmless (probably served from local cache).
> The MDS did not show any unusual behaviour like growing cache and the
> like. Everything looked normal except for OSD server load which sky-
> rocketed to unprecedented levels due to some client's IO requests.
> 

Oh, yeah I think we're thinking similar things and that num_caps itself
doesn't necessarily indicate a problematic client... Do you know what
the request load means? Sounds like it's not actually anything to do
with performance load, but maybe just amount? I don't know what that
metric really is...


> It must have been small random IO and the only way currently to
> identify such clients is network packet traffic. Unfortunately, our
> network monitoring system has a few blind spots and I was not able to
> find out which client was bombarding the OSDs with a packet storm.
> Proper IO rate counters down to PID level and appropriate warnings
> about aggressive clients would really help and are dearly missing.
> 

Yeah, I see... that would be really useful. I'm not sure if my
situation is the same or not, I feel like my MDS is just not able to
keep up and that the OSDs are actually OK... but I don't know for sure.

Thanks, I appreciate all the information! I'm hopeful that with some
help I might be able to work out problematic clients, maybe some
combination of num_caps, ops, load, etc... I still think that would be
useful to know, even if the bottlenecks in my cluster can be discovered
and remedied...

Cheers,
-c

> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> 
> 
> From: Chris Smart 
> Sent: 14 August 2022 05:47:12
> To: ceph-users@ceph.io
> Subject: [ceph-users] What is client request_load_avg?
> 

[ceph-users] Ceph User + Dev Monthly August Meetup

2022-08-15 Thread Neha Ojha
Hi everyone,

This month's Ceph User + Dev Monthly meetup is on August 18,
14:00-15:00 UTC. We are planning to get some user feedback on
BlueStore compression modes. Please add other topics to the agenda:
https://pad.ceph.com/p/ceph-user-dev-monthly-minutes.

Hope to see you there!

Thanks,
Neha

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: What is client request_load_avg? Troubleshooting MDS issues on Luminous

2022-08-15 Thread Chris Smart
On Tue, 2022-08-16 at 13:21 +1000, distro...@gmail.com wrote:
> 
> I'm not quite sure of the relationship of operations between MDS and
> OSD data. The MDS gets written to nvme pool and clients access data
> directly on OSD nodes, but do MDS operations also need to wait for
> OSDs
> to perform operations? I think it makes sense that they do (for
> example, to unlink a file MDS needs to check if there are any other
> hardlinks to it, and if not, then the data can be deleted from OSDs
> and
> the metadata updated to remove the file)?
> 
> So to that end, would slow performing OSDs also impact MDS
> performance?
> Maybe it's stuck waiting for the OSDs to do their thing, and they
> aren't fast enough... but then wouldn't I see much more %wa?
> 

Related datapoints I forgot to mention:

We get lots of "MDS health slow requests are blocked" error messages
every couple of minutes. Looking at August 13th logs, we had 911 log
lines about the clearing of these slow requests.

The message with the highest number was 11,193 slow requests cleared,
the average is 472.

I know we also have some OSD disks in the cluster with SMART errors,
which I'm looking to replace. However, we do not see the same number of
slow OSD requests - "only" 13 lines about blocked requests due to OSD
messages. I do plan to chase those down though and see if I can work
out if it's unhealthy disk, or intermittent network/host issues.

However, my point is that if MDS was bottlenecked due to slow OSDs, I
feel like I should see more corresponding blocked request OSD
messages?...

Cheers,
-c

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: What is client request_load_avg? Troubleshooting MDS issues on Luminous

2022-08-15 Thread distroguy
On Mon, 2022-08-15 at 08:33 +, Eugen Block wrote:
> Hi,
> 
> do you see high disk utilization on the OSD nodes? 

Hi Eugen, thanks for the reply, much appreciated.

> How is the load on  
> the active MDS?

Yesterday I rebooted the three MDS nodes one at a time (which obviously
included a failover to a freshly booted node) and since then the
performance has improved. It could be a total coincidence though and
I'd really like to try and understand more of what's really going on.

The load seems to stay pretty low on the active MDS server (currently
1.56, 1.62, 1.57) and it has free ram (60G used, 195G free).

The MDS servers almost never have CPU spent waiting on access
(occasionally ~0.2 wa), so there does not seem to be a bottleneck to
disk or network.

However, the ceph-mds process is pretty much constantly over 100% CPU
and often over 200%. Given it's a single process, right? It makes me
think that some operations are too slow or some task is pegging the CPU
at 100%.

Perhaps profiling the MDS server somehow might tell me the kind of
thing it's stuck on?

> How much RAM is configured for the MDS  
> (mds_cache_memory_limit)?

Currently set to 51539607552, so ~50G?

We do often see this go over and as far as I understand, this triggers
MDS to ask clients to release unused caps (we do get clients who don't
respond).

I think restarting the MDS causes the clients to drop all of their
unused caps, but hold the used ones for when the new MDS comes online
(so as not to overwhelm it)?

I'm not sure whether increasing the cache size helps (because it can
store more caps and put less pressure on the system when it tries to
drop them), or whether that actually increases pressure (because it has
more to track and more things to do).

We do have RAM free on the node though so we could increase it if you
think it might help?

> You can list all MDS sessions with 'ceph daemon mds. session
> ls'  
> to identify all your clients

Thanks, yeah there is a lot of nice info in there, although I'm not
quite sure which elements are useful. That's where I saw the
"request_load_avg" which I'm not quite sure what it means.

We do have ~5000 active clients (and that number is pretty consistent).

The top 5 clients have over a million caps each, with the top client
having over 5 million itself.

> and 'ceph daemon mds.  
> dump_blocked_ops' to show blocked requests.

There are no blocked ops at the moment, according to (ceph daemon
mds.$(hostname) dump_blocked_ops) but I can try again once the system
performance degrades.

I feel like I need to get some of these metrics out into Prometheus or
something, so that I can look for historical trends (and add alerts).

> But simply killing  
> sessions isn't a solution, so first you need to find out where the  
> bottleneck is.

Yeah, I totally agree with finding the real bottleneck, thanks for your
help.

My thinking could be totally wrong but the reason I was looking into
identifying and killing problematic clients was because we get these
bursts where some clients might be doing some harsh requests (like
multiple jobs trying to read/link/unlink millions of tiny files at
once) and if I can identify them I could try and 1) stop them to
restore cluster performance for everyone else and 2) get them to find a
better way to do that task so we can avoid the issue...

To your point about finding the source of the bottleneck though, I'd
much rather the Ceph cluster was able to handle anything that was
thrown at it... :-) My feeling is that the MDS is easily overwhelmed,
hopefully profiling somehow can help shine a light there.

> Do you see hung requests or something? Anything in  
> 'dmesg' on the client side?

I don't see anything useful on the client side in dmesg, unfortunately.
Just lots of clients talking to mons successfully. The clients are
using kernel ceph, and mounting with relatime (that could explain lots
of caps, even on a ro mount) and acl (assume this puts extra
load/checks on MDS).

At a guess, we can probably optimise the client mounts with noatime
instead and maybe remove acl if we're not using them - not sure of the
impact to workloads though, so haven't tried.

I'm not quite sure of the relationship of operations between MDS and
OSD data. The MDS gets written to nvme pool and clients access data
directly on OSD nodes, but do MDS operations also need to wait for OSDs
to perform operations? I think it makes sense that they do (for
example, to unlink a file MDS needs to check if there are any other
hardlinks to it, and if not, then the data can be deleted from OSDs and
the metadata updated to remove the file)?

So to that end, would slow performing OSDs also impact MDS performance?
Maybe it's stuck waiting for the OSDs to do their thing, and they
aren't fast enough... but then wouldn't I see much more %wa?

One thing that I noticed yesterday is that when the cluster is under
pressure the I/O and throughput of the MDS to the metadata pool goes
very spiky (OSD pool did 

[ceph-users] Re: CephFS perforamnce degradation in root directory

2022-08-15 Thread Xiubo Li



On 8/9/22 4:07 PM, Robert Sander wrote:

Hi,

we have a cluster with 7 nodes each with 10 SSD OSDs providing CephFS 
to a CloudStack system as primary storage.


When copying a large file into the root directory of the CephFS the 
bandwidth drops from 500MB/s to 50MB/s after around 30 seconds. We see 
some MDS activity in the output of "ceph fs status" at the same time.


When copying the same file to a subdirectory of the CephFS the 
performance stays at 500MB/s for the whole time. MDS activity does not 
seems to influence the performance here.


There are appr 270 other files in the root directory. CloudStack 
stores VM images in qcow2 format there.


Is this a known issue?
Is there something special with the root directory of a CephFS wrt 
write performance?


AFAIK there is no special with the root dir. From my local test there is 
not difference with the subdir.


BTW, could you test it for more than once for the root dir ? When you 
are doing this for the first time the ceph may need to allocate the disk 
spaces, which will take a little time.


Thanks.



Regards


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph needs your help with defining availability!

2022-08-15 Thread Kamoltat Sirivadhna
Hi guys,

thank you so much for filling out the Ceph Cluster Availability survey!

we have received a total of 59 responses from various groups of people,
which is enough to help us understand more profoundly what availability
means to everyone.

As promised, here is the link to the results of the survey:
https://docs.google.com/forms/d/1J5Ab5KCy6fceXxHI8KDqY2Qx3FzR-V9ivKp_vunEWZ0/viewanalytics

Also, I've summarized some of the written responses such that it is easier
for you to make sense of the results.

I hope you will find these responses helpful and please feel free to reach
out if you have any questions!

Response summary of the question:

“””

In your own words, please describe what availability means to you in a Ceph
cluster. (For example, is it the ability to serve read and write requests
even if the cluster is in a degraded state?).

“””

In summary, the majority of people consider the definition of availability
to be the ability to serve I/O with reasonable performance (some suggest
10-20%, others say it should be user configurable) + the ability to provide
other services. A couple of people define availability as all PGs being in
the state of active+clean, but we will come to learn that many people
disagree with this in the next question. Interestingly, a handful of people
suggests that cluster availability shouldn’t be binary, but rather a scale
or tiers, e.g., one response suggests that we should have:


   1.

   Fully available -  all services can serve I/O normal performance.
   2.

   Partially available
   1.

  some access method, although configured, is not available e.g.,
  CephFS works and RGW doesn’t.
  2.

  only reads or writes are possible on some storage pools.
  3.

  some storage pools are completely unavailable while others are
  completely or partially available.
  4.

  performance is severely degraded.
  5.

  some services are stopped/crashed.
  3.

   Unavailable - when Partially available is not reached.


Moreover, some suggest that we should track availability as per pool basis
to deal with a scenario where we have different crush rules or when we can
afford a pool to be unavailable. Furthermore, some response cares more
about the availability of one service than another, e.g., one response
states that they wouldn’t care about the availability of RADOS if RGW is
unavailable.

Response summary of the question:

“””

Do you agree with the following metric in evaluating a cluster's
availability:

"All placement group (PG) state in a cluster must have 'active'  in them,
if at least 1 PG does not have 'active' in them, then the cluster as a
whole is deemed as unavailable".

“””

35.8 % of Users answered `No`

35.8% of Users answered `Yes`

28.3% of Users answered `maybe`

Data clearly states that we can’t just have this as criteria for
availability. Therefore, here are some of the reasons why 64.1% do not
fully agree with the statement.

If the client does not interact with that particular PG then it is not
important, e.g., if 1 PG is inactive and the s3 endpoint is down but CephFS
can still serve I/O, we cannot say that the cluster is unavailable. Some
disagree because they believe that a PG relates to a single pool,
therefore, that particular pool will be unavailable, not the cluster.
Furthermore, some suggest that there are events that might lead to PGs not
being inactive, such as provisioning a new OSD, creating a pool, or PG
split, however, these events don’t necessarily indicate unavailability.

Response summary of the question:

“””

From your own experience, what are some of the most common events that
cause a Ceph cluster to be considered unavailable based on your definition
of availability.

“””

Top four responses:


   1.

   Network-related issues, e.g., network failure/instability.
   2.

   OSD-related issues, e.g., failure, slow ops, flapping.
   3.

   Disk-related issues, e.g., dead disks.
   4.

   PGs-related issues,  e.g., many PGs became stale, unknown, and stuck in
   peering.


Response summary of the question:

“””

Are there any events that you might consider a cluster to be unavailable
but you feel like it is not worth tracking and is dismissible?

“””

Top three responses:


   1.

   No, all unavailable events are worth tracking.
   2.

   Network related issues
   3.

   Scheduled upgrades or maintenance



On Tue, Aug 9, 2022 at 1:51 PM Kamoltat Sirivadhna 
wrote:

> Hi John,
>
> Yes, I'm planning to summarize the results after this week. I will
> definitely share it with the community.
>
> Best,
>
> On Tue, Aug 9, 2022 at 1:19 PM John Bent  wrote:
>
>> Hello Kamoltat,
>>
>> This sounds very interesting. Will you be sharing the results of the
>> survey back with the community?
>>
>> Thanks,
>>
>> John
>>
>> On Sat, Aug 6, 2022 at 4:49 AM Kamoltat Sirivadhna 
>> wrote:
>>
>>> Hi everyone,
>>>
>>> One of the features we are looking into implementing for our upcoming
>>> Ceph release (Reef) is 

[ceph-users] Re: Quincy: Corrupted devicehealth sqlite3 database from MGR crashing bug

2022-08-15 Thread Daniel Williams
ceph-post-file: a9802e30-0096-410e-b5c0-f2e6d83acfd6

On Tue, Aug 16, 2022 at 3:13 AM Patrick Donnelly 
wrote:

> On Mon, Aug 15, 2022 at 11:39 AM Daniel Williams 
> wrote:
> >
> > Using ubuntu with apt repository from ceph.
> >
> > Ok that helped me figure out that it's .mgr not mgr.
> > # ceph -v
> > ceph version 17.2.3 (dff484dfc9e19a9819f375586300b3b79d80034d) quincy
> (stable)
> > # export CEPH_CONF='/etc/ceph/ceph.conf'
> > # export CEPH_KEYRING='/etc/ceph/ceph.client.admin.keyring'
> > # export CEPH_ARGS='--log_to_file true --log-file ceph-sqlite.log
> --debug_cephsqlite 20 --debug_ms 1'
> > # sqlite3
> > SQLite version 3.31.1 2020-01-27 19:55:54
> > Enter ".help" for usage hints.
> > sqlite> .load libcephsqlite.so
> > sqlite> .open file:///.mgr:devicehealth/main.db?vfs=ceph
> > sqlite> .tables
> > Segmentation fault (core dumped)
> >
> > # dpkg -l | grep ceph | grep sqlite
> > ii  libsqlite3-mod-ceph  17.2.3-1focal
> amd64SQLite3 VFS for Ceph
> >
> > Attached ceph-sqlite.log
>
> No real good hint in the log unfortunately. I will need the core dump
> to see where things went wrong. Can you upload it with
>
> https://docs.ceph.com/en/quincy/man/8/ceph-post-file/
>
> ?
>
> --
> Patrick Donnelly, Ph.D.
> He / Him / His
> Principal Software Engineer
> Red Hat, Inc.
> GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: The next quincy point release

2022-08-15 Thread Patrick Donnelly
This must go in the next quincy release:

https://github.com/ceph/ceph/pull/47288

but we're still waiting on reviews and final tests before merging into main.

On Mon, Aug 15, 2022 at 11:02 AM Yuri Weinstein  wrote:
>
> We plan to start QE validation for the next quincy point release this week.
>
> Dev leads please tag all PRs needed to be included ("needs-qa") ASAP
> so they can be tested and merged on time.
>
> Thx
> YuriW
>
> ___
> Dev mailing list -- d...@ceph.io
> To unsubscribe send an email to dev-le...@ceph.io
>


-- 
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Quincy: Corrupted devicehealth sqlite3 database from MGR crashing bug

2022-08-15 Thread Patrick Donnelly
On Mon, Aug 15, 2022 at 11:39 AM Daniel Williams  wrote:
>
> Using ubuntu with apt repository from ceph.
>
> Ok that helped me figure out that it's .mgr not mgr.
> # ceph -v
> ceph version 17.2.3 (dff484dfc9e19a9819f375586300b3b79d80034d) quincy (stable)
> # export CEPH_CONF='/etc/ceph/ceph.conf'
> # export CEPH_KEYRING='/etc/ceph/ceph.client.admin.keyring'
> # export CEPH_ARGS='--log_to_file true --log-file ceph-sqlite.log 
> --debug_cephsqlite 20 --debug_ms 1'
> # sqlite3
> SQLite version 3.31.1 2020-01-27 19:55:54
> Enter ".help" for usage hints.
> sqlite> .load libcephsqlite.so
> sqlite> .open file:///.mgr:devicehealth/main.db?vfs=ceph
> sqlite> .tables
> Segmentation fault (core dumped)
>
> # dpkg -l | grep ceph | grep sqlite
> ii  libsqlite3-mod-ceph  17.2.3-1focal
>   amd64SQLite3 VFS for Ceph
>
> Attached ceph-sqlite.log

No real good hint in the log unfortunately. I will need the core dump
to see where things went wrong. Can you upload it with

https://docs.ceph.com/en/quincy/man/8/ceph-post-file/

?

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Quincy: Corrupted devicehealth sqlite3 database from MGR crashing bug

2022-08-15 Thread Daniel Williams
Using ubuntu with apt repository from ceph.

Ok that helped me figure out that it's .mgr not mgr.
# ceph -v
ceph version 17.2.3 (dff484dfc9e19a9819f375586300b3b79d80034d) quincy
(stable)
# export CEPH_CONF='/etc/ceph/ceph.conf'
# export CEPH_KEYRING='/etc/ceph/ceph.client.admin.keyring'
# export CEPH_ARGS='--log_to_file true --log-file
ceph-sqlite.log --debug_cephsqlite 20 --debug_ms 1'
# sqlite3
SQLite version 3.31.1 2020-01-27 19:55:54
Enter ".help" for usage hints.
sqlite> .load libcephsqlite.so
sqlite> .open file:///.mgr:devicehealth/main.db?vfs=ceph
sqlite> .tables
Segmentation fault (core dumped)

# dpkg -l | grep ceph | grep sqlite
ii  libsqlite3-mod-ceph  17.2.3-1focal
 amd64SQLite3 VFS for Ceph

Attached ceph-sqlite.log


On Mon, Aug 15, 2022 at 11:10 PM Patrick Donnelly 
wrote:

> Hello Daniel,
>
> On Mon, Aug 15, 2022 at 10:38 AM Daniel Williams 
> wrote:
> >
> > My managers are crashing reading the sqlite database for deviceheatlth:
> > .mgr:devicehealth/main.db-journal
> > debug -2> 2022-08-15T11:14:09.184+ 7fa5721b7700  5 cephsqlite:
> > Read: (client.53284882) [.mgr:devicehealth/main.db-journal]
> 0x5601da0c0008
> > 4129788~65536
> > debug -1> 2022-08-15T11:14:09.184+ 7fa5721b7700  5
> client.53284882:
> > SimpleRADOSStriper: read: main.db-journal: 4129788~65536
> > debug  0> 2022-08-15T11:14:09.200+ 7fa664aca700 -1 *** Caught
> > signal (Segmentation fault) **
> >
> > I upgraded to 17.2.3 but it seems like I'll need to do a sqlite recovery
> on
> > the database, since the devicehealth module is now non-optional.
> >
> > I tried:
> > sqlite3 -cmd '.load libcephsqlite.so' '.open
> > file:///mgr:devicehealth/main.db?vfs=ceph'
> > but that didn't work
> > Error: unable to open database ".open
> > file:///mgr:devicehealth/main.db?vfs=ceph": unable to open database file
> >
> > Any suggestions?
>
> Are you on Ubuntu or CentOS?
>
> You can try to figure out where things are going wrong loading the
> database via:
>
> env CEPH_ARGS='--log_to_file true --log-file foo.log
> --debug_cephsqlite 20 --debug_ms 1'  sqlite3 ...
>
> --
> Patrick Donnelly, Ph.D.
> He / Him / His
> Principal Software Engineer
> Red Hat, Inc.
> GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Quincy: Corrupted devicehealth sqlite3 database from MGR crashing bug

2022-08-15 Thread Patrick Donnelly
Hello Daniel,

On Mon, Aug 15, 2022 at 10:38 AM Daniel Williams  wrote:
>
> My managers are crashing reading the sqlite database for deviceheatlth:
> .mgr:devicehealth/main.db-journal
> debug -2> 2022-08-15T11:14:09.184+ 7fa5721b7700  5 cephsqlite:
> Read: (client.53284882) [.mgr:devicehealth/main.db-journal] 0x5601da0c0008
> 4129788~65536
> debug -1> 2022-08-15T11:14:09.184+ 7fa5721b7700  5 client.53284882:
> SimpleRADOSStriper: read: main.db-journal: 4129788~65536
> debug  0> 2022-08-15T11:14:09.200+ 7fa664aca700 -1 *** Caught
> signal (Segmentation fault) **
>
> I upgraded to 17.2.3 but it seems like I'll need to do a sqlite recovery on
> the database, since the devicehealth module is now non-optional.
>
> I tried:
> sqlite3 -cmd '.load libcephsqlite.so' '.open
> file:///mgr:devicehealth/main.db?vfs=ceph'
> but that didn't work
> Error: unable to open database ".open
> file:///mgr:devicehealth/main.db?vfs=ceph": unable to open database file
>
> Any suggestions?

Are you on Ubuntu or CentOS?

You can try to figure out where things are going wrong loading the database via:

env CEPH_ARGS='--log_to_file true --log-file foo.log
--debug_cephsqlite 20 --debug_ms 1'  sqlite3 ...

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Some odd results while testing disk performance related to write caching

2022-08-15 Thread Dan van der Ster
Hi,

We have some docs about this in the Ceph hardware recommendations:
https://docs.ceph.com/en/latest/start/hardware-recommendations/#write-caches

I added some responses inline..

On Fri, Aug 5, 2022 at 7:23 PM Torbjörn Jansson  wrote:
>
> Hello
>
> i got a small 3 node ceph cluster and i'm doing some bench marking related to
> performance with drive write caching.
>
> the reason i started was because i wanted to test the SSDs i have for their
> performance for use as db device for the osds and make sure they are setup as
> good as i can get it.
>
> i read that turning off write cache can be beneficial even when it sounds
> backwards.

"write cache" is a volatile cache -- so when it is enabled, Linux
knows that it is writing to a volatile area on the device and
therefore it needs to issue flushes to persist data. Linux considers
these devices to be in "write back" mode.
When the write cache is disabled, then Linux knows it is writing to a
persisted area, and therefore doesn't bother sending flushes anymore
-- these devices are in "write through" mode.
And btw, new data centre class devices have firmware and special
hardware to accelerate those persisted writes when the volatile cache
is disabled. This is the so-called media cache.

> this seems to be true.
> i used mainly fio and "iostat -x" to test using something like:
> fio --filename=/dev/ceph-db-0/bench --direct=1 --sync=1 --rw=write --bs=4k
> --numjobs=5 --iodepth=1 --runtime=60 --time_based --group_reporting
>
> and then testing this with write cache turned off and on to compare the 
> results.
> also with and without sync in fio command above.
>
> one thing i observed related to turning off the write cache on drives was that
> it appears a reboot is needed for it to have any effect.

This is depending on the OS -- if you set the cache using the approach
mentioned in the docs above, then in all distros we tested it keeps
WCE and "write through" consistent with each other.

> and this is where it gets strange and the part i don't get.
>
> the disks i have, seagate nytro sas3 ssd, according to the drive manual the
> drive don't care what you set the WCE bit to and it will do write caching
> internally regardless.
> most likely because it is an enterprise disk with built in power loss 
> protection.
>
> BUT it makes a big difference to the performance and the flush per seconds in
> iostat.
> so it appears that if you boot and the drive got its write cache disabled 
> right
> from the start (dmesg contains stuff like: "sd 0:0:0:0: [sda] Write cache:
> disabled") then linux wont send any flush to the drive and you get good
> performance.
> if you change the write caching on a drive during runtime (sdparm for sas or
> hdparm for sata) then it wont change anything.

Check the cache_type at e.g. /sys/class/scsi_disk/0\:0\:0\:0/cache_type
"write back" -> flush is sent
"write through" -> flush not sent

> why is that? why do i have to do a reboot?
> i mean, lets say you boot with write cache disabled, linux decides to never
> send flush and you change it after boot to enable the cache, if there is no
> flush then you risk your data in case of a power loss, or?

On all devices we have, if we have "write through" at boot, then set
(with hdparm or sdparm) WCE=1 or echo "write back" > ...
then the cache_type is automatically set correctly to "write back" and
flushes are sent.

There is another /sys/ entry to toggle flush behaviour: echo "write
through" > /sys/block/sda/queue/write_cache
This is apparently a way to lie to the OS so it stops sending flushes
(without manipulating the WCE mode of the underlying device).

Cheers, Dan

> this is not very obvious or good behavior i think (i hope i'm wrong and some
> one can enlighten me)
>
>
> for sas drives sdparm -s WCE=0 --save /dev/sdX appears to do the right thing
> and it survives a reboot.
> but for sata disks hdparm -W 0 -K 1 /dev/sdX makes the change but as long as
> drive is connected to sas controller it still gets the write cache enabled at
> boot so i bet sas controller also messes with the write cache setting on the
> drives.
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Quincy: Corrupted devicehealth sqlite3 database from MGR crashing bug

2022-08-15 Thread Daniel Williams
My managers are crashing reading the sqlite database for deviceheatlth:
.mgr:devicehealth/main.db-journal
debug -2> 2022-08-15T11:14:09.184+ 7fa5721b7700  5 cephsqlite:
Read: (client.53284882) [.mgr:devicehealth/main.db-journal] 0x5601da0c0008
4129788~65536
debug -1> 2022-08-15T11:14:09.184+ 7fa5721b7700  5 client.53284882:
SimpleRADOSStriper: read: main.db-journal: 4129788~65536
debug  0> 2022-08-15T11:14:09.200+ 7fa664aca700 -1 *** Caught
signal (Segmentation fault) **

I upgraded to 17.2.3 but it seems like I'll need to do a sqlite recovery on
the database, since the devicehealth module is now non-optional.

I tried:
sqlite3 -cmd '.load libcephsqlite.so' '.open
file:///mgr:devicehealth/main.db?vfs=ceph'
but that didn't work
Error: unable to open database ".open
file:///mgr:devicehealth/main.db?vfs=ceph": unable to open database file

Any suggestions?

Also I've seen some pretty crazy bugs in Quincy now (rebalancing uses 100%
cpu - still not fixed and the mgr crashing), maybe I jumped in too early?
Is this normal at the start of the release? Is there guidance for a roughly
safe subversion to wait before upgrading to a new release?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Recovery very slow after upgrade to quincy

2022-08-15 Thread Torkil Svensgaard



On 15-08-2022 08:24, Satoru Takeuchi wrote:

2022年8月13日(土) 1:35 Robert W. Eckert :


Interesting, a few weeks ago I added a new disk to each of my 3 node
cluster and saw the same 2 Mb/s recovery.What I had noticed was that
one OSD was using very high CPU and seems to have been the primary node on
the affected PGs.I couldn’t find anything overly wrong with the OSD,
network , etc.

You may want to look at the output of

ceph pg ls

to see if the recovery is sourced from one specific OSD or one host, then
check that host /osd for high CPU/memory.



Probably you hit this bug.

https://tracker.ceph.com/issues/56530

It can be bypassed by setting "osd_op_queue=wpq" configuration.


Thanks both of you. Doing "ceph config set osd osd_op_queue wpq" and 
restarting the OSDs seems to have fixed it.


Mvh.

Torkil


Best,
Satoru








-Original Message-
From: Torkil Svensgaard 
Sent: Friday, August 12, 2022 7:50 AM
To: ceph-users@ceph.io
Cc: Ruben Vestergaard 
Subject: [ceph-users] Recovery very slow after upgrade to quincy

6 hosts with 2 x 10G NICs, data in 2+2 EC pool. 17.2.0, upgrade from
pacific.

cluster:
  id:
  health: HEALTH_WARN
  2 host(s) running different kernel versions
  2071 pgs not deep-scrubbed in time
  837 pgs not scrubbed in time

services:
  mon:5 daemons, quorum
test-ceph-03,test-ceph-04,dcn-ceph-03,dcn-ceph-02,dcn-ceph-01 (age 116s)
  mgr:dcn-ceph-01.dzercj(active, since 6h), standbys:
dcn-ceph-03.lrhaxo
  mds:1/1 daemons up, 2 standby
  osd:118 osds: 118 up (since 6d), 118 in (since 6d); 66
remapped pgs
  rbd-mirror: 2 daemons active (2 hosts)

data:
  volumes: 1/1 healthy
  pools:   9 pools, 2737 pgs
  objects: 246.02M objects, 337 TiB
  usage:   665 TiB used, 688 TiB / 1.3 PiB avail
  pgs: 42128281/978408875 objects misplaced (4.306%)
   2332 active+clean
   281  active+clean+snaptrim_wait
   66   active+remapped+backfilling
   36   active+clean+snaptrim
   11   active+clean+scrubbing+deep
   8active+clean+scrubbing
   1active+clean+scrubbing+deep+snaptrim_wait
   1active+clean+scrubbing+deep+snaptrim
   1active+clean+scrubbing+snaptrim

io:
  client:   159 MiB/s rd, 86 MiB/s wr, 17.14k op/s rd, 326 op/s wr
  recovery: 2.0 MiB/s, 3 objects/s


Low load, low latency, low network traffic. Tried
osd_mclock_profile=high_recovery_ops, no difference. Disabling scrubs and
snaptrim, no difference.

Am I missing something obvious I should have done after the upgrade?

Mvh.

Torkil

--
Torkil Svensgaard
Sysadmin
MR-Forskningssektionen, afs. 714
DRCMR, Danish Research Centre for Magnetic Resonance Hvidovre Hospital
Kettegård Allé 30
DK-2650 Hvidovre
Denmark
Tel: +45 386 22828
E-mail: tor...@drcmr.dk
___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Torkil Svensgaard
Systems Administrator
Danish Research Centre for Magnetic Resonance DRCMR, Section 714
Copenhagen University Hospital Amager and Hvidovre
Kettegaard Allé 30, 2650 Hvidovre, Denmark
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: What is client request_load_avg? Troubleshooting MDS issues on Luminous

2022-08-15 Thread Eugen Block

Hi,

do you see high disk utilization on the OSD nodes? How is the load on  
the active MDS? How much RAM is configured for the MDS  
(mds_cache_memory_limit)?
You can list all MDS sessions with 'ceph daemon mds. session ls'  
to identify all your clients and 'ceph daemon mds.  
dump_blocked_ops' to show blocked requests. But simply killing  
sessions isn't a solution, so first you need to find out where the  
bottleneck is. Do you see hung requests or something? Anything in  
'dmesg' on the client side?



Zitat von Chris Smart :


Hi all,

I have recently inherited a 10 node Ceph cluster running Luminous (12.2.12)
which is running specifically for CephFS (and I don't know much about MDS)
with only one active MDS server (two standby).
It's not a great cluster IMO, the cephfs_data pool is on high density nodes
with high capacity SATA drives but at least the cephfs_metadata pool is on
nvme drives.

Access to the cluster regularly goes slow for clients and I'm seeing lots
of warnings like this:

MDSs behind on trimming (MDS_TRIM)
MDSs report slow metadata IOs (MDS_SLOW_METADATA_IO)
MDSs report slow requests (MDS_SLOW_REQUEST)
MDSs have many clients failing to respond to capability release
(MDS_CLIENT_LATE_RELEASE_MANY)

If there is only one client that's failing to respond to capability release
I can see the client id in the output and work out what user that is and
get their job stopped. Performance then usually improves a bit.

However, if there is more than one, the output only shows a summary of the
number of clients and I don't know who the clients are to get their jobs
cancelled.
Is there a way I can work out what clients these are? I'm guessing some
kind of combination of in_flight_ops, blocked_ops and total num_caps?

However, I also feel like just having a large number of caps isn't
_necessarily_ an indicator of a problem, sometimes restarting MDS and
forcing clients to drop unused caps helps, sometimes it doesn't.

I'm curious if there's a better way to determine any clients that might be
causing issues in the cluster?
To that end, I've noticed there is a metric called "request_load_avg" in
the output of ceph mds client ls but I can't quite find any information
about it. It _seems_ like it could indicate a client that's doing lots and
lots of requests and therefore a useful metric to see what client might be
smashing the cluster, but does anyone know for sure?

Many thanks,
Chris
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS perforamnce degradation in root directory

2022-08-15 Thread Robert Sander

Am 09.08.22 um 10:07 schrieb Robert Sander:

When copying the same file to a subdirectory of the CephFS the 
performance stays at 500MB/s for the whole time. MDS activity does not 
seems to influence the performance here.


There is a new datapoint:

When mounting the subdirectory (and not CephFS's root) the performance 
also degrades while staying up when writing into a subdirectory.


Is there something special at the mountpoint?

Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

http://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 220009 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Recovery very slow after upgrade to quincy

2022-08-15 Thread Satoru Takeuchi
2022年8月13日(土) 1:35 Robert W. Eckert :

> Interesting, a few weeks ago I added a new disk to each of my 3 node
> cluster and saw the same 2 Mb/s recovery.What I had noticed was that
> one OSD was using very high CPU and seems to have been the primary node on
> the affected PGs.I couldn’t find anything overly wrong with the OSD,
> network , etc.
>
> You may want to look at the output of
>
> ceph pg ls
>
> to see if the recovery is sourced from one specific OSD or one host, then
> check that host /osd for high CPU/memory.


Probably you hit this bug.

https://tracker.ceph.com/issues/56530

It can be bypassed by setting "osd_op_queue=wpq" configuration.

Best,
Satoru


>
>
>
>
>
> -Original Message-
> From: Torkil Svensgaard 
> Sent: Friday, August 12, 2022 7:50 AM
> To: ceph-users@ceph.io
> Cc: Ruben Vestergaard 
> Subject: [ceph-users] Recovery very slow after upgrade to quincy
>
> 6 hosts with 2 x 10G NICs, data in 2+2 EC pool. 17.2.0, upgrade from
> pacific.
>
> cluster:
>  id:
>  health: HEALTH_WARN
>  2 host(s) running different kernel versions
>  2071 pgs not deep-scrubbed in time
>  837 pgs not scrubbed in time
>
>services:
>  mon:5 daemons, quorum
> test-ceph-03,test-ceph-04,dcn-ceph-03,dcn-ceph-02,dcn-ceph-01 (age 116s)
>  mgr:dcn-ceph-01.dzercj(active, since 6h), standbys:
> dcn-ceph-03.lrhaxo
>  mds:1/1 daemons up, 2 standby
>  osd:118 osds: 118 up (since 6d), 118 in (since 6d); 66
> remapped pgs
>  rbd-mirror: 2 daemons active (2 hosts)
>
>data:
>  volumes: 1/1 healthy
>  pools:   9 pools, 2737 pgs
>  objects: 246.02M objects, 337 TiB
>  usage:   665 TiB used, 688 TiB / 1.3 PiB avail
>  pgs: 42128281/978408875 objects misplaced (4.306%)
>   2332 active+clean
>   281  active+clean+snaptrim_wait
>   66   active+remapped+backfilling
>   36   active+clean+snaptrim
>   11   active+clean+scrubbing+deep
>   8active+clean+scrubbing
>   1active+clean+scrubbing+deep+snaptrim_wait
>   1active+clean+scrubbing+deep+snaptrim
>   1active+clean+scrubbing+snaptrim
>
>io:
>  client:   159 MiB/s rd, 86 MiB/s wr, 17.14k op/s rd, 326 op/s wr
>  recovery: 2.0 MiB/s, 3 objects/s
>
>
> Low load, low latency, low network traffic. Tried
> osd_mclock_profile=high_recovery_ops, no difference. Disabling scrubs and
> snaptrim, no difference.
>
> Am I missing something obvious I should have done after the upgrade?
>
> Mvh.
>
> Torkil
>
> --
> Torkil Svensgaard
> Sysadmin
> MR-Forskningssektionen, afs. 714
> DRCMR, Danish Research Centre for Magnetic Resonance Hvidovre Hospital
> Kettegård Allé 30
> DK-2650 Hvidovre
> Denmark
> Tel: +45 386 22828
> E-mail: tor...@drcmr.dk
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io