[ceph-users] Re: What is client request_load_avg? Troubleshooting MDS issues on Luminous
On Mon, 2022-08-15 at 09:00 +, Frank Schilder wrote: > Hi Chris, > Hi Frank, thanks for the reply. > I also have serious problems identifying problematic ceph-fs clients > (using mimic). I don't think that even in the newest ceph version > there are useful counters for that. Just last week I had the case > that a client caused an all-time peak in cluster load and I was not > able to locate the client due to the lack of useful rate counters. > There are two problems with ceph fs' load monitoring. The first is > the complete lack of rate-based IO load counters down to client+PID > level and that warnings generated actually flag the wrong clients. > Yikes, sounds familiar... > The hallmark of the last problem is basically explained in this > thread, specifically, this message: > > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/TWNF2PWM7SONLCT4OLAJLMLXHK3ABPUB/ > > It states that warnings are generated for *inactive* clients, not for > clients that are actually causing the trouble. Worse yet, the > proposed solution counteracts the problem that MDS client caps recall > is usually way too slow. I had to increase it to 64K just to get the > MDS cache balanced, because MDSes don't have a concept of rate- > limiting clients that go bonkers. The effect is, that the MDSes > punish all others because of a single rogue client instead of rate- > limiting the bad one. > Thanks for linking to that thread, it's very interesting. > The first problem is essentially that useful IO rate counters are > missing, for example, for each client the rates with which it > acquires and releases caps. What I really would love to see are > warnings for "clients acquiring caps much faster than releasing" > (with client ID and PID) and MDS-side rate-balancing essentially > throttling such aggressive clients. Every client holding more than, > say, 2*max-caps caps should be throttled so that caps-acquire rate = > caps-release rate. I also don't understand why the MDS is not going > after the rich clients first. I get all the time warnings that a > client with 4000 caps is not releasing fast enough while some fat > cats sit on millions and are not flagged as problematic. Why is the > recall rate not proportional to the amount of caps a client holds? > I don't know the answer, but is it the case that the number of caps in itself doesn't necessarily indicate a bad client? If I had a long- running job that slowly trawled through millions of files but didn't release caps, then I might end up with millions but I'm not really putting any pressure on MDS? Versus someone who's got 12 parallel threads running linking and unlinking thousands of the same files? If that's true, then maybe some kind of counter that tracks the rate of caps vs number of metadata updates required or something... I don't know. > Another counter that is missing is an actual IO rate counter. MDS > requests are in no way indicative of a client's IO activity. Once it > has the caps for a file it talks to OSDs directly. This communication > is not reflected in any counter I'm aware of. To return to my case > above, I had clients with more than 50K average load requests, but > these were completely harmless (probably served from local cache). > The MDS did not show any unusual behaviour like growing cache and the > like. Everything looked normal except for OSD server load which sky- > rocketed to unprecedented levels due to some client's IO requests. > Oh, yeah I think we're thinking similar things and that num_caps itself doesn't necessarily indicate a problematic client... Do you know what the request load means? Sounds like it's not actually anything to do with performance load, but maybe just amount? I don't know what that metric really is... > It must have been small random IO and the only way currently to > identify such clients is network packet traffic. Unfortunately, our > network monitoring system has a few blind spots and I was not able to > find out which client was bombarding the OSDs with a packet storm. > Proper IO rate counters down to PID level and appropriate warnings > about aggressive clients would really help and are dearly missing. > Yeah, I see... that would be really useful. I'm not sure if my situation is the same or not, I feel like my MDS is just not able to keep up and that the OSDs are actually OK... but I don't know for sure. Thanks, I appreciate all the information! I'm hopeful that with some help I might be able to work out problematic clients, maybe some combination of num_caps, ops, load, etc... I still think that would be useful to know, even if the bottlenecks in my cluster can be discovered and remedied... Cheers, -c > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > > From: Chris Smart > Sent: 14 August 2022 05:47:12 > To: ceph-users@ceph.io > Subject: [ceph-users] What is client request_load_avg? >
[ceph-users] Ceph User + Dev Monthly August Meetup
Hi everyone, This month's Ceph User + Dev Monthly meetup is on August 18, 14:00-15:00 UTC. We are planning to get some user feedback on BlueStore compression modes. Please add other topics to the agenda: https://pad.ceph.com/p/ceph-user-dev-monthly-minutes. Hope to see you there! Thanks, Neha ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: What is client request_load_avg? Troubleshooting MDS issues on Luminous
On Tue, 2022-08-16 at 13:21 +1000, distro...@gmail.com wrote: > > I'm not quite sure of the relationship of operations between MDS and > OSD data. The MDS gets written to nvme pool and clients access data > directly on OSD nodes, but do MDS operations also need to wait for > OSDs > to perform operations? I think it makes sense that they do (for > example, to unlink a file MDS needs to check if there are any other > hardlinks to it, and if not, then the data can be deleted from OSDs > and > the metadata updated to remove the file)? > > So to that end, would slow performing OSDs also impact MDS > performance? > Maybe it's stuck waiting for the OSDs to do their thing, and they > aren't fast enough... but then wouldn't I see much more %wa? > Related datapoints I forgot to mention: We get lots of "MDS health slow requests are blocked" error messages every couple of minutes. Looking at August 13th logs, we had 911 log lines about the clearing of these slow requests. The message with the highest number was 11,193 slow requests cleared, the average is 472. I know we also have some OSD disks in the cluster with SMART errors, which I'm looking to replace. However, we do not see the same number of slow OSD requests - "only" 13 lines about blocked requests due to OSD messages. I do plan to chase those down though and see if I can work out if it's unhealthy disk, or intermittent network/host issues. However, my point is that if MDS was bottlenecked due to slow OSDs, I feel like I should see more corresponding blocked request OSD messages?... Cheers, -c ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: What is client request_load_avg? Troubleshooting MDS issues on Luminous
On Mon, 2022-08-15 at 08:33 +, Eugen Block wrote: > Hi, > > do you see high disk utilization on the OSD nodes? Hi Eugen, thanks for the reply, much appreciated. > How is the load on > the active MDS? Yesterday I rebooted the three MDS nodes one at a time (which obviously included a failover to a freshly booted node) and since then the performance has improved. It could be a total coincidence though and I'd really like to try and understand more of what's really going on. The load seems to stay pretty low on the active MDS server (currently 1.56, 1.62, 1.57) and it has free ram (60G used, 195G free). The MDS servers almost never have CPU spent waiting on access (occasionally ~0.2 wa), so there does not seem to be a bottleneck to disk or network. However, the ceph-mds process is pretty much constantly over 100% CPU and often over 200%. Given it's a single process, right? It makes me think that some operations are too slow or some task is pegging the CPU at 100%. Perhaps profiling the MDS server somehow might tell me the kind of thing it's stuck on? > How much RAM is configured for the MDS > (mds_cache_memory_limit)? Currently set to 51539607552, so ~50G? We do often see this go over and as far as I understand, this triggers MDS to ask clients to release unused caps (we do get clients who don't respond). I think restarting the MDS causes the clients to drop all of their unused caps, but hold the used ones for when the new MDS comes online (so as not to overwhelm it)? I'm not sure whether increasing the cache size helps (because it can store more caps and put less pressure on the system when it tries to drop them), or whether that actually increases pressure (because it has more to track and more things to do). We do have RAM free on the node though so we could increase it if you think it might help? > You can list all MDS sessions with 'ceph daemon mds. session > ls' > to identify all your clients Thanks, yeah there is a lot of nice info in there, although I'm not quite sure which elements are useful. That's where I saw the "request_load_avg" which I'm not quite sure what it means. We do have ~5000 active clients (and that number is pretty consistent). The top 5 clients have over a million caps each, with the top client having over 5 million itself. > and 'ceph daemon mds. > dump_blocked_ops' to show blocked requests. There are no blocked ops at the moment, according to (ceph daemon mds.$(hostname) dump_blocked_ops) but I can try again once the system performance degrades. I feel like I need to get some of these metrics out into Prometheus or something, so that I can look for historical trends (and add alerts). > But simply killing > sessions isn't a solution, so first you need to find out where the > bottleneck is. Yeah, I totally agree with finding the real bottleneck, thanks for your help. My thinking could be totally wrong but the reason I was looking into identifying and killing problematic clients was because we get these bursts where some clients might be doing some harsh requests (like multiple jobs trying to read/link/unlink millions of tiny files at once) and if I can identify them I could try and 1) stop them to restore cluster performance for everyone else and 2) get them to find a better way to do that task so we can avoid the issue... To your point about finding the source of the bottleneck though, I'd much rather the Ceph cluster was able to handle anything that was thrown at it... :-) My feeling is that the MDS is easily overwhelmed, hopefully profiling somehow can help shine a light there. > Do you see hung requests or something? Anything in > 'dmesg' on the client side? I don't see anything useful on the client side in dmesg, unfortunately. Just lots of clients talking to mons successfully. The clients are using kernel ceph, and mounting with relatime (that could explain lots of caps, even on a ro mount) and acl (assume this puts extra load/checks on MDS). At a guess, we can probably optimise the client mounts with noatime instead and maybe remove acl if we're not using them - not sure of the impact to workloads though, so haven't tried. I'm not quite sure of the relationship of operations between MDS and OSD data. The MDS gets written to nvme pool and clients access data directly on OSD nodes, but do MDS operations also need to wait for OSDs to perform operations? I think it makes sense that they do (for example, to unlink a file MDS needs to check if there are any other hardlinks to it, and if not, then the data can be deleted from OSDs and the metadata updated to remove the file)? So to that end, would slow performing OSDs also impact MDS performance? Maybe it's stuck waiting for the OSDs to do their thing, and they aren't fast enough... but then wouldn't I see much more %wa? One thing that I noticed yesterday is that when the cluster is under pressure the I/O and throughput of the MDS to the metadata pool goes very spiky (OSD pool did
[ceph-users] Re: CephFS perforamnce degradation in root directory
On 8/9/22 4:07 PM, Robert Sander wrote: Hi, we have a cluster with 7 nodes each with 10 SSD OSDs providing CephFS to a CloudStack system as primary storage. When copying a large file into the root directory of the CephFS the bandwidth drops from 500MB/s to 50MB/s after around 30 seconds. We see some MDS activity in the output of "ceph fs status" at the same time. When copying the same file to a subdirectory of the CephFS the performance stays at 500MB/s for the whole time. MDS activity does not seems to influence the performance here. There are appr 270 other files in the root directory. CloudStack stores VM images in qcow2 format there. Is this a known issue? Is there something special with the root directory of a CephFS wrt write performance? AFAIK there is no special with the root dir. From my local test there is not difference with the subdir. BTW, could you test it for more than once for the root dir ? When you are doing this for the first time the ceph may need to allocate the disk spaces, which will take a little time. Thanks. Regards ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph needs your help with defining availability!
Hi guys, thank you so much for filling out the Ceph Cluster Availability survey! we have received a total of 59 responses from various groups of people, which is enough to help us understand more profoundly what availability means to everyone. As promised, here is the link to the results of the survey: https://docs.google.com/forms/d/1J5Ab5KCy6fceXxHI8KDqY2Qx3FzR-V9ivKp_vunEWZ0/viewanalytics Also, I've summarized some of the written responses such that it is easier for you to make sense of the results. I hope you will find these responses helpful and please feel free to reach out if you have any questions! Response summary of the question: “”” In your own words, please describe what availability means to you in a Ceph cluster. (For example, is it the ability to serve read and write requests even if the cluster is in a degraded state?). “”” In summary, the majority of people consider the definition of availability to be the ability to serve I/O with reasonable performance (some suggest 10-20%, others say it should be user configurable) + the ability to provide other services. A couple of people define availability as all PGs being in the state of active+clean, but we will come to learn that many people disagree with this in the next question. Interestingly, a handful of people suggests that cluster availability shouldn’t be binary, but rather a scale or tiers, e.g., one response suggests that we should have: 1. Fully available - all services can serve I/O normal performance. 2. Partially available 1. some access method, although configured, is not available e.g., CephFS works and RGW doesn’t. 2. only reads or writes are possible on some storage pools. 3. some storage pools are completely unavailable while others are completely or partially available. 4. performance is severely degraded. 5. some services are stopped/crashed. 3. Unavailable - when Partially available is not reached. Moreover, some suggest that we should track availability as per pool basis to deal with a scenario where we have different crush rules or when we can afford a pool to be unavailable. Furthermore, some response cares more about the availability of one service than another, e.g., one response states that they wouldn’t care about the availability of RADOS if RGW is unavailable. Response summary of the question: “”” Do you agree with the following metric in evaluating a cluster's availability: "All placement group (PG) state in a cluster must have 'active' in them, if at least 1 PG does not have 'active' in them, then the cluster as a whole is deemed as unavailable". “”” 35.8 % of Users answered `No` 35.8% of Users answered `Yes` 28.3% of Users answered `maybe` Data clearly states that we can’t just have this as criteria for availability. Therefore, here are some of the reasons why 64.1% do not fully agree with the statement. If the client does not interact with that particular PG then it is not important, e.g., if 1 PG is inactive and the s3 endpoint is down but CephFS can still serve I/O, we cannot say that the cluster is unavailable. Some disagree because they believe that a PG relates to a single pool, therefore, that particular pool will be unavailable, not the cluster. Furthermore, some suggest that there are events that might lead to PGs not being inactive, such as provisioning a new OSD, creating a pool, or PG split, however, these events don’t necessarily indicate unavailability. Response summary of the question: “”” From your own experience, what are some of the most common events that cause a Ceph cluster to be considered unavailable based on your definition of availability. “”” Top four responses: 1. Network-related issues, e.g., network failure/instability. 2. OSD-related issues, e.g., failure, slow ops, flapping. 3. Disk-related issues, e.g., dead disks. 4. PGs-related issues, e.g., many PGs became stale, unknown, and stuck in peering. Response summary of the question: “”” Are there any events that you might consider a cluster to be unavailable but you feel like it is not worth tracking and is dismissible? “”” Top three responses: 1. No, all unavailable events are worth tracking. 2. Network related issues 3. Scheduled upgrades or maintenance On Tue, Aug 9, 2022 at 1:51 PM Kamoltat Sirivadhna wrote: > Hi John, > > Yes, I'm planning to summarize the results after this week. I will > definitely share it with the community. > > Best, > > On Tue, Aug 9, 2022 at 1:19 PM John Bent wrote: > >> Hello Kamoltat, >> >> This sounds very interesting. Will you be sharing the results of the >> survey back with the community? >> >> Thanks, >> >> John >> >> On Sat, Aug 6, 2022 at 4:49 AM Kamoltat Sirivadhna >> wrote: >> >>> Hi everyone, >>> >>> One of the features we are looking into implementing for our upcoming >>> Ceph release (Reef) is
[ceph-users] Re: Quincy: Corrupted devicehealth sqlite3 database from MGR crashing bug
ceph-post-file: a9802e30-0096-410e-b5c0-f2e6d83acfd6 On Tue, Aug 16, 2022 at 3:13 AM Patrick Donnelly wrote: > On Mon, Aug 15, 2022 at 11:39 AM Daniel Williams > wrote: > > > > Using ubuntu with apt repository from ceph. > > > > Ok that helped me figure out that it's .mgr not mgr. > > # ceph -v > > ceph version 17.2.3 (dff484dfc9e19a9819f375586300b3b79d80034d) quincy > (stable) > > # export CEPH_CONF='/etc/ceph/ceph.conf' > > # export CEPH_KEYRING='/etc/ceph/ceph.client.admin.keyring' > > # export CEPH_ARGS='--log_to_file true --log-file ceph-sqlite.log > --debug_cephsqlite 20 --debug_ms 1' > > # sqlite3 > > SQLite version 3.31.1 2020-01-27 19:55:54 > > Enter ".help" for usage hints. > > sqlite> .load libcephsqlite.so > > sqlite> .open file:///.mgr:devicehealth/main.db?vfs=ceph > > sqlite> .tables > > Segmentation fault (core dumped) > > > > # dpkg -l | grep ceph | grep sqlite > > ii libsqlite3-mod-ceph 17.2.3-1focal > amd64SQLite3 VFS for Ceph > > > > Attached ceph-sqlite.log > > No real good hint in the log unfortunately. I will need the core dump > to see where things went wrong. Can you upload it with > > https://docs.ceph.com/en/quincy/man/8/ceph-post-file/ > > ? > > -- > Patrick Donnelly, Ph.D. > He / Him / His > Principal Software Engineer > Red Hat, Inc. > GPG: 19F28A586F808C2402351B93C3301A3E258DD79D > > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: The next quincy point release
This must go in the next quincy release: https://github.com/ceph/ceph/pull/47288 but we're still waiting on reviews and final tests before merging into main. On Mon, Aug 15, 2022 at 11:02 AM Yuri Weinstein wrote: > > We plan to start QE validation for the next quincy point release this week. > > Dev leads please tag all PRs needed to be included ("needs-qa") ASAP > so they can be tested and merged on time. > > Thx > YuriW > > ___ > Dev mailing list -- d...@ceph.io > To unsubscribe send an email to dev-le...@ceph.io > -- Patrick Donnelly, Ph.D. He / Him / His Principal Software Engineer Red Hat, Inc. GPG: 19F28A586F808C2402351B93C3301A3E258DD79D ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Quincy: Corrupted devicehealth sqlite3 database from MGR crashing bug
On Mon, Aug 15, 2022 at 11:39 AM Daniel Williams wrote: > > Using ubuntu with apt repository from ceph. > > Ok that helped me figure out that it's .mgr not mgr. > # ceph -v > ceph version 17.2.3 (dff484dfc9e19a9819f375586300b3b79d80034d) quincy (stable) > # export CEPH_CONF='/etc/ceph/ceph.conf' > # export CEPH_KEYRING='/etc/ceph/ceph.client.admin.keyring' > # export CEPH_ARGS='--log_to_file true --log-file ceph-sqlite.log > --debug_cephsqlite 20 --debug_ms 1' > # sqlite3 > SQLite version 3.31.1 2020-01-27 19:55:54 > Enter ".help" for usage hints. > sqlite> .load libcephsqlite.so > sqlite> .open file:///.mgr:devicehealth/main.db?vfs=ceph > sqlite> .tables > Segmentation fault (core dumped) > > # dpkg -l | grep ceph | grep sqlite > ii libsqlite3-mod-ceph 17.2.3-1focal > amd64SQLite3 VFS for Ceph > > Attached ceph-sqlite.log No real good hint in the log unfortunately. I will need the core dump to see where things went wrong. Can you upload it with https://docs.ceph.com/en/quincy/man/8/ceph-post-file/ ? -- Patrick Donnelly, Ph.D. He / Him / His Principal Software Engineer Red Hat, Inc. GPG: 19F28A586F808C2402351B93C3301A3E258DD79D ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Quincy: Corrupted devicehealth sqlite3 database from MGR crashing bug
Using ubuntu with apt repository from ceph. Ok that helped me figure out that it's .mgr not mgr. # ceph -v ceph version 17.2.3 (dff484dfc9e19a9819f375586300b3b79d80034d) quincy (stable) # export CEPH_CONF='/etc/ceph/ceph.conf' # export CEPH_KEYRING='/etc/ceph/ceph.client.admin.keyring' # export CEPH_ARGS='--log_to_file true --log-file ceph-sqlite.log --debug_cephsqlite 20 --debug_ms 1' # sqlite3 SQLite version 3.31.1 2020-01-27 19:55:54 Enter ".help" for usage hints. sqlite> .load libcephsqlite.so sqlite> .open file:///.mgr:devicehealth/main.db?vfs=ceph sqlite> .tables Segmentation fault (core dumped) # dpkg -l | grep ceph | grep sqlite ii libsqlite3-mod-ceph 17.2.3-1focal amd64SQLite3 VFS for Ceph Attached ceph-sqlite.log On Mon, Aug 15, 2022 at 11:10 PM Patrick Donnelly wrote: > Hello Daniel, > > On Mon, Aug 15, 2022 at 10:38 AM Daniel Williams > wrote: > > > > My managers are crashing reading the sqlite database for deviceheatlth: > > .mgr:devicehealth/main.db-journal > > debug -2> 2022-08-15T11:14:09.184+ 7fa5721b7700 5 cephsqlite: > > Read: (client.53284882) [.mgr:devicehealth/main.db-journal] > 0x5601da0c0008 > > 4129788~65536 > > debug -1> 2022-08-15T11:14:09.184+ 7fa5721b7700 5 > client.53284882: > > SimpleRADOSStriper: read: main.db-journal: 4129788~65536 > > debug 0> 2022-08-15T11:14:09.200+ 7fa664aca700 -1 *** Caught > > signal (Segmentation fault) ** > > > > I upgraded to 17.2.3 but it seems like I'll need to do a sqlite recovery > on > > the database, since the devicehealth module is now non-optional. > > > > I tried: > > sqlite3 -cmd '.load libcephsqlite.so' '.open > > file:///mgr:devicehealth/main.db?vfs=ceph' > > but that didn't work > > Error: unable to open database ".open > > file:///mgr:devicehealth/main.db?vfs=ceph": unable to open database file > > > > Any suggestions? > > Are you on Ubuntu or CentOS? > > You can try to figure out where things are going wrong loading the > database via: > > env CEPH_ARGS='--log_to_file true --log-file foo.log > --debug_cephsqlite 20 --debug_ms 1' sqlite3 ... > > -- > Patrick Donnelly, Ph.D. > He / Him / His > Principal Software Engineer > Red Hat, Inc. > GPG: 19F28A586F808C2402351B93C3301A3E258DD79D > > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Quincy: Corrupted devicehealth sqlite3 database from MGR crashing bug
Hello Daniel, On Mon, Aug 15, 2022 at 10:38 AM Daniel Williams wrote: > > My managers are crashing reading the sqlite database for deviceheatlth: > .mgr:devicehealth/main.db-journal > debug -2> 2022-08-15T11:14:09.184+ 7fa5721b7700 5 cephsqlite: > Read: (client.53284882) [.mgr:devicehealth/main.db-journal] 0x5601da0c0008 > 4129788~65536 > debug -1> 2022-08-15T11:14:09.184+ 7fa5721b7700 5 client.53284882: > SimpleRADOSStriper: read: main.db-journal: 4129788~65536 > debug 0> 2022-08-15T11:14:09.200+ 7fa664aca700 -1 *** Caught > signal (Segmentation fault) ** > > I upgraded to 17.2.3 but it seems like I'll need to do a sqlite recovery on > the database, since the devicehealth module is now non-optional. > > I tried: > sqlite3 -cmd '.load libcephsqlite.so' '.open > file:///mgr:devicehealth/main.db?vfs=ceph' > but that didn't work > Error: unable to open database ".open > file:///mgr:devicehealth/main.db?vfs=ceph": unable to open database file > > Any suggestions? Are you on Ubuntu or CentOS? You can try to figure out where things are going wrong loading the database via: env CEPH_ARGS='--log_to_file true --log-file foo.log --debug_cephsqlite 20 --debug_ms 1' sqlite3 ... -- Patrick Donnelly, Ph.D. He / Him / His Principal Software Engineer Red Hat, Inc. GPG: 19F28A586F808C2402351B93C3301A3E258DD79D ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Some odd results while testing disk performance related to write caching
Hi, We have some docs about this in the Ceph hardware recommendations: https://docs.ceph.com/en/latest/start/hardware-recommendations/#write-caches I added some responses inline.. On Fri, Aug 5, 2022 at 7:23 PM Torbjörn Jansson wrote: > > Hello > > i got a small 3 node ceph cluster and i'm doing some bench marking related to > performance with drive write caching. > > the reason i started was because i wanted to test the SSDs i have for their > performance for use as db device for the osds and make sure they are setup as > good as i can get it. > > i read that turning off write cache can be beneficial even when it sounds > backwards. "write cache" is a volatile cache -- so when it is enabled, Linux knows that it is writing to a volatile area on the device and therefore it needs to issue flushes to persist data. Linux considers these devices to be in "write back" mode. When the write cache is disabled, then Linux knows it is writing to a persisted area, and therefore doesn't bother sending flushes anymore -- these devices are in "write through" mode. And btw, new data centre class devices have firmware and special hardware to accelerate those persisted writes when the volatile cache is disabled. This is the so-called media cache. > this seems to be true. > i used mainly fio and "iostat -x" to test using something like: > fio --filename=/dev/ceph-db-0/bench --direct=1 --sync=1 --rw=write --bs=4k > --numjobs=5 --iodepth=1 --runtime=60 --time_based --group_reporting > > and then testing this with write cache turned off and on to compare the > results. > also with and without sync in fio command above. > > one thing i observed related to turning off the write cache on drives was that > it appears a reboot is needed for it to have any effect. This is depending on the OS -- if you set the cache using the approach mentioned in the docs above, then in all distros we tested it keeps WCE and "write through" consistent with each other. > and this is where it gets strange and the part i don't get. > > the disks i have, seagate nytro sas3 ssd, according to the drive manual the > drive don't care what you set the WCE bit to and it will do write caching > internally regardless. > most likely because it is an enterprise disk with built in power loss > protection. > > BUT it makes a big difference to the performance and the flush per seconds in > iostat. > so it appears that if you boot and the drive got its write cache disabled > right > from the start (dmesg contains stuff like: "sd 0:0:0:0: [sda] Write cache: > disabled") then linux wont send any flush to the drive and you get good > performance. > if you change the write caching on a drive during runtime (sdparm for sas or > hdparm for sata) then it wont change anything. Check the cache_type at e.g. /sys/class/scsi_disk/0\:0\:0\:0/cache_type "write back" -> flush is sent "write through" -> flush not sent > why is that? why do i have to do a reboot? > i mean, lets say you boot with write cache disabled, linux decides to never > send flush and you change it after boot to enable the cache, if there is no > flush then you risk your data in case of a power loss, or? On all devices we have, if we have "write through" at boot, then set (with hdparm or sdparm) WCE=1 or echo "write back" > ... then the cache_type is automatically set correctly to "write back" and flushes are sent. There is another /sys/ entry to toggle flush behaviour: echo "write through" > /sys/block/sda/queue/write_cache This is apparently a way to lie to the OS so it stops sending flushes (without manipulating the WCE mode of the underlying device). Cheers, Dan > this is not very obvious or good behavior i think (i hope i'm wrong and some > one can enlighten me) > > > for sas drives sdparm -s WCE=0 --save /dev/sdX appears to do the right thing > and it survives a reboot. > but for sata disks hdparm -W 0 -K 1 /dev/sdX makes the change but as long as > drive is connected to sas controller it still gets the write cache enabled at > boot so i bet sas controller also messes with the write cache setting on the > drives. > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Quincy: Corrupted devicehealth sqlite3 database from MGR crashing bug
My managers are crashing reading the sqlite database for deviceheatlth: .mgr:devicehealth/main.db-journal debug -2> 2022-08-15T11:14:09.184+ 7fa5721b7700 5 cephsqlite: Read: (client.53284882) [.mgr:devicehealth/main.db-journal] 0x5601da0c0008 4129788~65536 debug -1> 2022-08-15T11:14:09.184+ 7fa5721b7700 5 client.53284882: SimpleRADOSStriper: read: main.db-journal: 4129788~65536 debug 0> 2022-08-15T11:14:09.200+ 7fa664aca700 -1 *** Caught signal (Segmentation fault) ** I upgraded to 17.2.3 but it seems like I'll need to do a sqlite recovery on the database, since the devicehealth module is now non-optional. I tried: sqlite3 -cmd '.load libcephsqlite.so' '.open file:///mgr:devicehealth/main.db?vfs=ceph' but that didn't work Error: unable to open database ".open file:///mgr:devicehealth/main.db?vfs=ceph": unable to open database file Any suggestions? Also I've seen some pretty crazy bugs in Quincy now (rebalancing uses 100% cpu - still not fixed and the mgr crashing), maybe I jumped in too early? Is this normal at the start of the release? Is there guidance for a roughly safe subversion to wait before upgrading to a new release? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Recovery very slow after upgrade to quincy
On 15-08-2022 08:24, Satoru Takeuchi wrote: 2022年8月13日(土) 1:35 Robert W. Eckert : Interesting, a few weeks ago I added a new disk to each of my 3 node cluster and saw the same 2 Mb/s recovery.What I had noticed was that one OSD was using very high CPU and seems to have been the primary node on the affected PGs.I couldn’t find anything overly wrong with the OSD, network , etc. You may want to look at the output of ceph pg ls to see if the recovery is sourced from one specific OSD or one host, then check that host /osd for high CPU/memory. Probably you hit this bug. https://tracker.ceph.com/issues/56530 It can be bypassed by setting "osd_op_queue=wpq" configuration. Thanks both of you. Doing "ceph config set osd osd_op_queue wpq" and restarting the OSDs seems to have fixed it. Mvh. Torkil Best, Satoru -Original Message- From: Torkil Svensgaard Sent: Friday, August 12, 2022 7:50 AM To: ceph-users@ceph.io Cc: Ruben Vestergaard Subject: [ceph-users] Recovery very slow after upgrade to quincy 6 hosts with 2 x 10G NICs, data in 2+2 EC pool. 17.2.0, upgrade from pacific. cluster: id: health: HEALTH_WARN 2 host(s) running different kernel versions 2071 pgs not deep-scrubbed in time 837 pgs not scrubbed in time services: mon:5 daemons, quorum test-ceph-03,test-ceph-04,dcn-ceph-03,dcn-ceph-02,dcn-ceph-01 (age 116s) mgr:dcn-ceph-01.dzercj(active, since 6h), standbys: dcn-ceph-03.lrhaxo mds:1/1 daemons up, 2 standby osd:118 osds: 118 up (since 6d), 118 in (since 6d); 66 remapped pgs rbd-mirror: 2 daemons active (2 hosts) data: volumes: 1/1 healthy pools: 9 pools, 2737 pgs objects: 246.02M objects, 337 TiB usage: 665 TiB used, 688 TiB / 1.3 PiB avail pgs: 42128281/978408875 objects misplaced (4.306%) 2332 active+clean 281 active+clean+snaptrim_wait 66 active+remapped+backfilling 36 active+clean+snaptrim 11 active+clean+scrubbing+deep 8active+clean+scrubbing 1active+clean+scrubbing+deep+snaptrim_wait 1active+clean+scrubbing+deep+snaptrim 1active+clean+scrubbing+snaptrim io: client: 159 MiB/s rd, 86 MiB/s wr, 17.14k op/s rd, 326 op/s wr recovery: 2.0 MiB/s, 3 objects/s Low load, low latency, low network traffic. Tried osd_mclock_profile=high_recovery_ops, no difference. Disabling scrubs and snaptrim, no difference. Am I missing something obvious I should have done after the upgrade? Mvh. Torkil -- Torkil Svensgaard Sysadmin MR-Forskningssektionen, afs. 714 DRCMR, Danish Research Centre for Magnetic Resonance Hvidovre Hospital Kettegård Allé 30 DK-2650 Hvidovre Denmark Tel: +45 386 22828 E-mail: tor...@drcmr.dk ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io -- Torkil Svensgaard Systems Administrator Danish Research Centre for Magnetic Resonance DRCMR, Section 714 Copenhagen University Hospital Amager and Hvidovre Kettegaard Allé 30, 2650 Hvidovre, Denmark ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: What is client request_load_avg? Troubleshooting MDS issues on Luminous
Hi, do you see high disk utilization on the OSD nodes? How is the load on the active MDS? How much RAM is configured for the MDS (mds_cache_memory_limit)? You can list all MDS sessions with 'ceph daemon mds. session ls' to identify all your clients and 'ceph daemon mds. dump_blocked_ops' to show blocked requests. But simply killing sessions isn't a solution, so first you need to find out where the bottleneck is. Do you see hung requests or something? Anything in 'dmesg' on the client side? Zitat von Chris Smart : Hi all, I have recently inherited a 10 node Ceph cluster running Luminous (12.2.12) which is running specifically for CephFS (and I don't know much about MDS) with only one active MDS server (two standby). It's not a great cluster IMO, the cephfs_data pool is on high density nodes with high capacity SATA drives but at least the cephfs_metadata pool is on nvme drives. Access to the cluster regularly goes slow for clients and I'm seeing lots of warnings like this: MDSs behind on trimming (MDS_TRIM) MDSs report slow metadata IOs (MDS_SLOW_METADATA_IO) MDSs report slow requests (MDS_SLOW_REQUEST) MDSs have many clients failing to respond to capability release (MDS_CLIENT_LATE_RELEASE_MANY) If there is only one client that's failing to respond to capability release I can see the client id in the output and work out what user that is and get their job stopped. Performance then usually improves a bit. However, if there is more than one, the output only shows a summary of the number of clients and I don't know who the clients are to get their jobs cancelled. Is there a way I can work out what clients these are? I'm guessing some kind of combination of in_flight_ops, blocked_ops and total num_caps? However, I also feel like just having a large number of caps isn't _necessarily_ an indicator of a problem, sometimes restarting MDS and forcing clients to drop unused caps helps, sometimes it doesn't. I'm curious if there's a better way to determine any clients that might be causing issues in the cluster? To that end, I've noticed there is a metric called "request_load_avg" in the output of ceph mds client ls but I can't quite find any information about it. It _seems_ like it could indicate a client that's doing lots and lots of requests and therefore a useful metric to see what client might be smashing the cluster, but does anyone know for sure? Many thanks, Chris ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: CephFS perforamnce degradation in root directory
Am 09.08.22 um 10:07 schrieb Robert Sander: When copying the same file to a subdirectory of the CephFS the performance stays at 500MB/s for the whole time. MDS activity does not seems to influence the performance here. There is a new datapoint: When mounting the subdirectory (and not CephFS's root) the performance also degrades while staying up when writing into a subdirectory. Is there something special at the mountpoint? Regards -- Robert Sander Heinlein Consulting GmbH Schwedter Str. 8/9b, 10119 Berlin http://www.heinlein-support.de Tel: 030 / 405051-43 Fax: 030 / 405051-19 Zwangsangaben lt. §35a GmbHG: HRB 220009 B / Amtsgericht Berlin-Charlottenburg, Geschäftsführer: Peer Heinlein -- Sitz: Berlin ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Recovery very slow after upgrade to quincy
2022年8月13日(土) 1:35 Robert W. Eckert : > Interesting, a few weeks ago I added a new disk to each of my 3 node > cluster and saw the same 2 Mb/s recovery.What I had noticed was that > one OSD was using very high CPU and seems to have been the primary node on > the affected PGs.I couldn’t find anything overly wrong with the OSD, > network , etc. > > You may want to look at the output of > > ceph pg ls > > to see if the recovery is sourced from one specific OSD or one host, then > check that host /osd for high CPU/memory. Probably you hit this bug. https://tracker.ceph.com/issues/56530 It can be bypassed by setting "osd_op_queue=wpq" configuration. Best, Satoru > > > > > > -Original Message- > From: Torkil Svensgaard > Sent: Friday, August 12, 2022 7:50 AM > To: ceph-users@ceph.io > Cc: Ruben Vestergaard > Subject: [ceph-users] Recovery very slow after upgrade to quincy > > 6 hosts with 2 x 10G NICs, data in 2+2 EC pool. 17.2.0, upgrade from > pacific. > > cluster: > id: > health: HEALTH_WARN > 2 host(s) running different kernel versions > 2071 pgs not deep-scrubbed in time > 837 pgs not scrubbed in time > >services: > mon:5 daemons, quorum > test-ceph-03,test-ceph-04,dcn-ceph-03,dcn-ceph-02,dcn-ceph-01 (age 116s) > mgr:dcn-ceph-01.dzercj(active, since 6h), standbys: > dcn-ceph-03.lrhaxo > mds:1/1 daemons up, 2 standby > osd:118 osds: 118 up (since 6d), 118 in (since 6d); 66 > remapped pgs > rbd-mirror: 2 daemons active (2 hosts) > >data: > volumes: 1/1 healthy > pools: 9 pools, 2737 pgs > objects: 246.02M objects, 337 TiB > usage: 665 TiB used, 688 TiB / 1.3 PiB avail > pgs: 42128281/978408875 objects misplaced (4.306%) > 2332 active+clean > 281 active+clean+snaptrim_wait > 66 active+remapped+backfilling > 36 active+clean+snaptrim > 11 active+clean+scrubbing+deep > 8active+clean+scrubbing > 1active+clean+scrubbing+deep+snaptrim_wait > 1active+clean+scrubbing+deep+snaptrim > 1active+clean+scrubbing+snaptrim > >io: > client: 159 MiB/s rd, 86 MiB/s wr, 17.14k op/s rd, 326 op/s wr > recovery: 2.0 MiB/s, 3 objects/s > > > Low load, low latency, low network traffic. Tried > osd_mclock_profile=high_recovery_ops, no difference. Disabling scrubs and > snaptrim, no difference. > > Am I missing something obvious I should have done after the upgrade? > > Mvh. > > Torkil > > -- > Torkil Svensgaard > Sysadmin > MR-Forskningssektionen, afs. 714 > DRCMR, Danish Research Centre for Magnetic Resonance Hvidovre Hospital > Kettegård Allé 30 > DK-2650 Hvidovre > Denmark > Tel: +45 386 22828 > E-mail: tor...@drcmr.dk > ___ > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an > email to ceph-users-le...@ceph.io > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io