[ceph-users] Question regarding bluestore labels
I have a question regarding bluestore labels, specifically for a block.db partition. To make a long story short, we are currently in a position where checking the label of a block.db partition and it appears corrupted. I have seen another thread on here suggesting to copy the label from a working OSD to the non working OSD, then re-adding the correct value to the labels with ceph-bluestore-tool. Where this was mentioned this was with an OSD in mind, would the same logic apply if we were working with a db device instead? This is assuming the only issue with the db is the corrupted label, and there is no other issues. Regards, Bailey ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph RBD, MySQL write IOPs - what is possible?
> On Jun 7, 2024, at 13:20, Mark Lehrer wrote: > >> server RAM and CPU >> * osd_memory_target >> * OSD drive model > > Thanks for the reply. The servers have dual Xeon Gold 6154 CPUs with > 384 GB So roughly 7 vcores / HTs per OSD? Your Ceph is a recent release? > The drives are older, first gen NVMe - WDC SN620. Those appear to be a former SanDisk product, lower performers than more recent drives, how much a factor that is I can't say. Which specific SKU? There appear to be low and standard endurance SKUs, 3.84 or 1.92 T, 3.2T or 1.6T respectively. What is the lifetime used like on them? Less than 80%? If you really want to eliminate uncertainties: * Ensure they're updated to the latest firmware * In rolling fashion, destroy the OSDs, secure-erase each OSD, redeploy the OSDs > osd_memory_target is at the default. Mellanox CX5 and SN2700 > hardware. The test client is a similar machine with no drives. This is via RBD? Do you have the client RBD cache on or off? > The CPUs are 80% idle during the test. Do you have the server BMC/BIOS profile set to performance? Deep C-states disabled via TuneD or other means? > The OSDs (according to iostat) Careful, iostat's metrics are of limited utility on SSDs, especially NVMe. > I did find it interesting that the wareq-sz option in iostat is around > 5 during the test - I was expecting 16. Is there a way to tweak this > in bluestore? Not my area of expertise, but I once tried to make OSDs with a >4KB BlueStore block size, they crashed at startup. 4096 is hardcoded in various places. Quality SSD firmware will coalesce writes to NAND. If your firmware surfaces host vs NAND writes, you might capture deltas over, say, a week of workload and calculate the WAF. > These drives are terrible at under 8K I/O. Not that it really matters since > we're not I/O bound at all. I/O bound can be tricky, be careful with that assumption, there are multiple facets. I can't find anything specific, but that makes me suspect that internally the IU isn't the usual 4KB, perhaps to save a few bucks on DRAM. > I can also increase threads from 8 to 32 and the iops are roughly > quadruple so that's good at least. Single thread writes are about 250 > iops and like 3.7MB/sec. So sad. Assuming that the pool you're writing to spans all 60 OSDs, what is your PG count on that pool? Are there multiple pools in the cluster? As reported by `ceph osd df`, on average how many PG replicas are on each OSD? > The rados bench process is also under 50% CPU utilization of a single > core. This seems like a thead/semaphore kind of issue if I had to > guess. It's tricky to debug when there is no obvious bottleneck. rados bench is a good smoke test, but fio may better represent the E2E experience. > > Thanks, > Mark > > > > > On Fri, Jun 7, 2024 at 9:47 AM Anthony D'Atri wrote: >> >> Please describe: >> >> * server RAM and CPU >> * osd_memory_target >> * OSD drive model >> >>> On Jun 7, 2024, at 11:32, Mark Lehrer wrote: >>> >>> I've been using MySQL on Ceph forever, and have been down this road >>> before but it's been a couple of years so I wanted to see if there is >>> anything new here. >>> >>> So the TL:DR version of this email - is there a good way to improve >>> 16K write IOPs with a small number of threads? The OSDs themselves >>> are idle so is this just a weakness in the algorithms or do ceph >>> clients need some profiling? Or "other"? >>> >>> Basically, this is one of the worst possible Ceph workloads so it is >>> fun to try to push the limits. I also happen have a MySQL instance >>> that is reaching the write IOPs limit so this is also a last-ditch >>> effort to keep it on Ceph. >>> >>> This cluster is as straightforward as it gets... 6 servers with 10 >>> SSDs each, 100 Gb networking. I'm using size=3. During operations, >>> the OSDs are more or less idle so I don't suspect any hardware >>> limitations. >>> >>> MySQL has no parallelism so the number of threads and effective queue >>> depth stay pretty low. Therefore, as a proxy for MySQL I use rados >>> bench with 16K writes and 8 threads. The RBD actually gets about 2x >>> this level - still not so great. >>> >>> I get about 2000 IOPs with this test: >>> >>> # rados bench -p volumes 10 write -t 8 -b 16K >>> hints = 1 >>> Maintaining 8 concurrent writes of 16384 bytes to objects of size >>> 16384 for up to 10 seconds or 0 objects >>> Object prefix: benchmark_data_fstosinfra-5_3652583 >>> sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) >>> 0 0 0 0 0 0 - 0 >>> 1 8 2050 2042 31.9004 31.9062 0.00247633 0.00390848 >>> 2 8 4306 4298 33.5728 35.25 0.00278488 0.00371784 >>> 3 8 6607 6599 34.3645 35.9531 0.00277546 0.00363139 >>> 4 7 8951 8944 34.9323 36.6406 0.00414908 0.00357249 >>> 5
[ceph-users] Re: Ceph RBD, MySQL write IOPs - what is possible?
> server RAM and CPU > * osd_memory_target > * OSD drive model Thanks for the reply. The servers have dual Xeon Gold 6154 CPUs with 384 GB. The drives are older, first gen NVMe - WDC SN620. osd_memory_target is at the default. Mellanox CX5 and SN2700 hardware. The test client is a similar machine with no drives. The CPUs are 80% idle during the test. The OSDs (according to iostat) hover around 50% util during the test and are close to 0 at other times. I did find it interesting that the wareq-sz option in iostat is around 5 during the test - I was expecting 16. Is there a way to tweak this in bluestore? These drives are terrible at under 8K I/O. Not that it really matters since we're not I/O bound at all. I can also increase threads from 8 to 32 and the iops are roughly quadruple so that's good at least. Single thread writes are about 250 iops and like 3.7MB/sec. So sad. The rados bench process is also under 50% CPU utilization of a single core. This seems like a thead/semaphore kind of issue if I had to guess. It's tricky to debug when there is no obvious bottleneck. Thanks, Mark On Fri, Jun 7, 2024 at 9:47 AM Anthony D'Atri wrote: > > Please describe: > > * server RAM and CPU > * osd_memory_target > * OSD drive model > > > On Jun 7, 2024, at 11:32, Mark Lehrer wrote: > > > > I've been using MySQL on Ceph forever, and have been down this road > > before but it's been a couple of years so I wanted to see if there is > > anything new here. > > > > So the TL:DR version of this email - is there a good way to improve > > 16K write IOPs with a small number of threads? The OSDs themselves > > are idle so is this just a weakness in the algorithms or do ceph > > clients need some profiling? Or "other"? > > > > Basically, this is one of the worst possible Ceph workloads so it is > > fun to try to push the limits. I also happen have a MySQL instance > > that is reaching the write IOPs limit so this is also a last-ditch > > effort to keep it on Ceph. > > > > This cluster is as straightforward as it gets... 6 servers with 10 > > SSDs each, 100 Gb networking. I'm using size=3. During operations, > > the OSDs are more or less idle so I don't suspect any hardware > > limitations. > > > > MySQL has no parallelism so the number of threads and effective queue > > depth stay pretty low. Therefore, as a proxy for MySQL I use rados > > bench with 16K writes and 8 threads. The RBD actually gets about 2x > > this level - still not so great. > > > > I get about 2000 IOPs with this test: > > > > # rados bench -p volumes 10 write -t 8 -b 16K > > hints = 1 > > Maintaining 8 concurrent writes of 16384 bytes to objects of size > > 16384 for up to 10 seconds or 0 objects > > Object prefix: benchmark_data_fstosinfra-5_3652583 > > sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) > >0 0 0 0 0 0 - 0 > >1 8 2050 2042 31.9004 31.9062 0.00247633 0.00390848 > >2 8 4306 4298 33.5728 35.25 0.00278488 0.00371784 > >3 8 6607 6599 34.3645 35.9531 0.00277546 0.00363139 > >4 7 8951 8944 34.9323 36.6406 0.00414908 0.00357249 > >5 8 11292 1128435.257 36.5625 0.00291434 0.00353997 > >6 8 13588 13580 35.358835.875 0.00306094 0.00353084 > >7 7 15933 15926 35.5432 36.6562 0.00308388 0.0035123 > >8 8 18361 18353 35.8399 37.9219 0.00314996 0.00348327 > >9 8 20629 20621 35.7947 35.4375 0.00352998 0.0034877 > > 10 5 23010 23005 35.9397 37.25 0.00395566 0.00347376 > > Total time run: 10.003 > > Total writes made: 23010 > > Write size: 16384 > > Object size:16384 > > Bandwidth (MB/sec): 35.9423 > > Stddev Bandwidth: 1.63433 > > Max bandwidth (MB/sec): 37.9219 > > Min bandwidth (MB/sec): 31.9062 > > Average IOPS: 2300 > > Stddev IOPS:104.597 > > Max IOPS: 2427 > > Min IOPS: 2042 > > Average Latency(s): 0.0034737 > > Stddev Latency(s): 0.00163661 > > Max latency(s): 0.115932 > > Min latency(s): 0.00179735 > > Cleaning up (deleting benchmark objects) > > Removed 23010 objects > > Clean up completed and total clean up time :7.44664 > > > > > > Are there any good options to improve this? It seems like the client > > side is the bottleneck since the OSD servers are at like 15% > > utilization. > > > > Thanks, > > Mark > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph RBD, MySQL write IOPs - what is possible?
Please describe: * server RAM and CPU * osd_memory_target * OSD drive model > On Jun 7, 2024, at 11:32, Mark Lehrer wrote: > > I've been using MySQL on Ceph forever, and have been down this road > before but it's been a couple of years so I wanted to see if there is > anything new here. > > So the TL:DR version of this email - is there a good way to improve > 16K write IOPs with a small number of threads? The OSDs themselves > are idle so is this just a weakness in the algorithms or do ceph > clients need some profiling? Or "other"? > > Basically, this is one of the worst possible Ceph workloads so it is > fun to try to push the limits. I also happen have a MySQL instance > that is reaching the write IOPs limit so this is also a last-ditch > effort to keep it on Ceph. > > This cluster is as straightforward as it gets... 6 servers with 10 > SSDs each, 100 Gb networking. I'm using size=3. During operations, > the OSDs are more or less idle so I don't suspect any hardware > limitations. > > MySQL has no parallelism so the number of threads and effective queue > depth stay pretty low. Therefore, as a proxy for MySQL I use rados > bench with 16K writes and 8 threads. The RBD actually gets about 2x > this level - still not so great. > > I get about 2000 IOPs with this test: > > # rados bench -p volumes 10 write -t 8 -b 16K > hints = 1 > Maintaining 8 concurrent writes of 16384 bytes to objects of size > 16384 for up to 10 seconds or 0 objects > Object prefix: benchmark_data_fstosinfra-5_3652583 > sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) >0 0 0 0 0 0 - 0 >1 8 2050 2042 31.9004 31.9062 0.00247633 0.00390848 >2 8 4306 4298 33.5728 35.25 0.00278488 0.00371784 >3 8 6607 6599 34.3645 35.9531 0.00277546 0.00363139 >4 7 8951 8944 34.9323 36.6406 0.00414908 0.00357249 >5 8 11292 1128435.257 36.5625 0.00291434 0.00353997 >6 8 13588 13580 35.358835.875 0.00306094 0.00353084 >7 7 15933 15926 35.5432 36.6562 0.00308388 0.0035123 >8 8 18361 18353 35.8399 37.9219 0.00314996 0.00348327 >9 8 20629 20621 35.7947 35.4375 0.00352998 0.0034877 > 10 5 23010 23005 35.9397 37.25 0.00395566 0.00347376 > Total time run: 10.003 > Total writes made: 23010 > Write size: 16384 > Object size:16384 > Bandwidth (MB/sec): 35.9423 > Stddev Bandwidth: 1.63433 > Max bandwidth (MB/sec): 37.9219 > Min bandwidth (MB/sec): 31.9062 > Average IOPS: 2300 > Stddev IOPS:104.597 > Max IOPS: 2427 > Min IOPS: 2042 > Average Latency(s): 0.0034737 > Stddev Latency(s): 0.00163661 > Max latency(s): 0.115932 > Min latency(s): 0.00179735 > Cleaning up (deleting benchmark objects) > Removed 23010 objects > Clean up completed and total clean up time :7.44664 > > > Are there any good options to improve this? It seems like the client > side is the bottleneck since the OSD servers are at like 15% > utilization. > > Thanks, > Mark > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Ceph RBD, MySQL write IOPs - what is possible?
I've been using MySQL on Ceph forever, and have been down this road before but it's been a couple of years so I wanted to see if there is anything new here. So the TL:DR version of this email - is there a good way to improve 16K write IOPs with a small number of threads? The OSDs themselves are idle so is this just a weakness in the algorithms or do ceph clients need some profiling? Or "other"? Basically, this is one of the worst possible Ceph workloads so it is fun to try to push the limits. I also happen have a MySQL instance that is reaching the write IOPs limit so this is also a last-ditch effort to keep it on Ceph. This cluster is as straightforward as it gets... 6 servers with 10 SSDs each, 100 Gb networking. I'm using size=3. During operations, the OSDs are more or less idle so I don't suspect any hardware limitations. MySQL has no parallelism so the number of threads and effective queue depth stay pretty low. Therefore, as a proxy for MySQL I use rados bench with 16K writes and 8 threads. The RBD actually gets about 2x this level - still not so great. I get about 2000 IOPs with this test: # rados bench -p volumes 10 write -t 8 -b 16K hints = 1 Maintaining 8 concurrent writes of 16384 bytes to objects of size 16384 for up to 10 seconds or 0 objects Object prefix: benchmark_data_fstosinfra-5_3652583 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) 0 0 0 0 0 0 - 0 1 8 2050 2042 31.9004 31.9062 0.00247633 0.00390848 2 8 4306 4298 33.5728 35.25 0.00278488 0.00371784 3 8 6607 6599 34.3645 35.9531 0.00277546 0.00363139 4 7 8951 8944 34.9323 36.6406 0.00414908 0.00357249 5 8 11292 1128435.257 36.5625 0.00291434 0.00353997 6 8 13588 13580 35.358835.875 0.00306094 0.00353084 7 7 15933 15926 35.5432 36.6562 0.00308388 0.0035123 8 8 18361 18353 35.8399 37.9219 0.00314996 0.00348327 9 8 20629 20621 35.7947 35.4375 0.00352998 0.0034877 10 5 23010 23005 35.9397 37.25 0.00395566 0.00347376 Total time run: 10.003 Total writes made: 23010 Write size: 16384 Object size:16384 Bandwidth (MB/sec): 35.9423 Stddev Bandwidth: 1.63433 Max bandwidth (MB/sec): 37.9219 Min bandwidth (MB/sec): 31.9062 Average IOPS: 2300 Stddev IOPS:104.597 Max IOPS: 2427 Min IOPS: 2042 Average Latency(s): 0.0034737 Stddev Latency(s): 0.00163661 Max latency(s): 0.115932 Min latency(s): 0.00179735 Cleaning up (deleting benchmark objects) Removed 23010 objects Clean up completed and total clean up time :7.44664 Are there any good options to improve this? It seems like the client side is the bottleneck since the OSD servers are at like 15% utilization. Thanks, Mark ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Testing CEPH scrubbing / self-healing capabilities
Hello Petr, - Le 4 Juin 24, à 12:13, Petr Bena petr@bena.rocks a écrit : > Hello, > > I wanted to try out (lab ceph setup) what exactly is going to happen > when parts of data on OSD disk gets corrupted. I created a simple test > where I was going through the block device data until I found something > that resembled user data (using dd and hexdump) (/dev/sdd is a block > device that is used by OSD) > > INFRA [root@ceph-vm-lab5 ~]# dd if=/dev/sdd bs=32 count=1 skip=33920 | > hexdump -C > 6e 20 69 64 3d 30 20 65 78 65 3d 22 2f 75 73 72 |n id=0 > exe="/usr| > 0010 2f 73 62 69 6e 2f 73 73 68 64 22 20 68 6f 73 74 |/sbin/sshd" > host| > > Then I deliberately overwrote 32 bytes using random data: > > INFRA [root@ceph-vm-lab5 ~]# dd if=/dev/urandom of=/dev/sdd bs=32 > count=1 seek=33920 > > INFRA [root@ceph-vm-lab5 ~]# dd if=/dev/sdd bs=32 count=1 skip=33920 | > hexdump -C > 25 75 af 3e 87 b0 3b 04 78 ba 79 e3 64 fc 76 d2 >|%u.>..;.x.y.d.v.| > 0010 9e 94 00 c2 45 a5 e1 d2 a8 86 f1 25 fc 18 07 5a >|E..%...Z| > > At this point I would expect some sort of data corruption. I restarted > the OSD daemon on this host to make sure it flushes any potentially > buffered data. It restarted OK without noticing anything, which was > expected. > > Then I ran > > ceph osd scrub 5 > > ceph osd deep-scrub 5 > > And waiting for all scheduled scrub operations for all PGs to finish. > > No inconsistency was found. No errors reported, scrubs just finished OK, > data are still visibly corrupt via hexdump. > > Did I just hit some block of data that WAS used by OSD, but was marked > deleted and therefore no longer used or am I missing something? Possibly, if you deep-scrubed all PGs. I remember marking bad sectors in the past and still getting a fsck success on ceph-bluestore-tool fsck. To be sure, you could overwrite the very same sector, stop the OSD and then: $ ceph-bluestore-tool fsck --deep yes --path /var/lib/ceph/osd/ceph-X/ or (in containerized environment) $ cephadm shell --name osd.X ceph-bluestore-tool fsck --deep yes --path /var/lib/ceph/osd/ceph-X/ osd.X being the OSD associated to drive /dev/sdd. Regards, Frédéric. > I would expect CEPH to detect disk corruption and automatically replace the > invalid data with a valid copy? > > I use only replica pools in this lab setup, for RBD and CephFS. > > Thanks > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Excessively Chatty Daemons RHCS v5
Hi Joshua, These messages actually deserve more attention than you think, I believe. You may hit this one [1] that Mark (comment #4) also hit with 16.2.10 (RHCS 5). PR's here: https://github.com/ceph/ceph/pull/51669 Could you try raising osd_max_scrubs to 2 or 3 (now defaults to 3 in quincy and reef) and see if these logs disappear over the next hours/days? Regards, Frédéric. - Le 4 Juin 24, à 18:39, Joshua Arulsamy jarul...@uwyo.edu a écrit : > Hi, > > I recently upgraded my RHCS cluster from v4 to v5 and moved to containerized > daemons (podman) along the way. I noticed that there are a huge number of logs > going to journald on each of my hosts. I am unsure why there are so many. > > I tried changing the logging level at runtime with commands like these (from > the > ceph docs): > > ceph tell osd.\* config set debug_osd 0/5 > > I tried adjusting several different subsystems (also with 0/0) but I noticed > that logs seem to come at the same rate/content. I'm not sure what to try > next? > Is there a way to trace where logs are coming from? > > Some of the sample log entries are events like this on the OSD nodes: > > Jun 04 10:34:02 pf-osd1 ceph-osd-0[182875]: 2024-06-04T10:34:02.470-0600 > 7fc049c03700 -1 osd.0 pg_epoch: 703151 pg[35.39s0( v 703141'789389 > (701266'780746,703141'789389] local-lis/les=702935/702936 n=48162 > ec=63726/27988 > lis/c=702935/702935 les/c/f=702936/702936/0 sis=702935) > [0,194,132,3,177,159,83,18,149,14,145]p0(0) r=0 lpr=702935 crt=703141'789389 > lcod 703141'789388 mlcod 703141'789388 active+clean planned > DEEP_SCRUB_ON_ERROR] > scrubber : handle_scrub_reserve_grant: received unsolicited > reservation grant from osd 177(4) (0x55fdea6c4000) > > These are very verbose messages and occur roughly every 0.5 second per daemon. > On a cluster with 200 daemons this is getting unmanageable and is flooding my > syslog servers. > > Any advice on how to tame all the logs would be greatly appreciated! > > Best, > > Josh > > Joshua Arulsamy > HPC Systems Architect > Advanced Research Computing Center > University of Wyoming > jarul...@uwyo.edu > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io