[ceph-users] Question regarding bluestore labels

2024-06-07 Thread Bailey Allison
I have a question regarding bluestore labels, specifically for a block.db
partition.

 

To make a long story short, we are currently in a position where checking
the label of a block.db partition and it appears corrupted. 

 

I have seen another thread on here suggesting to copy the label from a
working OSD to the non working OSD, then re-adding the correct value to the
labels with ceph-bluestore-tool. 

 

Where this was mentioned this was with an OSD in mind, would the same logic
apply if we were working with a db device instead? This is assuming the only
issue with the db is the corrupted label, and there is no other issues.

 

Regards,

 

Bailey

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph RBD, MySQL write IOPs - what is possible?

2024-06-07 Thread Anthony D'Atri


> On Jun 7, 2024, at 13:20, Mark Lehrer  wrote:
> 
>> server RAM and CPU
>> * osd_memory_target
>> * OSD drive model
> 
> Thanks for the reply.  The servers have dual Xeon Gold 6154 CPUs with
> 384 GB

So roughly 7 vcores / HTs per OSD?  Your Ceph is a recent release?

> The drives are older, first gen NVMe - WDC SN620.

Those appear to be a former SanDisk product, lower performers than more recent 
drives, how much a factor that is I can't say.
Which specific SKU?  There appear to be low and standard endurance SKUs, 3.84 
or 1.92 T, 3.2T or 1.6T respectively.

What is the lifetime used like on them?  Less than 80%?  If you really want to 
eliminate uncertainties:

* Ensure they're updated to the latest firmware
* In rolling fashion, destroy the OSDs, secure-erase each OSD, redeploy the OSDs

> osd_memory_target is at the default.  Mellanox CX5 and SN2700
> hardware.  The test client is a similar machine with no drives.

This is via RBD?  Do you have the client RBD cache on or off?

> The CPUs are 80% idle during the test.

Do you have the server BMC/BIOS profile set to performance?  Deep C-states 
disabled via TuneD or other means?

> The OSDs (according to iostat)

Careful, iostat's metrics are of limited utility on SSDs, especially NVMe.

> I did find it interesting that the wareq-sz option in iostat is around
> 5 during the test - I was expecting 16.  Is there a way to tweak this
> in bluestore?

Not my area of expertise, but I once tried to make OSDs with a >4KB BlueStore 
block size, they crashed at startup.   4096 is hardcoded in various places.

Quality SSD firmware will coalesce writes to NAND.  If your firmware surfaces 
host vs NAND writes, you might capture deltas over, say, a week of workload and 
calculate the WAF.

> These drives are terrible at under 8K I/O.  Not that it really matters since 
> we're not I/O bound at all.

I/O bound can be tricky, be careful with that assumption, there are multiple 
facets.  I can't find anything specific, but that makes me suspect that 
internally the IU isn't the usual 4KB, perhaps to save a few bucks on DRAM.  

> I can also increase threads from 8 to 32 and the iops are roughly
> quadruple so that's good at least.  Single thread writes are about 250
> iops and like 3.7MB/sec.  So sad.

Assuming that the pool you're writing to spans all 60 OSDs, what is your PG 
count on that pool?  Are there multiple pools in the cluster?  As reported by 
`ceph osd df`, on average how many PG replicas are on each OSD?

> The rados bench process is also under 50% CPU utilization of a single
> core.  This seems like a thead/semaphore kind of issue if I had to
> guess.  It's tricky to debug when there is no obvious bottleneck.

rados bench is a good smoke test, but fio may better represent the E2E 
experience.

> 
> Thanks,
> Mark
> 
> 
> 
> 
> On Fri, Jun 7, 2024 at 9:47 AM Anthony D'Atri  wrote:
>> 
>> Please describe:
>> 
>> * server RAM and CPU
>> * osd_memory_target
>> * OSD drive model
>> 
>>> On Jun 7, 2024, at 11:32, Mark Lehrer  wrote:
>>> 
>>> I've been using MySQL on Ceph forever, and have been down this road
>>> before but it's been a couple of years so I wanted to see if there is
>>> anything new here.
>>> 
>>> So the TL:DR version of this email - is there a good way to improve
>>> 16K write IOPs with a small number of threads?  The OSDs themselves
>>> are idle so is this just a weakness in the algorithms or do ceph
>>> clients need some profiling?  Or "other"?
>>> 
>>> Basically, this is one of the worst possible Ceph workloads so it is
>>> fun to try to push the limits.  I also happen have a MySQL instance
>>> that is reaching the write IOPs limit so this is also a last-ditch
>>> effort to keep it on Ceph.
>>> 
>>> This cluster is as straightforward as it gets... 6 servers with 10
>>> SSDs each, 100 Gb networking.  I'm using size=3.  During operations,
>>> the OSDs are more or less idle so I don't suspect any hardware
>>> limitations.
>>> 
>>> MySQL has no parallelism so the number of threads and effective queue
>>> depth stay pretty low.  Therefore, as a proxy for MySQL I use rados
>>> bench with 16K writes and 8 threads.  The RBD actually gets about 2x
>>> this level - still not so great.
>>> 
>>> I get about 2000 IOPs with this test:
>>> 
>>> # rados bench -p volumes 10 write -t 8 -b 16K
>>> hints = 1
>>> Maintaining 8 concurrent writes of 16384 bytes to objects of size
>>> 16384 for up to 10 seconds or 0 objects
>>> Object prefix: benchmark_data_fstosinfra-5_3652583
>>> sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
>>>   0   0 0 0 0 0   -   0
>>>   1   8  2050  2042   31.9004   31.9062  0.00247633  0.00390848
>>>   2   8  4306  4298   33.5728 35.25  0.00278488  0.00371784
>>>   3   8  6607  6599   34.3645   35.9531  0.00277546  0.00363139
>>>   4   7  8951  8944   34.9323   36.6406  0.00414908  0.00357249
>>>   5   

[ceph-users] Re: Ceph RBD, MySQL write IOPs - what is possible?

2024-06-07 Thread Mark Lehrer
> server RAM and CPU
> * osd_memory_target
> * OSD drive model

Thanks for the reply.  The servers have dual Xeon Gold 6154 CPUs with
384 GB.  The drives are older, first gen NVMe - WDC SN620.
osd_memory_target is at the default.  Mellanox CX5 and SN2700
hardware.  The test client is a similar machine with no drives.

The CPUs are 80% idle during the test.  The OSDs (according to iostat)
hover around 50% util during the test and are close to 0 at other
times.

I did find it interesting that the wareq-sz option in iostat is around
5 during the test - I was expecting 16.  Is there a way to tweak this
in bluestore?  These drives are terrible at under 8K I/O.  Not that it
really matters since we're not I/O bound at all.

I can also increase threads from 8 to 32 and the iops are roughly
quadruple so that's good at least.  Single thread writes are about 250
iops and like 3.7MB/sec.  So sad.

The rados bench process is also under 50% CPU utilization of a single
core.  This seems like a thead/semaphore kind of issue if I had to
guess.  It's tricky to debug when there is no obvious bottleneck.

Thanks,
Mark




On Fri, Jun 7, 2024 at 9:47 AM Anthony D'Atri  wrote:
>
> Please describe:
>
> * server RAM and CPU
> * osd_memory_target
> * OSD drive model
>
> > On Jun 7, 2024, at 11:32, Mark Lehrer  wrote:
> >
> > I've been using MySQL on Ceph forever, and have been down this road
> > before but it's been a couple of years so I wanted to see if there is
> > anything new here.
> >
> > So the TL:DR version of this email - is there a good way to improve
> > 16K write IOPs with a small number of threads?  The OSDs themselves
> > are idle so is this just a weakness in the algorithms or do ceph
> > clients need some profiling?  Or "other"?
> >
> > Basically, this is one of the worst possible Ceph workloads so it is
> > fun to try to push the limits.  I also happen have a MySQL instance
> > that is reaching the write IOPs limit so this is also a last-ditch
> > effort to keep it on Ceph.
> >
> > This cluster is as straightforward as it gets... 6 servers with 10
> > SSDs each, 100 Gb networking.  I'm using size=3.  During operations,
> > the OSDs are more or less idle so I don't suspect any hardware
> > limitations.
> >
> > MySQL has no parallelism so the number of threads and effective queue
> > depth stay pretty low.  Therefore, as a proxy for MySQL I use rados
> > bench with 16K writes and 8 threads.  The RBD actually gets about 2x
> > this level - still not so great.
> >
> > I get about 2000 IOPs with this test:
> >
> > # rados bench -p volumes 10 write -t 8 -b 16K
> > hints = 1
> > Maintaining 8 concurrent writes of 16384 bytes to objects of size
> > 16384 for up to 10 seconds or 0 objects
> > Object prefix: benchmark_data_fstosinfra-5_3652583
> >  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
> >0   0 0 0 0 0   -   0
> >1   8  2050  2042   31.9004   31.9062  0.00247633  0.00390848
> >2   8  4306  4298   33.5728 35.25  0.00278488  0.00371784
> >3   8  6607  6599   34.3645   35.9531  0.00277546  0.00363139
> >4   7  8951  8944   34.9323   36.6406  0.00414908  0.00357249
> >5   8 11292 1128435.257   36.5625  0.00291434  0.00353997
> >6   8 13588 13580   35.358835.875  0.00306094  0.00353084
> >7   7 15933 15926   35.5432   36.6562  0.00308388   0.0035123
> >8   8 18361 18353   35.8399   37.9219  0.00314996  0.00348327
> >9   8 20629 20621   35.7947   35.4375  0.00352998   0.0034877
> >   10   5 23010 23005   35.9397 37.25  0.00395566  0.00347376
> > Total time run: 10.003
> > Total writes made:  23010
> > Write size: 16384
> > Object size:16384
> > Bandwidth (MB/sec): 35.9423
> > Stddev Bandwidth:   1.63433
> > Max bandwidth (MB/sec): 37.9219
> > Min bandwidth (MB/sec): 31.9062
> > Average IOPS:   2300
> > Stddev IOPS:104.597
> > Max IOPS:   2427
> > Min IOPS:   2042
> > Average Latency(s): 0.0034737
> > Stddev Latency(s):  0.00163661
> > Max latency(s): 0.115932
> > Min latency(s): 0.00179735
> > Cleaning up (deleting benchmark objects)
> > Removed 23010 objects
> > Clean up completed and total clean up time :7.44664
> >
> >
> > Are there any good options to improve this?  It seems like the client
> > side is the bottleneck since the OSD servers are at like 15%
> > utilization.
> >
> > Thanks,
> > Mark
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph RBD, MySQL write IOPs - what is possible?

2024-06-07 Thread Anthony D'Atri
Please describe:

* server RAM and CPU
* osd_memory_target
* OSD drive model

> On Jun 7, 2024, at 11:32, Mark Lehrer  wrote:
> 
> I've been using MySQL on Ceph forever, and have been down this road
> before but it's been a couple of years so I wanted to see if there is
> anything new here.
> 
> So the TL:DR version of this email - is there a good way to improve
> 16K write IOPs with a small number of threads?  The OSDs themselves
> are idle so is this just a weakness in the algorithms or do ceph
> clients need some profiling?  Or "other"?
> 
> Basically, this is one of the worst possible Ceph workloads so it is
> fun to try to push the limits.  I also happen have a MySQL instance
> that is reaching the write IOPs limit so this is also a last-ditch
> effort to keep it on Ceph.
> 
> This cluster is as straightforward as it gets... 6 servers with 10
> SSDs each, 100 Gb networking.  I'm using size=3.  During operations,
> the OSDs are more or less idle so I don't suspect any hardware
> limitations.
> 
> MySQL has no parallelism so the number of threads and effective queue
> depth stay pretty low.  Therefore, as a proxy for MySQL I use rados
> bench with 16K writes and 8 threads.  The RBD actually gets about 2x
> this level - still not so great.
> 
> I get about 2000 IOPs with this test:
> 
> # rados bench -p volumes 10 write -t 8 -b 16K
> hints = 1
> Maintaining 8 concurrent writes of 16384 bytes to objects of size
> 16384 for up to 10 seconds or 0 objects
> Object prefix: benchmark_data_fstosinfra-5_3652583
>  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
>0   0 0 0 0 0   -   0
>1   8  2050  2042   31.9004   31.9062  0.00247633  0.00390848
>2   8  4306  4298   33.5728 35.25  0.00278488  0.00371784
>3   8  6607  6599   34.3645   35.9531  0.00277546  0.00363139
>4   7  8951  8944   34.9323   36.6406  0.00414908  0.00357249
>5   8 11292 1128435.257   36.5625  0.00291434  0.00353997
>6   8 13588 13580   35.358835.875  0.00306094  0.00353084
>7   7 15933 15926   35.5432   36.6562  0.00308388   0.0035123
>8   8 18361 18353   35.8399   37.9219  0.00314996  0.00348327
>9   8 20629 20621   35.7947   35.4375  0.00352998   0.0034877
>   10   5 23010 23005   35.9397 37.25  0.00395566  0.00347376
> Total time run: 10.003
> Total writes made:  23010
> Write size: 16384
> Object size:16384
> Bandwidth (MB/sec): 35.9423
> Stddev Bandwidth:   1.63433
> Max bandwidth (MB/sec): 37.9219
> Min bandwidth (MB/sec): 31.9062
> Average IOPS:   2300
> Stddev IOPS:104.597
> Max IOPS:   2427
> Min IOPS:   2042
> Average Latency(s): 0.0034737
> Stddev Latency(s):  0.00163661
> Max latency(s): 0.115932
> Min latency(s): 0.00179735
> Cleaning up (deleting benchmark objects)
> Removed 23010 objects
> Clean up completed and total clean up time :7.44664
> 
> 
> Are there any good options to improve this?  It seems like the client
> side is the bottleneck since the OSD servers are at like 15%
> utilization.
> 
> Thanks,
> Mark
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Ceph RBD, MySQL write IOPs - what is possible?

2024-06-07 Thread Mark Lehrer
I've been using MySQL on Ceph forever, and have been down this road
before but it's been a couple of years so I wanted to see if there is
anything new here.

So the TL:DR version of this email - is there a good way to improve
16K write IOPs with a small number of threads?  The OSDs themselves
are idle so is this just a weakness in the algorithms or do ceph
clients need some profiling?  Or "other"?

Basically, this is one of the worst possible Ceph workloads so it is
fun to try to push the limits.  I also happen have a MySQL instance
that is reaching the write IOPs limit so this is also a last-ditch
effort to keep it on Ceph.

This cluster is as straightforward as it gets... 6 servers with 10
SSDs each, 100 Gb networking.  I'm using size=3.  During operations,
the OSDs are more or less idle so I don't suspect any hardware
limitations.

MySQL has no parallelism so the number of threads and effective queue
depth stay pretty low.  Therefore, as a proxy for MySQL I use rados
bench with 16K writes and 8 threads.  The RBD actually gets about 2x
this level - still not so great.

I get about 2000 IOPs with this test:

# rados bench -p volumes 10 write -t 8 -b 16K
hints = 1
Maintaining 8 concurrent writes of 16384 bytes to objects of size
16384 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_fstosinfra-5_3652583
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
0   0 0 0 0 0   -   0
1   8  2050  2042   31.9004   31.9062  0.00247633  0.00390848
2   8  4306  4298   33.5728 35.25  0.00278488  0.00371784
3   8  6607  6599   34.3645   35.9531  0.00277546  0.00363139
4   7  8951  8944   34.9323   36.6406  0.00414908  0.00357249
5   8 11292 1128435.257   36.5625  0.00291434  0.00353997
6   8 13588 13580   35.358835.875  0.00306094  0.00353084
7   7 15933 15926   35.5432   36.6562  0.00308388   0.0035123
8   8 18361 18353   35.8399   37.9219  0.00314996  0.00348327
9   8 20629 20621   35.7947   35.4375  0.00352998   0.0034877
   10   5 23010 23005   35.9397 37.25  0.00395566  0.00347376
Total time run: 10.003
Total writes made:  23010
Write size: 16384
Object size:16384
Bandwidth (MB/sec): 35.9423
Stddev Bandwidth:   1.63433
Max bandwidth (MB/sec): 37.9219
Min bandwidth (MB/sec): 31.9062
Average IOPS:   2300
Stddev IOPS:104.597
Max IOPS:   2427
Min IOPS:   2042
Average Latency(s): 0.0034737
Stddev Latency(s):  0.00163661
Max latency(s): 0.115932
Min latency(s): 0.00179735
Cleaning up (deleting benchmark objects)
Removed 23010 objects
Clean up completed and total clean up time :7.44664


Are there any good options to improve this?  It seems like the client
side is the bottleneck since the OSD servers are at like 15%
utilization.

Thanks,
Mark
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Testing CEPH scrubbing / self-healing capabilities

2024-06-07 Thread Frédéric Nass
Hello Petr,

- Le 4 Juin 24, à 12:13, Petr Bena petr@bena.rocks a écrit :

> Hello,
> 
> I wanted to try out (lab ceph setup) what exactly is going to happen
> when parts of data on OSD disk gets corrupted. I created a simple test
> where I was going through the block device data until I found something
> that resembled user data (using dd and hexdump) (/dev/sdd is a block
> device that is used by OSD)
> 
> INFRA [root@ceph-vm-lab5 ~]# dd if=/dev/sdd bs=32 count=1 skip=33920 |
> hexdump -C
>   6e 20 69 64 3d 30 20 65  78 65 3d 22 2f 75 73 72  |n id=0
> exe="/usr|
> 0010  2f 73 62 69 6e 2f 73 73  68 64 22 20 68 6f 73 74 |/sbin/sshd"
> host|
> 
> Then I deliberately overwrote 32 bytes using random data:
> 
> INFRA [root@ceph-vm-lab5 ~]# dd if=/dev/urandom of=/dev/sdd bs=32
> count=1 seek=33920
> 
> INFRA [root@ceph-vm-lab5 ~]# dd if=/dev/sdd bs=32 count=1 skip=33920 |
> hexdump -C
>   25 75 af 3e 87 b0 3b 04  78 ba 79 e3 64 fc 76 d2
>|%u.>..;.x.y.d.v.|
> 0010  9e 94 00 c2 45 a5 e1 d2  a8 86 f1 25 fc 18 07 5a
>|E..%...Z|
> 
> At this point I would expect some sort of data corruption. I restarted
> the OSD daemon on this host to make sure it flushes any potentially
> buffered data. It restarted OK without noticing anything, which was
> expected.
> 
> Then I ran
> 
> ceph osd scrub 5
> 
> ceph osd deep-scrub 5
> 
> And waiting for all scheduled scrub operations for all PGs to finish.
> 
> No inconsistency was found. No errors reported, scrubs just finished OK,
> data are still visibly corrupt via hexdump.
> 
> Did I just hit some block of data that WAS used by OSD, but was marked
> deleted and therefore no longer used or am I missing something?

Possibly, if you deep-scrubed all PGs. I remember marking bad sectors in the 
past and still getting a fsck success on ceph-bluestore-tool fsck.

To be sure, you could overwrite the very same sector, stop the OSD and then:

$ ceph-bluestore-tool fsck --deep yes --path /var/lib/ceph/osd/ceph-X/

or (in containerized environment)

$ cephadm shell --name osd.X ceph-bluestore-tool fsck --deep yes --path 
/var/lib/ceph/osd/ceph-X/

osd.X being the OSD associated to drive /dev/sdd.

Regards,
Frédéric.


> I would expect CEPH to detect disk corruption and automatically replace the
> invalid data with a valid copy?
> 
> I use only replica pools in this lab setup, for RBD and CephFS.
> 
> Thanks
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Excessively Chatty Daemons RHCS v5

2024-06-07 Thread Frédéric Nass
Hi Joshua,

These messages actually deserve more attention than you think, I believe. You 
may hit this one [1] that Mark (comment #4) also hit with 16.2.10 (RHCS 5).
PR's here: https://github.com/ceph/ceph/pull/51669

Could you try raising osd_max_scrubs to 2 or 3 (now defaults to 3 in quincy and 
reef) and see if these logs disappear over the next hours/days?

Regards,
Frédéric.

- Le 4 Juin 24, à 18:39, Joshua Arulsamy jarul...@uwyo.edu a écrit :

> Hi,
> 
> I recently upgraded my RHCS cluster from v4 to v5 and moved to containerized
> daemons (podman) along the way. I noticed that there are a huge number of logs
> going to journald on each of my hosts. I am unsure why there are so many.
> 
> I tried changing the logging level at runtime with commands like these (from 
> the
> ceph docs):
> 
> ceph tell osd.\* config set debug_osd 0/5
> 
> I tried adjusting several different subsystems (also with 0/0) but I noticed
> that logs seem to come at the same rate/content. I'm not sure what to try 
> next?
> Is there a way to trace where logs are coming from?
> 
> Some of the sample log entries are events like this on the OSD nodes:
> 
> Jun 04 10:34:02 pf-osd1 ceph-osd-0[182875]: 2024-06-04T10:34:02.470-0600
> 7fc049c03700 -1 osd.0 pg_epoch: 703151 pg[35.39s0( v 703141'789389
> (701266'780746,703141'789389] local-lis/les=702935/702936 n=48162 
> ec=63726/27988
> lis/c=702935/702935 les/c/f=702936/702936/0 sis=702935)
> [0,194,132,3,177,159,83,18,149,14,145]p0(0) r=0 lpr=702935 crt=703141'789389
> lcod 703141'789388 mlcod 703141'789388 active+clean planned 
> DEEP_SCRUB_ON_ERROR]
> scrubber : handle_scrub_reserve_grant: received unsolicited
> reservation grant from osd 177(4) (0x55fdea6c4000)
> 
> These are very verbose messages and occur roughly every 0.5 second per daemon.
> On a cluster with 200 daemons this is getting unmanageable and is flooding my
> syslog servers.
> 
> Any advice on how to tame all the logs would be greatly appreciated!
> 
> Best,
> 
> Josh
> 
> Joshua Arulsamy
> HPC Systems Architect
> Advanced Research Computing Center
> University of Wyoming
> jarul...@uwyo.edu
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io