[ceph-users] Re: Ceph RBD, MySQL write IOPs - what is possible?

2024-06-10 Thread Mark Lehrer
If they can do 1 TB/s with a single 16K write thread, that will be
quite impressive :DOtherwise not really applicable.  Ceph scaling
has always been good.

More seriously, would you mind sending a link to this?


Thanks!

Mark

On Mon, Jun 10, 2024 at 12:01 PM Anthony D'Atri  wrote:
>
> Eh?  cf. Mark and Dan's 1TB/s presentation.
>
> On Jun 10, 2024, at 13:58, Mark Lehrer  wrote:
>
>  It
> seems like Ceph still hasn't adjusted to SSD performance.
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph RBD, MySQL write IOPs - what is possible?

2024-06-10 Thread Mark Lehrer
> Not the most helpful response, but on a (admittedly well-tuned)

Actually this was the most helpful since you ran the same rados bench
command.  I'm trying to stay away from rbd & qemu issues and just test
rados bench on a non-virtualized client.

I have a test instance newer drives, CPUs, and Ceph code, I'll see
what that looks like.

Maged's comments were quite useful as far as iops per thread.  It
seems like Ceph still hasn't adjusted to SSD performance.  This kind
of feels like MongoDB before the Wired Tiger engine... slow
performance but with all the system resources close to idle due to
threads being blocked.

Thanks,
Mark
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph RBD, MySQL write IOPs - what is possible?

2024-06-07 Thread Mark Lehrer
> server RAM and CPU
> * osd_memory_target
> * OSD drive model

Thanks for the reply.  The servers have dual Xeon Gold 6154 CPUs with
384 GB.  The drives are older, first gen NVMe - WDC SN620.
osd_memory_target is at the default.  Mellanox CX5 and SN2700
hardware.  The test client is a similar machine with no drives.

The CPUs are 80% idle during the test.  The OSDs (according to iostat)
hover around 50% util during the test and are close to 0 at other
times.

I did find it interesting that the wareq-sz option in iostat is around
5 during the test - I was expecting 16.  Is there a way to tweak this
in bluestore?  These drives are terrible at under 8K I/O.  Not that it
really matters since we're not I/O bound at all.

I can also increase threads from 8 to 32 and the iops are roughly
quadruple so that's good at least.  Single thread writes are about 250
iops and like 3.7MB/sec.  So sad.

The rados bench process is also under 50% CPU utilization of a single
core.  This seems like a thead/semaphore kind of issue if I had to
guess.  It's tricky to debug when there is no obvious bottleneck.

Thanks,
Mark




On Fri, Jun 7, 2024 at 9:47 AM Anthony D'Atri  wrote:
>
> Please describe:
>
> * server RAM and CPU
> * osd_memory_target
> * OSD drive model
>
> > On Jun 7, 2024, at 11:32, Mark Lehrer  wrote:
> >
> > I've been using MySQL on Ceph forever, and have been down this road
> > before but it's been a couple of years so I wanted to see if there is
> > anything new here.
> >
> > So the TL:DR version of this email - is there a good way to improve
> > 16K write IOPs with a small number of threads?  The OSDs themselves
> > are idle so is this just a weakness in the algorithms or do ceph
> > clients need some profiling?  Or "other"?
> >
> > Basically, this is one of the worst possible Ceph workloads so it is
> > fun to try to push the limits.  I also happen have a MySQL instance
> > that is reaching the write IOPs limit so this is also a last-ditch
> > effort to keep it on Ceph.
> >
> > This cluster is as straightforward as it gets... 6 servers with 10
> > SSDs each, 100 Gb networking.  I'm using size=3.  During operations,
> > the OSDs are more or less idle so I don't suspect any hardware
> > limitations.
> >
> > MySQL has no parallelism so the number of threads and effective queue
> > depth stay pretty low.  Therefore, as a proxy for MySQL I use rados
> > bench with 16K writes and 8 threads.  The RBD actually gets about 2x
> > this level - still not so great.
> >
> > I get about 2000 IOPs with this test:
> >
> > # rados bench -p volumes 10 write -t 8 -b 16K
> > hints = 1
> > Maintaining 8 concurrent writes of 16384 bytes to objects of size
> > 16384 for up to 10 seconds or 0 objects
> > Object prefix: benchmark_data_fstosinfra-5_3652583
> >  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
> >0   0 0 0 0 0   -   0
> >1   8  2050  2042   31.9004   31.9062  0.00247633  0.00390848
> >2   8  4306  4298   33.5728 35.25  0.00278488  0.00371784
> >3   8  6607  6599   34.3645   35.9531  0.00277546  0.00363139
> >4   7  8951  8944   34.9323   36.6406  0.00414908  0.00357249
> >5   8 11292 1128435.257   36.5625  0.00291434  0.00353997
> >6   8 13588 13580   35.358835.875  0.00306094  0.00353084
> >7   7 15933 15926   35.5432   36.6562  0.00308388   0.0035123
> >8   8 18361 18353   35.8399   37.9219  0.00314996  0.00348327
> >9   8 20629 20621   35.7947   35.4375  0.00352998   0.0034877
> >   10   5 23010 23005   35.9397 37.25  0.00395566  0.00347376
> > Total time run: 10.003
> > Total writes made:  23010
> > Write size: 16384
> > Object size:16384
> > Bandwidth (MB/sec): 35.9423
> > Stddev Bandwidth:   1.63433
> > Max bandwidth (MB/sec): 37.9219
> > Min bandwidth (MB/sec): 31.9062
> > Average IOPS:   2300
> > Stddev IOPS:104.597
> > Max IOPS:   2427
> > Min IOPS:   2042
> > Average Latency(s): 0.0034737
> > Stddev Latency(s):  0.00163661
> > Max latency(s): 0.115932
> > Min latency(s): 0.00179735
> > Cleaning up (deleting benchmark objects)
> > Removed 23010 objects
> > Clean up completed and total clean up time :7.44664
> >
> >
> > Are there any good options to improve this?  It seems like the client
> > side is the bottleneck since the OSD servers are at like 15%
> > utilization.
> >
> > Thanks,
> > Mark
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Ceph RBD, MySQL write IOPs - what is possible?

2024-06-07 Thread Mark Lehrer
I've been using MySQL on Ceph forever, and have been down this road
before but it's been a couple of years so I wanted to see if there is
anything new here.

So the TL:DR version of this email - is there a good way to improve
16K write IOPs with a small number of threads?  The OSDs themselves
are idle so is this just a weakness in the algorithms or do ceph
clients need some profiling?  Or "other"?

Basically, this is one of the worst possible Ceph workloads so it is
fun to try to push the limits.  I also happen have a MySQL instance
that is reaching the write IOPs limit so this is also a last-ditch
effort to keep it on Ceph.

This cluster is as straightforward as it gets... 6 servers with 10
SSDs each, 100 Gb networking.  I'm using size=3.  During operations,
the OSDs are more or less idle so I don't suspect any hardware
limitations.

MySQL has no parallelism so the number of threads and effective queue
depth stay pretty low.  Therefore, as a proxy for MySQL I use rados
bench with 16K writes and 8 threads.  The RBD actually gets about 2x
this level - still not so great.

I get about 2000 IOPs with this test:

# rados bench -p volumes 10 write -t 8 -b 16K
hints = 1
Maintaining 8 concurrent writes of 16384 bytes to objects of size
16384 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_fstosinfra-5_3652583
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
0   0 0 0 0 0   -   0
1   8  2050  2042   31.9004   31.9062  0.00247633  0.00390848
2   8  4306  4298   33.5728 35.25  0.00278488  0.00371784
3   8  6607  6599   34.3645   35.9531  0.00277546  0.00363139
4   7  8951  8944   34.9323   36.6406  0.00414908  0.00357249
5   8 11292 1128435.257   36.5625  0.00291434  0.00353997
6   8 13588 13580   35.358835.875  0.00306094  0.00353084
7   7 15933 15926   35.5432   36.6562  0.00308388   0.0035123
8   8 18361 18353   35.8399   37.9219  0.00314996  0.00348327
9   8 20629 20621   35.7947   35.4375  0.00352998   0.0034877
   10   5 23010 23005   35.9397 37.25  0.00395566  0.00347376
Total time run: 10.003
Total writes made:  23010
Write size: 16384
Object size:16384
Bandwidth (MB/sec): 35.9423
Stddev Bandwidth:   1.63433
Max bandwidth (MB/sec): 37.9219
Min bandwidth (MB/sec): 31.9062
Average IOPS:   2300
Stddev IOPS:104.597
Max IOPS:   2427
Min IOPS:   2042
Average Latency(s): 0.0034737
Stddev Latency(s):  0.00163661
Max latency(s): 0.115932
Min latency(s): 0.00179735
Cleaning up (deleting benchmark objects)
Removed 23010 objects
Clean up completed and total clean up time :7.44664


Are there any good options to improve this?  It seems like the client
side is the bottleneck since the OSD servers are at like 15%
utilization.

Thanks,
Mark
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How suitable is CEPH for....

2022-06-14 Thread Mark Lehrer
> I'm reading and trying to figure out how crazy
> is using Ceph for all of the above targets [MySQL]

Not crazy at all, it just depends on your performance needs.  16K I/O
is not the best Ceph use case, but the snapshot/qcow2 features may
justify it.

The biggest problem I have with MySQL is that each connection uses a
single CPU core.  Combine this with poor 16K performance, and it's
tough to get good performance unless there are a lot of users.

Mass loading of data is particularly agonizing on MySQL, especially on
Ceph.  The last time I had to do a mass import, it was much faster to
copy the rbd to a local drive partition and run my VM there for the
import and then copy the block device back to the rbd.  This is
because you can use qemu-img to copy the block device with a large
block size and up to 16 threads which can move multiple terabytes an
hour.

My MySQL database is almost always CPU bound and never more than ~20%
iowait, so it can run on Ceph fairly well.

Mark





On Tue, Jun 14, 2022 at 8:14 AM Kostadin Bukov
 wrote:
>
> Greetings to all great people from Ceph community,
> I'm currently digging and trying to collect pros and cons of using CEPH for
> below purposes:
>
> - for MySQL server datastore (InnoDB) using Cephfs or rbd. Let's say we
> have 1 running Mysql server (active) and in case it fails the same InnoDB
> datastore is accessed from a MySQL server 2 (started, access the InnoDB
> from MySQL server 1 and become the new active). Or better to use old-school
> 2 MySQL servers with replication and avoid Ceph at all)?
> - storing application log files from different nodes (something like a
> central place for logs from different bare-metal servers or VMs or
> containers). By the way our applications under heavy load could generate
> gigabytes of log files per hour...
> - for configuration files (for different applications)
> - for etcd
> - for storing backup files from different nodes
>
> I'm reading and trying to figure out how crazy is using Ceph for all of the
> above targets.
> Kindly can you share your opinions if you think this is too complex and I
> can end up with a lot of troubles if Ceph cluster goes down.
> The applications and MySQL server are for production/critical platform
> which might high-availability, redundancy and performance (sometimes apps
> and MySQL are quite hungry when writing to the disk)
> Log files and backup files are not so critical so maybe putting them on
> Ceph with replica x3 would just generate unnecessary ceph traffic between
> the ceph nodes.
> Application configurations are needed only when start/restart application.
> The most critical data from the whole is the MySQL InnoDB data
>
> Would be interesting to me if you share your thoughts/experience or I
> should look elsewhere
>
> Regards,
> Kosta
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Best distro to run ceph.

2021-04-30 Thread Mark Lehrer
I've had good luck with the Ubuntu LTS releases - no need to add extra
repos.  20.04 uses Octopus.

On Fri, Apr 30, 2021 at 1:14 PM Peter Childs  wrote:
>
> I'm trying to set up a new ceph cluster, and I've hit a bit of a blank.
>
> I started off with centos7 and cephadm. Worked fine to a point, except I
> had to upgrade podman but it mostly worked with octopus.
>
> Since this is a fresh cluster and hence no data at risk, I decided to jump
> straight into Pacific when it came out and upgrade. Which is where my
> trouble began. Mostly because Pacific needs a version on lvm later than
> what's in centos7.
>
> I can't upgrade to centos8 as my boot drives are not supported by centos8
> due to the way redhst disabled lots of disk drivers. I think I'm looking at
> Ubuntu or debian.
>
> Given cephadm has a very limited set of depends it would be good to have a
> supported matrix, it would also be good to have a check in cephadm on
> upgrade, that says no I won't upgrade if the version of lvm2 is too low on
> any host and let's the admin fix the issue and try again.
>
> I was thinking to upgrade to centos8 for this project anyway until I
> relised that centos8 can't support my hardware I've inherited. But
> currently I've got a broken cluster unless I can workout some way to
> upgrade lvm in centos7.
>
> Peter.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Performance questions - 4 node (commodity) cluster - what to expect (and what not ;-)

2021-04-30 Thread Mark Lehrer
Can you collect the output of this command on all 4 servers while your
test is running:

iostat -mtxy 1

This should show how busy the CPUs are as well as how busy each drive is.


On Thu, Apr 29, 2021 at 7:52 AM Schmid, Michael
 wrote:
>
> Hello folks,
>
> I am new to ceph and at the moment I am doing some performance tests with a 4 
> node ceph-cluster (pacific, 16.2.1).
>
> Node hardware (4 identical nodes):
>
>   *   DELL 3620 workstation
>   *   Intel Quad-Core i7-6700@3.4 GHz
>   *   8 GB RAM
>   *   Debian Buster (base system, installed a dedicated on Patriot Burst 120 
> GB SATA-SSD)
>   *   HP 530SPF+ 10 GBit dual-port NIC (tested with iperf to 9.4 GBit/s from 
> node to node)
>   *   1 x Kingston KC2500 M2 NVMe PCIe SSD (500 GB, NO power loss protection 
> !)
>   *   3 x Seagate Barracuda SATA disk drives (7200 rpm, 500 GB)
>
> After bootstrapping a containerized (docker) ceph-cluster, I did some 
> performance tests on the NVMe storage by creating a storage pool called 
> „ssdpool“, consisting of 4 OSDs per (one) NVMe device (per node). A first 
> write-performance test yields
>
> =
> root@ceph1:~# rados bench -p ssdpool 10 write -b 4M -t 16 --no-cleanup
> hints = 1
> Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 
> for up to 10 seconds or 0 objects
> Object prefix: benchmark_data_ceph1_78
>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
> 0   0 0 0 0 0   -   0
> 1  16301455.99756   0.02099770.493427
> 2  165337   73.990392   0.02643050.692179
> 3  167660   79.9871920.5595050.664204
> 4  169983   82.9879920.6093320.721016
> 5  16   116   100   79.9889680.6860930.698084
> 6  16   132   116   77.322464 1.197150.731808
> 7  16   153   137   78.2741840.6226460.755812
> 8  16   171   15577.48672 0.254090.764022
> 9  16   192   176   78.2076840.9683210.775292
>10  16   214   198   79.1856880.4013390.766764
>11   1   214   213   77.4408600.9696930.784002
> Total time run: 11.0698
> Total writes made:  214
> Write size: 4194304
> Object size:4194304
> Bandwidth (MB/sec): 77.3272
> Stddev Bandwidth:   13.7722
> Max bandwidth (MB/sec): 92
> Min bandwidth (MB/sec): 56
> Average IOPS:   19
> Stddev IOPS:3.44304
> Max IOPS:   23
> Min IOPS:   14
> Average Latency(s): 0.785372
> Stddev Latency(s):  0.49011
> Max latency(s): 2.16532
> Min latency(s): 0.0144995
> =
>
> ... and I think that 80 MB/s throughput is a very poor result in conjunction 
> with NVMe devices and 10 GBit nics.
>
> A bare write-test (with fsync=0 option) of the NVMe drives yields a write 
> throughput of round about 800 MB/s per device ... the second test (with 
> fsync=1) drops performance to 200 MB/s.
>
> =
> root@ceph1:/home/mschmid# fio --rw=randwrite --name=IOPS-write --bs=1024k 
> --direct=1 --filename=/dev/nvme0n1 --numjobs=4 --ioengine=libaio --iodepth=32 
> --refill_buffers --group_reporting --runtime=30 --time_based --fsync=0
> IOPS-write: (g=0): rw=randwrite, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, 
> (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=32...
> fio-3.12
> Starting 4 processes
> Jobs: 4 (f=4): [w(4)][100.0%][w=723MiB/s][w=722 IOPS][eta 00m:00s]
> IOPS-write: (groupid=0, jobs=4): err= 0: pid=31585: Thu Apr 29 15:15:03 2021
>   write: IOPS=740, BW=740MiB/s (776MB/s)(21.8GiB/30206msec); 0 zone resets
> slat (usec): min=16, max=810, avg=106.48, stdev=30.48
> clat (msec): min=7, max=1110, avg=172.09, stdev=120.18
>  lat (msec): min=7, max=1110, avg=172.19, stdev=120.18
> clat percentiles (msec):
>  |  1.00th=[   32],  5.00th=[   48], 10.00th=[   53], 20.00th=[   63],
>  | 30.00th=[  115], 40.00th=[  161], 50.00th=[  169], 60.00th=[  178],
>  | 70.00th=[  190], 80.00th=[  220], 90.00th=[  264], 95.00th=[  368],
>  | 99.00th=[  667], 99.50th=[  751], 99.90th=[  894], 99.95th=[  986],
>  | 99.99th=[ 1036]
>bw (  KiB/s): min=22528, max=639744, per=25.02%, avg=189649.94, 
> stdev=113845.69, samples=240
>iops: min=   22, max=  624, avg=185.11, stdev=111.18, samples=240
>   lat (msec)   : 10=0.01%, 20=0.19%, 50=6.43%, 100=20.29%, 250=61.52%
>   lat (msec)   : 500=8.21%, 750=2.85%, 1000=0.47%
>   cpu  : usr=11.87%, sys=2.05%, ctx=13141, majf=0, minf=45
>   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.3%, 32=99.4%, >=64=0.0%
>  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
> >=64=0.0%
>  complete  

[ceph-users] Re: HBA vs caching Raid controller

2021-04-20 Thread Mark Lehrer
> - The pattern is mainly write centric, so write latency is
>   probably the real factor
> - The HDD OSDs behind the raid controllers can cache / reorder
>   writes and reduce seeks potentially

OK that makes sense.

Unfortunately, re-ordering HDD writes without a battery backup is kind
of dangerous -- writes need to happen in order or the filesystem will
punish you when you least expect it.  This is the whole point of the
battery backup - to make sure that out-of-order writes get written to
disk even if there is a power loss in the middle of writing the
controller-write-cache data in an HDD-optimized order.

Your use case is ideal for an SSD-based WAL -- though it may be
difficult to beat the cost of H800s these days.


> In this context: is anyone here using HBAs with battery
> backed cache, and if yes, which controllers do you tend to use?

I almost always use MegaRAID-based controllers (such as the H800).


Good luck,
Mark


On Tue, Apr 20, 2021 at 2:28 PM Nico Schottelius
 wrote:
>
>
> Mark Lehrer  writes:
>
> >> One server has LSI SAS3008 [0] instead of the Perc H800,
> >> which comes with 512MB RAM + BBU. On most servers latencies are around
> >> 4-12ms (average 6ms), on the system with the LSI controller we see
> >> 20-60ms (average 30ms) latency.
> >
> > Are these reads, writes, or a mixed workload?  I would expect an
> > improvement in writes, but 512MB of cache isn't likely to help much on
> > reads with such a large data set.
>
> It's mostly write (~20MB/s), little read (1-5 MB/s) work load. This is
> probably due to many people using this storage for backup.
>
> > Just as a test, you could removing the battery on one of the H800s to
> > disable the write cache -- or else disable write caching with megaraid
> > or equivalent.
>
> That is certainly an interesting idea - and rereading your message and
> my statement above might actually explain the behaviour:
>
> - The pattern is mainly write centric, so write latency is probably the
>   real factor
> - The HDD OSDs behind the raid controllers can cache / reorder writes
>   and reduce seeks potentially
>
> So while "a raid controller" per se does probably not improve or reduce
> speed for ceph, "a (disk/raid) controller with a battery backed cache",
> might actually.
>
> In this context: is anyone here using HBAs with battery backed cache,
> and if yes, which controllers do you tend to use?
>
> Nico
>
>
> --
> Sustainable and modern Infrastructures by ungleich.ch
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: HBA vs caching Raid controller

2021-04-20 Thread Mark Lehrer
> One server has LSI SAS3008 [0] instead of the Perc H800,
> which comes with 512MB RAM + BBU. On most servers latencies are around
> 4-12ms (average 6ms), on the system with the LSI controller we see
> 20-60ms (average 30ms) latency.

Are these reads, writes, or a mixed workload?  I would expect an
improvement in writes, but 512MB of cache isn't likely to help much on
reads with such a large data set.

Just as a test, you could removing the battery on one of the H800s to
disable the write cache -- or else disable write caching with megaraid
or equivalent.





On Mon, Apr 19, 2021 at 12:21 PM Nico Schottelius
 wrote:
>
>
> Good evening,
>
> I've to tackle an old, probably recurring topic: HBAs vs. Raid
> controllers. While generally speaking many people in the ceph field
> recommend to go with HBAs, it seems in our infrastructure the only
> server we phased in with an HBA vs. raid controller is actually doing
> worse in terms of latency.
>
> For the background: we have many Perc H800+MD1200 [1] systems running with
> 10TB HDDs (raid0, read ahead, writeback cache).
> One server has LSI SAS3008 [0] instead of the Perc H800,
> which comes with 512MB RAM + BBU. On most servers latencies are around
> 4-12ms (average 6ms), on the system with the LSI controller we see
> 20-60ms (average 30ms) latency.
>
> Now, my question is, are we doing some inherently wrong with the
> SAS3008 or does in fact the cache help to possible reduce seek time?
>
> We were considering to move more towards LSI HBAs to reduce maintenance
> effort, however if we have a factor of 5 in latency between the two
> different systems, it might be better to stay on the H800 path for
> disks.
>
> Any input/experiences appreciated.
>
> Best regards,
>
> Nico
>
> [0]
> 05:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS3008 
> PCI-Express Fusion-MPT SAS-3 (rev 02)
> Subsystem: Dell 12Gbps HBA
> Kernel driver in use: mpt3sas
> Kernel modules: mpt3sas
>
> [1]
> 08:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2108 
> [Liberator] (rev 05)
> Subsystem: Dell PERC H800 Adapter
> Kernel driver in use: megaraid_sas
> Kernel modules: megaraid_sas
>
> --
> Sustainable and modern Infrastructures by ungleich.ch
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Changing IP addresses

2021-04-06 Thread Mark Lehrer
I just did this recently.  The only painful part was using
"monmaptool" to change the monitor IP addresses on disk.  Once you do
that, and change the monitor IPs in ceph.conf everywhere, it should
come up just fine.

Mark



On Tue, Apr 6, 2021 at 8:08 AM Jean-Marc FONTANA
 wrote:
>
> Hello everyone,
>
> We have installed a Nautilus Ceph cluster with 3 monitors, 5 osd and 1
> RGW gateway.
> It works but now, we need to change the IP addresses of these machines
> to put them in DMZ.
> Are there any recommandations to go about doing this ?
>
> Best regards,
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph nvme timeout and then aborting

2021-02-22 Thread Mark Lehrer
> Yes, it a Nvme, and on node has two Nvmes as db/wal, one
> for ssd(0-2) and another for hdd(3-6).  I have no spare to try.
> ...
> I/O 517 QID 7 timeout, aborting
> Input/output error

If you are seeing errors like these, it is almost certainly a bad
drive unless you are using fabric.

Why are you putting the wal on an SSD in the first place?  Are you
sure it is even necessary, especially when one of your pools is
already SSD?

Adding this complexity just means that there are more things to break
when you least expect it. Putting the db/wal on a separate drive is
usually premature optimization that is only useful for benchmarkers.
My opinion of course.

Mark








On Sun, Feb 21, 2021 at 7:16 PM zxcs  wrote:
>
> Thanks for you reply!
>
> Yes, it a Nvme, and on node has two Nvmes as db/wal, one for ssd(0-2) and 
> another for hdd(3-6).
> I have no spare to try.
> It’s  very strange, the load not very high at that time. and both ssd and 
> nvme seems healthy.
>
> If cannot fix it.  I am afraid I need to setup more nodes and set out remove 
> these OSDs which using this Nvme?
>
> Thanks,
> zx
>
>
> > 在 2021年2月22日,上午10:07,Mark Lehrer  写道:
> >
> >> One nvme  sudden crash again. Could anyone please help shed some light 
> >> here?
> >
> > It looks like a flaky NVMe drive.  Do you have a spare to try?
> >
> >
> > On Mon, Feb 22, 2021 at 1:56 AM zxcs  wrote:
> >>
> >> One nvme  sudden crash again. Could anyone please help shed some light 
> >> here? Thank a ton!!!
> >> Below are syslog and ceph log.
> >>
> >> From  /var/log/syslog
> >> Feb 21 19:38:33 ip kernel: [232562.847916] nvme :03:00.0: I/O 943 QID 
> >> 7 timeout, aborting
> >> Feb 21 19:38:34 ip kernel: [232563.847946] nvme :03:00.0: I/O 911 QID 
> >> 18 timeout, aborting
> >> Feb 21 19:38:34 ip kernel: [232563.847964] nvme :03:00.0: I/O 776 QID 
> >> 28 timeout, aborting
> >> Feb 21 19:38:36 ip ceph-osd[3241]: 2021-02-21 19:38:36.218 7f023b58f700 -1 
> >> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
> >> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
> >> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
> >> ondisk+write+known_if_redirected+full_force e7868)
> >> Feb 21 19:38:36 ip kernel: [232565.851961] nvme :03:00.0: I/O 442 QID 
> >> 2 timeout, aborting
> >> Feb 21 19:38:36 ip kernel: [232565.851982] nvme :03:00.0: I/O 912 QID 
> >> 18 timeout, aborting
> >> Feb 21 19:38:37 ip ceph-osd[3241]: 2021-02-21 19:38:37.254 7f023b58f700 -1 
> >> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
> >> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
> >> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
> >> ondisk+write+known_if_redirected+full_force e7868)
> >> Feb 21 19:38:38 ip ceph-osd[3241]: 2021-02-21 19:38:38.286 7f023b58f700 -1 
> >> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
> >> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
> >> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
> >> ondisk+write+known_if_redirected+full_force e7868)
> >> Feb 21 19:38:39 ip ceph-osd[3241]: 2021-02-21 19:38:39.334 7f023b58f700 -1 
> >> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
> >> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
> >> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
> >> ondisk+write+known_if_redirected+full_force e7868)
> >> Feb 21 19:38:40 ip ceph-osd[3241]: 2021-02-21 19:38:40.322 7f023b58f700 -1 
> >> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
> >> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
> >> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
> >> ondisk+write+known_if_redirected+full_force e7868)
> >> Feb 21 19:38:41 ip ceph-osd[3241]: 2021-02-21 19:38:41.326 7f023b58f700 -1 
> >> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
> >> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
> >> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
> >> ondisk+write+known_if_redirected+full_force e7868)
> >> Feb 21 19:38:41 ip kernel: [232570.852035] nvme :03:00.0: I/O 860 QID 
> >> 9 timeout, aborting
> >> Feb 21 19:38:42 ip ceph-osd[3241]: 2021-02-21 19:38:42.298 7f023b58f700 -1 
> >> osd.16 7868 

[ceph-users] Re: Ceph nvme timeout and then aborting

2021-02-21 Thread Mark Lehrer
> One nvme  sudden crash again. Could anyone please help shed some light here?

It looks like a flaky NVMe drive.  Do you have a spare to try?


On Mon, Feb 22, 2021 at 1:56 AM zxcs  wrote:
>
> One nvme  sudden crash again. Could anyone please help shed some light here? 
> Thank a ton!!!
> Below are syslog and ceph log.
>
> From  /var/log/syslog
> Feb 21 19:38:33 ip kernel: [232562.847916] nvme :03:00.0: I/O 943 QID 7 
> timeout, aborting
> Feb 21 19:38:34 ip kernel: [232563.847946] nvme :03:00.0: I/O 911 QID 18 
> timeout, aborting
> Feb 21 19:38:34 ip kernel: [232563.847964] nvme :03:00.0: I/O 776 QID 28 
> timeout, aborting
> Feb 21 19:38:36 ip ceph-osd[3241]: 2021-02-21 19:38:36.218 7f023b58f700 -1 
> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
> ondisk+write+known_if_redirected+full_force e7868)
> Feb 21 19:38:36 ip kernel: [232565.851961] nvme :03:00.0: I/O 442 QID 2 
> timeout, aborting
> Feb 21 19:38:36 ip kernel: [232565.851982] nvme :03:00.0: I/O 912 QID 18 
> timeout, aborting
> Feb 21 19:38:37 ip ceph-osd[3241]: 2021-02-21 19:38:37.254 7f023b58f700 -1 
> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
> ondisk+write+known_if_redirected+full_force e7868)
> Feb 21 19:38:38 ip ceph-osd[3241]: 2021-02-21 19:38:38.286 7f023b58f700 -1 
> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
> ondisk+write+known_if_redirected+full_force e7868)
> Feb 21 19:38:39 ip ceph-osd[3241]: 2021-02-21 19:38:39.334 7f023b58f700 -1 
> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
> ondisk+write+known_if_redirected+full_force e7868)
> Feb 21 19:38:40 ip ceph-osd[3241]: 2021-02-21 19:38:40.322 7f023b58f700 -1 
> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
> ondisk+write+known_if_redirected+full_force e7868)
> Feb 21 19:38:41 ip ceph-osd[3241]: 2021-02-21 19:38:41.326 7f023b58f700 -1 
> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
> ondisk+write+known_if_redirected+full_force e7868)
> Feb 21 19:38:41 ip kernel: [232570.852035] nvme :03:00.0: I/O 860 QID 9 
> timeout, aborting
> Feb 21 19:38:42 ip ceph-osd[3241]: 2021-02-21 19:38:42.298 7f023b58f700 -1 
> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
> ondisk+write+known_if_redirected+full_force e7868)
> Feb 21 19:38:43 ip ceph-osd[3241]: 2021-02-21 19:38:43.258 7f023b58f700 -1 
> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
> ondisk+write+known_if_redirected+full_force e7868)
> Feb 21 19:38:44 ip ceph-osd[3241]: 2021-02-21 19:38:44.258 7f023b58f700 -1 
> osd.16 7868 get_health_metrics reporting 2 slow ops, oldest is 
> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
> ondisk+write+known_if_redirected+full_force e7868)
> Feb 21 19:38:45 ip ntpd[3480]: Soliciting pool server 84.16.67.12
> Feb 21 19:38:45 ip ceph-osd[3241]: 2021-02-21 19:38:45.286 7f023b58f700 -1 
> osd.16 7868 get_health_metrics reporting 2 slow ops, oldest is 
> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
> ondisk+write+known_if_redirected+full_force e7868)
> Feb 21 19:38:46 ip ceph-osd[3241]: 2021-02-21 19:38:46.254 7f023b58f700 -1 
> osd.16 7868 get_health_metrics reporting 2 slow ops, oldest is 
> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
> ondisk+write+known_if_redirected+full_force e7868)
> Feb 21 19:38:47 ip ceph-osd[3241]: 2021-02-21 19:38:47.226 7f023b58f700 -1 
> osd.16 7868 get_health_metrics reporting 2 slow ops, oldest is 
> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
> [create,setxattr 

[ceph-users] Re: struggling to achieve high bandwidth on Ceph dev cluster - HELP

2021-02-10 Thread Mark Lehrer
> I am interested in benchmarking the cluster.

dstat is great, but can you send and example of this command on your
osd machine: iostat -mtxy 1

This will also show some basic CPU info and more detailed analysis of
the I/O pattern.

What kind of drives are you using?  Random access can be very slow on
spinning drives, especially if you have to do log structured merging
(double writes).

Mark


On Wed, Feb 10, 2021 at 1:31 AM Bobby  wrote:
>
> Hi,
>
> Hello I am using rados bench tool. Currently I am using this tool  on the
> development cluster after running vstart.sh script. It is working fine and
> I am interested in benchmarking the cluster. However I am struggling to
> achieve a good bandwidth i.e. bandwidth (MB/sec).  My target throughput is
> at least 50 MB/sec and more. But mostly I am achieving is around 15-20
> MB/sec. So, very poor.
>
> I am quite sure I am missing something. Either I have to change my cluster
> through vstart.sh script or I am not fully utilizing the rados bench tool.
> Or may be both. i.e. not the right cluster and also not using the rados
> bench tool correctly.
>
> Some of the shell examples I have been using to build the cluster are
> bellow:
> MDS=0 RGW=1 ../src/vstart.sh -d -l -n --bluestore
> MDS=0 RGW=1 MON=1 OSD=4../src/vstart.sh -d -l -n --bluestore
>
> While using rados bench tool I have been trying with different block sizes
> 4K, 8K, 16K, 32K, 64K, 128K, 256K, 512K. And I have also been changing the
> -t parameter in the shell to increase concurrent IOs.
>
>
> Looking forward to help.
>
> Bobby
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Worst thing that can happen if I have size= 2

2021-02-05 Thread Mark Lehrer
> Redhat/Micron/Samsung/Supermicro have all put out white papers backing the 
> idea of 2 copies on NVMe's as safe for production.

It's not like you can just jump from "unsafe" to "safe" -- it is about
comparing the probability of losing data against how valuable that
data is.

A vendor's decision on size --  when they have a vested interest in
making the price lower vs the competition -- may be a different
decision than you would make as the person who stands to lose your
data and potentially your career.  And I say this as someone who works
for a hardware vendor... listen to their advice but make your own
decision.

I have lost data on a size 2 cluster before and learned first-hand how
easy it is for this to happen.  Luckily it was just my home NAS.  But
if anyone has Roger Federer's 2018 tennis matches archived we need to
talk :D

Mark



On Wed, Feb 3, 2021 at 8:50 AM Adam Boyhan  wrote:
>
> Isn't this somewhat reliant on the OSD type?
>
> Redhat/Micron/Samsung/Supermicro have all put out white papers backing the 
> idea of 2 copies on NVMe's as safe for production.
>
>
> From: "Magnus HAGDORN" 
> To: pse...@avalon.org.ua
> Cc: "ceph-users" 
> Sent: Wednesday, February 3, 2021 4:43:08 AM
> Subject: [ceph-users] Re: Worst thing that can happen if I have size= 2
>
> On Wed, 2021-02-03 at 09:39 +, Max Krasilnikov wrote:
> > > if a OSD becomes unavailble (broken disk, rebooting server) then
> > > all
> > > I/O to the PGs stored on that OSD will block until replication
> > > level of
> > > 2 is reached again. So, for a highly available cluster you need a
> > > replication level of 3
> >
> >
> > AFAIK, with min_size 1 it is possible to write even to only active
> > OSD serving
> >
> yes, that's correct but then you seriously risk trashing your data
>
> The University of Edinburgh is a charitable body, registered in Scotland, 
> with registration number SC005336.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: NVMe and 2x Replica

2021-02-05 Thread Mark Lehrer
I have just one more suggestion for you:

> but even our Supermicro contact that we worked the
> config out with was in agreement with 2x on NVMe

These kinds of settings aren't set in stone, it is a one line command
to rebalance (admittedly you wouldn't want to just do this casually).

I don't know your situation in any detail, but perhaps you could start
with size 3 and put off the size 2 decision until your cluster is
maybe 30% full... then you could make a final decision to either add
more storage or rebalance to size 2.

You can also have different size settings for different pools
depending on how important the data is.

Mark


On Thu, Feb 4, 2021 at 4:38 AM Adam Boyhan  wrote:
>
> I know there is already a few threads about 2x replication but I wanted to 
> start one dedicated to discussion on NVMe. There are some older threads, but 
> nothing recent that addresses how the vendors are now pushing the idea of 2x.
>
> We are in the process of considering Ceph to replace our Nimble setup. We 
> will have two completely separate clusters at two different sites that we are 
> using rbd-mirror snapshot replication. The plan would be to run 2x 
> replication on each cluster. 3x is still an option, but for obvious reasons 
> 2x is enticing.
>
> Both clusters will be spot on to the super micro example in the white paper 
> below.
>
> It seems all the big vendors feel 2x is safe with NVMe but I get the feeling 
> this community feels otherwise. Trying to wrap my head around were the 
> disconnect is between the big players and the community. I could be missing 
> something, but even our Supermicro contact that we worked the config out with 
> was in agreement with 2x on NVMe.
>
> Appreciate the input!
>
> [ https://www.supermicro.com/white_paper/white_paper_Ceph-Ultra.pdf | 
> https://www.supermicro.com/white_paper/white_paper_Ceph-Ultra.pdf ]
>
> [ 
> https://www.redhat.com/cms/managed-files/st-micron-ceph-performance-reference-architecture-f17294-201904-en.pdf
>  ]
> [ 
> https://www.redhat.com/cms/managed-files/st-micron-ceph-performance-reference-architecture-f17294-201904-en.pdf
>  | 
> https://www.redhat.com/cms/managed-files/st-micron-ceph-performance-reference-architecture-f17294-201904-en.pdf
>  ]
>
> [ 
> https://www.samsung.com/semiconductor/global.semi/file/resource/2020/05/redhat-ceph-whitepaper-0521.pdf
>  | 
> https://www.samsung.com/semiconductor/global.semi/file/resource/2020/05/redhat-ceph-whitepaper-0521.pdf
>  ]
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: NVMe and 2x Replica

2021-02-04 Thread Mark Lehrer
> It seems all the big vendors feel 2x is safe with NVMe but
> I get the feeling this community feels otherwise

Definitely!

As someone who works for a big vendor (and I have since I worked at
Fusion-IO way back in the old days), IMO the correct way to phrase
this would probably be that "someone in technical marketing at the big
vendors" was convinced that 2x was safe enough to put in a white paper
or sales document.  They (we, I guess, since I'm one of these types of
people) are focused on performance and cost numbers and as much as I
hate to admit it, it can get in the way of long-term reliability
settings sometimes.

This doesn't mean that they are "wrong" -- these documents are
primarily meant to show the capabilities of their hardware, with a
bill of materials containing their part numbers.  It is expected that
end users will adjust a few things when it comes to a production
environment.

The idea that NVMe is safer than spinning rust drives is not
necessarily true -- and it's beside the point.  You are just as likely
to run into a weird situation where an OSD or pg acts up or disappears
for non-hardware reasons.

Unless you can live with "nine fives" instead of "five nines" (say, a
caching type of application where you can re-generate the data), use a
size of at least 3 -- and if you can't afford this much storage then
look at erasure coding schemes.

All of this is IMO of course,
Mark



On Thu, Feb 4, 2021 at 4:38 AM Adam Boyhan  wrote:
>
> I know there is already a few threads about 2x replication but I wanted to 
> start one dedicated to discussion on NVMe. There are some older threads, but 
> nothing recent that addresses how the vendors are now pushing the idea of 2x.
>
> We are in the process of considering Ceph to replace our Nimble setup. We 
> will have two completely separate clusters at two different sites that we are 
> using rbd-mirror snapshot replication. The plan would be to run 2x 
> replication on each cluster. 3x is still an option, but for obvious reasons 
> 2x is enticing.
>
> Both clusters will be spot on to the super micro example in the white paper 
> below.
>
> It seems all the big vendors feel 2x is safe with NVMe but I get the feeling 
> this community feels otherwise. Trying to wrap my head around were the 
> disconnect is between the big players and the community. I could be missing 
> something, but even our Supermicro contact that we worked the config out with 
> was in agreement with 2x on NVMe.
>
> Appreciate the input!
>
> [ https://www.supermicro.com/white_paper/white_paper_Ceph-Ultra.pdf | 
> https://www.supermicro.com/white_paper/white_paper_Ceph-Ultra.pdf ]
>
> [ 
> https://www.redhat.com/cms/managed-files/st-micron-ceph-performance-reference-architecture-f17294-201904-en.pdf
>  ]
> [ 
> https://www.redhat.com/cms/managed-files/st-micron-ceph-performance-reference-architecture-f17294-201904-en.pdf
>  | 
> https://www.redhat.com/cms/managed-files/st-micron-ceph-performance-reference-architecture-f17294-201904-en.pdf
>  ]
>
> [ 
> https://www.samsung.com/semiconductor/global.semi/file/resource/2020/05/redhat-ceph-whitepaper-0521.pdf
>  | 
> https://www.samsung.com/semiconductor/global.semi/file/resource/2020/05/redhat-ceph-whitepaper-0521.pdf
>  ]
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Limited performance

2020-02-25 Thread Mark Lehrer
Fabian said:

> The output of "ceph osd pool stats" shows ~100 op/s, but our disks are doing:

What does the iostat output look like on the old cluster?

Thanks,
Mark

On Mon, Feb 24, 2020 at 11:02 AM Fabian Zimmermann  wrote:
>
> Hi,
>
> we currently creating a new cluster. This cluster is (as far as we can
> tell) an config-copy (ansible) of our existing cluster, just 5 years later
> - with new hardware (nvme instead of ssd, bigger disks, ...)
>
> The setup:
>
> * NVMe for Journals and "Cache"-Pool
> * HDD with NVMe Journals for "Data"-Pool
> * Cache-Pool as writeback-Tier on Data-Pool
> * We are using 12.2.13 without bluestore.
>
> If we run an rados benchmark against this pool, everything seems fine, but
> as soon as we start a fio-benchmark
>
> -<-
> [global]
> ioengine=rbd
> clientname=cinder
> pool=cinder
> rbdname=fio_test
> rw=write
> bs=4M
>
> [rbd_iodepth32]
> iodepth=32
> ->-
>
> after some seconds the bandwidth drops to <15 MB/s and our hdd-disks are
> doing more IOs than our Journal-Disks.
> We also unconfigured the caching completely, but the issue remains.
>
> The output of "ceph osd pool stats" shows ~100 op/s, but our disks are
> doing:
> -<-
> Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz
> avgqu-sz   await r_await w_await  svctm  %util
> nvme0n1   0.00 0.000.00  278.50 0.0034.07   250.51
> 0.140.500.000.50   0.03   0.80
> nvme1n1   0.00 0.000.00   64.00 0.00 7.77   248.50
> 0.010.220.000.22   0.03   0.20
> sda   0.00 1.500.00  557.00 0.0029.49   108.45
>   180.57  160.590.00  160.59   1.80 100.00
> sdb   0.0042.000.00  592.00 0.0028.2197.60
>   176.51 1105.790.00 1105.79   1.69 100.00
> sdc   0.0014.500.00  528.50 0.0027.95   108.31
>   183.02  179.470.00  179.47   1.89 100.00
> sde   0.00   134.500.00  223.50 0.0014.05   128.72
>17.38   60.050.00   60.05   0.89  20.00
> sdg   0.0076.000.00  492.00 0.0026.32   109.54
>   191.81 1474.960.00 1474.96   2.03 100.00
> sdf   0.00 0.000.00  491.50 0.0026.76   111.49
>   176.55  326.050.00  326.05   2.03 100.00
> sdh   0.00 0.000.00  548.50 0.0026.7199.75
>   204.39  327.570.00  327.57   1.82 100.00
> sdi   0.00   112.000.00  526.00 0.0023.1590.14
>   158.32 1325.610.00 1325.61   1.90 100.00
> sdj   0.0012.000.00  641.00 0.0034.78   111.13
>   185.51  278.290.00  278.29   1.56 100.00
> sdk   0.0023.500.00  399.50 0.0020.38   104.46
>   166.77  461.670.00  461.67   2.50 100.00
> sdl   0.00   267.000.00  498.50 0.0034.46   141.58
>   200.37  490.800.00  490.80   2.01 100.00
> ->-
>
> Any hints how to debug the issue?
>
> Thanks a lot,
>
>  Fabian
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io