Hi, yes, it has software bottlenecks :-)
https://yourcmc.ru/wiki/Ceph_performance
If you just need block storage - try Vitastor https://vitastor.io/
https://yourcmc.ru/git/vitalif/vitastor/src/branch/master/README.md - I made it
very architecturally similar to Ceph - or if you're fine
Hi!
I have an interesting question regarding SSDs and I'll try to ask about it here.
During my testing of Ceph & Vitastor & Linstor on servers equipped with Intel
D3-4510 SSDs I discovered a very funny problem with these SSDs:
They don't like overwrites of the same sector.
That is, if you
One guy in the Russian Ceph chat had this problem when he had "insights" mgr
module enabled. So try
to disable various mgr modules and see if it helps...
> Hello,
>
> we observed massive and sudden growth of the mon db size on disk, from
> 50MB to 20GB+ (GB!) and thus reaching 100% disk usage
Yeah but you should divide sysstat of each disk by 5. Which is Ceph's WA.
60k/5 = 12k external iops, pretty realistic.
> I did not see 10 cores, but 7 cores per osd over a long period on pm1725a
> disks with around 60k
> IO/s according to sysstat of each disk.
OK, I'll retry my tests several times more.
But I've never seen OSD utilize 10 cores, so... I won't believe it until I see
it myself on my machine. :-))
I tried a fresh OSD on a block ramdisk ("brd"), for example. It was eating 658%
CPU and pushing only 4138 write iops...
Hi Roman,
Yes, you're right - OSDs list all objects during peering and take the latest
full version of each object. Full version is a version that has at least
min_size parts for XOR/EC, or any version for replicated setups which is OK
because writes are atomic. If there is a newer "committed"
I have no idea how you get 66k write iops with one OSD )
I've just repeated a test by creating a test pool on one NVMe OSD with 8 PGs
(all pinned to the same OSD with pg-upmap). Then I ran 4x fio randwrite q128
over 4 RBD images. I got 17k iops.
OK, in fact that's not the worst result for
> https://docs.google.com/spreadsheets/d/1e5eTeHdZnSizoY6AUjH0knb4jTCW7KMU4RoryLX9EHQ/edit?usp=sharing
I see that in your tests Octopus delivers more than twice iops with 1 OSD.
Can I ask you what's my problem then? :-)
I have a 4-node Ceph cluster with 14 NVMe drives and fast CPUs
Sounds like you just want to create 2 OSDs per drive? It's OK, everyone does
that :) I tested Ceph with 2 OSDs per SATA SSD when comparing it to my
Vitastor, Micron also tested Ceph with 2 OSDs per SSD in their PDF and so on.
> On 23/09/2020 10:54, Marc Roos wrote:
>
>>> Depends on your
Slow. https://yourcmc.ru/wiki/Ceph_performance :-)
> Hi,
>
> we're considering running KVM virtual machine images on Ceph RBD block
> devices. How does Ceph RBD perform with the synchronous writes of
> databases (MariaDB)?
>
> Best regards,
>
> Renne
Thanks Marc :)
It's easier to write code than to cooperate :) I can do whatever I want in my
own project.
Ceph is rather complex. For example, I failed to find bottlenecks in OSD when I
tried to profile it - I'm not an expert of course, but still... The only
bottleneck I found was
Hi
> We currently run a SSD cluster and HDD clusters and are looking at possibly
> creating a cluster for NVMe storage. For spinners and SSDs, it seemed the
> max recommended per osd host server was 16 OSDs ( I know it depends on the
> CPUs and RAM, like 1 cpu core and 2GB memory ).
What do you
Hi!
After almost a year of development in my spare time I present my own
software-defined block storage system: Vitastor - https://vitastor.io
I designed it similar to Ceph in many ways, it also has Pools, PGs, OSDs,
different coding schemes, rebalancing and so on. However it's much simpler
> we did test dm-cache, bcache and dm-writecache, we found the later to be
> much better.
Did you set bcache block size to 4096 during your tests? Without this setting
it's slow because 99.9% SSDs don't handle 512 byte overwrites well. Otherwise I
don't think bcache should be worse than
will also depend on whether your HDDs have internal SSD/media cache
(a lot of them do even if you're unaware of it).
+1 for hsbench, just be careful and use my repo
https://github.com/vitalif/hsbench because the original has at least 2 bugs for
now:
1) it only reads first 64KB when benchmarking
There's also Micron 7300 Pro/Max. Please benchmark it like described here
https://docs.google.com/spreadsheets/d/1E9-eXjzsKboiCCX-0u0r5fAjjufLKayaut_FOPxYZjc/edit
(https://docs.google.com/spreadsheets/d/1E9-eXjzsKboiCCX-0u0r5fAjjufLKayaut_FOPxYZjc/edit)
and send me the results if you get one
Samsung PM983 M.2
I want to have a separate disk for buckets index pool and all of my server
bays are full and I should use m2 storage devices. Also the bucket index
doesn't need much space so I plan to have a 6x device with replica 3 for it.
Each disk could be 240GB to not waste space but
Yeah, of course... but RBD is primarily used for KVM VMs, so the results from a
VM are the thing that real clients see. So they do mean something... :)
I know. I tested fio before testing cephwith fio. On null ioengine fio can
handle up to 14M IOPS (on my dusty lab's R220). On blk_null to gets
Hi George
Author of Ceph_performance here! :)
I suspect you're running tests with 1 PG. Every PG's requests are always
serialized, that's why OSD doesn't utilize all threads with 1 PG. You need
something like 8 PGs per OSD. More than 8 usually doesn't improve results.
Also note that read
Create a pool with size=minsize=1 and use ceph-gobench
https://github.com/rumanzo/ceph-gobench
Hi all.
Is there anyway to completely health check one OSD host or instance?
For example rados bech just on that OSD or do some checks for disk and
front and back netowrk?
Thanks.
Hi, https://yourcmc.ru/wiki/Ceph_performance author here %)
Disabling write cache is REALLY bad for SSDs without capacitors
[consumer SSDs], also it's bad for HDDs with firmwares that don't have
this bug-o-feature. The bug is really common though. I have no idea
where it comes from, but it's
Update on my issue. It seems it was caused by the broken compression
which one of 14.2.x releases (ubuntu builds) probably had.
My osd versions were mixed. Five OSDs were 14.2.7, one was 14.2.4, other
6 were 14.2.8.
I moved the same pg several times more. Space usage dropped when the pg
was
Hi,
The cluster is all-flash (NVMe), so the removal is fast and it's in fact
pretty noticeable, even on Prometheus graphs.
Also I've logged raw space usage from `ceph -f json df`:
1) before pg rebalance started the space usage was 32724002664448 bytes
2) just before the rebalance finished it
I have a question regarding this problem - is it possible to rebuild
bluestore allocation metadata? I could try it to test if it's an
allocator problem...
Hi.
I'm experiencing some kind of a space leak in Bluestore. I use EC,
compression and snapshots. First I thought that the leak was
Hi Steve,
Thanks, it's an interesting discussion, however I don't think that it's
the same problem, because in my case bluestore eats additional space
during rebalance. And it doesn't seem that Ceph does small overwrites
during rebalance. As I understand it does the opposite: it reads and
Hi.
I'm experiencing some kind of a space leak in Bluestore. I use EC,
compression and snapshots. First I thought that the leak was caused by
"virtual clones" (issue #38184). However, then I got rid of most of the
snapshots, but continued to experience the problem.
I suspected something
Hi Victor,
1) RocksDB doesn't put L4 on the fast device if it's less than ~ 286 GB,
so no. But, anyway, there's usually no L4, so 30 GB is usually
sufficient. I had ~17 GB block.dbs even for 8 TB hard drives used for
RBD... RGW probably uses slightly more if stored objects are small...
but
Hi,
Can you test it slightly differently (and simpler)? Like in this
googledoc:
https://docs.google.com/spreadsheets/d/1E9-eXjzsKboiCCX-0u0r5fAjjufLKayaut_FOPxYZjc/edit#gid=0
As we know that it's a QLC drive, first let it fill the SLC cache:
fio -ioengine=libaio -direct=1 -name=test -bs=4M
Hi Stefan,
Do you mean more info than:
Yes, there's more... I don't remember exactly, I think some information
ends up included into OSD perf counters and some information is dumped
into the OSD log, maybe there's even a 'ceph daemon' command to trigger
it...
There are 4 options that
Hi,
This helped to disable deferred writes in my case:
bluestore_min_alloc_size=4096
bluestore_prefer_deferred_size=0
bluestore_prefer_deferred_size_ssd=0
If you already deployed your OSDs with min_alloc_size=4K then you don't
need to redeploy them again.
Hi Vitality,
I completely
min_alloc_size can't be changed after formatting an OSD, and yes,
bluestore defers all writes that are < min_alloc_size. And default
min_alloc_size_ssd is 16KB.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to
SSD (block.db) partition contains object metadata in RocksDB so it
probably loads the metadata before modifying objects (if it's not in
cache yet). Also it sometimes performs compaction which also results in
disk reads and writes. There are other things going on that I'm not
completely aware
I think 800 GB NVMe per 2 SSDs is an overkill. 1 OSD usually only
requires 30 GB block.db, so 400 GB per an OSD is a lot. On the other
hand, does 7300 have twice the iops of 5300? In fact, I'm not sure if a
7300 + 5300 OSD will perform better than just a 5300 OSD at all.
It would be
Hi,
The results look strange to me...
To begin with, it's strange that read and write performance differs. But
the thing is that a lot (if not most) large Seagate EXOS drives have
internal SSD cache (~8 GB of it). I suspect that new EXOS also does and
I'm not sure if Toshiba has it. It could
The worst part about the official repository is that it lacks Debian
packages
Also of course it would be very convenient to be able to install any
version from the repos, not just the latest one. It's certainly possible
with debian repos...
___
Please never use dd for disk benchmarks.
Use fio. For linear write:
fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M
-iodepth=32 -rw=write -runtime=60 -filename=/dev/sdX
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe
on a new ceph cluster with the same software and config (ansible) on
the old hardware. 2 replica, 1 host, 4 osd.
=> New hardware : 32.6MB/s READ / 10.5MiB WRITE
=> Old hardware : 184MiB/s READ / 46.9MiB WRITE
No discussion ? I suppose I will keep the old hardware. What do you
think ? :D
In
fio -ioengine=libaio -name=test -bs=4k -iodepth=32 -rw=randread
-runtime=60 -filename=/dev/rbd/kube/bench
Now add -direct=1 because Linux async IO isn't async without O_DIRECT.
:)
+ Repeat the same for randwrite.
___
ceph-users mailing list --
And once more you're checking random I/O with 4 MB !!! block size.
Now recheck it with bs=4k.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
- libaio randwrite
- libaio randread
- libaio randwrite on mapped rbd
- libaio randread on mapped rbd
- rbd read
- rbd write
recheck RBD with RAND READ / RAND WRITE
you're again comparing RANDOM and NON-RANDOM I/O
your SSDs aren't that bad, 3000 single-thread iops isn't the worst
Now to go for "apples to apples" either run
fio -ioengine=libaio -name=test -bs=4k -iodepth=1 -direct=1 -fsync=1
-rw=randwrite -runtime=60 -filename=/dev/nvmeX
to compare with the single-threaded RBD random write result (the test is
destructive, so use a separate partition without
Hi.
It's not Ceph to blame!
Linux does not support cached asynchronous I/O, except for the new
io-uring! I.e. it supports aio calls, but they just block when you're
trying to do them on an FD opened without O_DIRECT.
So basically what happens when you benchmark it with -ioengine=libaio
42 matches
Mail list logo