[ceph-users] Re: Ceph Performance very bad even in Memory?!

2022-01-30 Thread vitalif
Hi, yes, it has software bottlenecks :-) https://yourcmc.ru/wiki/Ceph_performance If you just need block storage - try Vitastor https://vitastor.io/ https://yourcmc.ru/git/vitalif/vitastor/src/branch/master/README.md - I made it very architecturally similar to Ceph - or if you're fine

[ceph-users] Intel SSD firmware guys contacts, if any

2020-11-02 Thread vitalif
Hi! I have an interesting question regarding SSDs and I'll try to ask about it here. During my testing of Ceph & Vitastor & Linstor on servers equipped with Intel D3-4510 SSDs I discovered a very funny problem with these SSDs: They don't like overwrites of the same sector. That is, if you

[ceph-users] Re: Massive Mon DB Size with noout on 14.2.11

2020-10-02 Thread vitalif
One guy in the Russian Ceph chat had this problem when he had "insights" mgr module enabled. So try to disable various mgr modules and see if it helps... > Hello, > > we observed massive and sudden growth of the mon db size on disk, from > 50MB to 20GB+ (GB!) and thus reaching 100% disk usage

[ceph-users] Re: NVMe's

2020-09-24 Thread vitalif
Yeah but you should divide sysstat of each disk by 5. Which is Ceph's WA. 60k/5 = 12k external iops, pretty realistic. > I did not see 10 cores, but 7 cores per osd over a long period on pm1725a > disks with around 60k > IO/s according to sysstat of each disk.

[ceph-users] Re: NVMe's

2020-09-24 Thread vitalif
OK, I'll retry my tests several times more. But I've never seen OSD utilize 10 cores, so... I won't believe it until I see it myself on my machine. :-)) I tried a fresh OSD on a block ramdisk ("brd"), for example. It was eating 658% CPU and pushing only 4138 write iops...

[ceph-users] Re: Vitastor, a fast Ceph-like block storage for VMs

2020-09-24 Thread vitalif
Hi Roman, Yes, you're right - OSDs list all objects during peering and take the latest full version of each object. Full version is a version that has at least min_size parts for XOR/EC, or any version for replicated setups which is OK because writes are atomic. If there is a newer "committed"

[ceph-users] Re: NVMe's

2020-09-23 Thread vitalif
I have no idea how you get 66k write iops with one OSD ) I've just repeated a test by creating a test pool on one NVMe OSD with 8 PGs (all pinned to the same OSD with pg-upmap). Then I ran 4x fio randwrite q128 over 4 RBD images. I got 17k iops. OK, in fact that's not the worst result for

[ceph-users] Re: NVMe's

2020-09-23 Thread vitalif
> https://docs.google.com/spreadsheets/d/1e5eTeHdZnSizoY6AUjH0knb4jTCW7KMU4RoryLX9EHQ/edit?usp=sharing I see that in your tests Octopus delivers more than twice iops with 1 OSD. Can I ask you what's my problem then? :-) I have a 4-node Ceph cluster with 14 NVMe drives and fast CPUs

[ceph-users] Re: NVMe's

2020-09-23 Thread vitalif
Sounds like you just want to create 2 OSDs per drive? It's OK, everyone does that :) I tested Ceph with 2 OSDs per SATA SSD when comparing it to my Vitastor, Micron also tested Ceph with 2 OSDs per SSD in their PDF and so on. > On 23/09/2020 10:54, Marc Roos wrote: > >>> Depends on your

[ceph-users] Re: Ceph RBD latency with synchronous writes?

2020-09-23 Thread vitalif
Slow. https://yourcmc.ru/wiki/Ceph_performance :-) > Hi, > > we're considering running KVM virtual machine images on Ceph RBD block > devices. How does Ceph RBD perform with the synchronous writes of > databases (MariaDB)? > > Best regards, > > Renne

[ceph-users] Re: Vitastor, a fast Ceph-like block storage for VMs

2020-09-23 Thread vitalif
Thanks Marc :) It's easier to write code than to cooperate :) I can do whatever I want in my own project. Ceph is rather complex. For example, I failed to find bottlenecks in OSD when I tried to profile it - I'm not an expert of course, but still... The only bottleneck I found was

[ceph-users] Re: NVMe's

2020-09-23 Thread vitalif
Hi > We currently run a SSD cluster and HDD clusters and are looking at possibly > creating a cluster for NVMe storage. For spinners and SSDs, it seemed the > max recommended per osd host server was 16 OSDs ( I know it depends on the > CPUs and RAM, like 1 cpu core and 2GB memory ). What do you

[ceph-users] Vitastor, a fast Ceph-like block storage for VMs

2020-09-22 Thread vitalif
Hi! After almost a year of development in my spare time I present my own software-defined block storage system: Vitastor - https://vitastor.io I designed it similar to Ceph in many ways, it also has Pools, PGs, OSDs, different coding schemes, rebalancing and so on. However it's much simpler

[ceph-users] Re: Benchmark WAL/DB on SSD and HDD for RGW RBD CephFS

2020-09-18 Thread vitalif
> we did test dm-cache, bcache and dm-writecache, we found the later to be > much better. Did you set bcache block size to 4096 during your tests? Without this setting it's slow because 99.9% SSDs don't handle 512 byte overwrites well. Otherwise I don't think bcache should be worse than

[ceph-users] Re: Benchmark WAL/DB on SSD and HDD for RGW RBD CephFS

2020-09-17 Thread vitalif
will also depend on whether your HDDs have internal SSD/media cache (a lot of them do even if you're unaware of it). +1 for hsbench, just be careful and use my repo https://github.com/vitalif/hsbench because the original has at least 2 bugs for now: 1) it only reads first 64KB when benchmarking

[ceph-users] Re: Choosing suitable SSD for Ceph cluster

2020-09-14 Thread vitalif
There's also Micron 7300 Pro/Max. Please benchmark it like described here https://docs.google.com/spreadsheets/d/1E9-eXjzsKboiCCX-0u0r5fAjjufLKayaut_FOPxYZjc/edit (https://docs.google.com/spreadsheets/d/1E9-eXjzsKboiCCX-0u0r5fAjjufLKayaut_FOPxYZjc/edit) and send me the results if you get one

[ceph-users] Re: Choosing suitable SSD for Ceph cluster

2020-09-14 Thread vitalif
Samsung PM983 M.2 I want to have a separate disk for buckets index pool and all of my server bays are full and I should use m2 storage devices. Also the bucket index doesn't need much space so I plan to have a 6x device with replica 3 for it. Each disk could be 240GB to not waste space but

[ceph-users] Re: ceph-osd performance on ram disk

2020-09-10 Thread vitalif
Yeah, of course... but RBD is primarily used for KVM VMs, so the results from a VM are the thing that real clients see. So they do mean something... :) I know. I tested fio before testing cephwith fio. On null ioengine fio can handle up to 14M IOPS (on my dusty lab's R220). On blk_null to gets

[ceph-users] Re: ceph-osd performance on ram disk

2020-09-10 Thread vitalif
Hi George Author of Ceph_performance here! :) I suspect you're running tests with 1 PG. Every PG's requests are always serialized, that's why OSD doesn't utilize all threads with 1 PG. You need something like 8 PGs per OSD. More than 8 usually doesn't improve results. Also note that read

[ceph-users] Re: Bench on specific OSD

2020-06-30 Thread vitalif
Create a pool with size=minsize=1 and use ceph-gobench https://github.com/rumanzo/ceph-gobench Hi all. Is there anyway to completely health check one OSD host or instance? For example rados bech just on that OSD or do some checks for disk and front and back netowrk? Thanks.

[ceph-users] Re: High ceph_osd_commit_latency_ms on Toshiba MG07ACA14TE HDDs

2020-06-24 Thread vitalif
Hi, https://yourcmc.ru/wiki/Ceph_performance author here %) Disabling write cache is REALLY bad for SSDs without capacitors [consumer SSDs], also it's bad for HDDs with firmwares that don't have this bug-o-feature. The bug is really common though. I have no idea where it comes from, but it's

[ceph-users] Re: Space leak in Bluestore

2020-03-27 Thread vitalif
Update on my issue. It seems it was caused by the broken compression which one of 14.2.x releases (ubuntu builds) probably had. My osd versions were mixed. Five OSDs were 14.2.7, one was 14.2.4, other 6 were 14.2.8. I moved the same pg several times more. Space usage dropped when the pg was

[ceph-users] Re: Space leak in Bluestore

2020-03-26 Thread vitalif
Hi, The cluster is all-flash (NVMe), so the removal is fast and it's in fact pretty noticeable, even on Prometheus graphs. Also I've logged raw space usage from `ceph -f json df`: 1) before pg rebalance started the space usage was 32724002664448 bytes 2) just before the rebalance finished it

[ceph-users] Re: Space leak in Bluestore

2020-03-25 Thread vitalif
I have a question regarding this problem - is it possible to rebuild bluestore allocation metadata? I could try it to test if it's an allocator problem... Hi. I'm experiencing some kind of a space leak in Bluestore. I use EC, compression and snapshots. First I thought that the leak was

[ceph-users] Re: Space leak in Bluestore

2020-03-24 Thread vitalif
Hi Steve, Thanks, it's an interesting discussion, however I don't think that it's the same problem, because in my case bluestore eats additional space during rebalance. And it doesn't seem that Ceph does small overwrites during rebalance. As I understand it does the opposite: it reads and

[ceph-users] Space leak in Bluestore

2020-03-24 Thread vitalif
Hi. I'm experiencing some kind of a space leak in Bluestore. I use EC, compression and snapshots. First I thought that the leak was caused by "virtual clones" (issue #38184). However, then I got rid of most of the snapshots, but continued to experience the problem. I suspected something

[ceph-users] Re: Advice on sizing WAL/DB cluster for Optane and SATA SSD disks.

2020-03-16 Thread vitalif
Hi Victor, 1) RocksDB doesn't put L4 on the fast device if it's less than ~ 286 GB, so no. But, anyway, there's usually no L4, so 30 GB is usually sufficient. I had ~17 GB block.dbs even for 8 TB hard drives used for RBD... RGW probably uses slightly more if stored objects are small... but

[ceph-users] Re: Ceph Performance of Micron 5210 SATA?

2020-03-13 Thread vitalif
Hi, Can you test it slightly differently (and simpler)? Like in this googledoc: https://docs.google.com/spreadsheets/d/1E9-eXjzsKboiCCX-0u0r5fAjjufLKayaut_FOPxYZjc/edit#gid=0 As we know that it's a QLC drive, first let it fill the SLC cache: fio -ioengine=libaio -direct=1 -name=test -bs=4M

[ceph-users] Re: Understanding Bluestore performance characteristics

2020-02-06 Thread vitalif
Hi Stefan, Do you mean more info than: Yes, there's more... I don't remember exactly, I think some information ends up included into OSD perf counters and some information is dumped into the OSD log, maybe there's even a 'ceph daemon' command to trigger it... There are 4 options that

[ceph-users] Re: Understanding Bluestore performance characteristics

2020-02-05 Thread vitalif
Hi, This helped to disable deferred writes in my case: bluestore_min_alloc_size=4096 bluestore_prefer_deferred_size=0 bluestore_prefer_deferred_size_ssd=0 If you already deployed your OSDs with min_alloc_size=4K then you don't need to redeploy them again. Hi Vitality, I completely

[ceph-users] Re: Understanding Bluestore performance characteristics

2020-02-05 Thread vitalif
min_alloc_size can't be changed after formatting an OSD, and yes, bluestore defers all writes that are < min_alloc_size. And default min_alloc_size_ssd is 16KB. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to

[ceph-users] Re: Understanding Bluestore performance characteristics

2020-02-04 Thread vitalif
SSD (block.db) partition contains object metadata in RocksDB so it probably loads the metadata before modifying objects (if it's not in cache yet). Also it sometimes performs compaction which also results in disk reads and writes. There are other things going on that I'm not completely aware

[ceph-users] Re: Micron SSD/Basic Config

2020-01-31 Thread vitalif
I think 800 GB NVMe per 2 SSDs is an overkill. 1 OSD usually only requires 30 GB block.db, so 400 GB per an OSD is a lot. On the other hand, does 7300 have twice the iops of 5300? In fact, I'm not sure if a 7300 + 5300 OSD will perform better than just a 5300 OSD at all. It would be

[ceph-users] Re: Benchmark results for Seagate Exos2X14 Dual Actuator HDDs

2020-01-16 Thread vitalif
Hi, The results look strange to me... To begin with, it's strange that read and write performance differs. But the thing is that a lot (if not most) large Seagate EXOS drives have internal SSD cache (~8 GB of it). I suspect that new EXOS also does and I'm not sure if Toshiba has it. It could

[ceph-users] Re: download.ceph.com repository changes

2019-09-17 Thread vitalif
The worst part about the official repository is that it lacks Debian packages Also of course it would be very convenient to be able to install any version from the repos, not just the latest one. It's certainly possible with debian repos... ___

[ceph-users] Re: Strange hardware behavior

2019-09-03 Thread vitalif
Please never use dd for disk benchmarks. Use fio. For linear write: fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -iodepth=32 -rw=write -runtime=60 -filename=/dev/sdX ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe

[ceph-users] Re: Mapped rbd is very slow

2019-08-16 Thread vitalif
on a new ceph cluster with the same software and config (ansible) on the old hardware. 2 replica, 1 host, 4 osd. => New hardware : 32.6MB/s READ / 10.5MiB WRITE => Old hardware : 184MiB/s READ / 46.9MiB WRITE No discussion ? I suppose I will keep the old hardware. What do you think ? :D In

[ceph-users] Re: Mapped rbd is very slow

2019-08-16 Thread vitalif
fio -ioengine=libaio -name=test -bs=4k -iodepth=32 -rw=randread -runtime=60 -filename=/dev/rbd/kube/bench Now add -direct=1 because Linux async IO isn't async without O_DIRECT. :) + Repeat the same for randwrite. ___ ceph-users mailing list --

[ceph-users] Re: Mapped rbd is very slow

2019-08-16 Thread vitalif
And once more you're checking random I/O with 4 MB !!! block size. Now recheck it with bs=4k. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Mapped rbd is very slow

2019-08-16 Thread vitalif
- libaio randwrite - libaio randread - libaio randwrite on mapped rbd - libaio randread on mapped rbd - rbd read - rbd write recheck RBD with RAND READ / RAND WRITE you're again comparing RANDOM and NON-RANDOM I/O your SSDs aren't that bad, 3000 single-thread iops isn't the worst

[ceph-users] Re: Mapped rbd is very slow

2019-08-16 Thread vitalif
Now to go for "apples to apples" either run fio -ioengine=libaio -name=test -bs=4k -iodepth=1 -direct=1 -fsync=1 -rw=randwrite -runtime=60 -filename=/dev/nvmeX to compare with the single-threaded RBD random write result (the test is destructive, so use a separate partition without

[ceph-users] Re: Small HDD cluster, switch from Bluestore to Filestore

2019-08-16 Thread vitalif
Hi. It's not Ceph to blame! Linux does not support cached asynchronous I/O, except for the new io-uring! I.e. it supports aio calls, but they just block when you're trying to do them on an FD opened without O_DIRECT. So basically what happens when you benchmark it with -ioengine=libaio