[ceph-users] Namespace usability for mutitenancy
Hello. Had been someone starting using namespaces for real production for multi-tenancy? How good is it at isolating tenants from each other? Can they see each other presence, quotas, etc? Is is safe to give access via cephx to (possibly hostile to each other) users to the same pool with restrictions 'user per namespace'? How badly can one user affect others? Quotas restrict space overuse, but what about IO and omaps overuse? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Decoding pgmap
There is a command `ceph pg getmap`. It produces a binary file. Are there any utility to decode it? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] How to make HEALTH_ERR quickly and pain-free
I have hell of the question: how to make HEALTH_ERR status for a cluster without consequences? I'm working on CI tests and I need to check if our reaction to HEALTH_ERR is good. For this I need to take an empty cluster with an empty pool and do something. Preferably quick and reversible. For HEALTH_WARN the best thing I found is to change pool size to 1, it raises "1 pool(s) have no replicas configured" warning almost instantly and it can be reverted very quickly for empty pool. But HEALTH_ERR is a bit more tricky. Any ideas? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: How to make HEALTH_ERR quickly and pain-free
On 21/01/2021 13:02, Eugen Block wrote: But HEALTH_ERR is a bit more tricky. Any ideas? I think if you set a very low quota for a pool (e.g. 1000 bytes or so) and fill it up it should create a HEALTH_ERR status, IIRC. Cool idea. Unfortunately, even with 1 byte quota (and some data in the pool), it's HEALTH_WARN, 1 pool(s) full ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: How to make HEALTH_ERR quickly and pain-free
On 21/01/2021 12:57, George Shuklin wrote: I have hell of the question: how to make HEALTH_ERR status for a cluster without consequences? I'm working on CI tests and I need to check if our reaction to HEALTH_ERR is good. For this I need to take an empty cluster with an empty pool and do something. Preferably quick and reversible. For HEALTH_WARN the best thing I found is to change pool size to 1, it raises "1 pool(s) have no replicas configured" warning almost instantly and it can be reverted very quickly for empty pool. But HEALTH_ERR is a bit more tricky. Any ideas? I found the way: ceph osd set-full-ratio 0.0 instantly causing health: HEALTH_ERR full ratio(s) out of order even on empty cluster. Problem solved. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Permissions for OSD
Docs for permissions are super vague. What each flag does? What is 'x' permitting? What's the difference between class-write and write? And the last question: can we limit user to reading/writing only to existing objects in the pool? Thanks! ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] List pg with heavily degraded objects
Hello. I wonder if there is a way to see how many replicas are available for each object (or, at least, PG-level statistics). Basically, if I have damaged cluster, I want to see the scale of damage, and I want to see the most degraded objects (which has 1 copy, then objects with 2 copies, etc). Are there a way? pg list is not very informative, as it does not show how badly 'unreplicated' data are. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: List pg with heavily degraded objects
On 10/09/2021 15:19, Janne Johansson wrote: Den fre 10 sep. 2021 kl 13:55 skrev George Shuklin : Hello. I wonder if there is a way to see how many replicas are available for each object (or, at least, PG-level statistics). Basically, if I have damaged cluster, I want to see the scale of damage, and I want to see the most degraded objects (which has 1 copy, then objects with 2 copies, etc). Are there a way? pg list is not very informative, as it does not show how badly 'unreplicated' data are. ceph pg dump should list all PGs and how many active OSDs they have in a list like this: [12,34,78,56], [12,34,2134872348723,56] for which four (in my example) that should hold a replica to this PG, and the second list is who actually hold one, with 2^31-1 as a placeholder for UNKNOWN-OSD-NUMBER where an OSD is missing. It's not about been undersized. Imagine a small cluster with three OSD. You have two OSD dead, than two more empty were added to the cluster. Normally you'll see that each PG found a peer and there are no undersized PGs. But data, actually, wasn't replicated yet, the replication is in the process. Is there any way to see if there are PG with 'holding a single data copy, but is replicating now'? I'm curious about this transition time between 'found a peer and doing recovery' and 'got at least two copies of data'. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: List pg with heavily degraded objects
On 10/09/2021 14:49, George Shuklin wrote: Hello. I wonder if there is a way to see how many replicas are available for each object (or, at least, PG-level statistics). Basically, if I have damaged cluster, I want to see the scale of damage, and I want to see the most degraded objects (which has 1 copy, then objects with 2 copies, etc). Are there a way? pg list is not very informative, as it does not show how badly 'unreplicated' data are. Actually, the problem is more complicated than I expected. Here is the artificial cluster, where there is a sizable chunk of data are single, (cluster of thee servers with 2 OSD each, put some data, shutdown server #1, put some more data, kill server #3, start server#1, it's guaranteed that server #2 holds a single copy). This is snapshot of the ceph pg dump for it as soon as #2 booted, and I can't find a proof that some data are in a single copy: https://gist.github.com/amarao/fbc8ef3538f66a9f2c264f8555f5c29a ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: List pg with heavily degraded objects
On 10/09/2021 15:37, Janne Johansson wrote: Den fre 10 sep. 2021 kl 14:27 skrev George Shuklin : On 10/09/2021 15:19, Janne Johansson wrote: Are there a way? pg list is not very informative, as it does not show how badly 'unreplicated' data are. ceph pg dump should list all PGs and how many active OSDs they have in a list like this: [12,34,78,56], [12,34,2134872348723,56] It's not about been undersized. Imagine a small cluster with three OSD. You have two OSD dead, than two more empty were added to the cluster. Normally you'll see that each PG found a peer and there are no undersized PGs. But data, actually, wasn't replicated yet, the replication is in the process. My view is that they actually would be "undersized" until backfill is done to the PGs on the new empty disks you just added. I've just created a counter-example for that. Each server has 2 OSD, default replicated_rules. There is 4 servers, pool size is 3. * shutdown srv1, wait for recovery, shutdown srv2, wait for recovery. * put some big amount of data (enough to see replication traffic), all data are in srv3+srv4 with degrade. * shutdown srv3, start srv1, srv2. srv4 is a single server with all data available. I can see no 'undersized' PG, but data ARE in a single copy: https://gist.github.com/amarao/fbc8ef3538f66a9f2c264f8555f5c29a ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: List pg with heavily degraded objects
On 10/09/2021 15:54, Janne Johansson wrote: Den fre 10 sep. 2021 kl 14:39 skrev George Shuklin : On 10/09/2021 14:49, George Shuklin wrote: Hello. I wonder if there is a way to see how many replicas are available for each object (or, at least, PG-level statistics). Basically, if I have damaged cluster, I want to see the scale of damage, and I want to see the most degraded objects (which has 1 copy, then objects with 2 copies, etc). Are there a way? pg list is not very informative, as it does not show how badly 'unreplicated' data are. Actually, the problem is more complicated than I expected. Here is the artificial cluster, where there is a sizable chunk of data are single, (cluster of thee servers with 2 OSD each, put some data, shutdown server #1, put some more data, kill server #3, start server#1, it's guaranteed that server #2 holds a single copy). This is snapshot of the ceph pg dump for it as soon as #2 booted, and I can't find a proof that some data are in a single copy: https://gist.github.com/amarao/fbc8ef3538f66a9f2c264f8555f5c29a In this case, where you have both made PGs undersized, and also degraded by letting one OSD pick up some changes and then remove it and get another one back in (I didn't see where #2 stopped in your example), I guess you will have to take a deep dive into ceph pg query to see ALL the info about it. By the time you are stacking multiple error scenarios on top of eachother, I don't think there is a simple "show me a short understandable list of what it almost near working" No, I'm worried about observability of the situation when data are in a single copy (which I consider bit emergency). I've just created scenario when only single server (2 OSD) got data on it, and right after replication started, I can't detect that it's THAT bad. I've updated the gist: https://gist.github.com/amarao/fbc8ef3538f66a9f2c264f8555f5c29a with snapshot after cluster with single copy of data available found enough space to make all PG 'well-sized'. Replication is underway, but data are single-copy at the moment of snapshot. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] ceph-osd performance on ram disk
I'm creating a benchmark suite for Сeph. During benchmarking of benchmark, I've checked how fast ceph-osd works. I decided to skip all 'SSD mess' and use brd (block ram disk, modprobe brd) as underlying storage. Brd itself can yield up to 2.7Mpps in fio. In single thread mode (iodepth=1) it can yield up to 750k IOPS. LVM over brd gives about 600kIOPS in single-threaded mode with iodepth=1 (16us latency). But, as soon as I put ceph-osd (bluestore) on it, I see something very odd. No matter how much parallel load I push onto this OSD, it never gives more than 30 kIOPS, and I can't understand where bottleneck is. CPU utilization: ~300%. There are 8 cores on my setup, so, CPU is not a bottleneck. Network: I've moved benchmark on the same host as OSD, so it's a localhost. Even counting network, it's still far away from saturation. 30kIOPS (4k) is about 1Gb/s, but I have 10G links. Anyway, tests are run on localhost, so network is irrelevant (I've checked it, traffic is on localhost). Test itself consumes about 70% CPU of one core, so there are plenty left. Replication: I've killed it (size=1, single osd in the pool). single-threaded latency: 200us, 4.8kIOPS. iopdeth=32: 2ms (15kIOPS). iodepth=16,numjobs=8: 5ms (24k IOPS) I'm running fio with 'rados' ioengine, and it looks like putting more workers doesn't change much, so it's not rados ioengine. As there is plenty CPU and IO left, there is only one possible place for bottleneck: some time-consuming single-threaded code in ceph-osd. Are there any knobs to tweak to see higher performance for ceph-osd? I'm pretty sure it's not any kind of leveling, GC or other 'iops-related' issues (brd has performance of two order of magnitude higher). ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: ceph-osd performance on ram disk
Thank you! I know that article, but they promise 6 core use per OSD, and I got barely over three, and all this in totally synthetic environment with no SDD to blame (brd is more than fast and have a very consistent latency under any kind of load). On Thu, Sep 10, 2020, 19:39 Marc Roos wrote: > > > > Hi George, > > Very interesting and also a bit expecting result. Some messages posted > here are already indicating that getting expensive top of the line > hardware does not really result in any performance increase above some > level. Vitaliy has documented something similar[1] > > [1] > https://yourcmc.ru/wiki/Ceph_performance > > > > -Original Message- > To: ceph-users@ceph.io > Subject: [ceph-users] ceph-osd performance on ram disk > > I'm creating a benchmark suite for Сeph. > > During benchmarking of benchmark, I've checked how fast ceph-osd works. > I decided to skip all 'SSD mess' and use brd (block ram disk, modprobe > brd) as underlying storage. Brd itself can yield up to 2.7Mpps in fio. > In single thread mode (iodepth=1) it can yield up to 750k IOPS. LVM over > brd gives about 600kIOPS in single-threaded mode with iodepth=1 (16us > latency). > > But, as soon as I put ceph-osd (bluestore) on it, I see something very > odd. No matter how much parallel load I push onto this OSD, it never > gives more than 30 kIOPS, and I can't understand where bottleneck is. > > CPU utilization: ~300%. There are 8 cores on my setup, so, CPU is not a > bottleneck. > > Network: I've moved benchmark on the same host as OSD, so it's a > localhost. Even counting network, it's still far away from saturation. > 30kIOPS (4k) is about 1Gb/s, but I have 10G links. Anyway, tests are run > on localhost, so network is irrelevant (I've checked it, traffic is on > localhost). Test itself consumes about 70% CPU of one core, so there are > plenty left. > > Replication: I've killed it (size=1, single osd in the pool). > > single-threaded latency: 200us, 4.8kIOPS. > iopdeth=32: 2ms (15kIOPS). > iodepth=16,numjobs=8: 5ms (24k IOPS) > > I'm running fio with 'rados' ioengine, and it looks like putting more > workers doesn't change much, so it's not rados ioengine. > > As there is plenty CPU and IO left, there is only one possible place for > bottleneck: some time-consuming single-threaded code in ceph-osd. > > Are there any knobs to tweak to see higher performance for ceph-osd? I'm > pretty sure it's not any kind of leveling, GC or other 'iops-related' > issues (brd has performance of two order of magnitude higher). > > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > > > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: ceph-osd performance on ram disk
I know. I tested fio before testing ceph with fio. On null ioengine fio can handle up to 14M IOPS (on my dusty lab's R220). On blk_null to gets down to 2.4-2.8M IOPS. On brd it drops to sad 700k IOPS. BTW, never run synthetic high-performance benchmarks on kvm. My old server with 'makelinuxfastagain' fixes make one io request in 3.4us, and on KVM VM it become 24us. Some guy said it got about 8.5us on vmware. It's all on purely software stack without any hypervisor IO. 24us sounds like a small number, but if your synthetics makes 200k iops, it's 4us. You can't make 200k on VM with 24us syscall time. On Thu, Sep 10, 2020, 22:49 Виталий Филиппов wrote: > By the way, DON'T USE rados bench. It's an incorrect benchmark. ONLY use > fio > > 10 сентября 2020 г. 22:35:53 GMT+03:00, vita...@yourcmc.ru пишет: >> >> Hi George >> >> Author of Ceph_performance here! :) >> >> I suspect you're running tests with 1 PG. Every PG's requests are always >> serialized, that's why OSD doesn't utilize all threads with 1 PG. You need >> something like 8 PGs per OSD. More than 8 usually doesn't improve results. >> >> Also note that read tests are meaningless after full overwrite on small OSDs >> because everything fits in cache. Restart the OSD to clear it. You can drop >> the cache via the admin socket too, but restarting is the simplest way. >> >> I've repeated your test with brd. My results with 8 PGs after filling the >> RBD image, turning CPU powersave off and restarting the OSD are: >> >> # fio -name=test -ioengine=rbd -bs=4k -iodepth=1 -rw=randread -pool=ramdisk >> -rbdname=testimg >> read: IOPS=3586, BW=14.0MiB/s (14.7MB/s)(411MiB/29315msec) >> lat (usec): min=182, max=5710, avg=277.41, stdev=90.16 >> >> # fio -name=test -ioengine=rbd -bs=4k -iodepth=1 -rw=randwrite -pool=ramdisk >> -rbdname=testimg >> write: IOPS=1247, BW=4991KiB/s (5111kB/s)(67.0MiB/13746msec); 0 zone resets >> lat (usec): min=555, max=4015, avg=799.45, stdev=142.92 >> >> # fio -name=test -ioengine=rbd -bs=4k -iodepth=128 -rw=randwrite >> -pool=ramdisk -rbdname=testimg >> write: IOPS=4138, BW=16.2MiB/s (16.9MB/s)(282MiB/17451msec); 0 zone resets >> 658% CPU >> >> # fio -name=test -ioengine=rbd -bs=4k -iodepth=128 -rw=randread >> -pool=ramdisk -rbdname=testimg >> read: IOPS=15.7k, BW=61.4MiB/s (64.4MB/s)(979MiB/15933msec) >> 540% CPU >> >> Basically the same shit as on an NVMe. So even an "in-memory Ceph" is slow, >> haha. >> >> Thank you! >>> >>> I know that article, but they promise 6 core use per OSD, and I got barely >>> over three, and all this in totally synthetic environment with no SDD to >>> blame (brd is more than fast and have a very consistent latency under any >>> kind of load). >>> >> -- >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io >> >> > -- > With best regards, > Vitaliy Filippov > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: ceph-osd performance on ram disk
Latency from a client side is not an issue. It just combines with other latencies in the stack. The more client lags, the easier it's for the cluster. Here, the thing I talk, is slightly different. When you want to establish baseline performance for osd daemon (disregarding block device and network latencies), sudden order-of-magnitude delay on syscalls cause disproportionate skew to results. It does not relate to the production in any way, only to ceph-osd benchmarks. On Thu, Sep 10, 2020, 23:21 wrote: > Yeah, of course... but RBD is primarily used for KVM VMs, so the results > from a VM are the thing that real clients see. So they do mean something... > :) > > I know. I tested fio before testing ceph > with fio. On null ioengine fio can handle up to 14M IOPS (on my dusty > lab's R220). On blk_null to gets down to 2.4-2.8M IOPS. > On brd it drops to sad 700k IOPS. > BTW, never run synthetic high-performance benchmarks on kvm. My old server > with 'makelinuxfastagain' fixes make one io request in 3.4us, and on KVM VM > it become 24us. Some guy said it got about 8.5us on vmware. It's all on > purely software stack without any hypervisor IO. > 24us sounds like a small number, but if your synthetics makes 200k iops, > it's 4us. You can't make 200k on VM with 24us syscall time. > > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: ceph-osd performance on ram disk
On 10/09/2020 22:35, vita...@yourcmc.ru wrote: Hi George Author of Ceph_performance here! :) I suspect you're running tests with 1 PG. Every PG's requests are always serialized, that's why OSD doesn't utilize all threads with 1 PG. You need something like 8 PGs per OSD. More than 8 usually doesn't improve results. Also note that read tests are meaningless after full overwrite on small OSDs because everything fits in cache. Restart the OSD to clear it. You can drop the cache via the admin socket too, but restarting is the simplest way. I've repeated your test with brd. My results with 8 PGs after filling the RBD image, turning CPU powersave off and restarting the OSD are: # fio -name=test -ioengine=rbd -bs=4k -iodepth=1 -rw=randread -pool=ramdisk -rbdname=testimg read: IOPS=3586, BW=14.0MiB/s (14.7MB/s)(411MiB/29315msec) lat (usec): min=182, max=5710, avg=277.41, stdev=90.16 # fio -name=test -ioengine=rbd -bs=4k -iodepth=1 -rw=randwrite -pool=ramdisk -rbdname=testimg write: IOPS=1247, BW=4991KiB/s (5111kB/s)(67.0MiB/13746msec); 0 zone resets lat (usec): min=555, max=4015, avg=799.45, stdev=142.92 # fio -name=test -ioengine=rbd -bs=4k -iodepth=128 -rw=randwrite -pool=ramdisk -rbdname=testimg write: IOPS=4138, BW=16.2MiB/s (16.9MB/s)(282MiB/17451msec); 0 zone resets 658% CPU # fio -name=test -ioengine=rbd -bs=4k -iodepth=128 -rw=randread -pool=ramdisk -rbdname=testimg read: IOPS=15.7k, BW=61.4MiB/s (64.4MB/s)(979MiB/15933msec) 540% CPU Basically the same shit as on an NVMe. So even an "in-memory Ceph" is slow, haha. Hello! Thank you for feedback! PG idea is really good. Unfortunately, autoscale made it to 32, and I have 30 kIOPS of 32-pg 1-size pool on ramdisk. :-/ I've checked read speed (I hadn't done this before, I have no idea why), and I got amazing 160kIOPS, but I suspect it's caching. Anyway, thank you for data, I assume 600% CPU in exchange for ~16-17kIOPS for OSD. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: ceph-osd performance on ram disk
On 10/09/2020 19:37, Mark Nelson wrote: On 9/10/20 11:03 AM, George Shuklin wrote: ... Are there any knobs to tweak to see higher performance for ceph-osd? I'm pretty sure it's not any kind of leveling, GC or other 'iops-related' issues (brd has performance of two order of magnitude higher). So as you've seen, Ceph does a lot more than just write a chunk of data out to a block on disk. There's tons of encoding/decoding happening, crc checksums, crush calculations, onode lookups, write-ahead-logging, and other work involved that all adds latency. You can overcome some of that through parallelism, but 30K IOPs per OSD is probably pretty on-point for a nautilus era OSD. For octopus+ the cache refactor in bluestore should get you farther (40-50k+ for and OSD in isolation). The maximum performance we've seen in-house is around 70-80K IOPs on a single OSD using very fast NVMe and highly tuned settings. A couple of things you can try: - upgrade to octopus+ for the cache refactor - Make sure you are using the equivalent of the latency-performance or latency-network tuned profile. The most important part is disabling CPU cstate transitions. - increase osd_memory_target if you have a larger dataset (onode cache misses in bluestore add a lot of latency) - enable turbo if it's disabled (higher clock speed generally helps) On the write path you are correct that there is a limitation regarding a single kv sync thread. Over the years we've made this less of a bottleneck but it's possible you still could be hitting it. In our test lab we've managed to utilize up to around 12-14 cores on a single OSD in isolation with 16 tp_osd_tp worker threads and on a larger cluster about 6-7 cores per OSD. There's probably multiple factors at play, including context switching, cache thrashing, memory throughput, object creation/destruction, etc. If you decide to look into it further you may want to try wallclock profiling the OSD under load and seeing where it is spending its time. Thank you for feedback. I forgot to mention this, it's Octopus, fresh installation. I've disabled CSTATE (governor=performance), it make no difference - same iops, same CPU use by ceph-osd I've just can't force Ceph to consume more than 330% of CPU. I can force read up to 150k IOPS (both network and local), hitting CPU limit, but write is somewhat restricted by ceph itself. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Is it possible to assign osd id numbers?
On 11/09/2020 16:11, Shain Miley wrote: Hello, I have been wondering for quite some time whether or not it is possible to influence the osd.id numbers that are assigned during an install. I have made an attempt to keep our osds in order over the last few years, but it is a losing battle without having some control over the osd assignment. I am currently using ceph-deploy to handle adding nodes to the cluster. You can reuse osd numbers, but I strongly advice you not to focus on precise IDs. The reason is that you can have such combination of server faults, which will swap IDs no matter what. It's a false sense of beauty to have 'ID of OSD match ID in the name of the server'. How to reuse osd nums? OSD number is used (and should be cleaned if OSD dies) in three places in Ceph: 1) Crush map: ceph osd crush rm osd.x 2) osd list: ceph osd rm osd.x 3) auth: ceph auth rm osd.x The last one is often forgoten and is a usual reason for ceph-ansible to fail on new disk in the server. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Is it possible to assign osd id numbers?
On 11/09/2020 22:43, Shain Miley wrote: Thank you for your answer below. I'm not looking to reuse them as much as I am trying to control what unused number is actually used. For example if I have 20 osds and 2 have failed...when I replace a disk in one server I don't want it to automatically use the next lowest number for the osd assignment. I understand what you mean about not focusing on the osd ids...but my ocd is making me ask the question. Well, technically, you can create fake OSD to hold numbers and release those 'fake OSD' if you need to use their numbers, but you are really complicate everything. I suggest you to stop worrying about numbers. If you are ok that every OSD on every sever is using /dev/sdb (OCD requires that server1 uses /dev/sda, server2 uses /dev/sdb, server 3 /dev/sdc, etc), so you should be fine with random OSD numbers. Moreover, you should be fine with discrepancy on sorting order of OSD uuid and their numbers, misalignment of IP adddress and OSD тumber (192.168.0.4 for OSD.1 ). While it's may be fun to play with numbers in a lab, if you are using Ceph in production, you should avoid doing unnecessary changes, as they will surprise other people (and you!) trying to keep this thing running. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: ceph-osd performance on ram disk
On 11/09/2020 17:44, Mark Nelson wrote: On 9/11/20 4:15 AM, George Shuklin wrote: On 10/09/2020 19:37, Mark Nelson wrote: On 9/10/20 11:03 AM, George Shuklin wrote: ... Are there any knobs to tweak to see higher performance for ceph-osd? I'm pretty sure it's not any kind of leveling, GC or other 'iops-related' issues (brd has performance of two order of magnitude higher). ... I've disabled CSTATE (governor=performance), it make no difference - same iops, same CPU use by ceph-osd I've just can't force Ceph to consume more than 330% of CPU. I can force read up to 150k IOPS (both network and local), hitting CPU limit, but write is somewhat restricted by ceph itself. Ok, can I assume block/db/wal are all on the ramdisk? I'd start a benchmark and attach gdbpmp to the OSD and see if you can get a callgraph (1000 samples is nice if you don't mind waiting a bit). That will tell us a lot more about where the code is spending time. It will slow the benchmark way down fwiw. Some other things you could try: Try to tweak the number of osd worker threads to better match the number of cores in your system. Too many and you end up with context switching. Too few and you limit parallelism. You can also check rocksdb compaction stats in the osd logs using this tool: https://github.com/ceph/cbt/blob/master/tools/ceph_rocksdb_log_parser.py Given that you are on ramdisk the 1GB default WAL limit should be plenty to let you avoid WAL throttling during compaction, but just verifying that compactions are not taking a long time is good peace of mind. Thank you very much for feedback. In my case all OSD data was on the brd device. (To test it just create a ramdisk: modprobe brd rd_size=20G, create pv and vg for ceph, and let ceph-ansible to consume them as OSD devices). The stuff you've give me here is really cool, but a bit out of my skills now. I wrote them into my tasklist, and I'll continue to research this topic further. Thank you for directions to look into. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Benchmark WAL/DB on SSD and HDD for RGW RBD CephFS
On 16/09/2020 07:26, Danni Setiawan wrote: Hi all, I'm trying to find performance penalty with OSD HDD when using WAL/DB in faster device (SSD/NVMe) vs WAL/DB in same device (HDD) for different workload (RBD, RGW with index bucket in SSD pool, and CephFS with metadata in SSD pool). I want to know if giving up disk slot for WAL/DB device is worth vs adding more OSD. Unfortunately I cannot find the benchmark for these kind workload. Has anyone ever done this benchmark? For everything except CephFS, fio looks like a best tool for benchmarking. It can benchmark ceph on all levels: rados, rbd, http/S3. Moreover, it has excellent configuration options, detailed metrics and it can run with multi-server workload (one fio client forcing many fio servers to do benchmarking). The own fio performance is at about 15M IOPS (null engine per fio-server), and it scales horizontally. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] disk scheduler for SSD
I start to wonder (again) which scheduler is better for ceph on SSD. My reasoning. None: 1. Reduces latency for requests. The lower latency is, the higher is perceived performance for unbounded workload with fixed queue depth (hello, benchmarks). 2. Causes possible spikes in latency for requests because of the 'unfair' request ordering (hello, deep scrub). Deadline-mq: 1. Reduce size of nr_requests (queue size) to 256 (noop shows me 916???). Make introduce latency. 2. May reduce latency spikes due to different rates for different types of workloads. I'm doing some benchmarks, and they, but of course, gives higher marks for 'none' scheduler. Nevertheless, I believe most of normal workload on Ceph does not utilize it with unbounded rate, so bounded (f.e. app making IO based on external independed events) workload can be hurt by lack of disk scheduler in presence of unbounded workload. Any ideas? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Benchmark WAL/DB on SSD and HDD for RGW RBD CephFS
On 17/09/2020 17:37, Mark Nelson wrote: Does fio handle S3 objects spread across many buckets well? I think bucket listing performance was maybe missing too, but It's been a while since I looked at fio's S3 support. Maybe they have those use cases covered now. I wrote a go based benchmark called hsbench based on the wasabi-tech benchmark a while back that tries to cover some of those cases, but I haven't touched it in a while: https://github.com/markhpc/hsbench The way to spread across many buckets is to use 'farm' for servers under one client manage. You just give each server a different bucket to torture inside jobfile. iodepth=1 restriction for http ioengine is actually encouraging this. FWIW fio can be used for cephfs as well and it works reasonably well if you give it a long enough run time and only expect hero run scenarios from it. For metadata intensive workloads you'll need to use mdtest or smallfile. At this point I mostly just use the io500 suite that includes both ior for hero runs and mdtest for metadata (but you need mpi to coordinate it across multiple nodes). Yep, I've talked about metadata intensive workloads. Romping within a file or two is not a true fs-specific benchmark. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Low level bluestore usage
As far as I know, bluestore doesn't like super small sizes. Normally odd should stop doing funny things as full mark, but if device is too small it may be too late and bluefs run out of space. Two things: 1. Don't use too small osd 2. Have a spare area on the drive. I usually reserve 1% for emergency extension (and to give ssd firmware a bit if space to breath). On Wed, Sep 23, 2020, 01:03 Ivan Kurnosov wrote: > Hi, > > this morning I woke up to a degraded test ceph cluster (managed by rook, > but it does not really change anything for the question I'm about to ask). > > After checking logs I have found that bluestore on one of the OSDs run out > of space. > > Some cluster details: > > ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus > (stable) > it runs on 3 little OSDs 10Gb each > > `ceph osd df` returned RAW USE of about 4.5GB on every node, happily > reporting about 5.5GB of AVAIL. > > Yet: > > ... > So, my question would be: how could I have prevented that? From monitoring > I have (prometheus) - OSDs are healthy, have plenty of space, yet they are > not. > > What command (and prometheus metric) would help me understand the actual > real bluestore use? Or am I missing something? > > Oh, and I "fixed" the cluster by expanding the broken osd.0 with a larger > 15GB volume. And 2 other OSDs still run on 10GB volumes. > > Thanks in advance for any thoughts. > > > -- > With best regards, Ivan Kurnosov > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: NVMe's
I've just finishing doing our own benchmarking, and I can say, you want to do something very unbalanced and CPU bounded. 1. Ceph consume a LOT of CPU. My peak value was around 500% CPU per ceph-osd at top-performance (see the recent thread on 'ceph on brd') with more realistic numbers around 300-400% CPU per device. 2. Ceph is unable to deliver more than 12k IOPS per ceph-osd (may be a little more with top-tier low-core high-frequency CPU, but not much). So, super-duper-nvme wont make difference. (btw, I have a stupid idea to try to run two ceph-osd from the same LV with a single PV underneath VG, but it not tested). 3. You wll find that any given client performance is heavily limited by sum of all RTT in the network, plus own latencies of ceph, so very fast NVME give a diminishing return. 4. CPU bounded ceph-osd completely wipe any differences for underlying devices (except for desktop-class crawlers). You can run your own tests, even without fancy 48-nvme boxes - just run ceph-osd on brd (block ram disk). ceph-osd won't run any faster on anything else (ramdisk is the fastest), so numbers you get from brd is supremum (upper bound) for theoretical performance. Given max 400-500% CPU per ceph-osd I'd say you need to keep number of NVME in server below 12, or, 15 (but sometimes you'll get CPU saturation). In my opinion less fancy boxes with smaller number of drives per server (but larger number of servers) would make your (or your operation team's) life much less stressful. NEVER ever use raid with ceph. On 23/09/2020 08:39, Brent Kennedy wrote: We currently run a SSD cluster and HDD clusters and are looking at possibly creating a cluster for NVMe storage. For spinners and SSDs, it seemed the max recommended per osd host server was 16 OSDs ( I know it depends on the CPUs and RAM, like 1 cpu core and 2GB memory ). Questions: 1. If we do a jbod setup, the servers can hold 48 NVMes, if the servers were bought with 48 cores and 100+ GB of RAM, would this make sense? 2. Should we just raid 5 by groups of NVMe drives instead ( and buy less CPU/RAM )? There is a reluctance to waste even a single drive on raid because redundancy is basically cephs job. 3. The plan was to build this with octopus ( hopefully there are no issues we should know about ). Though I just saw one posted today, but this is a few months off. 4. Any feedback on max OSDs? 5. Right now they run 10Gb everywhere with 80Gb uplinks, I was thinking this would need at least 40Gb links to every node ( the hope is to use these to speed up image processing at the application layer locally in the DC ). I haven't spoken to the Dell engineers yet but my concern with NVMe is that the raid controller would end up being the bottleneck ( next in line after network connectivity ). Regards, -Brent Existing Clusters: Test: Nautilus 14.2.11 with 3 osd servers, 1 mon/man, 1 gateway, 2 iscsi gateways ( all virtual on nvme ) US Production(HDD): Nautilus 14.2.11 with 12 osd servers, 3 mons, 4 gateways, 2 iscsi gateways UK Production(HDD): Nautilus 14.2.11 with 12 osd servers, 3 mons, 4 gateways US Production(SSD): Nautilus 14.2.11 with 6 osd servers, 3 mons, 3 gateways, 2 iscsi gateways ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: NVMe's
On 23/09/2020 10:54, Marc Roos wrote: Depends on your expected load not? I already read here numerous of times that osd's can not keep up with nvme's, that is why people put 2 osd's on a single nvme. So on a busy node, you probably run out of cores? (But better verify this with someone that has an nvme cluster ;)) Did you? I just start to though about this idea too, as some devices can deliver about twice of the own ceph-osd performance. How they did it? I have an idea to create a new bucket type under host, and put two LV from each ceph osd VG into that new bucket. Rules are the same (different host), so redundancy won't be affected, but doubling number of ceph-osd daemons can squeeze a bit more iops from backend devices at expense of doubling Rocksdb size (reducing payload size) and using more cores. And I really want to hear all bad things about this setup before trying it. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Low level bluestore usage
On 23/09/2020 04:09, Alexander E. Patrakov wrote: Sometimes this doesn't help. For data recovery purposes, the most helpful step if you get the "bluefs enospc" error is to add a separate db device, like this: systemctl disable --now ceph-osd@${OSDID} truncate -s 32G /junk/osd.${OSDID}-recover/block.db sgdisk -n 0:0:0 /junk/osd.${OSDID}-recover/block.db ceph-bluestore-tool \ bluefs-bdev-new-db --path /var/lib/ceph/osd/ceph-${OSDID} \ --dev-target /junk/osd.${OSDID}-recover/block.db \ --bluestore-block-db-size=31G --bluefs-log-compact-min-size=31G Of course you can use a real block device instead of just a file. After that, export all PGs using ceph-objecttstore-tool and re-import into a fresh OSD, then destroy or purge the full one. Here is why the options: --bluestore-block-db-size=31G: ceph-bluestore-tool refuses to do anything if this option is not set to any value --bluefs-log-compact-min-size=31G: make absolutely sure that log compaction doesn't happen, because it would hit "bluefs enospc" again. Oh, you went this way... I solved my 'pocket ceph' needs by exporting disk images (from files) via iscsi and mounting them back to localhost. That gives me a perfect 'scsi' devices which work exactly as in production. I have a little playbook (iscsi_loopback) to setup it on random scrap (including VMs) for development purposes. After iscsi is loopback-mounted, all other code works exactly the same as it would in production. I've got this issue few times on small 10GB osds, so I moved to 15Gb and it become a less often. I never have had this in real-hardware tests with real disk sizes (>>100G per OSD). ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: NVMe's
I've just finishing doing our own benchmarking, and I can say, you want to do something very unbalanced and CPU bounded. 1. Ceph consume a LOT of CPU. My peak value was around 500% CPU per ceph-osd at top-performance (see the recent thread on 'ceph on brd') with more realistic numbers around 300-400% CPU per device. In fact in isolation on the test setup that Intel donated for community ceph R&D we've pushed a single OSD to consume around 1400% CPU at 80K write IOPS! :) I agree though, we typical see a peak of about 500-600% CPU per OSD on multi-node clusters with a correspondingly lower write throughput. I do believe that in some cases the mix of IO we are doing is causing us to at least be partially bound by disk write latency with the single writer thread in the rocksdb WAL though. I'd really like to see how they done this without offloading (their configuration). 2. Ceph is unable to deliver more than 12k IOPS per ceph-osd (may be a little more with top-tier low-core high-frequency CPU, but not much). So, super-duper-nvme wont make difference. (btw, I have a stupid idea to try to run two ceph-osd from the same LV with a single PV underneath VG, but it not tested). I'm curious if you've tried octopus+ yet? We refactored bluestore's caches which internally has proven to help quite a bit with latency bound workloads as it reduces lock contention in onode cache shards and the impact of cache trimming (no more single trimming trim thread constantly grabbing the lock for long periods of time!). In a 64 NVMe drive setup (P4510s), we were able to do a little north of 400K write IOPS with 3x replication, so about 19K IOPs per OSD once you factor rep in. Also, in Nautilus you can see real benefits wtih running multiple OSDs on a single device but with Octopus and master we've pretty much closed the gap on our test setup: It's octopus. I was doing single-osd benchmark, removing all movable parts (brd instead of nvme, no network, size=1, etc). Moreover, I've focused on rados benchmark, as RBD is just a derivative from rados performance. Anyway, big thank you for input. https://docs.google.com/spreadsheets/d/1e5eTeHdZnSizoY6AUjH0knb4jTCW7KMU4RoryLX9EHQ/edit?usp=sharing Generally speaking using the latency-performance or latency-network tuned profiles helps (mostly due to avoid C state CPU transitions) as does higher clock speeds. Not using replication helps but that's obviously not a realistic solution for most people. :) I used size=1 and 'no ssd, no network' as upper bound. If allows to find limits for ceph-osd performance. Any real-life things (replication, network, real block devices) will make things worse, not better. Knowing upper performance bound is really nice when start to choose server configuration. 3. You wll find that any given client performance is heavily limited by sum of all RTT in the network, plus own latencies of ceph, so very fast NVME give a diminishing return. 4. CPU bounded ceph-osd completely wipe any differences for underlying devices (except for desktop-class crawlers). You can run your own tests, even without fancy 48-nvme boxes - just run ceph-osd on brd (block ram disk). ceph-osd won't run any faster on anything else (ramdisk is the fastest), so numbers you get from brd is supremum (upper bound) for theoretical performance. Given max 400-500% CPU per ceph-osd I'd say you need to keep number of NVME in server below 12, or, 15 (but sometimes you'll get CPU saturation). In my opinion less fancy boxes with smaller number of drives per server (but larger number of servers) would make your (or your operation team's) life much less stressful. That's pretty much the advice I've been giving people since the Inktank days. It costs more and is lower density, but the design is simpler, you are less likely to under provision CPU, less likely to run into memory bandwidth bottlenecks, and you have less recovery to do when a node fails. Especially now with how many NVMe drives you can fit in a single 1U server! ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io