[ceph-users] Namespace usability for mutitenancy

2020-12-17 Thread George Shuklin
Hello.

Had been someone starting using namespaces for real production for
multi-tenancy?

How good is it at isolating tenants from each other? Can they see each
other presence, quotas, etc?

Is is safe to give access via cephx to (possibly hostile to each other)
users to the same pool with restrictions 'user per namespace'?

How badly can one user affect others? Quotas restrict space overuse, but
what about IO and omaps overuse?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Decoding pgmap

2021-01-14 Thread George Shuklin

There is a command `ceph pg getmap`.

It produces a binary file. Are there any utility to decode it?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] How to make HEALTH_ERR quickly and pain-free

2021-01-21 Thread George Shuklin
I have hell of the question: how to make HEALTH_ERR status for a cluster 
without consequences?


I'm working on CI tests and I need to check if our reaction to 
HEALTH_ERR is good. For this I need to take an empty cluster with an 
empty pool and do something. Preferably quick and reversible.


For HEALTH_WARN the best thing I found is to change pool size to 1, it 
raises "1 pool(s) have no replicas configured" warning almost instantly 
and it can be reverted very quickly for empty pool.


But HEALTH_ERR is a bit more tricky. Any ideas?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to make HEALTH_ERR quickly and pain-free

2021-01-21 Thread George Shuklin

On 21/01/2021 13:02, Eugen Block wrote:

But HEALTH_ERR is a bit more tricky. Any ideas?


I think if you set a very low quota for a pool (e.g. 1000 bytes or so) 
and fill it up it should create a HEALTH_ERR status, IIRC. 
Cool idea. Unfortunately, even with 1 byte quota (and some data in the 
pool), it's HEALTH_WARN, 1 pool(s) full

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to make HEALTH_ERR quickly and pain-free

2021-01-21 Thread George Shuklin

On 21/01/2021 12:57, George Shuklin wrote:
I have hell of the question: how to make HEALTH_ERR status for a 
cluster without consequences?


I'm working on CI tests and I need to check if our reaction to 
HEALTH_ERR is good. For this I need to take an empty cluster with an 
empty pool and do something. Preferably quick and reversible.


For HEALTH_WARN the best thing I found is to change pool size to 1, it 
raises "1 pool(s) have no replicas configured" warning almost 
instantly and it can be reverted very quickly for empty pool.


But HEALTH_ERR is a bit more tricky. Any ideas?


I found the way:

ceph osd set-full-ratio 0.0

instantly causing

    health: HEALTH_ERR
    full ratio(s) out of order

even on empty cluster. Problem solved.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Permissions for OSD

2021-01-25 Thread George Shuklin

Docs for permissions are super vague. What each flag does?

What is 'x' permitting?

What's the difference between class-write and write?

And the last question: can we limit user to reading/writing only to 
existing objects in the pool?


Thanks!
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] List pg with heavily degraded objects

2021-09-10 Thread George Shuklin

Hello.

I wonder if there is a way to see how many replicas are available for 
each object (or, at least, PG-level statistics). Basically, if I have 
damaged cluster, I want to see the scale of damage, and I want to see 
the most degraded objects (which has 1 copy, then objects with 2 copies, 
etc).


Are there a way? pg list is not very informative, as it does not show 
how badly 'unreplicated' data are.




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: List pg with heavily degraded objects

2021-09-10 Thread George Shuklin

On 10/09/2021 15:19, Janne Johansson wrote:

Den fre 10 sep. 2021 kl 13:55 skrev George Shuklin :

Hello.
I wonder if there is a way to see how many replicas are available for
each object (or, at least, PG-level statistics). Basically, if I have
damaged cluster, I want to see the scale of damage, and I want to see
the most degraded objects (which has 1 copy, then objects with 2 copies,
etc).
Are there a way? pg list is not very informative, as it does not show
how badly 'unreplicated' data are.

ceph pg dump should list all PGs and how many active OSDs they have in
a list like this:
[12,34,78,56], [12,34,2134872348723,56]

for which four (in my example) that should hold a replica to this PG,
and the second list is who actually hold one, with 2^31-1 as a
placeholder for UNKNOWN-OSD-NUMBER where an OSD is missing.



It's not about been undersized.

Imagine a small cluster with three OSD. You have two OSD dead, than two 
more empty were added to the cluster.


Normally you'll see that each PG found a peer and there are no 
undersized PGs. But data, actually, wasn't replicated yet, the 
replication is in the process.


Is there any way to see if there are PG with 'holding a single data 
copy, but is replicating now'? I'm curious about this transition time 
between 'found a peer and doing recovery' and 'got at least two copies 
of data'.


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: List pg with heavily degraded objects

2021-09-10 Thread George Shuklin

On 10/09/2021 14:49, George Shuklin wrote:

Hello.

I wonder if there is a way to see how many replicas are available for 
each object (or, at least, PG-level statistics). Basically, if I have 
damaged cluster, I want to see the scale of damage, and I want to see 
the most degraded objects (which has 1 copy, then objects with 2 
copies, etc).


Are there a way? pg list is not very informative, as it does not show 
how badly 'unreplicated' data are. 



Actually, the problem is more complicated than I expected. Here is the 
artificial cluster, where there is a sizable chunk of data are single, 
(cluster of thee servers with 2 OSD each, put some data, shutdown server 
#1, put some more data, kill server #3, start server#1, it's guaranteed 
that server #2 holds a single copy). This is snapshot of the ceph pg 
dump for it as soon as #2 booted, and I can't find a proof that some 
data are in a single copy: 
https://gist.github.com/amarao/fbc8ef3538f66a9f2c264f8555f5c29a



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: List pg with heavily degraded objects

2021-09-10 Thread George Shuklin

On 10/09/2021 15:37, Janne Johansson wrote:

Den fre 10 sep. 2021 kl 14:27 skrev George Shuklin :

On 10/09/2021 15:19, Janne Johansson wrote:

Are there a way? pg list is not very informative, as it does not show
how badly 'unreplicated' data are.

ceph pg dump should list all PGs and how many active OSDs they have in
a list like this:
[12,34,78,56], [12,34,2134872348723,56]


It's not about been undersized.
Imagine a small cluster with three OSD. You have two OSD dead, than two
more empty were added to the cluster.
Normally you'll see that each PG found a peer and there are no
undersized PGs. But data, actually, wasn't replicated yet, the
replication is in the process.

My view is that they actually would be "undersized" until backfill is
done to the PGs on the new empty disks you just added.


I've just created a counter-example for that.

Each server has 2 OSD, default replicated_rules.

There is 4 servers, pool size is 3.

* shutdown srv1, wait for recovery, shutdown srv2, wait for recovery.

* put some big amount of data (enough to see replication traffic), all 
data are in srv3+srv4 with degrade.


* shutdown srv3, start srv1, srv2.  srv4 is a single server with all 
data available.


I can see no 'undersized' PG, but data ARE in a single copy: 
https://gist.github.com/amarao/fbc8ef3538f66a9f2c264f8555f5c29a



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: List pg with heavily degraded objects

2021-09-10 Thread George Shuklin

On 10/09/2021 15:54, Janne Johansson wrote:

Den fre 10 sep. 2021 kl 14:39 skrev George Shuklin :

On 10/09/2021 14:49, George Shuklin wrote:

Hello.

I wonder if there is a way to see how many replicas are available for
each object (or, at least, PG-level statistics). Basically, if I have
damaged cluster, I want to see the scale of damage, and I want to see
the most degraded objects (which has 1 copy, then objects with 2
copies, etc).

Are there a way? pg list is not very informative, as it does not show
how badly 'unreplicated' data are.


Actually, the problem is more complicated than I expected. Here is the
artificial cluster, where there is a sizable chunk of data are single,
(cluster of thee servers with 2 OSD each, put some data, shutdown server
#1, put some more data, kill server #3, start server#1, it's guaranteed
that server #2 holds a single copy). This is snapshot of the ceph pg
dump for it as soon as #2 booted, and I can't find a proof that some
data are in a single copy:
https://gist.github.com/amarao/fbc8ef3538f66a9f2c264f8555f5c29a


In this case, where you have both made PGs undersized, and also degraded
by letting one OSD pick up some changes and then remove it and get another
one back in (I didn't see where #2 stopped in your example), I guess you will
have to take a deep dive into
ceph pg  query to see ALL the info about it.

By the time you are stacking multiple error scenarios on top of eachother,
I don't think there is a simple "show me a short understandable list of what
it almost near working"


No, I'm worried about observability of the situation when data are in a 
single copy (which I consider bit emergency). I've just created scenario 
when only single server (2 OSD) got data on it, and right after 
replication started, I can't detect that it's THAT bad. I've updated the 
gist: https://gist.github.com/amarao/fbc8ef3538f66a9f2c264f8555f5c29a 
with snapshot after cluster with single copy of data available found 
enough space to make all PG 'well-sized'. Replication is underway, but 
data are single-copy at the moment of snapshot.



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] ceph-osd performance on ram disk

2020-09-10 Thread George Shuklin

I'm creating a benchmark suite for Сeph.

During benchmarking of benchmark, I've checked how fast ceph-osd works. 
I decided to skip all 'SSD mess' and use brd (block ram disk, modprobe 
brd) as underlying storage. Brd itself can yield up to 2.7Mpps in fio. 
In single thread mode (iodepth=1) it can yield up to 750k IOPS. LVM over 
brd gives about 600kIOPS in single-threaded mode with iodepth=1 (16us 
latency).


But, as soon as I put ceph-osd (bluestore) on it, I see something very 
odd. No matter how much parallel load I push onto this OSD, it never 
gives more than 30 kIOPS, and I can't understand where bottleneck is.


CPU utilization: ~300%. There are 8 cores on my setup, so, CPU is not a 
bottleneck.


Network: I've moved benchmark on the same host as OSD, so it's a 
localhost. Even counting network, it's still far away from saturation. 
30kIOPS (4k) is about 1Gb/s, but I have 10G links. Anyway, tests are run 
on localhost, so network is irrelevant (I've checked it, traffic is on 
localhost). Test itself consumes about 70% CPU of one core, so there are 
plenty left.


Replication: I've killed it (size=1, single osd in the pool).

single-threaded latency: 200us, 4.8kIOPS.
iopdeth=32: 2ms (15kIOPS).
iodepth=16,numjobs=8: 5ms (24k IOPS)

I'm running fio with 'rados' ioengine, and it looks like putting more 
workers doesn't change much, so it's not rados ioengine.


As there is plenty CPU and IO left, there is only one possible place for 
bottleneck: some time-consuming single-threaded code in ceph-osd.


Are there any knobs to tweak to see higher performance for ceph-osd? I'm 
pretty sure it's not any kind of leveling, GC or other 'iops-related' 
issues (brd has performance of two order of magnitude higher).



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-osd performance on ram disk

2020-09-10 Thread George Shuklin
Thank you!

I know that article, but they promise 6 core use per OSD, and I got barely
over three, and all this in totally synthetic environment with no SDD to
blame (brd is more than fast and have a very consistent latency under any
kind of load).

On Thu, Sep 10, 2020, 19:39 Marc Roos  wrote:

>
>
>
> Hi George,
>
> Very interesting and also a bit expecting result. Some messages posted
> here are already indicating that getting expensive top of the line
> hardware does not really result in any performance increase above some
> level. Vitaliy has documented something similar[1]
>
> [1]
> https://yourcmc.ru/wiki/Ceph_performance
>
>
>
> -Original Message-
> To: ceph-users@ceph.io
> Subject: [ceph-users] ceph-osd performance on ram disk
>
> I'm creating a benchmark suite for Сeph.
>
> During benchmarking of benchmark, I've checked how fast ceph-osd works.
> I decided to skip all 'SSD mess' and use brd (block ram disk, modprobe
> brd) as underlying storage. Brd itself can yield up to 2.7Mpps in fio.
> In single thread mode (iodepth=1) it can yield up to 750k IOPS. LVM over
> brd gives about 600kIOPS in single-threaded mode with iodepth=1 (16us
> latency).
>
> But, as soon as I put ceph-osd (bluestore) on it, I see something very
> odd. No matter how much parallel load I push onto this OSD, it never
> gives more than 30 kIOPS, and I can't understand where bottleneck is.
>
> CPU utilization: ~300%. There are 8 cores on my setup, so, CPU is not a
> bottleneck.
>
> Network: I've moved benchmark on the same host as OSD, so it's a
> localhost. Even counting network, it's still far away from saturation.
> 30kIOPS (4k) is about 1Gb/s, but I have 10G links. Anyway, tests are run
> on localhost, so network is irrelevant (I've checked it, traffic is on
> localhost). Test itself consumes about 70% CPU of one core, so there are
> plenty left.
>
> Replication: I've killed it (size=1, single osd in the pool).
>
> single-threaded latency: 200us, 4.8kIOPS.
> iopdeth=32: 2ms (15kIOPS).
> iodepth=16,numjobs=8: 5ms (24k IOPS)
>
> I'm running fio with 'rados' ioengine, and it looks like putting more
> workers doesn't change much, so it's not rados ioengine.
>
> As there is plenty CPU and IO left, there is only one possible place for
> bottleneck: some time-consuming single-threaded code in ceph-osd.
>
> Are there any knobs to tweak to see higher performance for ceph-osd? I'm
> pretty sure it's not any kind of leveling, GC or other 'iops-related'
> issues (brd has performance of two order of magnitude higher).
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-osd performance on ram disk

2020-09-10 Thread George Shuklin
I know. I tested fio before testing ceph
 with fio. On null ioengine fio can handle up to 14M IOPS (on my dusty
lab's R220). On blk_null to gets down to 2.4-2.8M IOPS.

On brd it drops to sad 700k IOPS.

BTW, never run synthetic high-performance benchmarks on kvm. My old server
with 'makelinuxfastagain' fixes make one io request in 3.4us, and on KVM VM
it become 24us. Some guy said it got about 8.5us on vmware. It's all on
purely software stack without any hypervisor IO.

24us sounds like a small number, but if your synthetics makes 200k iops,
it's 4us. You can't make 200k on VM with 24us syscall time.

On Thu, Sep 10, 2020, 22:49 Виталий Филиппов  wrote:

> By the way, DON'T USE rados bench. It's an incorrect benchmark. ONLY use
> fio
>
> 10 сентября 2020 г. 22:35:53 GMT+03:00, vita...@yourcmc.ru пишет:
>>
>> Hi George
>>
>> Author of Ceph_performance here! :)
>>
>> I suspect you're running tests with 1 PG. Every PG's requests are always 
>> serialized, that's why OSD doesn't utilize all threads with 1 PG. You need 
>> something like 8 PGs per OSD. More than 8 usually doesn't improve results.
>>
>> Also note that read tests are meaningless after full overwrite on small OSDs 
>> because everything fits in cache. Restart the OSD to clear it. You can drop 
>> the cache via the admin socket too, but restarting is the simplest way.
>>
>> I've repeated your test with brd. My results with 8 PGs after filling the 
>> RBD image, turning CPU powersave off and restarting the OSD are:
>>
>> # fio -name=test -ioengine=rbd -bs=4k -iodepth=1 -rw=randread -pool=ramdisk 
>> -rbdname=testimg
>>   read: IOPS=3586, BW=14.0MiB/s (14.7MB/s)(411MiB/29315msec)
>>  lat (usec): min=182, max=5710, avg=277.41, stdev=90.16
>>
>> # fio -name=test -ioengine=rbd -bs=4k -iodepth=1 -rw=randwrite -pool=ramdisk 
>> -rbdname=testimg
>>   write: IOPS=1247, BW=4991KiB/s (5111kB/s)(67.0MiB/13746msec); 0 zone resets
>>  lat (usec): min=555, max=4015, avg=799.45, stdev=142.92
>>
>> # fio -name=test -ioengine=rbd -bs=4k -iodepth=128 -rw=randwrite 
>> -pool=ramdisk -rbdname=testimg
>>   write: IOPS=4138, BW=16.2MiB/s (16.9MB/s)(282MiB/17451msec); 0 zone resets
>>   658% CPU
>>
>> # fio -name=test -ioengine=rbd -bs=4k -iodepth=128 -rw=randread 
>> -pool=ramdisk -rbdname=testimg
>>   read: IOPS=15.7k, BW=61.4MiB/s (64.4MB/s)(979MiB/15933msec)
>>   540% CPU
>>
>> Basically the same shit as on an NVMe. So even an "in-memory Ceph" is slow, 
>> haha.
>>
>> Thank you!
>>>
>>> I know that article, but they promise 6 core use per OSD, and I got barely
>>> over three, and all this in totally synthetic environment with no SDD to
>>> blame (brd is more than fast and have a very consistent latency under any
>>> kind of load).
>>>
>> --
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>>
> --
> With best regards,
> Vitaliy Filippov
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-osd performance on ram disk

2020-09-10 Thread George Shuklin
Latency from a client side is not an issue. It just combines with other
latencies in the stack. The more client lags, the easier it's for the
cluster.

Here, the thing I talk, is slightly different. When you want to establish
baseline performance for osd daemon (disregarding block device and network
latencies), sudden order-of-magnitude delay on syscalls cause
disproportionate skew to results.

It does not relate to the production in any way, only to ceph-osd
benchmarks.

On Thu, Sep 10, 2020, 23:21  wrote:

> Yeah, of course... but RBD is primarily used for KVM VMs, so the results
> from a VM are the thing that real clients see. So they do mean something...
> :)
>
> I know. I tested fio before testing ceph
> with fio. On null ioengine fio can handle up to 14M IOPS (on my dusty
> lab's R220). On blk_null to gets down to 2.4-2.8M IOPS.
> On brd it drops to sad 700k IOPS.
> BTW, never run synthetic high-performance benchmarks on kvm. My old server
> with 'makelinuxfastagain' fixes make one io request in 3.4us, and on KVM VM
> it become 24us. Some guy said it got about 8.5us on vmware. It's all on
> purely software stack without any hypervisor IO.
> 24us sounds like a small number, but if your synthetics makes 200k iops,
> it's 4us. You can't make 200k on VM with 24us syscall time.
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-osd performance on ram disk

2020-09-10 Thread George Shuklin

On 10/09/2020 22:35, vita...@yourcmc.ru wrote:

Hi George

Author of Ceph_performance here! :)

I suspect you're running tests with 1 PG. Every PG's requests are always 
serialized, that's why OSD doesn't utilize all threads with 1 PG. You need 
something like 8 PGs per OSD. More than 8 usually doesn't improve results.

Also note that read tests are meaningless after full overwrite on small OSDs 
because everything fits in cache. Restart the OSD to clear it. You can drop the 
cache via the admin socket too, but restarting is the simplest way.

I've repeated your test with brd. My results with 8 PGs after filling the RBD 
image, turning CPU powersave off and restarting the OSD are:

# fio -name=test -ioengine=rbd -bs=4k -iodepth=1 -rw=randread -pool=ramdisk 
-rbdname=testimg
   read: IOPS=3586, BW=14.0MiB/s (14.7MB/s)(411MiB/29315msec)
  lat (usec): min=182, max=5710, avg=277.41, stdev=90.16

# fio -name=test -ioengine=rbd -bs=4k -iodepth=1 -rw=randwrite -pool=ramdisk 
-rbdname=testimg
   write: IOPS=1247, BW=4991KiB/s (5111kB/s)(67.0MiB/13746msec); 0 zone resets
  lat (usec): min=555, max=4015, avg=799.45, stdev=142.92

# fio -name=test -ioengine=rbd -bs=4k -iodepth=128 -rw=randwrite -pool=ramdisk 
-rbdname=testimg
   write: IOPS=4138, BW=16.2MiB/s (16.9MB/s)(282MiB/17451msec); 0 zone resets
   658% CPU

# fio -name=test -ioengine=rbd -bs=4k -iodepth=128 -rw=randread -pool=ramdisk 
-rbdname=testimg
   read: IOPS=15.7k, BW=61.4MiB/s (64.4MB/s)(979MiB/15933msec)
   540% CPU

Basically the same shit as on an NVMe. So even an "in-memory Ceph" is slow, 
haha.


Hello!

Thank you for feedback! PG idea is really good. Unfortunately, autoscale 
made it to 32, and I have 30 kIOPS of 32-pg 1-size pool on ramdisk. :-/
I've checked read speed (I hadn't done this before, I have no idea why), 
and I got amazing 160kIOPS, but I suspect it's caching.
Anyway, thank you for data, I assume 600% CPU in exchange for 
~16-17kIOPS for OSD.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-osd performance on ram disk

2020-09-11 Thread George Shuklin

On 10/09/2020 19:37, Mark Nelson wrote:

On 9/10/20 11:03 AM, George Shuklin wrote:


...
Are there any knobs to tweak to see higher performance for ceph-osd? 
I'm pretty sure it's not any kind of leveling, GC or other 
'iops-related' issues (brd has performance of two order of magnitude 
higher).



So as you've seen, Ceph does a lot more than just write a chunk of 
data out to a block on disk.  There's tons of encoding/decoding 
happening, crc checksums, crush calculations, onode lookups, 
write-ahead-logging, and other work involved that all adds latency.  
You can overcome some of that through parallelism, but 30K IOPs per 
OSD is probably pretty on-point for a nautilus era OSD.  For octopus+ 
the cache refactor in bluestore should get you farther (40-50k+ for 
and OSD in isolation).  The maximum performance we've seen in-house is 
around 70-80K IOPs on a single OSD using very fast NVMe and highly 
tuned settings.



A couple of things you can try:


- upgrade to octopus+ for the cache refactor

- Make sure you are using the equivalent of the latency-performance or 
latency-network tuned profile.  The most important part is disabling 
CPU cstate transitions.


- increase osd_memory_target if you have a larger dataset (onode cache 
misses in bluestore add a lot of latency)


- enable turbo if it's disabled (higher clock speed generally helps)


On the write path you are correct that there is a limitation regarding 
a single kv sync thread.  Over the years we've made this less of a 
bottleneck but it's possible you still could be hitting it.  In our 
test lab we've managed to utilize up to around 12-14 cores on a single 
OSD in isolation with 16 tp_osd_tp worker threads and on a larger 
cluster about 6-7 cores per OSD.  There's probably multiple factors at 
play, including context switching, cache thrashing, memory throughput, 
object creation/destruction, etc.  If you decide to look into it 
further you may want to try wallclock profiling the OSD under load and 
seeing where it is spending its time. 


Thank you for feedback.

I forgot to mention this, it's Octopus, fresh installation.

I've disabled CSTATE (governor=performance), it make no difference - 
same iops, same CPU use by ceph-osd  I've just can't force Ceph to 
consume more than 330% of CPU. I can force read up to 150k IOPS (both 
network and local), hitting CPU limit, but write is somewhat restricted 
by ceph itself.


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Is it possible to assign osd id numbers?

2020-09-11 Thread George Shuklin

On 11/09/2020 16:11, Shain Miley wrote:

Hello,
I have been wondering for quite some time whether or not it is possible to 
influence the osd.id numbers that are  assigned during an install.

I have made an attempt to keep our osds in order over the last few years, but 
it is a losing battle without having some control over the osd assignment.

I am currently using ceph-deploy to handle adding nodes to the cluster.

You can reuse osd numbers, but I strongly advice you not to focus on 
precise IDs. The reason is that you can have such combination of server 
faults, which will swap IDs no matter what.


It's a false sense of beauty to have 'ID of OSD match ID in the name of 
the server'.


How to reuse osd nums?

OSD number is used (and should be cleaned if OSD dies) in three places 
in Ceph:


1) Crush map: ceph osd crush rm osd.x

2) osd list: ceph osd rm osd.x

3) auth: ceph auth rm osd.x

The last one is often forgoten and is a usual reason for ceph-ansible to 
fail on new disk in the server.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Is it possible to assign osd id numbers?

2020-09-14 Thread George Shuklin

On 11/09/2020 22:43, Shain Miley wrote:

Thank you for your answer below.

I'm not looking to reuse them as much as I am trying to control what unused 
number is actually used.

For example if I have 20 osds and 2 have failed...when I replace a disk in one 
server I don't want it to automatically use the next lowest number for the osd 
assignment.

I understand what you mean about not focusing on the osd ids...but my ocd is 
making me ask the question.

Well, technically, you can create fake OSD to hold numbers and release 
those 'fake OSD' if you need to use their numbers, but you are really 
complicate everything. I suggest you to stop worrying about numbers. If 
you are ok that every OSD on every sever is using /dev/sdb (OCD requires 
that server1 uses /dev/sda, server2 uses /dev/sdb, server 3 /dev/sdc, 
etc), so you should be fine with random OSD numbers. Moreover, you 
should be fine with discrepancy on sorting order of OSD uuid and their 
numbers, misalignment of IP adddress and OSD тumber (192.168.0.4 for 
OSD.1 ).


While it's may be fun to play with numbers in a lab, if you are using 
Ceph in production, you should avoid doing unnecessary changes, as they 
will surprise other people (and you!) trying to keep this thing running.


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-osd performance on ram disk

2020-09-14 Thread George Shuklin

On 11/09/2020 17:44, Mark Nelson wrote:


On 9/11/20 4:15 AM, George Shuklin wrote:

On 10/09/2020 19:37, Mark Nelson wrote:

On 9/10/20 11:03 AM, George Shuklin wrote:


...
Are there any knobs to tweak to see higher performance for 
ceph-osd? I'm pretty sure it's not any kind of leveling, GC or 
other 'iops-related' issues (brd has performance of two order of 
magnitude higher).




...

I've disabled CSTATE (governor=performance), it make no difference - 
same iops, same CPU use by ceph-osd  I've just can't force Ceph to 
consume more than 330% of CPU. I can force read up to 150k IOPS (both 
network and local), hitting CPU limit, but write is somewhat 
restricted by ceph itself.



Ok, can I assume block/db/wal are all on the ramdisk?  I'd start a 
benchmark and attach gdbpmp to the OSD and see if you can get a 
callgraph (1000 samples is nice if you don't mind waiting a bit). That 
will tell us a lot more about where the code is spending time.  It 
will slow the benchmark way down fwiw.  Some other things you could 
try:  Try to tweak the number of osd worker threads to better match 
the number of cores in your system.  Too many and you end up with 
context switching.  Too few and you limit parallelism.  You can also 
check rocksdb compaction stats in the osd logs using this tool:



https://github.com/ceph/cbt/blob/master/tools/ceph_rocksdb_log_parser.py


Given that you are on ramdisk the 1GB default WAL limit should be 
plenty to let you avoid WAL throttling during compaction, but just 
verifying that compactions are not taking a long time is good peace of 
mind. 



Thank you very much for feedback. In my case all OSD data was on the brd 
device. (To test it just create a ramdisk: modprobe brd rd_size=20G, 
create pv and vg for ceph, and let ceph-ansible to consume them as OSD 
devices).


The stuff you've give me here is really cool, but a bit out of my skills 
now. I wrote them into my tasklist, and I'll continue to research this 
topic further.


Thank you for directions to look into.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Benchmark WAL/DB on SSD and HDD for RGW RBD CephFS

2020-09-17 Thread George Shuklin

On 16/09/2020 07:26, Danni Setiawan wrote:

Hi all,

I'm trying to find performance penalty with OSD HDD when using WAL/DB 
in faster device (SSD/NVMe) vs WAL/DB in same device (HDD) for 
different workload (RBD, RGW with index bucket in SSD pool, and CephFS 
with metadata in SSD pool). I want to know if giving up disk slot for 
WAL/DB device is worth vs adding more OSD.


Unfortunately I cannot find the benchmark for these kind workload. Has 
anyone ever done this benchmark?


For everything except CephFS, fio looks like a best tool for 
benchmarking. It can benchmark ceph on all levels: rados, rbd, http/S3. 
Moreover, it has excellent configuration options, detailed metrics and 
it can run with multi-server workload (one fio client forcing many fio 
servers to do benchmarking). The own fio performance is at about 15M 
IOPS (null engine per fio-server), and it scales horizontally.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] disk scheduler for SSD

2020-09-18 Thread George Shuklin

I start to wonder (again) which scheduler is better for ceph on SSD.

My reasoning.

None:

1. Reduces latency for requests. The lower latency is, the higher is 
perceived performance for unbounded workload with fixed queue depth 
(hello, benchmarks).
2. Causes possible spikes in latency for requests because of the 
'unfair' request ordering (hello, deep scrub).


Deadline-mq:

1. Reduce size of nr_requests (queue size) to 256 (noop shows me 
916???). Make introduce latency.
2. May reduce latency spikes due to different rates for different types 
of workloads.


I'm doing some benchmarks, and they, but of course, gives higher marks 
for 'none' scheduler. Nevertheless, I believe most of normal workload on 
Ceph does not utilize it with unbounded rate, so bounded (f.e. app 
making IO based on external independed events) workload can be hurt by 
lack of disk scheduler in presence of unbounded workload.


Any ideas?

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Benchmark WAL/DB on SSD and HDD for RGW RBD CephFS

2020-09-18 Thread George Shuklin

On 17/09/2020 17:37, Mark Nelson wrote:
Does fio handle S3 objects spread across many buckets well? I think 
bucket listing performance was maybe missing too, but It's been a 
while since I looked at fio's S3 support.  Maybe they have those use 
cases covered now.  I wrote a go based benchmark called hsbench based 
on the wasabi-tech benchmark a while back that tries to cover some of 
those cases, but I haven't touched it in a while:



https://github.com/markhpc/hsbench


The way to spread across many buckets is to use 'farm' for servers under 
one client manage. You just give each server a different bucket to 
torture inside jobfile. iodepth=1 restriction for http ioengine is 
actually encouraging this.





FWIW fio can be used for cephfs as well and it works reasonably well 
if you give it a long enough run time and only expect hero run 
scenarios from it.  For metadata intensive workloads you'll need to 
use mdtest or smallfile.  At this point I mostly just use the io500 
suite that includes both ior for hero runs and mdtest for metadata 
(but you need mpi to coordinate it across multiple nodes).
Yep, I've talked about metadata intensive workloads. Romping within a 
file or two is not a true fs-specific benchmark.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Low level bluestore usage

2020-09-22 Thread George Shuklin
As far as I know, bluestore doesn't like super small sizes. Normally odd
should stop doing funny things as full mark, but if device is too small it
may be too late and bluefs run out of space.

Two things:
1. Don't use too small osd
2. Have a spare area on the drive. I usually reserve 1% for emergency
extension (and to give ssd firmware a bit if space to breath).


On Wed, Sep 23, 2020, 01:03 Ivan Kurnosov  wrote:

> Hi,
>
> this morning I woke up to a degraded test ceph cluster (managed by rook,
> but it does not really change anything for the question I'm about to ask).
>
> After checking logs I have found that bluestore on one of the OSDs run out
> of space.
>
> Some cluster details:
>
> ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus
> (stable)
> it runs on 3 little OSDs 10Gb each
>
> `ceph osd df` returned RAW USE of about 4.5GB on every node, happily
> reporting about 5.5GB of AVAIL.
>
> Yet:
>
> ...
> So, my question would be: how could I have prevented that? From monitoring
> I have (prometheus) - OSDs are healthy, have plenty of space, yet they are
> not.
>
> What command (and prometheus metric) would help me understand the actual
> real bluestore use? Or am I missing something?
>
> Oh, and I "fixed" the cluster by expanding the broken osd.0 with a larger
> 15GB volume. And 2 other OSDs still run on 10GB volumes.
>
> Thanks in advance for any thoughts.
>
>
> --
> With best regards, Ivan Kurnosov
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: NVMe's

2020-09-23 Thread George Shuklin



I've just finishing doing our own benchmarking, and I can say, you want 
to do something very unbalanced and CPU bounded.


1. Ceph consume a LOT of CPU. My peak value was around 500% CPU per 
ceph-osd at top-performance (see the recent thread on 'ceph on brd') 
with more realistic numbers around 300-400% CPU per device.
2. Ceph is unable to deliver more than 12k IOPS per ceph-osd (may be a 
little more with top-tier low-core high-frequency CPU, but not much). 
So, super-duper-nvme wont make difference. (btw, I have a stupid idea to 
try to run two ceph-osd from the same LV with a single PV underneath VG, 
but it not tested).
3. You wll find that any given client performance is heavily limited by 
sum of all RTT in the network, plus own latencies of ceph, so very fast 
NVME give a diminishing return.
4. CPU bounded ceph-osd completely wipe any differences for underlying 
devices (except for desktop-class crawlers).


You can run your own tests, even without fancy 48-nvme boxes - just run 
ceph-osd on brd (block ram disk). ceph-osd won't run any faster on 
anything else (ramdisk is the fastest), so numbers you get from brd is 
supremum (upper bound) for theoretical performance.


Given max 400-500% CPU per ceph-osd I'd say you need to keep number of 
NVME in server below 12, or, 15 (but sometimes you'll get CPU saturation).


In my opinion less fancy boxes with smaller number of drives per server 
(but larger number of servers) would make your (or your operation 
team's) life much less stressful.


NEVER ever use raid with ceph.


On 23/09/2020 08:39, Brent Kennedy wrote:

We currently run a SSD cluster and HDD clusters and are looking at possibly
creating a cluster for NVMe storage.  For spinners and SSDs, it seemed the
max recommended per osd host server was 16 OSDs ( I know it depends on the
CPUs and RAM, like 1 cpu core and 2GB memory ).

  


Questions:
1.  If we do a jbod setup, the servers can hold 48 NVMes, if the servers
were bought with 48 cores and 100+ GB of RAM, would this make sense?

2.  Should we just raid 5 by groups of NVMe drives instead ( and buy less
CPU/RAM )?  There is a reluctance to waste even a single drive on raid
because redundancy is basically cephs job.
3.  The plan was to build this with octopus ( hopefully there are no issues
we should know about ).  Though I just saw one posted today, but this is a
few months off.

4.  Any feedback on max OSDs?

5.  Right now they run 10Gb everywhere with 80Gb uplinks, I was thinking
this would need at least 40Gb links to every node ( the hope is to use these
to speed up image processing at the application layer locally in the DC ).
I haven't spoken to the Dell engineers yet but my concern with NVMe is that
the raid controller would end up being the bottleneck ( next in line after
network connectivity ).

  


Regards,

-Brent

  


Existing Clusters:

Test: Nautilus 14.2.11 with 3 osd servers, 1 mon/man, 1 gateway, 2 iscsi
gateways ( all virtual on nvme )

US Production(HDD): Nautilus 14.2.11 with 12 osd servers, 3 mons, 4
gateways, 2 iscsi gateways

UK Production(HDD): Nautilus 14.2.11 with 12 osd servers, 3 mons, 4 gateways

US Production(SSD): Nautilus 14.2.11 with 6 osd servers, 3 mons, 3 gateways,
2 iscsi gateways

  

  

  

  


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: NVMe's

2020-09-23 Thread George Shuklin

On 23/09/2020 10:54, Marc Roos wrote:
  
Depends on your expected load not? I already read here numerous of times

that osd's can not keep up with nvme's, that is why people put 2 osd's
on a single nvme. So on a busy node, you probably run out of cores? (But
better verify this with someone that has an nvme cluster ;))



Did you? I just start to though about this idea too, as some devices can 
deliver about twice of the own ceph-osd performance.


How they did it?

I have an idea to create a new bucket type under host, and put two LV 
from each ceph osd VG into that new bucket. Rules are the same 
(different host), so redundancy won't be affected, but doubling number 
of ceph-osd daemons can squeeze a bit more iops from backend devices at 
expense of doubling Rocksdb size (reducing payload size) and using more 
cores.


And I really want to hear all bad things about this setup before trying it.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Low level bluestore usage

2020-09-23 Thread George Shuklin

On 23/09/2020 04:09, Alexander E. Patrakov wrote:


Sometimes this doesn't help. For data recovery purposes, the most
helpful step if you get the "bluefs enospc" error is to add a separate
db device, like this:

systemctl disable --now ceph-osd@${OSDID}
truncate -s 32G /junk/osd.${OSDID}-recover/block.db
sgdisk -n 0:0:0 /junk/osd.${OSDID}-recover/block.db
ceph-bluestore-tool \
 bluefs-bdev-new-db --path /var/lib/ceph/osd/ceph-${OSDID} \
 --dev-target /junk/osd.${OSDID}-recover/block.db \
 --bluestore-block-db-size=31G --bluefs-log-compact-min-size=31G

Of course you can use a real block device instead of just a file.

After that, export all PGs using ceph-objecttstore-tool and re-import
into a fresh OSD, then destroy or purge the full one.

Here is why the options:

--bluestore-block-db-size=31G: ceph-bluestore-tool refuses to do
anything if this option is not set to any value
--bluefs-log-compact-min-size=31G: make absolutely sure that log
compaction doesn't happen, because it would hit "bluefs enospc" again.



Oh, you went this way... I solved my 'pocket ceph' needs by exporting 
disk images (from files) via iscsi and mounting them back to localhost. 
That gives me a perfect 'scsi' devices which work exactly as in 
production. I have a little playbook (iscsi_loopback) to setup it on 
random scrap (including VMs) for development purposes. After iscsi is 
loopback-mounted, all other code works exactly the same as it would in 
production.


I've got this issue few times on small 10GB osds, so I moved to 15Gb and 
it become a less often. I never have had this in real-hardware tests 
with real disk sizes (>>100G per OSD).

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: NVMe's

2020-09-23 Thread George Shuklin


I've just finishing doing our own benchmarking, and I can say, you 
want to do something very unbalanced and CPU bounded.


1. Ceph consume a LOT of CPU. My peak value was around 500% CPU per 
ceph-osd at top-performance (see the recent thread on 'ceph on brd') 
with more realistic numbers around 300-400% CPU per device.



In fact in isolation on the test setup that Intel donated for 
community ceph R&D we've pushed a single OSD to consume around 1400% 
CPU at 80K write IOPS! :)  I agree though, we typical see a peak of 
about 500-600% CPU per OSD on multi-node clusters with a 
correspondingly lower write throughput.  I do believe that in some 
cases the mix of IO we are doing is causing us to at least be 
partially bound by disk write latency with the single writer thread in 
the rocksdb WAL though.


I'd really like to see how they done this without offloading (their 
configuration).




2. Ceph is unable to deliver more than 12k IOPS per ceph-osd (may be 
a little more with top-tier low-core high-frequency CPU, but not 
much). So, super-duper-nvme wont make difference. (btw, I have a 
stupid idea to try to run two ceph-osd from the same LV with a single 
PV underneath VG, but it not tested).



I'm curious if you've tried octopus+ yet?  We refactored bluestore's 
caches which internally has proven to help quite a bit with latency 
bound workloads as it reduces lock contention in onode cache shards 
and the impact of cache trimming (no more single trimming trim thread 
constantly grabbing the lock for long periods of time!).  In a 64 NVMe 
drive setup (P4510s), we were able to do a little north of 400K write 
IOPS with 3x replication, so about 19K IOPs per OSD once you factor 
rep in.  Also, in Nautilus you can see real benefits wtih running 
multiple OSDs on a single device but with Octopus and master we've 
pretty much closed the gap on our test setup:


It's octopus. I was doing single-osd benchmark, removing all movable 
parts (brd instead of nvme, no network, size=1, etc). Moreover, I've 
focused on rados benchmark, as RBD is just a derivative from rados 
performance.


Anyway, big thank you for input.


https://docs.google.com/spreadsheets/d/1e5eTeHdZnSizoY6AUjH0knb4jTCW7KMU4RoryLX9EHQ/edit?usp=sharing 




Generally speaking using the latency-performance or latency-network 
tuned profiles helps (mostly due to avoid C state CPU transitions) as 
does higher clock speeds.  Not using replication helps but that's 
obviously not a realistic solution for most people. :)


I used size=1 and 'no ssd, no network' as upper bound. If allows to find 
limits for ceph-osd performance. Any real-life things (replication, 
network, real block devices) will make things worse, not better. Knowing 
upper performance bound is really nice when start to choose server 
configuration.





3. You wll find that any given client performance is heavily limited 
by sum of all RTT in the network, plus own latencies of ceph, so very 
fast NVME give a diminishing return.
4. CPU bounded ceph-osd completely wipe any differences for 
underlying devices (except for desktop-class crawlers).


You can run your own tests, even without fancy 48-nvme boxes - just 
run ceph-osd on brd (block ram disk). ceph-osd won't run any faster 
on anything else (ramdisk is the fastest), so numbers you get from 
brd is supremum (upper bound) for theoretical performance.


Given max 400-500% CPU per ceph-osd I'd say you need to keep number 
of NVME in server below 12, or, 15 (but sometimes you'll get CPU 
saturation).


In my opinion less fancy boxes with smaller number of drives per 
server (but larger number of servers) would make your (or your 
operation team's) life much less stressful.



That's pretty much the advice I've been giving people since the 
Inktank days.  It costs more and is lower density, but the design is 
simpler, you are less likely to under provision CPU, less likely to 
run into memory bandwidth bottlenecks, and you have less recovery to 
do when a node fails.  Especially now with how many NVMe drives you 
can fit in a single 1U server!




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io