[ceph-users] Re: librbd 4k read/write?

2023-08-10 Thread Murilo Morais
It makes sense.

Em qui., 10 de ago. de 2023 às 16:04, Zakhar Kirpichenko 
escreveu:

> Hi,
>
> You can use the following formula to roughly calculate the IOPS you can
> get from a cluster: (Drive_IOPS * Number_of_Drives * 0.75) / Cluster_Size.
>
> For example, for 60 10K rpm SAS drives each capable of 200 4K IOPS and a
> replicated pool with size 3: (~200 * 60 * 0.75) / 3 = ~3000 IOPS with block
> size = 4K.
>
> That's what the OP is getting, give or take.
>
> /Z
>
> On Thu, 10 Aug 2023 at 20:20, Anthony D'Atri  wrote:
>
>>
>>
>> >
>> > Good afternoon everybody!
>> >
>> > I have the following scenario:
>> > Pool RBD replication x3
>> > 5 hosts with 12 SAS spinning disks each
>>
>> Old hardware?  SAS is mostly dead.
>>
>> > I'm using exactly the following line with FIO to test:
>> > fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -size=10G
>> > -iodepth=16 -rw=write -filename=./test.img
>>
>> On what kind of client?
>>
>> > If I increase the blocksize I can easily reach 1.5 GBps or more.
>> >
>> > But when I use blocksize in 4K I get a measly 12 Megabytes per second,
>> > which is quite annoying. I achieve the same rate if rw=read.
>>
>> If your client is VM especially, check if you have IOPS throttling. With
>> small block sizes you'll throttle IOPS long before bandwidth.
>>
>> > Note: I tested it on another smaller cluster, with 36 SAS disks and got
>> the
>> > same result.
>>
>> SAS has a price premium over SATA, and still requires an HBA.  Many
>> chassis vendors really want you to buy an anachronistic RoC HBA.
>>
>> Eschewing SAS and the HBA helps close the gap to justify SSDs, the TCO
>> just doesn't favor spinners.
>>
>> > Maybe the 5 host cluster is not
>> > saturated by your current fio test. Try running 2 or 4 in parallel.
>>
>>
>> Agreed that Ceph is a scale out solution, not DAS, but note the
>> difference reported with a larger block size.
>>
>> >How is this related to 60 drives? His test is only on 3 drives at a time
>> not?
>>
>> RBD volumes by and large will live on most or all OSDs in the pool.
>>
>>
>>
>>
>> >
>> > I don't know exactly what to look for or configure to have any
>> improvement.
>> > ___
>> > ceph-users mailing list -- ceph-users@ceph.io
>> > To unsubscribe send an email to ceph-users-le...@ceph.io
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: librbd 4k read/write?

2023-08-10 Thread Murilo Morais
Em qui., 10 de ago. de 2023 às 13:01, Marc 
escreveu:

> > I have the following scenario:
> > Pool RBD replication x3
> > 5 hosts with 12 SAS spinning disks each
> >
> > I'm using exactly the following line with FIO to test:
> > fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -size=10G
> > -iodepth=16 -rw=write -filename=./test.img
> >
> > If I increase the blocksize I can easily reach 1.5 GBps or more.
> >
> > But when I use blocksize in 4K I get a measly 12 Megabytes per second,
> > which is quite annoying. I achieve the same rate if rw=read.
> >
> > If I use librbd's cache I get a considerable improvement in writing, but
> > reading remains the same.
> >
> > I already tested with rbd_read_from_replica_policy=balance but I didn't
> > notice any difference. I tried to leave readahead enabled by setting
> > rbd_readahead_disable_after_bytes=0 but I didn't see any difference in
> > sequential reading either.
> >
> > Note: I tested it on another smaller cluster, with 36 SAS disks and got
> the
> > same result.
> >
> > I don't know exactly what to look for or configure to have any
> improvement.
>
> What are you expecting?
>
I expected something a little better (at least in reading), since the other
one, with less disks, is showing the same rates. :(

>
> This is what I have on a vm with an rbd from a hdd pool
>
>
> 
>
I'm using exactly this in libvirt.

>
> [@~]# fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k
> -size=1G -iodepth=16 -rw=write -filename=./test.img
> test: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
> 4096B-4096B, ioengine=libaio, iodepth=16
> fio-3.7
> Starting 1 process
> Jobs: 1 (f=1): [W(1)][100.0%][r=0KiB/s,w=57.5MiB/s][r=0,w=14.7k IOPS][eta
> 00m:00s]
>
With writeback I get constant 100 Megs, which is pretty good.  I can live
with writeback.

>
>
> [@~]# fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k
> -size=1G -iodepth=1 -rw=write -filename=./test.img
> test: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
> 4096B-4096B, ioengine=libaio, iodepth=1
> fio-3.7
> Starting 1 process
> Jobs: 1 (f=1): [W(1)][100.0%][r=0KiB/s,w=19.9MiB/s][r=0,w=5090 IOPS][eta
> 00m:00s]
>
>
> Thanks for showing your results, it's something I can compare to.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: librbd 4k read/write?

2023-08-10 Thread Murilo Morais
Em qui., 10 de ago. de 2023 às 12:47, Hans van den Bogert <
hansbog...@gmail.com> escreveu:

> On Thu, Aug 10, 2023, 17:36 Murilo Morais  wrote:
>
> > Good afternoon everybody!
> >
> > I have the following scenario:
> > Pool RBD replication x3
> > 5 hosts with 12 SAS spinning disks each
> >
> > I'm using exactly the following line with FIO to test:
> > fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -size=10G
> > -iodepth=16 -rw=write -filename=./test.img
> >
> > If I increase the blocksize I can easily reach 1.5 GBps or more.
> >
> > But when I use blocksize in 4K I get a measly 12 Megabytes per second,
> >
> This is 3000iops. I would call that bad for 60 drives and a replication of
> 3. Which amount of iops did you expect?
>
> which is quite annoying. I achieve the same rate if rw=read.
> >
> > If I use librbd's cache I get a considerable improvement in writing, but
> > reading remains the same.
> >
> > I already tested with rbd_read_from_replica_policy=balance but I didn't
> > notice any difference. I tried to leave readahead enabled by setting
> > rbd_readahead_disable_after_bytes=0 but I didn't see any difference in
> > sequential reading either.
> >
> > Note: I tested it on another smaller cluster, with 36 SAS disks and got
> the
> > same result.
> >
> This I concur is a weird result compared to 60 disks. Are you using the
> same disks and all other parameters the same, like the replication factor?
> Is the performance really the same? Maybe the 5 host cluster is not
> saturated by your current fio test. Try running 2 or 4 in parallel.
>
Yes is yes. I will try with others in parallel and compare the results.

>
> >
> > I don't know exactly what to look for or configure to have any
> improvement.
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: librbd 4k read/write?

2023-08-10 Thread Zakhar Kirpichenko
Hi,

You can use the following formula to roughly calculate the IOPS you can get
from a cluster: (Drive_IOPS * Number_of_Drives * 0.75) / Cluster_Size.

For example, for 60 10K rpm SAS drives each capable of 200 4K IOPS and a
replicated pool with size 3: (~200 * 60 * 0.75) / 3 = ~3000 IOPS with block
size = 4K.

That's what the OP is getting, give or take.

/Z

On Thu, 10 Aug 2023 at 20:20, Anthony D'Atri  wrote:

>
>
> >
> > Good afternoon everybody!
> >
> > I have the following scenario:
> > Pool RBD replication x3
> > 5 hosts with 12 SAS spinning disks each
>
> Old hardware?  SAS is mostly dead.
>
> > I'm using exactly the following line with FIO to test:
> > fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -size=10G
> > -iodepth=16 -rw=write -filename=./test.img
>
> On what kind of client?
>
> > If I increase the blocksize I can easily reach 1.5 GBps or more.
> >
> > But when I use blocksize in 4K I get a measly 12 Megabytes per second,
> > which is quite annoying. I achieve the same rate if rw=read.
>
> If your client is VM especially, check if you have IOPS throttling. With
> small block sizes you'll throttle IOPS long before bandwidth.
>
> > Note: I tested it on another smaller cluster, with 36 SAS disks and got
> the
> > same result.
>
> SAS has a price premium over SATA, and still requires an HBA.  Many
> chassis vendors really want you to buy an anachronistic RoC HBA.
>
> Eschewing SAS and the HBA helps close the gap to justify SSDs, the TCO
> just doesn't favor spinners.
>
> > Maybe the 5 host cluster is not
> > saturated by your current fio test. Try running 2 or 4 in parallel.
>
>
> Agreed that Ceph is a scale out solution, not DAS, but note the difference
> reported with a larger block size.
>
> >How is this related to 60 drives? His test is only on 3 drives at a time
> not?
>
> RBD volumes by and large will live on most or all OSDs in the pool.
>
>
>
>
> >
> > I don't know exactly what to look for or configure to have any
> improvement.
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: librbd 4k read/write?

2023-08-10 Thread Anthony D'Atri



> 
> Good afternoon everybody!
> 
> I have the following scenario:
> Pool RBD replication x3
> 5 hosts with 12 SAS spinning disks each

Old hardware?  SAS is mostly dead.

> I'm using exactly the following line with FIO to test:
> fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -size=10G
> -iodepth=16 -rw=write -filename=./test.img

On what kind of client?  

> If I increase the blocksize I can easily reach 1.5 GBps or more.
> 
> But when I use blocksize in 4K I get a measly 12 Megabytes per second,
> which is quite annoying. I achieve the same rate if rw=read.

If your client is VM especially, check if you have IOPS throttling. With small 
block sizes you'll throttle IOPS long before bandwidth.

> Note: I tested it on another smaller cluster, with 36 SAS disks and got the
> same result.

SAS has a price premium over SATA, and still requires an HBA.  Many chassis 
vendors really want you to buy an anachronistic RoC HBA.

Eschewing SAS and the HBA helps close the gap to justify SSDs, the TCO just 
doesn't favor spinners.

> Maybe the 5 host cluster is not
> saturated by your current fio test. Try running 2 or 4 in parallel.


Agreed that Ceph is a scale out solution, not DAS, but note the difference 
reported with a larger block size.

>How is this related to 60 drives? His test is only on 3 drives at a time not? 

RBD volumes by and large will live on most or all OSDs in the pool.




> 
> I don't know exactly what to look for or configure to have any improvement.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: librbd 4k read/write?

2023-08-10 Thread Marc
> > Good afternoon everybody!
> >
> > I have the following scenario:
> > Pool RBD replication x3
> > 5 hosts with 12 SAS spinning disks each
> >
> > I'm using exactly the following line with FIO to test:
> > fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -size=10G
> > -iodepth=16 -rw=write -filename=./test.img
> >
> > If I increase the blocksize I can easily reach 1.5 GBps or more.
> >
> > But when I use blocksize in 4K I get a measly 12 Megabytes per second,
> >
> This is 3000iops. I would call that bad for 60 drives and a replication of
> 3. Which amount of iops did you expect?
> 

How is this related to 60 drives? His test is only on 3 drives at a time not? 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: librbd 4k read/write?

2023-08-10 Thread Marc
> I have the following scenario:
> Pool RBD replication x3
> 5 hosts with 12 SAS spinning disks each
> 
> I'm using exactly the following line with FIO to test:
> fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -size=10G
> -iodepth=16 -rw=write -filename=./test.img
> 
> If I increase the blocksize I can easily reach 1.5 GBps or more.
> 
> But when I use blocksize in 4K I get a measly 12 Megabytes per second,
> which is quite annoying. I achieve the same rate if rw=read.
> 
> If I use librbd's cache I get a considerable improvement in writing, but
> reading remains the same.
> 
> I already tested with rbd_read_from_replica_policy=balance but I didn't
> notice any difference. I tried to leave readahead enabled by setting
> rbd_readahead_disable_after_bytes=0 but I didn't see any difference in
> sequential reading either.
> 
> Note: I tested it on another smaller cluster, with 36 SAS disks and got the
> same result.
> 
> I don't know exactly what to look for or configure to have any improvement.

What are you expecting?

This is what I have on a vm with an rbd from a hdd pool




[@~]# fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -size=1G 
-iodepth=16 -rw=write -filename=./test.img
test: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, 
ioengine=libaio, iodepth=16
fio-3.7
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][r=0KiB/s,w=57.5MiB/s][r=0,w=14.7k IOPS][eta 
00m:00s]


[@~]# fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -size=1G 
-iodepth=1 -rw=write -filename=./test.img
test: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, 
ioengine=libaio, iodepth=1
fio-3.7
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][r=0KiB/s,w=19.9MiB/s][r=0,w=5090 IOPS][eta 
00m:00s]



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: librbd 4k read/write?

2023-08-10 Thread Hans van den Bogert
On Thu, Aug 10, 2023, 17:36 Murilo Morais  wrote:

> Good afternoon everybody!
>
> I have the following scenario:
> Pool RBD replication x3
> 5 hosts with 12 SAS spinning disks each
>
> I'm using exactly the following line with FIO to test:
> fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -size=10G
> -iodepth=16 -rw=write -filename=./test.img
>
> If I increase the blocksize I can easily reach 1.5 GBps or more.
>
> But when I use blocksize in 4K I get a measly 12 Megabytes per second,
>
This is 3000iops. I would call that bad for 60 drives and a replication of
3. Which amount of iops did you expect?

which is quite annoying. I achieve the same rate if rw=read.
>
> If I use librbd's cache I get a considerable improvement in writing, but
> reading remains the same.
>
> I already tested with rbd_read_from_replica_policy=balance but I didn't
> notice any difference. I tried to leave readahead enabled by setting
> rbd_readahead_disable_after_bytes=0 but I didn't see any difference in
> sequential reading either.
>
> Note: I tested it on another smaller cluster, with 36 SAS disks and got the
> same result.
>
This I concur is a weird result compared to 60 disks. Are you using the
same disks and all other parameters the same, like the replication factor?
Is the performance really the same? Maybe the 5 host cluster is not
saturated by your current fio test. Try running 2 or 4 in parallel.

>
> I don't know exactly what to look for or configure to have any improvement.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] librbd 4k read/write?

2023-08-10 Thread Murilo Morais
Good afternoon everybody!

I have the following scenario:
Pool RBD replication x3
5 hosts with 12 SAS spinning disks each

I'm using exactly the following line with FIO to test:
fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -size=10G
-iodepth=16 -rw=write -filename=./test.img

If I increase the blocksize I can easily reach 1.5 GBps or more.

But when I use blocksize in 4K I get a measly 12 Megabytes per second,
which is quite annoying. I achieve the same rate if rw=read.

If I use librbd's cache I get a considerable improvement in writing, but
reading remains the same.

I already tested with rbd_read_from_replica_policy=balance but I didn't
notice any difference. I tried to leave readahead enabled by setting
rbd_readahead_disable_after_bytes=0 but I didn't see any difference in
sequential reading either.

Note: I tested it on another smaller cluster, with 36 SAS disks and got the
same result.

I don't know exactly what to look for or configure to have any improvement.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-volume lvm new-db fails

2023-08-10 Thread Christian Rohmann



On 11/05/2022 23:21, Joost Nieuwenhuijse wrote:
After a reboot the OSD turned out to be corrupt. Not sure if 
ceph-volume lvm new-db caused the problem, or failed because of 
another problem.



I just ran into the same issue trying to add a db to an existing OSD.
Apparently this is a known bug: https://tracker.ceph.com/issues/55260

It's already fixed master, but the backports are all still pending ...



Regards

Christian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph bucket notification events stop working

2023-08-10 Thread daniel . yordanov1
Hello Yuval, 

Thanks for your reply!
We continued digging in the problem and we found out that it was caused by a 
recent change in our infrastructure. 
Loadbalancer pods were added in front or rgw ones and those were logging an SSL 
error. 
As we weren't aware right away of that change we weren't checking the logs of 
those pods. 
We have fixed it and it works now. 

Thanks, 
Daniel
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: how to set load balance on multi active mds?

2023-08-10 Thread Eugen Block
Okay, you didn't mention that in your initial question. There was an  
interesting talk [3] at the Cephalocon in Amsterdam about an approach  
to combine dynamic and static pinning. But I don't know what the  
current status is. Regarding tuning options for the existing balancer  
I would hope that Gregory or Patrick could chime in here.


[3] https://www.youtube.com/watch?v=pDURll6Y-Ug

Zitat von zxcs :


Thanks a lot, Eugen!

we are using dynamic subtree pinning, we have another cluster using  
manual pinning, but we have many directory , and we need pin each  
dir for each request. so in our new cluster, we want to try dynamic  
subtree pinning. we don’t want to human kick in every time. Because  
some A directory hot, and sometimes B directory hot.. each directory  
has many subdirectory and sub-subdirectory...


But we found the load not balance on all mds when we using dynamic  
subtree pinning. So we want to know if any config we can tune for  
the dynamic subtree pinning. Thanks again!


Thanks,
xz


2023年8月9日 17:40,Eugen Block  写道:

Hi,

you could benefit from directory pinning [1] or dynamic subtree  
pinning [2]. We had great results with manual pinning in an older  
Nautilus cluster, didn't have a chance to test the dynamic subtree  
pinning yet though. It's difficult to tell in advance which option  
would suit best your use-case, so you'll probably have to try.


Regards,
Eugen

[1]  
https://docs.ceph.com/en/reef/cephfs/multimds/#manually-pinning-directory-trees-to-a-particular-rank
[2]  
https://docs.ceph.com/en/reef/cephfs/multimds/#dynamic-subtree-partitioning-with-balancer-on-specific-ranks


Zitat von zxcs mailto:zhuxion...@163.com>>:


Hi, experts,

we have a  product env build with ceph version 16.2.11 pacific,  
and using CephFS.
Also enable multi active mds(more than 10), but we usually see  
load unbalance on our client request with these mds.
see below picture. the top 1 mds has 32.2k client request. and the  
last one only 331.


this always lead our cluster into very bad situation. say many MDS  
report slow requests…

...
 7 MDSs report slow requests
 1 MDSs behind on trimming
…


So our question is how to set those mdss load balance? Could any  
one please help to shed some light here?

Thanks a ton!


Thanks,
xz

___
ceph-users mailing list -- ceph-users@ceph.io 
To unsubscribe send an email to ceph-users-le...@ceph.io  




___
ceph-users mailing list -- ceph-users@ceph.io 
To unsubscribe send an email to ceph-users-le...@ceph.io  




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io