[ceph-users] Re: CephFS thrashing through the page cache

2023-04-03 Thread Ashu Pachauri
Hi Xiubo,

Did you get a chance to work on this? I am curious to test out the
improvements.

Thanks and Regards,
Ashu Pachauri


On Fri, Mar 17, 2023 at 3:33 PM Frank Schilder  wrote:

> Hi Ashu,
>
> thanks for the clarification. That's not an option that is easy to change.
> I hope that the modifications to the fs clients Xiubo has in mind will
> improve that. Thanks for flagging this performance issue. Would be great if
> this becomes part of a test suite.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Ashu Pachauri 
> Sent: 17 March 2023 09:55:25
> To: Xiubo Li
> Cc: Frank Schilder; ceph-users@ceph.io
> Subject: Re: [ceph-users] Re: CephFS thrashing through the page cache
>
> Hi Xiubo,
>
> As you have correctly pointed out, I was talking about the stipe_unit
> setting in the file layout configuration. Here is the documentation for
> that for anyone else's reference:
> https://docs.ceph.com/en/quincy/cephfs/file-layouts/
>
> As with any RAID0 setup, the stripe_unit is definitely workload dependent.
> Our use case requires us to read somewhere from a few kilobytes to a few
> hundred kilobytes at once. Having a 4MB default stripe_unit definitely
> hurts quite a bit. We were able to achieve almost 2x improvement in terms
> of average latency and overall throughput (for useful data) by reducing the
> stripe_unit. The rule of thumb is that you want to align the stripe_unit to
> your most common IO size.
>
> > BTW, have you tried to set 'rasize' option to a small size instead of 0
> > ? Won't this work ?
>
> No this won't work. I have tried it already. Since rasize simply impacts
> readahead, your minimum io size to the cephfs client will still be at the
> maximum of (rasize, stripe_unit).  rasize is a useful configuration only if
> it is required to be larger than the stripe_unit, otherwise it's not. Also,
> it's worth pointing out that simply setting rasize is not sufficient; one
> needs to change the corresponding configurations that control
> maximum/minimum readahead for ceph clients.
>
> Thanks and Regards,
> Ashu Pachauri
>
>
> On Fri, Mar 17, 2023 at 2:14 PM Xiubo Li  xiu...@redhat.com>> wrote:
>
> On 15/03/2023 17:20, Frank Schilder wrote:
> > Hi Ashu,
> >
> > are you talking about the kernel client? I can't find "stripe size"
> anywhere in its mount-documentation. Could you possibly post exactly what
> you did? Mount fstab line, config setting?
>
> There is no mount option to do this in both userspace and kernel
> clients. You need to change the file layout, which is (4MB stripe_unit,
> 1 stripe_count and 4MB object_size) by default, instead.
>
> Certainly with a smaller size of the stripe_unit will work. But IMO it
> will depend and be careful, changing the layout may cause other
> performance issues in some case, for example too small stripe_unit size
> may split the sync read into more osd requests to different OSDs.
>
> I will generate one patch to make the kernel client wiser instead of
> blindly setting it to stripe_unit always.
>
> Thanks
>
> - Xiubo
>
>
> >
> > Thanks!
> > =
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > 
> > From: Ashu Pachauri mailto:ashu210...@gmail.com>>
> > Sent: 14 March 2023 19:23:42
> > To: ceph-users@ceph.io
> > Subject: [ceph-users] Re: CephFS thrashing through the page cache
> >
> > Got the answer to my own question; posting here if someone else
> > encounters the same problem. The issue is that the default stripe size
> in a
> > cephfs mount is 4 MB. If you are doing small reads (like 4k reads in the
> > test I posted) inside the file, you'll end up pulling at least 4MB to the
> > client (and then discarding most of the pulled data) even if you set
> > readahead to zero. So, the solution for us was to set a lower stripe
> size,
> > which aligns better with our workloads.
> >
> > Thanks and Regards,
> > Ashu Pachauri
> >
> >
> > On Fri, Mar 10, 2023 at 9:41 PM Ashu Pachauri  > wrote:
> >
> >> Also, I am able to reproduce the network read amplification when I try
> to
> >> do very small reads from larger files. e.g.
> >>
> >> for i in $(seq 1 1); do
> >>dd if=test_${i} of=/dev/null bs=5k count=10
> >> done
> >>
> >>
> >> This piece of code generates a network traffic of 3.3 GB while it
> actually
> >> reads approx 500 MB of data.
> >>
> >>
> >> Thanks and Regards,
> >> Ashu Pachauri
> >>
> >> On Fri, Mar 10, 2023 at 9:22 PM Ashu Pachauri  >
> >> wrote:
> >>
> >>> We have an internal use case where we back the storage of a proprietary
> >>> database by a shared file system. We noticed something very odd when
> >>> testing some workload with a local block device backed file system vs
> >>> cephfs. We noticed that the amount of network IO done by cephfs is
> almost
> >>> double compared to the IO d

[ceph-users] Re: Recently deployed cluster showing 9Tb of raw usage without any load deployed

2023-04-03 Thread Anthony D'Atri
Any chance you ran `rados bench` but didn’t fully clean up afterward?

> On Apr 3, 2023, at 9:25 PM, Work Ceph  
> wrote:
> 
> Hello guys!
> 
> 
> We noticed an unexpected situation. In a recently deployed Ceph cluster we
> are seeing a raw usage, that is a bit odd. We have the following setup:
> 
> 
> We have a new cluster with 5 nodes with the following setup:
> 
>   - 128 GB of RAM
>   - 2 cpus Intel(R) Intel Xeon Silver 4210R
>   - 1 NVME of 2 TB for the rocks DB caching
>   - 5 HDDs of 14TB
>   - 1 NIC dual port of 25GiB in BOND mode.
> 
> 
> Right after deploying the Ceph cluster, we see a raw usage of about 9TiB.
> However, no load has been applied onto the cluster. Have you guys seen such
> a situation? Or, can you guys help understand it?
> 
> 
> We are using Ceph Octopus, and we have set the following configurations:
> 
> ```
> 
> ceph_conf_overrides:
> 
>  global:
> 
>osd pool default size: 3
> 
>osd pool default min size: 1
> 
>osd pool default pg autoscale mode: "warn"
> 
>perf: true
> 
>rocksdb perf: true
> 
>  mon:
> 
>mon osd down out interval: 120
> 
>  osd:
> 
>bluestore min alloc size hdd: 65536
> 
> 
> ```
> 
> 
> Any tip or help on how to explain this situation is welcome!
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Recently deployed cluster showing 9Tb of raw usage without any load deployed

2023-04-03 Thread Work Ceph
To add more information, in case that helps:
```
# ceph -s
  cluster:
id: 
health: HEALTH_OK

 

  task status:

  data:
pools:   6 pools, 161 pgs
objects: 223 objects, 7.0 KiB
usage:   9.3 TiB used, 364 TiB / 373 TiB avail
pgs: 161 active+clean

# ceph df
--- RAW STORAGE ---
CLASS  SIZE AVAILUSED RAW USED  %RAW USED
hdd373 TiB  364 TiB  9.3 TiB   9.3 TiB   2.50
TOTAL  373 TiB  364 TiB  9.3 TiB   9.3 TiB   2.50

--- POOLS ---
POOL   ID  PGS  STORED   OBJECTS  USED %USED  MAX AVAIL
device_health_metrics   11  0 B0  0 B  0115 TiB
.rgw.root   2   32  3.6 KiB8  1.5 MiB  0115 TiB
default.rgw.log 3   32  3.4 KiB  2076 MiB  0115 TiB
default.rgw.control 4   32  0 B8  0 B  0115 TiB
default.rgw.meta5   32  0 B0  0 B  0115 TiB
rbd 6   32  0 B0  0 B  0115 TiB
```

On Mon, Apr 3, 2023 at 10:25 PM Work Ceph 
wrote:

> Hello guys!
>
>
> We noticed an unexpected situation. In a recently deployed Ceph cluster we
> are seeing a raw usage, that is a bit odd. We have the following setup:
>
>
> We have a new cluster with 5 nodes with the following setup:
>
>- 128 GB of RAM
>- 2 cpus Intel(R) Intel Xeon Silver 4210R
>- 1 NVME of 2 TB for the rocks DB caching
>- 5 HDDs of 14TB
>- 1 NIC dual port of 25GiB in BOND mode.
>
>
> Right after deploying the Ceph cluster, we see a raw usage of about 9TiB.
> However, no load has been applied onto the cluster. Have you guys seen such
> a situation? Or, can you guys help understand it?
>
>
> We are using Ceph Octopus, and we have set the following configurations:
>
> ```
>
> ceph_conf_overrides:
>
>   global:
>
> osd pool default size: 3
>
> osd pool default min size: 1
>
> osd pool default pg autoscale mode: "warn"
>
> perf: true
>
> rocksdb perf: true
>
>   mon:
>
> mon osd down out interval: 120
>
>   osd:
>
> bluestore min alloc size hdd: 65536
>
>
> ```
>
>
> Any tip or help on how to explain this situation is welcome!
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Recently deployed cluster showing 9Tb of raw usage without any load deployed

2023-04-03 Thread Work Ceph
Hello guys!


We noticed an unexpected situation. In a recently deployed Ceph cluster we
are seeing a raw usage, that is a bit odd. We have the following setup:


We have a new cluster with 5 nodes with the following setup:

   - 128 GB of RAM
   - 2 cpus Intel(R) Intel Xeon Silver 4210R
   - 1 NVME of 2 TB for the rocks DB caching
   - 5 HDDs of 14TB
   - 1 NIC dual port of 25GiB in BOND mode.


Right after deploying the Ceph cluster, we see a raw usage of about 9TiB.
However, no load has been applied onto the cluster. Have you guys seen such
a situation? Or, can you guys help understand it?


We are using Ceph Octopus, and we have set the following configurations:

```

ceph_conf_overrides:

  global:

osd pool default size: 3

osd pool default min size: 1

osd pool default pg autoscale mode: "warn"

perf: true

rocksdb perf: true

  mon:

mon osd down out interval: 120

  osd:

bluestore min alloc size hdd: 65536


```


Any tip or help on how to explain this situation is welcome!
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Read and write performance on distributed filesystem

2023-04-03 Thread David Cunningham
Hello,

We are considering CephFS as an alternative to GlusterFS, and have some
questions about performance. Is anyone able to advise us please?

This would be for file systems between 100GB and 2TB in size, average file
size around 5MB, and a mixture of reads and writes. I may not be using the
correct terminology in the Ceph world, but in my parlance a node is a Linux
server running the Ceph storage software. Multiple nodes make up the whole
Ceph storage solution. Someone correct me if I should be using different
terms!

In our normal scenario the nodes in the replicated filesystem would be
around 0.3ms apart, but we're also interested in geographically remote
nodes which would be say 20ms away. We are using third party software which
relies on a traditional Linux filesystem, so we can't use an object storage
solution directly.

So my specific questions are:

1. When reading a file from CephFS, does it read from just one node, or
from all nodes?

2. If reads are from one node then does it choose the node with the fastest
response to optimise performance, or if from all nodes then will reads be
no faster than latency to the furthest node?

3. When writing to CephFS, are all nodes written to synchronously, or are
writes to one node which then replicates that to other nodes asynchronously?

4. Can anyone give a recommendation on maximum latency between nodes to
have decent performance?

5. How does CephFS handle a node which suddenly becomes unavailable on the
network? Is the block time configurable, and how good is the healing
process after the lost node rejoins the network?

6. I have read that CephFS is more complicated to administer than
GlusterFS. What does everyone think? Are things like healing after a net
split difficult for administrators new to Ceph to handle?

Thanks very much in advance.

-- 
David Cunningham, Voisonics Limited
http://voisonics.com/
USA: +1 213 221 1092
New Zealand: +64 (0)28 2558 3782
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: quincy v17.2.6 QE Validation status

2023-04-03 Thread Yuri Weinstein
Josh, the release is ready for your review and approval.

Adam, can you please update the LRC upgrade to 17.2.6 RC?

Thx


On Wed, Mar 29, 2023 at 3:07 PM Yuri Weinstein  wrote:

> The release has been approved.
>
> And the gibba cluster upgraded.
>
> We are awaiting the LRC upgrade and then/or in parallel will publish an RC
> for testing.
>
> ETA for the release publishing is 04/07/23
>
> On Tue, Mar 28, 2023 at 2:59 PM Neha Ojha  wrote:
>
>> upgrades approved!
>>
>> Thanks,
>> Neha
>>
>> On Tue, Mar 28, 2023 at 12:09 PM Radoslaw Zarzynski 
>> wrote:
>>
>>> rados: approved!
>>>
>>> On Mon, Mar 27, 2023 at 7:02 PM Laura Flores  wrote:
>>>
 Rados review, second round:

 Failures:
 1. https://tracker.ceph.com/issues/58560
 2. https://tracker.ceph.com/issues/58476
 3. https://tracker.ceph.com/issues/58475 -- pending Q backport
 4. https://tracker.ceph.com/issues/49287
 5. https://tracker.ceph.com/issues/58585

 Details:
 1. test_envlibrados_for_rocksdb.sh failed to subscribe to repo -
 Infrastructure
 2. test_non_existent_cluster: cluster does not exist - Ceph -
 Orchestrator
 3. test_dashboard_e2e.sh: Conflicting peer dependency:
 postcss@8.4.21 - Ceph - Mgr - Dashboard
 4. podman: setting cgroup config for procHooks process caused: Unit
 libpod-$hash.scope not found - Ceph - Orchestrator
 5. rook: failed to pull kubelet image - Ceph - Orchestrator

 @Radoslaw Zarzynski  will give final approval for
 rados.

 On Mon, Mar 27, 2023 at 10:02 AM Casey Bodley 
 wrote:

> On Fri, Mar 24, 2023 at 3:46 PM Yuri Weinstein 
> wrote:
> >
> > Details of this release are updated here:
> >
> > https://tracker.ceph.com/issues/59070#note-1
> > Release Notes - TBD
> >
> > The slowness we experienced seemed to be self-cured.
> > Neha, Radek, and Laura please provide any findings if you have them.
> >
> > Seeking approvals/reviews for:
> >
> > rados - Neha, Radek, Travis, Ernesto, Adam King (rerun on Build 2
> with
> > PRs merged on top of quincy-release)
> > rgw - Casey (rerun on Build 2 with PRs merged on top of
> quincy-release)
>
> rgw approved
>
> > fs - Venky
> >
> > upgrade/octopus-x - Neha, Laura (package issue Adam Kraitman any
> updates?)
> > upgrade/pacific-x - Neha, Laura, Ilya see
> https://tracker.ceph.com/issues/58914
> > upgrade/quincy-p2p - Neha, Laura
> > client-upgrade-octopus-quincy-quincy - Neha, Laura (package issue
> Adam
> > Kraitman any updates?)
> > powercycle - Brad
> >
> > Please reply to this email with approval and/or trackers of known
> > issues/PRs to address them.
> >
> > Josh, Neha - gibba and LRC upgrades pending major suites approvals.
> > RC release - pending major suites approvals.
> >
> > On Tue, Mar 21, 2023 at 1:04 PM Yuri Weinstein 
> wrote:
> > >
> > > Details of this release are summarized here:
> > >
> > > https://tracker.ceph.com/issues/59070#note-1
> > > Release Notes - TBD
> > >
> > > The reruns were in the queue for 4 days because of some slowness
> issues.
> > > The core team (Neha, Radek, Laura, and others) are trying to narrow
> > > down the root cause.
> > >
> > > Seeking approvals/reviews for:
> > >
> > > rados - Neha, Radek, Travis, Ernesto, Adam King (we still have to
> test
> > > and merge at least one PR https://github.com/ceph/ceph/pull/50575
> for
> > > the core)
> > > rgw - Casey
> > > fs - Venky (the fs suite has an unusually high amount of failed
> jobs,
> > > any reason to suspect it in the observed slowness?)
> > > orch - Adam King
> > > rbd - Ilya
> > > krbd - Ilya
> > > upgrade/octopus-x - Laura is looking into failures
> > > upgrade/pacific-x - Laura is looking into failures
> > > upgrade/quincy-p2p - Laura is looking into failures
> > > client-upgrade-octopus-quincy-quincy - missing packages, Adam
> Kraitman
> > > is looking into it
> > > powercycle - Brad
> > > ceph-volume - needs a rerun on merged
> > > https://github.com/ceph/ceph-ansible/pull/7409
> > >
> > > Please reply to this email with approval and/or trackers of known
> > > issues/PRs to address them.
> > >
> > > Also, share any findings or hypnosis about the slowness in the
> > > execution of the suite.
> > >
> > > Josh, Neha - gibba and LRC upgrades pending major suites approvals.
> > > RC release - pending major suites approvals.
> > >
> > > Thx
> > > YuriW
> > ___
> > Dev mailing list -- d...@ceph.io
> > To unsubscribe send an email to dev-le...@ceph.io
> ___
> Dev mailing list -- d...@ceph.io
> To unsub

[ceph-users] Re: Crushmap rule for multi-datacenter erasure coding

2023-04-03 Thread Anthony D'Atri
Mark Nelson's space amp sheet visualizes this really well.  A nuance here is 
that Ceph always writes a full stripe, so with a 9,6 profile, on conventional 
media, a minimum of 15x4KB=20KB  underlying storage will be consumed, even for 
a 1KB object.  A 22 KB object would similarly tie up ~18KB of storage.  As the 
size increases, the remainder factor drops off quite quickly.  This is an 
important consideration when using, say, QLC SSDs with an 8, 16, or even 64KB 
IU size where there are good reasons to set min_alloc_size to amtch.

If compression is enabled, this can be exacerbated as well.  

Large parity groups also can result in lower overall write performance.





https://docs.google.com/spreadsheets/d/1rpGfScgG-GLoIGMJWDixEkqs-On9w8nAUToPQjN8bDI/edit#gid=358760253
Bluestore Space Amplification Cheat Sheet
docs.google.com

> 
>> As you can see, the larger N the smaller the overhead. The downside is 
>> larger stripes, meaning that larger N only make sense

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Crushmap rule for multi-datacenter erasure coding

2023-04-03 Thread Michel Jouvin

Hi Frank,

Thanks for this detailed answer. About your point of 4+2 or similar schemes 
defeating the purpose of a 3-datacenter configuration, you're right in 
principle. In our case, the goal is to avoid any impact for replicated 
pools (in particular RBD for the cloud) but it may be acceptable for some 
pools to be readonly during a short period. But I'll explore your 
alternative k+m scénarios as some may be interesting..


I'm also interested by experience feedback with LRC EC, even if I don't 
think it changes the problem for resilience to a DC failure.


Best regards,

Michel
Sent from my mobile
Le 3 avril 2023 21:57:41 Frank Schilder  a écrit :


Hi Michel,

failure domain = datacenter doesn't work, because crush wants to put 1 
shard per failure domain and you have 3 data centers and not 6. The 
modified crush rule you wrote should work. I believe equally well with x=0 
or 2 -- but try it out before doing anything to your cluster.


The easiest way for non-destructive testing is to download the osdmap from 
your cluster and from that map extract the crush map. You can then 
*without* modifying your cluster update the crush map in the (off-line) 
copy of the OSD map and let it compute mappings (commands for all this are 
in the ceph docs, look for osdmaptool). These mappings you can check for if 
they are as you want. There was an earlier case where someone posted a 
script to confirm mappings automatically. I used some awk magic, its not 
that difficult.


As a note of warning, if you want to be failure resistant, don't use 4+2. 
Its not worth the effort of having 3 data centers. In case you loose one 
DC, you have only 4 shards left, in which case the pool becomes read-only. 
Don't even consider to set min_size=4, it again completely defeats the 
purpose of having 3 DCs in the first place.


The smallest profile you can use that will ensure RW access in case of a DC 
failure is 5+4 (55% usable capacity). If 1 DC fails, you have 6 shards, 
which is equivalent to 5+1. Here, you have RW access in case of 1 DC down. 
However, k=5 is a prime number with negative performance impact, ideal are 
powers of 2 for k. The alternative is k=4, m=5 (44% usable capacity) with 
good performance but higher redundancy overhead.


You can construct valid schemes by looking at all N multiples of 3 and 
trying k<=(2N/3-1):


N=6 -> k=2 m=4
N=9 -> k=4, m=5 ; k=5, m=4
N=12 -> k=6, m=6; k=7, m=5
N=15 -> k=8, m=7 ; k=9, m=6

As you can see, the larger N the smaller the overhead. The downside is 
larger stripes, meaning that larger N only make sense if you have a large 
files/objects. An often overlooked advantage of profiles with large m is 
that you can defeat tail latencies for read operations by setting 
fast_read=true for the pool. This is really great when you have silently 
failing disks. Unfortunately, there is no fast_write counterpart (which 
would not be useful in your use case any ways).


There are only very few useful profiles with k a power of 2 (4+5, 8+7). 
Some people use 7+5 with success and 5+4 does look somewhat OK as well. If 
you use latest ceph with bluestore min alloc size = 4K, stripe size is less 
of an issue and 8+7 is a really good candidate that I would give a shot in 
a benchmark.


You should benchmark a number of different profiles on your system to get 
an idea of how important the profile is for performance and how much 
replication overhead you can afford. Remember to benchmark also in degraded 
condition. While as an admin you might be happy that stuff is up, users 
will still complain if things are suddenly unbearably slow. Make 
long-running tests in degraded state to catch all the pitfalls of MONs not 
trimming logs etc. to have a reliable configuration that doesn't let you 
down the first time it rains.


Good luck and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Michel Jouvin 
Sent: Monday, April 3, 2023 6:40 PM
To: ceph-users@ceph.io
Subject: [ceph-users] Crushmap rule for multi-datacenter erasure coding

Hi,

We have a 3-site Ceph cluster and would like to create a 4+2 EC pool
with 2 chunks per datacenter, to maximise the resilience in case of 1
datacenter being down. I have not found a way to create an EC profile
with this 2-level allocation strategy. I created an EC profile with a
failure domain = datacenter but it doesn't work as, I guess, it would
like to ensure it has always 5 OSDs up (to ensure that the pools remains
R/W) where with a failure domain = datacenter, the guarantee is only 4.
My idea was to create a 2-step allocation and a failure domain=host to
achieve our desired configuration, with something like the following in
the crushmap rule:

step choose indep 3 datacenter
step chooseleaf indep x host
step emit

Is it the right approach? If yes, what should be 'x'? Would 0 work?

From what I have seen, there is no way to create such a rule with the
'ceph osd crush' commands: I have to

[ceph-users] Re: Crushmap rule for multi-datacenter erasure coding

2023-04-03 Thread Frank Schilder
Hi Michel,

failure domain = datacenter doesn't work, because crush wants to put 1 shard 
per failure domain and you have 3 data centers and not 6. The modified crush 
rule you wrote should work. I believe equally well with x=0 or 2 -- but try it 
out before doing anything to your cluster.

The easiest way for non-destructive testing is to download the osdmap from your 
cluster and from that map extract the crush map. You can then *without* 
modifying your cluster update the crush map in the (off-line) copy of the OSD 
map and let it compute mappings (commands for all this are in the ceph docs, 
look for osdmaptool). These mappings you can check for if they are as you want. 
There was an earlier case where someone posted a script to confirm mappings 
automatically. I used some awk magic, its not that difficult.

As a note of warning, if you want to be failure resistant, don't use 4+2. Its 
not worth the effort of having 3 data centers. In case you loose one DC, you 
have only 4 shards left, in which case the pool becomes read-only. Don't even 
consider to set min_size=4, it again completely defeats the purpose of having 3 
DCs in the first place.

The smallest profile you can use that will ensure RW access in case of a DC 
failure is 5+4 (55% usable capacity). If 1 DC fails, you have 6 shards, which 
is equivalent to 5+1. Here, you have RW access in case of 1 DC down. However, 
k=5 is a prime number with negative performance impact, ideal are powers of 2 
for k. The alternative is k=4, m=5 (44% usable capacity) with good performance 
but higher redundancy overhead.

You can construct valid schemes by looking at all N multiples of 3 and trying 
k<=(2N/3-1):

N=6 -> k=2 m=4
N=9 -> k=4, m=5 ; k=5, m=4
N=12 -> k=6, m=6; k=7, m=5
N=15 -> k=8, m=7 ; k=9, m=6

As you can see, the larger N the smaller the overhead. The downside is larger 
stripes, meaning that larger N only make sense if you have a large 
files/objects. An often overlooked advantage of profiles with large m is that 
you can defeat tail latencies for read operations by setting fast_read=true for 
the pool. This is really great when you have silently failing disks. 
Unfortunately, there is no fast_write counterpart (which would not be useful in 
your use case any ways).

There are only very few useful profiles with k a power of 2 (4+5, 8+7). Some 
people use 7+5 with success and 5+4 does look somewhat OK as well. If you use 
latest ceph with bluestore min alloc size = 4K, stripe size is less of an issue 
and 8+7 is a really good candidate that I would give a shot in a benchmark.

You should benchmark a number of different profiles on your system to get an 
idea of how important the profile is for performance and how much replication 
overhead you can afford. Remember to benchmark also in degraded condition. 
While as an admin you might be happy that stuff is up, users will still 
complain if things are suddenly unbearably slow. Make long-running tests in 
degraded state to catch all the pitfalls of MONs not trimming logs etc. to have 
a reliable configuration that doesn't let you down the first time it rains.

Good luck and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Michel Jouvin 
Sent: Monday, April 3, 2023 6:40 PM
To: ceph-users@ceph.io
Subject: [ceph-users] Crushmap rule for multi-datacenter erasure coding

Hi,

We have a 3-site Ceph cluster and would like to create a 4+2 EC pool
with 2 chunks per datacenter, to maximise the resilience in case of 1
datacenter being down. I have not found a way to create an EC profile
with this 2-level allocation strategy. I created an EC profile with a
failure domain = datacenter but it doesn't work as, I guess, it would
like to ensure it has always 5 OSDs up (to ensure that the pools remains
R/W) where with a failure domain = datacenter, the guarantee is only 4.
My idea was to create a 2-step allocation and a failure domain=host to
achieve our desired configuration, with something like the following in
the crushmap rule:

step choose indep 3 datacenter
step chooseleaf indep x host
step emit

Is it the right approach? If yes, what should be 'x'? Would 0 work?

 From what I have seen, there is no way to create such a rule with the
'ceph osd crush' commands: I have to download the current CRUSHMAP, edit
it and upload the modified version. Am I right?

Thanks in advance for your help or suggestions. Best regards,

Michel

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Misplaced objects greater than 100%

2023-04-03 Thread Johan Hattne
Thanks Mehmet; I took a closer look at what I sent you and the problem 
appears to be in the CRUSH map.  At some point since anything was last 
rebooted, I created rack buckets and moved the OSD nodes in under them:


  # ceph osd crush add-bucket rack-0 rack
  # ceph osd crush add-bucket rack-1 rack

  # ceph osd crush move bcgonen-r0h0 rack=rack-0
  # ceph osd crush move bcgonen-r0h1 rack=rack-0
  # ceph osd crush move bcgonen-r1h0 rack=rack-1

All seemed fine at the time; it was not until bcgonen-r1h0 was rebooted 
that stuff got weird.  But as per "ceph osd tree" output, those rack 
buckets were sitting next to the default root as opposed to under it.


Now that's fixed, and the cluster is backfilling remapped PGs.

// J

On 2023-03-31 16:01, Johan Hattne wrote:

Here goes:

# ceph -s
   cluster:
     id: e1327a10-8b8c-11ed-88b9-3cecef0e3946
     health: HEALTH_OK

   services:
     mon: 5 daemons, quorum 
bcgonen-a,bcgonen-b,bcgonen-c,bcgonen-r0h0,bcgonen-r0h1 (age 16h)

     mgr: bcgonen-b.furndm(active, since 8d), standbys: bcgonen-a.qmmqxj
     mds: 1/1 daemons up, 2 standby
     osd: 36 osds: 36 up (since 16h), 36 in (since 3d); 1041 remapped pgs

   data:
     volumes: 1/1 healthy
     pools:   3 pools, 1041 pgs
     objects: 5.42M objects, 6.5 TiB
     usage:   19 TiB used, 428 TiB / 447 TiB avail
     pgs: 27087125/16252275 objects misplaced (166.667%)
  1039 active+clean+remapped
  2    active+clean+remapped+scrubbing+deep

# ceph osd tree
ID   CLASS  WEIGHT TYPE NAME  STATUS  REWEIGHT  PRI-AFF
-14 149.02008  rack rack-1
  -7 149.02008  host bcgonen-r1h0
  20    hdd   14.55269  osd.20 up   1.0  1.0
  21    hdd   14.55269  osd.21 up   1.0  1.0
  22    hdd   14.55269  osd.22 up   1.0  1.0
  23    hdd   14.55269  osd.23 up   1.0  1.0
  24    hdd   14.55269  osd.24 up   1.0  1.0
  25    hdd   14.55269  osd.25 up   1.0  1.0
  26    hdd   14.55269  osd.26 up   1.0  1.0
  27    hdd   14.55269  osd.27 up   1.0  1.0
  28    hdd   14.55269  osd.28 up   1.0  1.0
  29    hdd   14.55269  osd.29 up   1.0  1.0
  34    ssd    1.74660  osd.34 up   1.0  1.0
  35    ssd    1.74660  osd.35 up   1.0  1.0
-13 298.04016  rack rack-0
  -3 149.02008  host bcgonen-r0h0
   0    hdd   14.55269  osd.0  up   1.0  1.0
   1    hdd   14.55269  osd.1  up   1.0  1.0
   2    hdd   14.55269  osd.2  up   1.0  1.0
   3    hdd   14.55269  osd.3  up   1.0  1.0
   4    hdd   14.55269  osd.4  up   1.0  1.0
   5    hdd   14.55269  osd.5  up   1.0  1.0
   6    hdd   14.55269  osd.6  up   1.0  1.0
   7    hdd   14.55269  osd.7  up   1.0  1.0
   8    hdd   14.55269  osd.8  up   1.0  1.0
   9    hdd   14.55269  osd.9  up   1.0  1.0
  30    ssd    1.74660  osd.30 up   1.0  1.0
  31    ssd    1.74660  osd.31 up   1.0  1.0
  -5 149.02008  host bcgonen-r0h1
  10    hdd   14.55269  osd.10 up   1.0  1.0
  11    hdd   14.55269  osd.11 up   1.0  1.0
  12    hdd   14.55269  osd.12 up   1.0  1.0
  13    hdd   14.55269  osd.13 up   1.0  1.0
  14    hdd   14.55269  osd.14 up   1.0  1.0
  15    hdd   14.55269  osd.15 up   1.0  1.0
  16    hdd   14.55269  osd.16 up   1.0  1.0
  17    hdd   14.55269  osd.17 up   1.0  1.0
  18    hdd   14.55269  osd.18 up   1.0  1.0
  19    hdd   14.55269  osd.19 up   1.0  1.0
  32    ssd    1.74660  osd.32 up   1.0  1.0
  33    ssd    1.74660  osd.33 up   1.0  1.0
  -1 0  root default

# ceph osd pool ls detail
pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash 
rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 31 flags 
hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr
pool 2 'cephfs.cephfs.meta' replicated size 3 min_size 2 crush_rule 2 
object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on last_change 
9833 lfor 0/0/584 flags hashpspool stripe_width 0 pg_autoscale_bias 4 
pg_num_min 16 recovery_priority 5 application cephfs
pool 3 'cephfs.cephfs.data' replicated size 

[ceph-users] Re: compiling Nautilus for el9

2023-04-03 Thread Marc
I am building with a centos9 stream container currently. I have been adding 
some rpms that were missing and not in the dependencies. 

Currently with these cmake options, these binaries are not build. Anyone an 
idea what this could be.

cmake .. -DCMAKE_INSTALL_PREFIX=/usr -DCMAKE_INSTALL_LIBDIR=/usr/lib64 
-DCMAKE_INSTALL_LIBEXECDIR=/usr/lib -DCMAKE_INSTALL_LOCALSTATEDIR=/var 
-DCMAKE_INSTALL_SYSCONFDIR=/etc -DCMAKE_INSTALL_MANDIR=/usr/share/man 
-DCMAKE_INSTALL_DOCDIR=/usr/share/doc/ceph 
-DCMAKE_INSTALL_INCLUDEDIR=/usr/include -DWITH_MANPAGE=ON -DWITH_PYTHON3=3.9 
-DWITH_MGR_DASHBOARD_FRONTEND=OFF -DWITH_PYTHON2=OFF -DMGR_PYTHON_VERSION=3 
-DWITH_SELINUX=ON -DWITH_LTTNG=ON -DWITH_BABELTRACE=ON -DWITH_OCF=ON 
-DWITH_BOOST_CONTEXT=ON -DWITH_LIBRADOSSTRIPER=ON 
-DWITH_RADOSGW_AMQP_ENDPOINT=ON -DWITH_RADOSGW_KAFKA_ENDPOINT=ON 
-DWITH_GRAFANA=ON -DWITH_SYSTEM_BOOST=ON -DWITH_TESTS=OFF 
-DWITH_MGR_DASHBOARD_FRONTEND=OFF -DWITH_SYSTEM_NPM=OFF 
-DWITH_RADOSGW_KAFKA_ENDPOINT=OFF -DWITH_RADOSGW=ON -DWITH_GRAFANA=OFF 
-DBOOST_J=6


RPM build errors:
File not found: 
/root/rpmbuild/BUILDROOT/ceph-14.2.22-0.el9.x86_64/usr/bin/ceph-client-debug
File not found: 
/root/rpmbuild/BUILDROOT/ceph-14.2.22-0.el9.x86_64/usr/bin/ceph_bench_log
File not found: 
/root/rpmbuild/BUILDROOT/ceph-14.2.22-0.el9.x86_64/usr/bin/ceph_kvstorebench
File not found: 
/root/rpmbuild/BUILDROOT/ceph-14.2.22-0.el9.x86_64/usr/bin/ceph_multi_stress_watch
File not found: 
/root/rpmbuild/BUILDROOT/ceph-14.2.22-0.el9.x86_64/usr/bin/ceph_erasure_code
File not found: 
/root/rpmbuild/BUILDROOT/ceph-14.2.22-0.el9.x86_64/usr/bin/ceph_erasure_code_benchmark
File not found: 
/root/rpmbuild/BUILDROOT/ceph-14.2.22-0.el9.x86_64/usr/bin/ceph_omapbench
File not found: 
/root/rpmbuild/BUILDROOT/ceph-14.2.22-0.el9.x86_64/usr/bin/ceph_objectstore_bench
File not found: 
/root/rpmbuild/BUILDROOT/ceph-14.2.22-0.el9.x86_64/usr/bin/ceph_perf_objectstore
File not found: 
/root/rpmbuild/BUILDROOT/ceph-14.2.22-0.el9.x86_64/usr/bin/ceph_perf_local
File not found: 
/root/rpmbuild/BUILDROOT/ceph-14.2.22-0.el9.x86_64/usr/bin/ceph_perf_msgr_client
File not found: 
/root/rpmbuild/BUILDROOT/ceph-14.2.22-0.el9.x86_64/usr/bin/ceph_perf_msgr_server
File not found: 
/root/rpmbuild/BUILDROOT/ceph-14.2.22-0.el9.x86_64/usr/bin/ceph_psim
File not found: 
/root/rpmbuild/BUILDROOT/ceph-14.2.22-0.el9.x86_64/usr/bin/ceph_radosacl
File not found: 
/root/rpmbuild/BUILDROOT/ceph-14.2.22-0.el9.x86_64/usr/bin/ceph_rgw_jsonparser
File not found: 
/root/rpmbuild/BUILDROOT/ceph-14.2.22-0.el9.x86_64/usr/bin/ceph_rgw_multiparser
File not found: 
/root/rpmbuild/BUILDROOT/ceph-14.2.22-0.el9.x86_64/usr/bin/ceph_scratchtool
File not found: 
/root/rpmbuild/BUILDROOT/ceph-14.2.22-0.el9.x86_64/usr/bin/ceph_scratchtoolpp
File not found: 
/root/rpmbuild/BUILDROOT/ceph-14.2.22-0.el9.x86_64/usr/bin/ceph_test_*
File not found: 
/root/rpmbuild/BUILDROOT/ceph-14.2.22-0.el9.x86_64/usr/bin/ceph-coverage
File not found: 
/root/rpmbuild/BUILDROOT/ceph-14.2.22-0.el9.x86_64/usr/bin/ceph-debugpack
File not found: 
/root/rpmbuild/BUILDROOT/ceph-14.2.22-0.el9.x86_64/usr/bin/cephdeduptool
File not found: 
/root/rpmbuild/BUILDROOT/ceph-14.2.22-0.el9.x86_64/usr/share/man/man8/ceph-debugpack.8*
File not found: 
/root/rpmbuild/BUILDROOT/ceph-14.2.22-0.el9.x86_64/usr/lib64/ceph/ceph-monstore-update-crush.sh
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Crushmap rule for multi-datacenter erasure coding

2023-04-03 Thread Michel Jouvin

Hi,

We have a 3-site Ceph cluster and would like to create a 4+2 EC pool 
with 2 chunks per datacenter, to maximise the resilience in case of 1 
datacenter being down. I have not found a way to create an EC profile 
with this 2-level allocation strategy. I created an EC profile with a 
failure domain = datacenter but it doesn't work as, I guess, it would 
like to ensure it has always 5 OSDs up (to ensure that the pools remains 
R/W) where with a failure domain = datacenter, the guarantee is only 4. 
My idea was to create a 2-step allocation and a failure domain=host to 
achieve our desired configuration, with something like the following in 
the crushmap rule:


step choose indep 3 datacenter
step chooseleaf indep x host
step emit

Is it the right approach? If yes, what should be 'x'? Would 0 work?

From what I have seen, there is no way to create such a rule with the 
'ceph osd crush' commands: I have to download the current CRUSHMAP, edit 
it and upload the modified version. Am I right?


Thanks in advance for your help or suggestions. Best regards,

Michel

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How mClock profile calculation works, and IOPS

2023-04-03 Thread Sridhar Seshasayee
Responses inline.

I have a last question. Why is the bench performed using writes of 4 KiB.
> Is any reason to choose that over another another value?
>
> Yes, the mClock scheduler considers this as a baseline in order to
estimate costs for operations involving other block sizes.
This is again an internal implementation detail.

On my lab, I tested with various values, and I have mainly two type of
> disks. Some Seagates and Toshiba.
>
> If I do bench with 4KiB, what I get from Seagate is a result around 2000
> IOPS. While the Toshiba is more arround 600.
>
> If I do bench with 128KiB, I still have results arround 2000 IOPS for
> Seagate, but Toshiba also bench arround 2000 IOPS. And from the rados
> experiment I did, having osd_mclock_max_capacity_iops_hdd set to 2000 on
> that lab setup is the value I get the most performance from my rados
> experiments, both with Segate and Toshiba disks.
>
> I would currently suggest setting osd_mclock_max_capacity_iops_hdd to
values you measured with fio as that is more realistic.
Like I mentioned, there are some improvements coming around this area that
would allow users to have greater control on
setting a realistic benchmark value.

-Sridhar
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How mClock profile calculation works, and IOPS

2023-04-03 Thread Luis Domingues
Hi,

Thanks a lot for the information.

I have a last question. Why is the bench performed using writes of 4 KiB. Is 
any reason to choose that over another another value?

On my lab, I tested with various values, and I have mainly two type of disks. 
Some Seagates and Toshiba.

If I do bench with 4KiB, what I get from Seagate is a result around 2000 IOPS. 
While the Toshiba is more arround 600.

If I do bench with 128KiB, I still have results arround 2000 IOPS for Seagate, 
but Toshiba also bench arround 2000 IOPS. And from the rados experiment I did, 
having osd_mclock_max_capacity_iops_hdd set to 2000 on that lab setup is the 
value I get the most performance from my rados experiments, both with Segate 
and Toshiba disks.

Luis Domingues
Proton AG


--- Original Message ---
On Monday, April 3rd, 2023 at 08:44, Sridhar Seshasayee  
wrote:


> Why was it done that way? I do not understand the reason why distributing
> 
> > the IOPS accross different disks, when the measurement we have is for one
> > disk alone. This means with default parameters we will always be far from
> > reaching OSD limit right?
> > 
> > It's not on different disks. We distribute the IOPS across shards on a
> 
> given OSD/disk. This is an internal implementation detail.
> This means in your case, 450 IOPS is distributed across 5 shards on the
> same OSD/disk. You can think of it as 5 threads
> being allocated a share of the total IOPS on a given OSD.
> -Sridhar
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How mClock profile calculation works, and IOPS

2023-04-03 Thread Sridhar Seshasayee
Why was it done that way? I do not understand the reason why distributing
> the IOPS accross different disks, when the measurement we have is for one
> disk alone. This means with default parameters we will always be far from
> reaching OSD limit right?
>
> It's not on different disks. We distribute the IOPS across shards on a
given OSD/disk. This is an internal implementation detail.
This means in your case, 450 IOPS is distributed across 5 shards on the
same OSD/disk. You can think of it as 5 threads
being allocated a share of the total IOPS on a given OSD.
-Sridhar
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How mClock profile calculation works, and IOPS

2023-04-03 Thread Luis Domingues
Hi Sridhar

Thanks for the information.

> 
> The above values are a result of distributing the IOPS across all the OSD
> shards as defined by the
> osd_op_num_shards_[hdd|ssd] option. For HDDs, this is set to 5 and
> therefore the IOPS will be
> distributed across the 5 shards (i.e. for e.g., 675/5 for
> osd_mclock_scheduler_background_recovery_lim
> and so on for other reservation and limit options).

Why was it done that way? I do not understand the reason why distributing the 
IOPS accross different disks, when the measurement we have is for one disk 
alone. This means with default parameters we will always be far from reaching 
OSD limit right?

Luis Domingues
Proton AG


--- Original Message ---
On Monday, April 3rd, 2023 at 07:43, Sridhar Seshasayee  
wrote:


> Hi Luis,
> 
> 
> I am reading reading some documentation about mClock and have two questions.
> 
> > First, about the IOPS. Are those IOPS disk IOPS or other kind of IOPS? And
> > what the assumption of those? (Like block size, sequential or random
> > reads/writes)?
> 
> 
> This is the result of testing running OSD bench random writes at 4 KiB
> block size.
> 
> > But what I get is:
> > 
> > "osd_mclock_scheduler_background_best_effort_lim": "99",
> > "osd_mclock_scheduler_background_best_effort_res": "18",
> > "osd_mclock_scheduler_background_best_effort_wgt": "2",
> > "osd_mclock_scheduler_background_recovery_lim": "135",
> > "osd_mclock_scheduler_background_recovery_res": "36",
> > "osd_mclock_scheduler_background_recovery_wgt": "1",
> > "osd_mclock_scheduler_client_lim": "90",
> > "osd_mclock_scheduler_client_res": "36",
> > "osd_mclock_scheduler_client_wgt": "1",
> > 
> > Which seems very low according to what my disk seems to be able to handle.
> > 
> > Is this calculation the expected one? Or did I miss something on how those
> > profiles are populated?
> 
> 
> The above values are a result of distributing the IOPS across all the OSD
> shards as defined by the
> osd_op_num_shards_[hdd|ssd] option. For HDDs, this is set to 5 and
> therefore the IOPS will be
> distributed across the 5 shards (i.e. for e.g., 675/5 for
> osd_mclock_scheduler_background_recovery_lim
> and so on for other reservation and limit options).
> 
> -Sridhar
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io