[ceph-users] Re: Crushmap rule for multi-datacenter erasure coding

2023-04-04 Thread Frédéric Nass
Hello Michel,

What you need is:

step choose indep 0 type datacenter
step chooseleaf indep 2 type host
step emit

I think you're right about the need to tweak the crush rule by editing the 
crushmap directly.

Regards
Frédéric.

- Le 3 Avr 23, à 18:34, Michel Jouvin michel.jou...@ijclab.in2p3.fr a écrit 
:

> Hi,
> 
> We have a 3-site Ceph cluster and would like to create a 4+2 EC pool
> with 2 chunks per datacenter, to maximise the resilience in case of 1
> datacenter being down. I have not found a way to create an EC profile
> with this 2-level allocation strategy. I created an EC profile with a
> failure domain = datacenter but it doesn't work as, I guess, it would
> like to ensure it has always 5 OSDs up (to ensure that the pools remains
> R/W) where with a failure domain = datacenter, the guarantee is only 4.
> My idea was to create a 2-step allocation and a failure domain=host to
> achieve our desired configuration, with something like the following in
> the crushmap rule:
> 
> step choose indep 3 datacenter
> step chooseleaf indep x host
> step emit
> 
> Is it the right approach? If yes, what should be 'x'? Would 0 work?
> 
> From what I have seen, there is no way to create such a rule with the
> 'ceph osd crush' commands: I have to download the current CRUSHMAP, edit
> it and upload the modified version. Am I right?
> 
> Thanks in advance for your help or suggestions. Best regards,
> 
> Michel
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Upgrading to 16.2.11 timing out on ceph-volume due to raw list performance bug, downgrade isn't possible due to new OP code in bluestore

2023-04-04 Thread Mikael Öhman
Trying to upgrade a containerized setup from 16.2.10 to 16.2.11 gave us two
big surprises, I wanted to share in case anyone else encounters the same. I
don't see any nice solution to this apart from a new release that fixes the
performance regression that completely breaks the container setup in
cephadm due to timeouts:

After some digging, we would that the it was the "ceph-volume" command that
kept timing out, and after a ton of digging, found that it does so because
of
https://github.com/ceph/ceph/commit/bea9f4b643ce32268ad79c0fc257b25ff2f8333c#diff-29697ff230f01df036802c8b2842648267767b3a7231ea04a402eaf4e1819d29R104
which was introduced into 16.2.11.
Unfortunately, the vital fix for this
https://github.com/ceph/ceph/commit/8d7423c3e75afbe111c91e699ef3cb1c0beee61b
was not included in 16.2.11

So, in a setup like ours, with *many* devices, a simple "ceph-volume raw
list" now takes over 10 minutes to run (instead of 5 seconds in 16.2.10).
As a result, the service files that cephadm generates

[Service]
LimitNOFILE=1048576
LimitNPROC=1048576
EnvironmentFile=-/etc/environment
ExecStart=/bin/bash
/var/lib/ceph/5406fed0-d52b-11ec-beff-7ed30a54847b/%i/unit.run
ExecStop=-/bin/bash -c '/bin/podman stop
ceph-5406fed0-d52b-11ec-beff-7ed30a54847b-%i ; bash
/var/lib/ceph/5406fed0-d52b-11ec-beff-7ed30a54847b/%i/unit.stop'
ExecStopPost=-/bin/bash
/var/lib/ceph/5406fed0-d52b-11ec-beff-7ed30a54847b/%i/unit.poststop
KillMode=none
Restart=on-failure
RestartSec=10s
TimeoutStartSec=120
TimeoutStopSec=120
StartLimitInterval=30min
StartLimitBurst=5
ExecStartPre=-/bin/rm -f %t/%n-pid %t/%n-cid
ExecStopPost=-/bin/rm -f %t/%n-pid %t/%n-cid
Type=forking
PIDFile=%t/%n-pid
Delegate=yes

will repeatedly be marked as failed, as they take over 2 minutes to run
now. This tells systemd to restart, and we now have an infinite loop, as
the 5 restarts takes over 50 minutes, it never even triggers the
StarLimitInterval, leaving this OSD in an infinite loop over listing the
n^2 devices (which, as a bonus, is also filling up the root  disk with an
enormous amount of repeated logging in ceph-volume.log as it infinitely
tries to figure out if a block device is a bluestore)
And trying to just fix the service or unit files manually to at least just
stop this container from being incorrectly restarted over and over, is also
a dead end, since the orchestration stuff just overwrites this
automatically, and restarts the services again.
I found it seemed to be
/var/lib/ceph/5406fed0-d52b-11ec-beff-7ed30a54847b/cephadm.8d0364fef6c92fc3580b0d022e32241348e6f11a7694d2b957cdafcb9d059ff2
on my system that generated these files, so i tried tweaking that to have
the necessary 1200 second TimeoutStart and finally that managed to get the
darn container to stop restarting endlessly. (I admit i'm very fuzzy on how
these services and orchestration stuff is triggered as i usually don't work
on our storage stuff)
Still though, it takes 11 minutes to start each OSD service now, so this
isn't great.

We wanted to revert back to 16.2.10 but it turns out to also be a no-go, as
a new operation added to bluefs https://github.com/ceph/ceph/pull/42750 in
16.2.11 (though this isn't mentioned in the changelogs, i had to compare
the source code to see that it was in fact added 16.2.11). So trying to
revert an OSD then fails with:

debug 2023-04-04T11:42:45.927+ 7f2c12f6a200 -1 bluefs _replay 0x10:
stop: unrecognized op 12
debug 2023-04-04T11:42:45.927+ 7f2c12f6a200 -1 bluefs mount failed to
replay log: (5) Input/output error
debug 2023-04-04T11:42:45.927+ 7f2c12f6a200 -1
bluestore(/var/lib/ceph/osd/ceph-10) _open_bluefs failed bluefs mount: (5)
Input/output error
debug 2023-04-04T11:42:45.927+ 7f2c12f6a200 -1
bluestore(/var/lib/ceph/osd/ceph-10) _open_db failed to prepare db
environment:
debug 2023-04-04T11:42:45.927+ 7f2c12f6a200  1 bdev(0x5590e80a0400
/var/lib/ceph/osd/ceph-10/block) close
debug 2023-04-04T11:42:46.153+ 7f2c12f6a200 -1 osd.10 0 OSD:init:
unable to mount object store
debug 2023-04-04T11:42:46.153+ 7f2c12f6a200 -1  ** ERROR: osd init
failed: (5) Input/output error

Ouch
Best regards, Mikael
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Crushmap rule for multi-datacenter erasure coding

2023-04-04 Thread Michel Jouvin

Hi,

We have a 3-site Ceph cluster and would like to create a 4+2 EC pool 
with 2 chunks per datacenter, to maximise the resilience in case of 1 
datacenter being down. I have not found a way to create an EC profile 
with this 2-level allocation strategy. I created an EC profile with a 
failure domain = datacenter but it doesn't work as, I guess, it would 
like to ensure it has always 5 OSDs up (to ensure that the pools remains 
R/W) where with a failure domain = datacenter, the guarantee is only 4. 
My idea was to create a 2-step allocation and a failure domain=host to 
achieve our desired configuration, with something like the following in 
the crushmap rule:


step choose indep 3 datacenter
step chooseleaf indep x host
step emit

Is it the right approach? If yes, what should be 'x'? Would 0 work?

From what I have seen, there is no way to create such a rule with the 
'ceph osd crush' commands: I have to download the current CRUSHMAP, edit 
it and upload the modified version. Am I right?


Thanks in advance for your help or suggestions. Best regards,

Michel
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-04-04 Thread Michel Jouvin
Answering to myself, I found the reason for 2147483647: it's documented 
as a failure to find enough OSD (missing OSDs). And it is normal as I 
selected different hosts for the 15 OSDs but I have only 12 hosts!


I'm still interested by an "expert" to confirm that LRC  k=9, m=3, l=4 
configuration is equivalent, in terms of redundancy, to a jerasure 
configuration with k=9, m=6.


Michel

Le 04/04/2023 à 15:26, Michel Jouvin a écrit :

Hi,

As discussed in another thread (Crushmap rule for multi-datacenter 
erasure coding), I'm trying to create an EC pool spanning 3 
datacenters (datacenters are present in the crushmap), with the 
objective to be resilient to 1 DC down, at least keeping the readonly 
access to the pool and if possible the read-write access, and have a 
storage efficiency better than 3 replica (let say a storage overhead 
<= 2).


In the discussion, somebody mentioned LRC plugin as a possible 
jerasure alternative to implement this without tweaking the crushmap 
rule to implement the 2-step OSD allocation. I looked at the 
documentation 
(https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/) 
but I have some questions if someone has experience/expertise with 
this LRC plugin.


I tried to create a rule for using 5 OSDs per datacenter (15 in 
total), with 3 (9 in total) being data chunks and others being coding 
chunks. For this, based of my understanding of examples, I used k=9, 
m=3, l=4. Is it right? Is this configuration equivalent, in terms of 
redundancy, to a jerasure configuration with k=9, m=6?


The resulting rule, which looks correct to me, is:



{
    "rule_id": 6,
    "rule_name": "test_lrc_2",
    "ruleset": 6,
    "type": 3,
    "min_size": 3,
    "max_size": 15,
    "steps": [
    {
    "op": "set_chooseleaf_tries",
    "num": 5
    },
    {
    "op": "set_choose_tries",
    "num": 100
    },
    {
    "op": "take",
    "item": -4,
    "item_name": "default~hdd"
    },
    {
    "op": "choose_indep",
    "num": 3,
    "type": "datacenter"
    },
    {
    "op": "chooseleaf_indep",
    "num": 5,
    "type": "host"
    },
    {
    "op": "emit"
    }
    ]
}



Unfortunately, it doesn't work as expected: a pool created with this 
rule ends up with its pages active+undersize, which is unexpected for 
me. Looking at 'ceph health detail` output, I see for each page 
something like:


pg 52.14 is stuck undersized for 27m, current state active+undersized, 
last acting 
[90,113,2147483647,103,64,147,164,177,2147483647,133,58,28,8,32,2147483647]


For each PG, there is 3 '2147483647' entries and I guess it is the 
reason of the problem. What are these entries about? Clearly it is not 
OSD entries... Looks like a negative number, -1, which in terms of 
crushmap ID is the crushmap root (named "default" in our 
configuration). Any trivial mistake I would have made?


Thanks in advance for any help or for sharing any successful 
configuration?


Best regards,

Michel
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Help needed to configure erasure coding LRC plugin

2023-04-04 Thread Michel Jouvin

Hi,

As discussed in another thread (Crushmap rule for multi-datacenter 
erasure coding), I'm trying to create an EC pool spanning 3 datacenters 
(datacenters are present in the crushmap), with the objective to be 
resilient to 1 DC down, at least keeping the readonly access to the pool 
and if possible the read-write access, and have a storage efficiency 
better than 3 replica (let say a storage overhead <= 2).


In the discussion, somebody mentioned LRC plugin as a possible jerasure 
alternative to implement this without tweaking the crushmap rule to 
implement the 2-step OSD allocation. I looked at the documentation 
(https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/) but 
I have some questions if someone has experience/expertise with this LRC 
plugin.


I tried to create a rule for using 5 OSDs per datacenter (15 in total), 
with 3 (9 in total) being data chunks and others being coding chunks. 
For this, based of my understanding of examples, I used k=9, m=3, l=4. 
Is it right? Is this configuration equivalent, in terms of redundancy, 
to a jerasure configuration with k=9, m=6?


The resulting rule, which looks correct to me, is:



{
    "rule_id": 6,
    "rule_name": "test_lrc_2",
    "ruleset": 6,
    "type": 3,
    "min_size": 3,
    "max_size": 15,
    "steps": [
    {
    "op": "set_chooseleaf_tries",
    "num": 5
    },
    {
    "op": "set_choose_tries",
    "num": 100
    },
    {
    "op": "take",
    "item": -4,
    "item_name": "default~hdd"
    },
    {
    "op": "choose_indep",
    "num": 3,
    "type": "datacenter"
    },
    {
    "op": "chooseleaf_indep",
    "num": 5,
    "type": "host"
    },
    {
    "op": "emit"
    }
    ]
}



Unfortunately, it doesn't work as expected: a pool created with this 
rule ends up with its pages active+undersize, which is unexpected for 
me. Looking at 'ceph health detail` output, I see for each page 
something like:


pg 52.14 is stuck undersized for 27m, current state active+undersized, 
last acting 
[90,113,2147483647,103,64,147,164,177,2147483647,133,58,28,8,32,2147483647]


For each PG, there is 3 '2147483647' entries and I guess it is the 
reason of the problem. What are these entries about? Clearly it is not 
OSD entries... Looks like a negative number, -1, which in terms of 
crushmap ID is the crushmap root (named "default" in our configuration). 
Any trivial mistake I would have made?


Thanks in advance for any help or for sharing any successful configuration?

Best regards,

Michel
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Crushmap rule for multi-datacenter erasure coding

2023-04-04 Thread Michel Jouvin

Hi Frank,

Thanks for this additional information. Currently, I'd like to 
experiment with LRC that provides a "natural" way to implement the 
multistep OSD allocation to ensure the distribution across datacenters, 
without tweaking the crushmap rule. Configuration of LRC plugin is far 
from obvious for me, from the documentation but for the clarity, I'll 
open a distinct email thread to report what I have done and the problems 
I'm facing, hoping that this plugin is used by somebody reading the list!


Thanks agains for your detailed answers. Best regards,

Michel

Le 04/04/2023 à 10:10, Frank Schilder a écrit :

Hi Michel,

I don't have experience with LRC profiles. They may reduce cross-site traffic 
at the expense of extra overhead. But this might actually be unproblematic with 
EC profiles that have a large m any ways. If you do experiments with this, 
please let the list know.

I would like to add here a remark that might be of general interest. One 
particular advantage of the 4+5 and 8+7 profile in addition to k being a power 
of 2 is the following. In case of DC failure, one has 6 or 10 shards available, 
respectively. This computes to being equivalent to 4+2 and 8+2. In other words, 
the 4+5 and 8+7 profiles allow maintenance under degraded conditions, because 
one can loose 1DC + 1 other host and still have RW access. This is an advantage 
over 3-times replicated (with less overhead!), where if 1 DC is down one really 
needs to keep everything else running at all cost.

In addition to that, the 4+5 profile tolerates 4 and the 8+7 profile 6 hosts 
down. This means that one can use DCs that are independently administrated. A 
ceph upgrade would require only minimal coordination to synchronize the major 
steps. Specifically, after MONs and MGRs are upgraded, the OSD upgrade can 
proceed independently host by host on each DC without service outage. Even if 
something goes wrong in one DC, the others could proceed without service outage.

In many discussions I miss the impact of replication on maintainability (under 
degraded conditions). Just adding that here, because often the value of 
maintainability greatly outweighs the cost of the extra redundancy. For 
example, I gave up on trying to squeeze out the last byte of disks that are 
actually very cheap compared to my salary and rather have a system that runs 
without me doing overtime on incidents. Its much cheaper in the long run. A 
50-60% usable-capacity cluster is very easy and cheap to administrate.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Michel Jouvin 
Sent: Monday, April 3, 2023 10:19 PM
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Crushmap rule for multi-datacenter erasure coding

Hi Frank,

Thanks for this detailed answer. About your point of 4+2 or similar schemes 
defeating the purpose of a 3-datacenter configuration, you're right in 
principle. In our case, the goal is to avoid any impact for replicated pools 
(in particular RBD for the cloud) but it may be acceptable for some pools to be 
readonly during a short period. But I'll explore your alternative k+m scénarios 
as some may be interesting..

I'm also interested by experience feedback with LRC EC, even if I don't think 
it changes the problem for resilience to a DC failure.

Best regards,

Michel
Sent from my mobile

Le 3 avril 2023 21:57:41 Frank Schilder  a écrit :

Hi Michel,

failure domain = datacenter doesn't work, because crush wants to put 1 shard 
per failure domain and you have 3 data centers and not 6. The modified crush 
rule you wrote should work. I believe equally well with x=0 or 2 -- but try it 
out before doing anything to your cluster.

The easiest way for non-destructive testing is to download the osdmap from your 
cluster and from that map extract the crush map. You can then *without* 
modifying your cluster update the crush map in the (off-line) copy of the OSD 
map and let it compute mappings (commands for all this are in the ceph docs, 
look for osdmaptool). These mappings you can check for if they are as you want. 
There was an earlier case where someone posted a script to confirm mappings 
automatically. I used some awk magic, its not that difficult.

As a note of warning, if you want to be failure resistant, don't use 4+2. Its 
not worth the effort of having 3 data centers. In case you loose one DC, you 
have only 4 shards left, in which case the pool becomes read-only. Don't even 
consider to set min_size=4, it again completely defeats the purpose of having 3 
DCs in the first place.

The smallest profile you can use that will ensure RW access in case of a DC 
failure is 5+4 (55% usable capacity). If 1 DC fails, you have 6 shards, which 
is equivalent to 5+1. Here, you have RW access in case of 1 DC down. However, 
k=5 is a prime number with negative performance impact, ideal are powers of 2 
for k. The alternative is k=4, m=5 (44% usable ca

[ceph-users] Re: Recently deployed cluster showing 9Tb of raw usage without any load deployed

2023-04-04 Thread Work Ceph
The disks are 14.9, but the exact size of them does not matter much in this
context. We figured out the issue. The raw used space accounts for the
Rocks.DB and WAL space. Therefore, as we dedicated an NVME device in each
host for them, Ceph is showing that space as used space already. It is
funny though that it is still decrementing/discounting that space from the
total space we have for HDDs when presenting the status. That seems to be a
bug or incorrect information that is shown.

On Tue, Apr 4, 2023 at 8:31 AM Igor Fedotov  wrote:

> Originally you mentioned 14TB HDDs not 15TB. Could this be a trick?
>
> If not - please share "ceph osd df tree" output?
>
>
> On 4/4/2023 2:18 PM, Work Ceph wrote:
>
> Thank you guys for your replies. The "used space" there is exactly that.
> It is the accounting for Rocks.DB and WAL.
> ```
>
> RAW USED: The sum of USED space and the space allocated the db and wal 
> BlueStore partitions.
>
> ```
>
> There is one detail I do not understand. We are off-loading WAL and
> RocksDB to an NVME device; however, Ceph still seems to think that we use
> our data plane disks to store those elements. We have about 375TB (5 * 5 *
> 15) in HDD disks, and Ceph seems to be discounting from the usable space
> the volume (space) dedicated to WAL and Rocks.DB, which are applied into
> different disks; therefore, it shows as usable space 364 TB (after removing
> the space dedicated to WAL and Rocks.DB, which are in another device). Is
> that a bug of some sort?
>
>
> On Tue, Apr 4, 2023 at 6:31 AM Igor Fedotov  wrote:
>
>> Please also note that total cluster size reported below as SIZE
>> apparently includes DB volumes:
>>
>> # ceph df
>> --- RAW STORAGE ---
>> CLASS  SIZE AVAILUSED RAW USED  %RAW USED
>> hdd373 TiB  364 TiB  9.3 TiB   9.3 TiB   2.50
>>
>> On 4/4/2023 12:22 PM, Igor Fedotov wrote:
>> > Do you have standalone DB volumes for your OSD?
>> >
>> > If so then highly likely RAW usage is that high due to DB volumes
>> > space is considered as in-use one already.
>> >
>> > Could you please share "ceph osd df tree" output to prove that?
>> >
>> >
>> > Thanks,
>> >
>> > Igor
>> >
>> > On 4/4/2023 4:25 AM, Work Ceph wrote:
>> >> Hello guys!
>> >>
>> >>
>> >> We noticed an unexpected situation. In a recently deployed Ceph
>> >> cluster we
>> >> are seeing a raw usage, that is a bit odd. We have the following setup:
>> >>
>> >>
>> >> We have a new cluster with 5 nodes with the following setup:
>> >>
>> >> - 128 GB of RAM
>> >> - 2 cpus Intel(R) Intel Xeon Silver 4210R
>> >> - 1 NVME of 2 TB for the rocks DB caching
>> >> - 5 HDDs of 14TB
>> >> - 1 NIC dual port of 25GiB in BOND mode.
>> >>
>> >>
>> >> Right after deploying the Ceph cluster, we see a raw usage of about
>> >> 9TiB.
>> >> However, no load has been applied onto the cluster. Have you guys
>> >> seen such
>> >> a situation? Or, can you guys help understand it?
>> >>
>> >>
>> >> We are using Ceph Octopus, and we have set the following
>> configurations:
>> >>
>> >> ```
>> >>
>> >> ceph_conf_overrides:
>> >>
>> >>global:
>> >>
>> >>  osd pool default size: 3
>> >>
>> >>  osd pool default min size: 1
>> >>
>> >>  osd pool default pg autoscale mode: "warn"
>> >>
>> >>  perf: true
>> >>
>> >>  rocksdb perf: true
>> >>
>> >>mon:
>> >>
>> >>  mon osd down out interval: 120
>> >>
>> >>osd:
>> >>
>> >>  bluestore min alloc size hdd: 65536
>> >>
>> >>
>> >> ```
>> >>
>> >>
>> >> Any tip or help on how to explain this situation is welcome!
>> >> ___
>> >> ceph-users mailing list -- ceph-users@ceph.io
>> >> To unsubscribe send an email to ceph-users-le...@ceph.io
>> >
>> --
>> Igor Fedotov
>> Ceph Lead Developer
>>
>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>
>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>> CEO: Martin Verges - VAT-ID: DE310638492
>> Com. register: Amtsgericht Munich HRB 231263
>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>>
>> --
> Igor Fedotov
> Ceph Lead Developer
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Read and write performance on distributed filesystem

2023-04-04 Thread Xiubo Li


On 4/4/23 07:59, David Cunningham wrote:

Hello,

We are considering CephFS as an alternative to GlusterFS, and have some
questions about performance. Is anyone able to advise us please?

This would be for file systems between 100GB and 2TB in size, average file
size around 5MB, and a mixture of reads and writes. I may not be using the
correct terminology in the Ceph world, but in my parlance a node is a Linux
server running the Ceph storage software. Multiple nodes make up the whole
Ceph storage solution. Someone correct me if I should be using different
terms!

In our normal scenario the nodes in the replicated filesystem would be
around 0.3ms apart, but we're also interested in geographically remote
nodes which would be say 20ms away. We are using third party software which
relies on a traditional Linux filesystem, so we can't use an object storage
solution directly.

So my specific questions are:

1. When reading a file from CephFS, does it read from just one node, or
from all nodes?


Different objects could be in different nodes, if your read size is 
large enough the crush rule probably will distribute the read into 
different nodes.



2. If reads are from one node then does it choose the node with the fastest
response to optimise performance, or if from all nodes then will reads be
no faster than latency to the furthest node?


It  choose the primary OSD to issue the reads, but sometimes it will 
choose a replica OSD instead, such as when considering the balance or if 
the primary OSD is not up, etc.




3. When writing to CephFS, are all nodes written to synchronously, or are
writes to one node which then replicates that to other nodes asynchronously?


My understanding is that the Rados will reply to client only when all 
the replica OSDs have been successfully wrote to cache or disk.



4. Can anyone give a recommendation on maximum latency between nodes to
have decent performance?

5. How does CephFS handle a node which suddenly becomes unavailable on the
network? Is the block time configurable, and how good is the healing
process after the lost node rejoins the network?


When one node is down or up the MON will update the osdmap to clients 
and the client will depend on this to issue the requests. As I 
remembered there should have some options doing this, but need to 
confirm this later.


Thanks,

- Xiubo



6. I have read that CephFS is more complicated to administer than
GlusterFS. What does everyone think? Are things like healing after a net
split difficult for administrators new to Ceph to handle?

Thanks very much in advance.


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Recently deployed cluster showing 9Tb of raw usage without any load deployed

2023-04-04 Thread Igor Fedotov

Originally you mentioned 14TB HDDs not 15TB. Could this be a trick?

If not - please share "ceph osd df tree" output?


On 4/4/2023 2:18 PM, Work Ceph wrote:
Thank you guys for your replies. The "used space" there is exactly 
that. It is the accounting for Rocks.DB and WAL.

```
RAW USED: The sum of USED space and the space allocated the db and wal 
BlueStore partitions.
```

There is one detail I do not understand. We are off-loading WAL and 
RocksDB to an NVME device; however, Ceph still seems to think that we 
use our data plane disks to store those elements. We have about 375TB 
(5 * 5 * 15) in HDD disks, and Ceph seems to be discounting from the 
usable space the volume (space) dedicated to WAL and Rocks.DB, which 
are applied into different disks; therefore, it shows as usable space 
364 TB (after removing the space dedicated to WAL and Rocks.DB, which 
are in another device). Is that a bug of some sort?



On Tue, Apr 4, 2023 at 6:31 AM Igor Fedotov  wrote:

Please also note that total cluster size reported below as SIZE
apparently includes DB volumes:

# ceph df
--- RAW STORAGE ---
CLASS  SIZE     AVAIL    USED     RAW USED  %RAW USED
hdd    373 TiB  364 TiB  9.3 TiB   9.3 TiB       2.50

On 4/4/2023 12:22 PM, Igor Fedotov wrote:
> Do you have standalone DB volumes for your OSD?
>
> If so then highly likely RAW usage is that high due to DB volumes
> space is considered as in-use one already.
>
> Could you please share "ceph osd df tree" output to prove that?
>
>
> Thanks,
>
> Igor
>
> On 4/4/2023 4:25 AM, Work Ceph wrote:
>> Hello guys!
>>
>>
>> We noticed an unexpected situation. In a recently deployed Ceph
>> cluster we
>> are seeing a raw usage, that is a bit odd. We have the
following setup:
>>
>>
>> We have a new cluster with 5 nodes with the following setup:
>>
>>     - 128 GB of RAM
>>     - 2 cpus Intel(R) Intel Xeon Silver 4210R
>>     - 1 NVME of 2 TB for the rocks DB caching
>>     - 5 HDDs of 14TB
>>     - 1 NIC dual port of 25GiB in BOND mode.
>>
>>
>> Right after deploying the Ceph cluster, we see a raw usage of
about
>> 9TiB.
>> However, no load has been applied onto the cluster. Have you guys
>> seen such
>> a situation? Or, can you guys help understand it?
>>
>>
>> We are using Ceph Octopus, and we have set the following
configurations:
>>
>> ```
>>
>> ceph_conf_overrides:
>>
>>    global:
>>
>>  osd pool default size: 3
>>
>>  osd pool default min size: 1
>>
>>  osd pool default pg autoscale mode: "warn"
>>
>>  perf: true
>>
>>  rocksdb perf: true
>>
>>    mon:
>>
>>  mon osd down out interval: 120
>>
>>    osd:
>>
>>  bluestore min alloc size hdd: 65536
>>
>>
>> ```
>>
>>
>> Any tip or help on how to explain this situation is welcome!
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>
-- 
Igor Fedotov

Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at
https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us athttps://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web:https://croit.io  | YouTube:https://goo.gl/PGE1Bx
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Recently deployed cluster showing 9Tb of raw usage without any load deployed

2023-04-04 Thread Work Ceph
Thank you guys for your replies. The "used space" there is exactly that. It
is the accounting for Rocks.DB and WAL.
```

RAW USED: The sum of USED space and the space allocated the db and wal
BlueStore partitions.

```

There is one detail I do not understand. We are off-loading WAL and RocksDB
to an NVME device; however, Ceph still seems to think that we use our data
plane disks to store those elements. We have about 375TB (5 * 5 * 15) in
HDD disks, and Ceph seems to be discounting from the usable space the
volume (space) dedicated to WAL and Rocks.DB, which are applied into
different disks; therefore, it shows as usable space 364 TB (after removing
the space dedicated to WAL and Rocks.DB, which are in another device). Is
that a bug of some sort?


On Tue, Apr 4, 2023 at 6:31 AM Igor Fedotov  wrote:

> Please also note that total cluster size reported below as SIZE
> apparently includes DB volumes:
>
> # ceph df
> --- RAW STORAGE ---
> CLASS  SIZE AVAILUSED RAW USED  %RAW USED
> hdd373 TiB  364 TiB  9.3 TiB   9.3 TiB   2.50
>
> On 4/4/2023 12:22 PM, Igor Fedotov wrote:
> > Do you have standalone DB volumes for your OSD?
> >
> > If so then highly likely RAW usage is that high due to DB volumes
> > space is considered as in-use one already.
> >
> > Could you please share "ceph osd df tree" output to prove that?
> >
> >
> > Thanks,
> >
> > Igor
> >
> > On 4/4/2023 4:25 AM, Work Ceph wrote:
> >> Hello guys!
> >>
> >>
> >> We noticed an unexpected situation. In a recently deployed Ceph
> >> cluster we
> >> are seeing a raw usage, that is a bit odd. We have the following setup:
> >>
> >>
> >> We have a new cluster with 5 nodes with the following setup:
> >>
> >> - 128 GB of RAM
> >> - 2 cpus Intel(R) Intel Xeon Silver 4210R
> >> - 1 NVME of 2 TB for the rocks DB caching
> >> - 5 HDDs of 14TB
> >> - 1 NIC dual port of 25GiB in BOND mode.
> >>
> >>
> >> Right after deploying the Ceph cluster, we see a raw usage of about
> >> 9TiB.
> >> However, no load has been applied onto the cluster. Have you guys
> >> seen such
> >> a situation? Or, can you guys help understand it?
> >>
> >>
> >> We are using Ceph Octopus, and we have set the following configurations:
> >>
> >> ```
> >>
> >> ceph_conf_overrides:
> >>
> >>global:
> >>
> >>  osd pool default size: 3
> >>
> >>  osd pool default min size: 1
> >>
> >>  osd pool default pg autoscale mode: "warn"
> >>
> >>  perf: true
> >>
> >>  rocksdb perf: true
> >>
> >>mon:
> >>
> >>  mon osd down out interval: 120
> >>
> >>osd:
> >>
> >>  bluestore min alloc size hdd: 65536
> >>
> >>
> >> ```
> >>
> >>
> >> Any tip or help on how to explain this situation is welcome!
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> --
> Igor Fedotov
> Ceph Lead Developer
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Recently deployed cluster showing 9Tb of raw usage without any load deployed

2023-04-04 Thread Igor Fedotov
Please also note that total cluster size reported below as SIZE 
apparently includes DB volumes:


# ceph df
--- RAW STORAGE ---
CLASS  SIZE AVAILUSED RAW USED  %RAW USED
hdd373 TiB  364 TiB  9.3 TiB   9.3 TiB   2.50

On 4/4/2023 12:22 PM, Igor Fedotov wrote:

Do you have standalone DB volumes for your OSD?

If so then highly likely RAW usage is that high due to DB volumes 
space is considered as in-use one already.


Could you please share "ceph osd df tree" output to prove that?


Thanks,

Igor

On 4/4/2023 4:25 AM, Work Ceph wrote:

Hello guys!


We noticed an unexpected situation. In a recently deployed Ceph 
cluster we

are seeing a raw usage, that is a bit odd. We have the following setup:


We have a new cluster with 5 nodes with the following setup:

    - 128 GB of RAM
    - 2 cpus Intel(R) Intel Xeon Silver 4210R
    - 1 NVME of 2 TB for the rocks DB caching
    - 5 HDDs of 14TB
    - 1 NIC dual port of 25GiB in BOND mode.


Right after deploying the Ceph cluster, we see a raw usage of about 
9TiB.
However, no load has been applied onto the cluster. Have you guys 
seen such

a situation? Or, can you guys help understand it?


We are using Ceph Octopus, and we have set the following configurations:

```

ceph_conf_overrides:

   global:

 osd pool default size: 3

 osd pool default min size: 1

 osd pool default pg autoscale mode: "warn"

 perf: true

 rocksdb perf: true

   mon:

 mon osd down out interval: 120

   osd:

 bluestore min alloc size hdd: 65536


```


Any tip or help on how to explain this situation is welcome!
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Recently deployed cluster showing 9Tb of raw usage without any load deployed

2023-04-04 Thread Igor Fedotov

Do you have standalone DB volumes for your OSD?

If so then highly likely RAW usage is that high due to DB volumes space 
is considered as in-use one already.


Could you please share "ceph osd df tree" output to prove that?


Thanks,

Igor

On 4/4/2023 4:25 AM, Work Ceph wrote:

Hello guys!


We noticed an unexpected situation. In a recently deployed Ceph cluster we
are seeing a raw usage, that is a bit odd. We have the following setup:


We have a new cluster with 5 nodes with the following setup:

- 128 GB of RAM
- 2 cpus Intel(R) Intel Xeon Silver 4210R
- 1 NVME of 2 TB for the rocks DB caching
- 5 HDDs of 14TB
- 1 NIC dual port of 25GiB in BOND mode.


Right after deploying the Ceph cluster, we see a raw usage of about 9TiB.
However, no load has been applied onto the cluster. Have you guys seen such
a situation? Or, can you guys help understand it?


We are using Ceph Octopus, and we have set the following configurations:

```

ceph_conf_overrides:

   global:

 osd pool default size: 3

 osd pool default min size: 1

 osd pool default pg autoscale mode: "warn"

 perf: true

 rocksdb perf: true

   mon:

 mon osd down out interval: 120

   osd:

 bluestore min alloc size hdd: 65536


```


Any tip or help on how to explain this situation is welcome!
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS thrashing through the page cache

2023-04-04 Thread Xiubo Li

Hi Ashu,

Yeah, please see 
https://patchwork.kernel.org/project/ceph-devel/list/?series=733010.


Sorry I forgot to reply it here.

- Xiubo

On 4/4/23 13:58, Ashu Pachauri wrote:

Hi Xiubo,

Did you get a chance to work on this? I am curious to test out the 
improvements.


Thanks and Regards,
Ashu Pachauri


On Fri, Mar 17, 2023 at 3:33 PM Frank Schilder  wrote:

Hi Ashu,

thanks for the clarification. That's not an option that is easy to
change. I hope that the modifications to the fs clients Xiubo has
in mind will improve that. Thanks for flagging this performance
issue. Would be great if this becomes part of a test suite.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Ashu Pachauri 
Sent: 17 March 2023 09:55:25
To: Xiubo Li
Cc: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: CephFS thrashing through the page cache

Hi Xiubo,

As you have correctly pointed out, I was talking about the
stipe_unit setting in the file layout configuration. Here is the
documentation for that for anyone else's reference:
https://docs.ceph.com/en/quincy/cephfs/file-layouts/

As with any RAID0 setup, the stripe_unit is definitely workload
dependent. Our use case requires us to read somewhere from a few
kilobytes to a few hundred kilobytes at once. Having a 4MB default
stripe_unit definitely hurts quite a bit. We were able to achieve
almost 2x improvement in terms of average latency and overall
throughput (for useful data) by reducing the stripe_unit. The rule
of thumb is that you want to align the stripe_unit to your most
common IO size.

> BTW, have you tried to set 'rasize' option to a small size
instead of 0
> ? Won't this work ?

No this won't work. I have tried it already. Since rasize simply
impacts readahead, your minimum io size to the cephfs client will
still be at the maximum of (rasize, stripe_unit). rasize is a
useful configuration only if it is required to be larger than the
stripe_unit, otherwise it's not. Also, it's worth pointing out
that simply setting rasize is not sufficient; one needs to change
the corresponding configurations that control maximum/minimum
readahead for ceph clients.

Thanks and Regards,
Ashu Pachauri


On Fri, Mar 17, 2023 at 2:14 PM Xiubo Li
mailto:xiu...@redhat.com>> wrote:

On 15/03/2023 17:20, Frank Schilder wrote:
> Hi Ashu,
>
> are you talking about the kernel client? I can't find "stripe
size" anywhere in its mount-documentation. Could you possibly post
exactly what you did? Mount fstab line, config setting?

There is no mount option to do this in both userspace and kernel
clients. You need to change the file layout, which is (4MB
stripe_unit,
1 stripe_count and 4MB object_size) by default, instead.

Certainly with a smaller size of the stripe_unit will work. But IMO it
will depend and be careful, changing the layout may cause other
performance issues in some case, for example too small stripe_unit
size
may split the sync read into more osd requests to different OSDs.

I will generate one patch to make the kernel client wiser instead of
blindly setting it to stripe_unit always.

Thanks

- Xiubo


>
> Thanks!
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Ashu Pachauri
mailto:ashu210...@gmail.com>>
> Sent: 14 March 2023 19:23:42
> To: ceph-users@ceph.io
> Subject: [ceph-users] Re: CephFS thrashing through the page cache
>
> Got the answer to my own question; posting here if someone else
> encounters the same problem. The issue is that the default
stripe size in a
> cephfs mount is 4 MB. If you are doing small reads (like 4k
reads in the
> test I posted) inside the file, you'll end up pulling at least
4MB to the
> client (and then discarding most of the pulled data) even if you set
> readahead to zero. So, the solution for us was to set a lower
stripe size,
> which aligns better with our workloads.
>
> Thanks and Regards,
> Ashu Pachauri
>
>
> On Fri, Mar 10, 2023 at 9:41 PM Ashu Pachauri
mailto:ashu210...@gmail.com>> wrote:
>
>> Also, I am able to reproduce the network read amplification
when I try to
>> do very small reads from larger files. e.g.
>>
>> for i in $(seq 1 1); do
>>    dd if=test_${i} of=/dev/null bs=5k count=10
>> done
>>
>>
>> This piece of code generates a network traffic of 3.3 GB while
it actually
>> reads approx 500 MB of data.
>>
>>
>> Thanks and Regards,
>> Ashu Pachauri
>>
>> On Fri, Mar 10, 2023 at

[ceph-users] Re: Crushmap rule for multi-datacenter erasure coding

2023-04-04 Thread Frank Schilder
Hi Michel,

I don't have experience with LRC profiles. They may reduce cross-site traffic 
at the expense of extra overhead. But this might actually be unproblematic with 
EC profiles that have a large m any ways. If you do experiments with this, 
please let the list know.

I would like to add here a remark that might be of general interest. One 
particular advantage of the 4+5 and 8+7 profile in addition to k being a power 
of 2 is the following. In case of DC failure, one has 6 or 10 shards available, 
respectively. This computes to being equivalent to 4+2 and 8+2. In other words, 
the 4+5 and 8+7 profiles allow maintenance under degraded conditions, because 
one can loose 1DC + 1 other host and still have RW access. This is an advantage 
over 3-times replicated (with less overhead!), where if 1 DC is down one really 
needs to keep everything else running at all cost.

In addition to that, the 4+5 profile tolerates 4 and the 8+7 profile 6 hosts 
down. This means that one can use DCs that are independently administrated. A 
ceph upgrade would require only minimal coordination to synchronize the major 
steps. Specifically, after MONs and MGRs are upgraded, the OSD upgrade can 
proceed independently host by host on each DC without service outage. Even if 
something goes wrong in one DC, the others could proceed without service outage.

In many discussions I miss the impact of replication on maintainability (under 
degraded conditions). Just adding that here, because often the value of 
maintainability greatly outweighs the cost of the extra redundancy. For 
example, I gave up on trying to squeeze out the last byte of disks that are 
actually very cheap compared to my salary and rather have a system that runs 
without me doing overtime on incidents. Its much cheaper in the long run. A 
50-60% usable-capacity cluster is very easy and cheap to administrate.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Michel Jouvin 
Sent: Monday, April 3, 2023 10:19 PM
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Crushmap rule for multi-datacenter erasure coding

Hi Frank,

Thanks for this detailed answer. About your point of 4+2 or similar schemes 
defeating the purpose of a 3-datacenter configuration, you're right in 
principle. In our case, the goal is to avoid any impact for replicated pools 
(in particular RBD for the cloud) but it may be acceptable for some pools to be 
readonly during a short period. But I'll explore your alternative k+m scénarios 
as some may be interesting..

I'm also interested by experience feedback with LRC EC, even if I don't think 
it changes the problem for resilience to a DC failure.

Best regards,

Michel
Sent from my mobile

Le 3 avril 2023 21:57:41 Frank Schilder  a écrit :

Hi Michel,

failure domain = datacenter doesn't work, because crush wants to put 1 shard 
per failure domain and you have 3 data centers and not 6. The modified crush 
rule you wrote should work. I believe equally well with x=0 or 2 -- but try it 
out before doing anything to your cluster.

The easiest way for non-destructive testing is to download the osdmap from your 
cluster and from that map extract the crush map. You can then *without* 
modifying your cluster update the crush map in the (off-line) copy of the OSD 
map and let it compute mappings (commands for all this are in the ceph docs, 
look for osdmaptool). These mappings you can check for if they are as you want. 
There was an earlier case where someone posted a script to confirm mappings 
automatically. I used some awk magic, its not that difficult.

As a note of warning, if you want to be failure resistant, don't use 4+2. Its 
not worth the effort of having 3 data centers. In case you loose one DC, you 
have only 4 shards left, in which case the pool becomes read-only. Don't even 
consider to set min_size=4, it again completely defeats the purpose of having 3 
DCs in the first place.

The smallest profile you can use that will ensure RW access in case of a DC 
failure is 5+4 (55% usable capacity). If 1 DC fails, you have 6 shards, which 
is equivalent to 5+1. Here, you have RW access in case of 1 DC down. However, 
k=5 is a prime number with negative performance impact, ideal are powers of 2 
for k. The alternative is k=4, m=5 (44% usable capacity) with good performance 
but higher redundancy overhead.

You can construct valid schemes by looking at all N multiples of 3 and trying 
k<=(2N/3-1):

N=6 -> k=2 m=4
N=9 -> k=4, m=5 ; k=5, m=4
N=12 -> k=6, m=6; k=7, m=5
N=15 -> k=8, m=7 ; k=9, m=6

As you can see, the larger N the smaller the overhead. The downside is larger 
stripes, meaning that larger N only make sense if you have a large 
files/objects. An often overlooked advantage of profiles with large m is that 
you can defeat tail latencies for read operations by setting fast_read=true for 
the pool. This is really great when you have silently failing