[ceph-users] Crushmap rule for multi-datacenter erasure coding

2023-04-03 Thread Michel Jouvin
Hi, We have a 3-site Ceph cluster and would like to create a 4+2 EC pool with 2 chunks per datacenter, to maximise the resilience in case of 1 datacenter being down. I have not found a way to create an EC profile with this 2-level allocation strategy. I created an EC profile with a failure do

[ceph-users] Re: Crushmap rule for multi-datacenter erasure coding

2023-04-03 Thread Michel Jouvin
=== Frank Schilder AIT Risø Campus Bygning 109, rum S14 ____ From: Michel Jouvin Sent: Monday, April 3, 2023 6:40 PM To: ceph-users@ceph.io Subject: [ceph-users] Crushmap rule for multi-datacenter erasure coding Hi, We have a 3-site Ceph cluster and would like t

[ceph-users] Re: Crushmap rule for multi-datacenter erasure coding

2023-04-04 Thread Michel Jouvin
without me doing overtime on incidents. Its much cheaper in the long run. A 50-60% usable-capacity cluster is very easy and cheap to administrate. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________ From: Michel Jouvin Sent: Mon

[ceph-users] Help needed to configure erasure coding LRC plugin

2023-04-04 Thread Michel Jouvin
Hi, As discussed in another thread (Crushmap rule for multi-datacenter erasure coding), I'm trying to create an EC pool spanning 3 datacenters (datacenters are present in the crushmap), with the objective to be resilient to 1 DC down, at least keeping the readonly access to the pool and if po

[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-04-04 Thread Michel Jouvin
, l=4 configuration is equivalent, in terms of redundancy, to a jerasure configuration with k=9, m=6. Michel Le 04/04/2023 à 15:26, Michel Jouvin a écrit : Hi, As discussed in another thread (Crushmap rule for multi-datacenter erasure coding), I'm trying to create an EC pool spanning 3

[ceph-users] Crushmap rule for multi-datacenter erasure coding

2023-04-04 Thread Michel Jouvin
Hi, We have a 3-site Ceph cluster and would like to create a 4+2 EC pool with 2 chunks per datacenter, to maximise the resilience in case of 1 datacenter being down. I have not found a way to create an EC profile with this 2-level allocation strategy. I created an EC profile with a failure do

[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-04-05 Thread Michel Jouvin
. Thanks in advance if somebody could provide some sort of authoritative answer on these 2 questions. Best regards, Michel Le 04/04/2023 à 15:53, Michel Jouvin a écrit : Answering to myself, I found the reason for 2147483647: it's documented as a failure to find enough OSD (missing OSDs). And

[ceph-users] Pacific dashboard: unable to get RGW information

2023-04-11 Thread Michel Jouvin
Hi, Our cluster is running Pacific 16.2.10. We have a problem using the dashboard to display information about RGWs configured in the cluster. When clicking on "Object Gateway", we get an error 500. Looking in the mgr logs, I found that the problem is that the RGW is accessed by its IP addres

[ceph-users] Re: Pacific dashboard: unable to get RGW information

2023-04-11 Thread Michel Jouvin
rsion 16.2.11 (which was just recently released) contains a fix for that. But it still doesn’t work with wildcard certificates, that’s still an issue for us. Zitat von Michel Jouvin : Hi, Our cluster is running Pacific 16.2.10. We have a problem using the dashboard to display information about

[ceph-users] Re: 17.2.6 Dashboard/RGW Signature Mismatch

2023-04-13 Thread Michel Jouvin
Hi, For what is worth, we have a similar problem in 16..2.10 that I had no time to troubleshoot yet. It happened after adding a haproxy in front of rgw to manage https and switch rgw to http (to overcome the other pb mentioned when using https in rgw). The access/secret key is refused despite

[ceph-users] Re: upgrading from el7 / nautilus

2023-04-19 Thread Michel Jouvin
Hi Marc, I can share what we did a few months ago. As a remark, I am not sure Nautilus is available in EL8 but may be I missed it. In our case we did the following travel: - Pacific to Octopus on EL7, traditionally managed - Conversion of the cluster to a cephadm cluster as it makes every up

[ceph-users] 17.2.6 dashboard: unable to get RGW dashboard working

2023-04-20 Thread Michel Jouvin
Hi, I just upgraded in 17.2.6 but in fact I had the same problem in 16.2.10. I'm trying to configure the Ceph dashboard to monitor the RGWs (object gateways used as S3 gw). Our cluster has 2 RGW realms (eros, fink) with 1 zonegroup per realm (p2io-eros and p2io-fink respectively) and 1 zone p

[ceph-users] Re: 17.2.6 dashboard: unable to get RGW dashboard working

2023-04-21 Thread Michel Jouvin
à 12:55, Michel Jouvin a écrit : Hi, I just upgraded in 17.2.6 but in fact I had the same problem in 16.2.10. I'm trying to configure the Ceph dashboard to monitor the RGWs (object gateways used as S3 gw). Our cluster has 2 RGW realms (eros, fink) with 1 zonegroup per realm (p2io-eros and

[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-04-24 Thread Michel Jouvin
down (5 for R/W access) so I'm wondering if there is something wrong in our pool configuration (k=9, m=6, l=5). Cheers, Michel Le 06/04/2023 à 08:51, Michel Jouvin a écrit : Hi, Is somebody using LRC plugin ? I came to the conclusion that LRC  k=9, m=3, l=4 is not the same as jerasure

[ceph-users] Re: v16.2.12 Pacific (hot-fix) released

2023-04-24 Thread Michel Jouvin
Hi Wesley, I can only answer your second question and give an opinion on the last one! - Yes the OSD activation problem (in cephadm clusters only) was introduce by an unfortunate change (indentation problem in Python code) in 16.2.11. The issue doesn't exist in 16.2.10 and is one of the fixed

[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-04-28 Thread Michel Jouvin
that as I was using the LRC plugin, I had the warranty that I could loose a site without impact, thus the possibility to loose 1 OSD server. Am I wrong? Best regards, Michel Le 24/04/2023 à 13:24, Michel Jouvin a écrit : Hi, I'm still interesting by getting feedback from those using

[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-04-29 Thread Michel Jouvin
t a écrit : Hello, What is your current setup, 1 server pet data center with 12 osd each? What is your current crush rule and LRC crush rule? On Fri, Apr 28, 2023, 12:29 Michel Jouvin wrote: Hi, I think I found a possible cause of my PG down but still understand why. As explai

[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-05-04 Thread Michel Jouvin
SD), min_size 4. But as I wrote, it probably doesn't have the resiliency for a DC failure, so that needs some further investigation. Regards, Eugen Zitat von Michel Jouvin : Hi, No... our current setup is 3 datacenters with the same configuration, i.e. 1 mon/mgr + 4 OSD servers with 16 OSDs

[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-05-16 Thread Michel Jouvin
ck a écrit : Hi, I don't think you've shared your osd tree yet, could you do that? Apparently nobody else but us reads this thread or nobody reading this uses the LRC plugin. ;-) Thanks, Eugen Zitat von Michel Jouvin : Hi, I had to restart one of my OSD server today and the prob

[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-05-21 Thread Michel Jouvin
e only way I can currently visualize it working is with more servers, I'm thinking 6 or 9 per data center min, but that could be my lack of knowledge on some of the step rules. Thanks Curt On Tue, May 16, 2023 at 11:09 AM Michel Jouvin < michel.jou...@ijclab.in2p3.fr> wrote: Hi

[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-05-26 Thread Michel Jouvin
interested. Best regards, Michel Le 21/05/2023 à 16:07, Michel Jouvin a écrit : Hi Eugen, My LRC pool is also somewhat experimental so nothing really urgent. If you manage to do some tests that help me to understand the problem I remain interested. I propose to keep this thread for that

[ceph-users] Re: [EXTERNAL] [Pacific] ceph orch device ls do not returns any HDD

2023-05-26 Thread Michel Jouvin
Hi Patrick, It is weird, we have a couple of clusters with cephadm and running pacify or quincy and ceph orch device works well. Have you looked at the cephadm logs (ceph log last cephadm)? Except if you are using a very specific hardware, I suspect Ceph is suffering of a problem outside it.

[ceph-users] Re: [EXTERNAL] [Pacific] ceph orch device ls do not returns any HDD

2023-05-26 Thread Michel Jouvin
e to Pacific? Could this be a workaround for this sort of regression from Octopus to Pacific ? May be updating the BIOS from 1.7.1 to 1.8.1 ? All this is a little bit confusing for me as I'm trying to discover Ceph 😁 Thanks Patrick Le 26/05/2023 à 17:19, Michel Jouvin a écrit : Hi Patrick

[ceph-users] Re: Seeking feedback on Improving cephadm bootstrap process

2023-05-30 Thread Michel Jouvin
+1 Michel Le 30/05/2023 à 11:23, Frank Schilder a écrit : What I'm having in mind is if the command is already in history. A wrong history reference can execute a command with "--yes-i-really-mean-it" even though you really don't mean it. Been there. For an OSD this is maybe tolerable, but fo

[ceph-users] Re: Converting to cephadm : Error EINVAL: Failed to connect

2023-06-02 Thread Michel Jouvin
Hi David, Normally cephadm connection issue are not that difficult to solve. It is just the matter of having the appropriate SSH configuration in the root account. Mainly the public key used by cephadm (extracted with the command you used in a shell) added in the root account .ssh/authorized_k

[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-06-19 Thread Michel Jouvin
ove to have a comment from the developers on this topic. But in this state I would not recommend to use the LRC plugin when the resiliency requirements are to sustain the loss of an entire DC. Thanks, Eugen [1] https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/#low-level-plu

[ceph-users] Re: [EXTERNAL] [Pacific] ceph orch device ls do not returns any HDD

2023-06-28 Thread Michel Jouvin
, Michel Le 26/05/2023 à 18:50, Michel Jouvin a écrit : Patrick, I can only say that I would not expect a specific problem due to your hardware. Upgrading the firmware is generally a good idea but I wouldn't expect it helps in your case if the osk (lsblk) sees the disk. As for starting

[ceph-users] Re: Adding datacenter level to CRUSH tree causes rebalancing

2023-07-16 Thread Michel Jouvin
Hi Niklas, I am not sure why you are surprised. In a large cluster, you should expect some rebalancing on every crush map or crush map rule change. Ceph doesn't just enforce the failure domain, it also whants to have a "perfect" pseudo-random distribution across the clusters based on the crus

[ceph-users] Re: Adding datacenter level to CRUSH tree causes rebalancing

2023-07-20 Thread Michel Jouvin
Hi Niklas, As I said, ceph placement is based on more than fulfilling the failure domain constraint. This is a core feature in ceph design. There is no reason for a rebalancing on a cluster with a few hundreds OSDs to last a month. Just before 17 you have to adjust the max backfills parameter

[ceph-users] Re: Impacts on doubling the size of pgs in a rbd pool?

2023-10-03 Thread Michel Jouvin
Hi Herve, Why you don't use the automatic adjustment of the number of PGs. This makes life much easier and works well. Cheers, Michel Le 03/10/2023 à 17:06, Hervé Ballans a écrit : Hi all, Sorry for the reminder, but does anyone have any advice on how to deal with this? Many thanks! Her

[ceph-users] Quincy: failure to enable mgr rgw module if not --force

2023-10-24 Thread Michel Jouvin
Hi, I'm trying to use the rgw mgr module to configure RGWs. Unfortunately it is not present in 'ceph mgr module ls' list and any attempt to enable it suggests that one mgr doesn't support it and that --force should be added. Adding --force effectively enabled it. It is strange as it is a bra

[ceph-users] cephadm failing to add hosts despite a working SSH connection

2023-10-25 Thread Michel Jouvin
Hi, I'm struggling with a problem to add cephadm some hosts in our Quincy cluster. "ceph orch host add host addr" fails with the famous "missing 2 required positional arguments: 'hostname' and 'addr'" because of bug https://tracker.ceph.com/issues/59081 but looking at cephadm messages with "c

[ceph-users] Re: cephadm failing to add hosts despite a working SSH connection

2023-10-25 Thread Michel Jouvin
1 Gb management network so no point to use Jumbo frames). I cannot think of anything that Ceph could have mentioned to help diagnose this. Best regards, Michel Le 25/10/2023 à 14:42, Michel Jouvin a écrit : Hi, I'm struggling with a problem to add cephadm some hosts in our Quincy cluster.

[ceph-users] Re: cephadm vs ceph.conf

2023-11-23 Thread Michel Jouvin
Hi Albert, You should never edit any file in the containers, cephadm takes care of it. Most of the parameters described in the doc you mentioned are better managed with "ceph config" command in the Ceph configuration database. If you want to run the ceph commnand on a Ceph machine outside a c

[ceph-users] Reef (18.2): Some PG not scrubbed/deep scrubbed for 1 month

2024-03-20 Thread Michel Jouvin
Hi, We have a Reef cluster that started to complain a couple of weeks ago about ~20 PGs (over 10K) not scrubbed/deep-scrubbed in time. Looking at it since a few days, I saw this affect only those PGs that could not be scrubbed since mid-February. Old the other PGs are regularly scrubbed. I d

[ceph-users] Re: Reef (18.2): Some PG not scrubbed/deep scrubbed for 1 month

2024-03-20 Thread Michel Jouvin
Hi Rafael, Good to know I am not alone! Additional information ~6h after the OSD restart: over the 20 PGs impacted, 2 have been processed successfully... I don't have a clear picture on how Ceph prioritize the scrub of one PG over another, I had thought that the oldest/expired scrubs are take

[ceph-users] Re: Reef (18.2): Some PG not scrubbed/deep scrubbed for 1 month

2024-03-21 Thread Michel Jouvin
low Universitätsrechenzentrum (URZ) Universität Greifswald Felix-Hausdorff-Straße 18 17489 Greifswald Germany Tel.: +49 3834 420 1450 --- Original Nachricht --- *Betreff: *[ceph-users] Re: Reef (18.2): Some PG not scrubbed/deep scrubbed for 1 month *Von: *"Michel Jouvin" <mailto:michel.j

[ceph-users] Re: Reef (18.2): Some PG not scrubbed/deep scrubbed for 1 month

2024-03-22 Thread Michel Jouvin
still flooded with "scrub starts" and i have no clue why these OSDs are causing the problems. Will investigate further. Best regards, Gunnar ===  Gunnar Bandelow  Universitätsrechenzentrum (URZ)  Universität Greifswald  Felix-Hausdorff-Straße 18  17489 Gr

[ceph-users] Re: Reef (18.2): Some PG not scrubbed/deep scrubbed for 1 month

2024-03-22 Thread Michel Jouvin
onfiguration setting if you haven't already, but note that it can impact client I/O performance. Also, if the delays appear to be related to a single OSD, have you checked the health and performance of this device? On Fri, 22 Mar 2024 at 09:29, Michel Jouvin wrote: Hi, As I s

[ceph-users] Re: Reef (18.2): Some PG not scrubbed/deep scrubbed for 1 month

2024-03-22 Thread Michel Jouvin
en setting osd_op_queue back to mpq for this only OSD would probably reveal it. Not sure about the implication of having a signel OSD running a different scheduler in the cluster though. - Le 22 Mar 24, à 10:11, Michel Jouvin michel.jou...@ijclab.in2p3.fr a écrit : Pierre, Yes, as ment

[ceph-users] Re: Reef (18.2): Some PG not scrubbed/deep scrubbed for 1 month

2024-03-22 Thread Michel Jouvin
rvers at some points. It is unclear why it is not enough to just rerun the benchmark and why a crazy value for an HDD is found... Best regards, Michel Le 22/03/2024 à 14:44, Michel Jouvin a écrit : Hi Frédéric, I think you raise the right point, sorry if I misunderstood Pierre's suggesti

[ceph-users] Re: Reef (18.2): Some PG not scrubbed/deep scrubbed for 1 month

2024-03-22 Thread Michel Jouvin
OSD would probably reveal it. Not sure about the implication of having a signel OSD running a different scheduler in the cluster though. > > > - Le 22 Mar 24, à 10:11, Michel Jouvin michel.jou...@ijclab.in2p3.fr a écrit : > >> Pierre, >>

[ceph-users] Re: 18.8.2: osd_mclock_iops_capacity_threshold_hdd untypical values

2024-03-22 Thread Michel Jouvin
ing osd_op_queue     back to mpq for this only OSD would probably reveal it. Not sure     about the implication of having a signel OSD running a different     scheduler in the cluster though.     >     >     > - Le 22 Mar 24, à 10:11, Michel Jouvin     michel.jou...@ijclab.in2p3.fr a écri

[ceph-users] Re: Upgrading Ceph Cluster OS

2024-05-13 Thread Michel Jouvin
Nima, Can you also specify the Ceph version you are using and whether your current configuration is cephadm-based? Michel Le 13/05/2024 à 15:19, Götz Reinicke a écrit : Hi, Am 11.05.2024 um 15:54 schrieb Nima AbolhassanBeigi : Hi, We want to upgrade the OS version of our production ceph

[ceph-users] 15.2.17: RGW deploy through cephadm exits immediately with exit code 5/NOTINSTALLED

2022-09-28 Thread Michel Jouvin
Hi, We have a cephadm-based Octopus (upgraded to 15.2.17 today but the problem started with 15.2.16) cluster where we try to deploy a RGW in multisite configuration. We followed the documentation at https://docs.ceph.com/en/octopus/radosgw/multisite/ to do the basic realm, zonegroup, zone and

[ceph-users] Re: 15.2.17: RGW deploy through cephadm exits immediately with exit code 5/NOTINSTALLED

2022-09-28 Thread Michel Jouvin
One additional information that may be relevant for the problem: the server hosting the RGW has 2 networks configured on the same interface and the default one is not the Ceph public network but another network not related to Ceph. I suspect a Podman configuration issue where the default networ

[ceph-users] Re: 15.2.17: RGW deploy through cephadm exits immediately with exit code 5/NOTINSTALLED

2022-09-29 Thread Michel Jouvin
Hi, Unfortunately it was a wrong track. The problem remains the same, with the same error messages, on another host with only one network address in the Ceph cluster public network. BTW, "ceph shell --name rgw_daemon" works and from the shell I can use radosgw-admin and ceph command, suggesti

[ceph-users] Re: 15.2.17: RGW deploy through cephadm exits immediately with exit code 5/NOTINSTALLED

2022-10-04 Thread Michel Jouvin
ter starting) but it makes diagnostics of such trivial misconfiguration difficults... Cheers, Michel Le 29/09/2022 à 14:04, Michel Jouvin a écrit : Hi, Unfortunately it was a wrong track. The problem remains the same, with the same error messages, on another host with only one network address

[ceph-users] 1 OSD laggy: log_latency_fn slow; heartbeat_map is_healthy had timed out after 15

2022-10-16 Thread Michel Jouvin
Hi, We have a production cluster made of 12 OSD servers with 16 OSD each (all the same HW) which has been running fine for 5 years (initially installed with Luminous) and which has been running Octopus (15.2.16) for 1 year and was recently upgraded to 15.2.17 (1 week before the problem starte

[ceph-users] Re: 1 OSD laggy: log_latency_fn slow; heartbeat_map is_healthy had timed out after 15

2022-10-16 Thread Michel Jouvin
CSI bus when failing. Stopping the osd revives the rest of the disks on the box). Cheers, Dan On Sun, Oct 16, 2022, 22:08 Michel Jouvin wrote: Hi, We have a production cluster made of 12 OSD servers with 16 OSD each (all the same HW) which has been running fine for 5 years

[ceph-users] Re: 1 OSD laggy: log_latency_fn slow; heartbeat_map is_healthy had timed out after 15

2022-10-17 Thread Michel Jouvin
received. Cheers, Michel Le 16/10/2022 à 22:49, Michel Jouvin a écrit : Hi, In fact, a very stupid mistake. This is a CentOS 8 system where smartd was not installed. After installing and starting it, the OSD device is indeed in bad shape... Sorry for the noise. Cheers, Michel Le 16/10/2022 à 22

[ceph-users] Re: Re-install host OS on Ceph OSD node

2022-10-19 Thread Michel Jouvin
Hi, Eugen is right, no need to drain and readd your OSDs, this will be very long and lead to a lot of unnecessary load balancing. You should decouple replacing the failed OSD (where you should try to drain it, after setting its primary-affinity to 0 to limit the I/O redirected to it, and read