[ceph-users] Re: Ceph cluster not recover after OSD down

2021-05-06 Thread Andres Rojas Guerrero
But, it another cluster with version 14.2.16, it's working ... it's seems a problem of the version 14.2.6 ...? El 6/5/21 a las 18:28, Clyso GmbH - Ceph Foundation Member escribió: Hi Andres, does the commando work with the original rule/crushmap? ___ Clyso Gmb

[ceph-users] Re: Ceph cluster not recover after OSD down

2021-05-06 Thread Andres Rojas Guerrero
No, it doesn't work with an unedit crush map file. El 6/5/21 a las 18:28, Clyso GmbH - Ceph Foundation Member escribió: Hi Andres, does the commando work with the original rule/crushmap? ___ Clyso GmbH - Ceph Foundation Member supp...@clyso.com https://www.cl

[ceph-users] Re: Ceph cluster not recover after OSD down

2021-05-06 Thread Clyso GmbH - Ceph Foundation Member
Hi Andres, does the commando work with the original rule/crushmap? ___ Clyso GmbH - Ceph Foundation Member supp...@clyso.com https://www.clyso.com Am 06.05.2021 um 15:21 schrieb Andres Rojas Guerrero: Yes, my ceph version is Nautilus: # ceph -v ceph version 14

[ceph-users] Re: Ceph cluster not recover after OSD down

2021-05-06 Thread Andres Rojas Guerrero
Yes, my ceph version is Nautilus: # ceph -v ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable) First dump the crush map: # ceph osd getcrushmap -o crush_map Then, decompile the crush map: # crushtool -d crush_map -o crush_map_d Now, edit the crush rule and co

[ceph-users] Re: Ceph cluster not recover after OSD down

2021-05-06 Thread Eugen Block
Interesting, I haven't had that yet with crushtool. Your ceph version is Nautilus, right? And you did decompile the binary crushmap with crushtool, correct? I don't know how to reproduce that. Zitat von Andres Rojas Guerrero : I have this error when try to show mappings with crushtool: # c

[ceph-users] Re: Ceph cluster not recover after OSD down

2021-05-06 Thread Andres Rojas Guerrero
I have this error when try to show mappings with crushtool: # crushtool -i crush_map_new --test --rule 2 --num-rep 7 --show-mappings CRUSH rule 2 x 0 [-5,-45,-49,-47,-43,-41,-29] *** Caught signal (Segmentation fault) ** in thread 7f7f7a0ccb40 thread_name:crushtool El 6/5/21 a las 13:47, Euge

[ceph-users] Re: Ceph cluster not recover after OSD down

2021-05-06 Thread Andres Rojas Guerrero
Ok, thank you very much for the answer. El 6/5/21 a las 13:47, Eugen Block escribió: > Yes it is possible, but you should validate it with crushtool before > injecting it to make sure the PGs land where they belong. > > crushtool -i crushmap.bin --test --rule 2 --num-rep 7 --show-mappings > crush

[ceph-users] Re: Ceph cluster not recover after OSD down

2021-05-06 Thread Eugen Block
Yes it is possible, but you should validate it with crushtool before injecting it to make sure the PGs land where they belong. crushtool -i crushmap.bin --test --rule 2 --num-rep 7 --show-mappings crushtool -i crushmap.bin --test --rule 2 --num-rep 7 --show-bad-mappings If you don't get bad ma

[ceph-users] Re: Ceph cluster not recover after OSD down

2021-05-06 Thread Andres Rojas Guerrero
Hi, I try to make a new crush rule (Nautilus) in order take the new correct_failure_domain to hosts: "rule_id": 2, "rule_name": "nxtcloudAFhost", "ruleset": 2, "type": 3, "min_size": 3, "max_size": 7, "steps": [ { "op":

[ceph-users] Re: Ceph cluster not recover after OSD down

2021-05-05 Thread Joachim Kraftmayer
Create a new crush rule with the correct failure domain, test it properly and assign it to the pool(s). -- Beste Grüße, Joachim Kraftmayer ___ Clyso GmbH Am 05.05.2021 um 15:11 schrieb Andres Rojas Guerrero: Nice observation, how can avoid this problem? El 5

[ceph-users] Re: Ceph cluster not recover after OSD down

2021-05-05 Thread Andres Rojas Guerrero
Thanks, I will test it. El 5/5/21 a las 16:37, Joachim Kraftmayer escribió: Create a new crush rule with the correct failure domain, test it properly and assign it to the pool(s). -- *** Andrés Rojas Guerrero Unidad Sistemas Linux Area Arqu

[ceph-users] Re: Ceph cluster not recover after OSD down

2021-05-05 Thread Joachim Kraftmayer
Hi Andres, the crush rule with ID 1 distributes your EC chunks over the osds without considering the ceph host. As Robert already suspected. Greetings, Joachim ___ Clyso GmbH Homepage: https://www.clyso.com Am 05.05.2021 um 13:16 schrieb Andres Rojas Guerrero

[ceph-users] Re: Ceph cluster not recover after OSD down

2021-05-05 Thread Andres Rojas Guerrero
Nice observation, how can avoid this problem? El 5/5/21 a las 14:54, Robert Sander escribió: Hi, Am 05.05.21 um 13:39 schrieb Joachim Kraftmayer: the crush rule with ID 1 distributes your EC chunks over the osds without considering the ceph host. As Robert already suspected. Yes, the "nxt

[ceph-users] Re: Ceph cluster not recover after OSD down

2021-05-05 Thread Robert Sander
Hi, Am 05.05.21 um 13:39 schrieb Joachim Kraftmayer: > the crush rule with ID 1 distributes your EC chunks over the osds > without considering the ceph host. As Robert already suspected. Yes, the "nxtcloudAF" rule is not fault tolerant enough. Having the OSD as failure zone will lead to data los

[ceph-users] Re: Ceph cluster not recover after OSD down

2021-05-05 Thread Andres Rojas Guerrero
# ceph osd crush rule dump [ { "rule_id": 0, "rule_name": "replicated_rule", "ruleset": 0, "type": 1, "min_size": 1, "max_size": 10, "steps": [ { "op": "take", "item": -1, "item_n

[ceph-users] Re: Ceph cluster not recover after OSD down

2021-05-05 Thread Robert Sander
Am 05.05.21 um 12:34 schrieb Andres Rojas Guerrero: > Thanks for the answer. > >> For the default redundancy rule and pool size 3 you need three separate >> hosts. > > I have 24 separate server nodes with with 32 osd in everyone in total > 768 osd, my question is why the mds suffer when only 4%

[ceph-users] Re: Ceph cluster not recover after OSD down

2021-05-05 Thread Andres Rojas Guerrero
Thanks for the answer. > For the default redundancy rule and pool size 3 you need three separate > hosts. I have 24 separate server nodes with with 32 osd in everyone in total 768 osd, my question is why the mds suffer when only 4% of the osd goes down (in the same node). I need to modify the cr

[ceph-users] Re: Ceph cluster not recover after OSD down

2021-05-05 Thread Robert Sander
Hi, Am 05.05.21 um 11:44 schrieb Andres Rojas Guerrero: > I have in the cluster 768 OSD, it is enough that 32 (~ 4%) of them (in > the same node) fall and the information becomes inaccessible. Is it > possible to improve this behavior? You need to spread your failure zone in the crush map. It loo

[ceph-users] Re: Ceph cluster not recover after OSD down

2021-05-05 Thread Andres Rojas Guerrero
I have in the cluster 768 OSD, it is enough that 32 (~ 4%) of them (in the same node) fall and the information becomes inaccessible. Is it possible to improve this behavior? # ceph status cluster: id: c74da5b8-3d1b-483e-8b3a-739134db6cf8 health: HEALTH_WARN 1 clients fail

[ceph-users] Re: Ceph cluster not recover after OSD down

2021-05-05 Thread Andres Rojas Guerrero
They are located on a single node ... El 5/5/21 a las 11:17, Burkhard Linke escribió: > Hi, > > On 05.05.21 11:07, Andres Rojas Guerrero wrote: >> Sorry, I have not understood the problem well, the problem I see is that >> once the OSD fails, the cluster recovers but the MDS remains faulty: > >

[ceph-users] Re: Ceph cluster not recover after OSD down

2021-05-05 Thread Burkhard Linke
Hi, On 05.05.21 11:07, Andres Rojas Guerrero wrote: Sorry, I have not understood the problem well, the problem I see is that once the OSD fails, the cluster recovers but the MDS remains faulty: *snipsnap* pgs: 1.562% pgs not active 16128 active+clean 238

[ceph-users] Re: Ceph cluster not recover after OSD down

2021-05-05 Thread Andres Rojas Guerrero
I see that problem, when the osds fail the mds fail, with errors with type "slow metadata, slow requests" but do not recover once the cluster has recovered ... Why? El 5/5/21 a las 11:07, Andres Rojas Guerrero escribió: > Sorry, I have not understood the problem well, the problem I see is that

[ceph-users] Re: Ceph cluster not recover after OSD down

2021-05-05 Thread David Caro
I think that the recovery might be blocked due to all those PGs in inactive state: https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/4/html/administration_guide/monitoring-a-ceph-storage-cluster#identifying-stuck-placement-groups_admin """ Inactive: Placement groups cannot proc

[ceph-users] Re: Ceph cluster not recover after OSD down

2021-05-05 Thread Andres Rojas Guerrero
Sorry, I have not understood the problem well, the problem I see is that once the OSD fails, the cluster recovers but the MDS remains faulty: # ceph status cluster: id: c74da5b8-3d1b-483e-8b3a-739134db6cf8 health: HEALTH_WARN 3 clients failing to respond to capability rel

[ceph-users] Re: Ceph cluster not recover after OSD down

2021-05-05 Thread Andres Rojas Guerrero
Yes, the principal problem is the MDS start to report slowly and the information is no longer accessible, and the cluster never recover. # ceph status cluster: id: c74da5b8-3d1b-483e-8b3a-739134db6cf8 health: HEALTH_WARN 2 clients failing to respond to capability release

[ceph-users] Re: Ceph cluster not recover after OSD down

2021-05-05 Thread David Caro
Can you share more information? The output of 'ceph status' when the osd is down would help, also 'ceph health detail' could be useful. On 05/05 10:48, Andres Rojas Guerrero wrote: > Hi, I have a Nautilus cluster version 14.2.6 , and I have noted that > when some OSD go down the cluster doesn't