from:"Francois Legrand"

[ceph-users] Re: octopus (15.2.16) OSDs crash or don't answer heathbeats (and get marked as down)

2022-03-08 Thread Francois Legrand

Hi,

The last 2 osd I recreated were on december 30 and february 8.

I totally agree that ssd cache are a terrible spof. I think that's an 
option if you use 1 ssd/nvme for 1 or 2 osd, but the cost is then very 
high. Using 1 ssd for 10 osd increase the risk for almost no gain 
because the ssd is 10 times faster but has 10 times more access !
Indeed, we did some benches with nvme for the wal db (1 nvme for ~10 
osds), and the gain was not tremendous, so we decided not use them !

F.

Le 08/03/2022 à 11:57, Boris Behrens a écrit :

Hi Francois,

thanks for the reminder. We offline compacted all of the OSDs when we 
reinstalled the hosts with the new OS.

But actually reinstalling them was never on my list.

I could try that and in the same go I can remove all the cache SSDs 
(when one SSD share the cache for 10 OSDs this is a horrible SPOF) and 
reuse the SSDs as OSDs for the smaller pools in a RGW (like log and meta).

How long ago did you recreate the earliest OSD?

Cheers
 Boris

Am Di., 8. März 2022 um 10:03 Uhr schrieb Francois Legrand 
:

Hi,
We also had this kind of problems after upgrading to octopus.
Maybe you
can play with the hearthbeat grace time (
https://docs.ceph.com/en/latest/rados/configuration/mon-osd-interaction/

) to tell osds to wait a little more before declaring another osd
down !
We also try to fix the problem by manually compact the down osd
(something like : systemctl stop ceph-osd@74; sleep 10;
ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-74 compact;
systemctl start ceph-osd@74).
This worked a few times, but some osd went down again, thus we simply
wait for the datas to be reconstructed elswhere and then reinstall
the
dead osd :
ceph osd destroy 74 --yes-i-really-mean-it
ceph-volume lvm zap /dev/sde --destroy
ceph-volume lvm create --osd-id 74 --data /dev/sde

This seems to fix the issue for us (up to now).

F.

Le 08/03/2022 à 09:35, Boris Behrens a écrit :
> Yes, this is something we know and we disabled it, because we
ran into the
> problem that PGs went unavailable when two or more OSDs went
offline.
>
> I am searching for the reason WHY this happens.
> Currently we have set the service file to restart=always and
removed the
> StartLimitBurst from the service file.
>
> We just don't understand why the OSDs don't answer the
heathbeat. The OSDs
> that are flapping are random in terms of Host, Disksize, having SSD
> block.db or not.
> Network connectivity issues is something that I would rule out,
because the
> cluster went from "nothing ever happens except IOPS" to "random
OSDs are
> marked DOWN until they kill themself" with the update from
nautilus to
> octopus.
>
> I am out of ideas and hoped this was a bug in 15.2.15, but after
the update
> things got worse (happen more often).
> We tried to:
> * disable swap
> * more swap
> * disable bluefs_buffered_io
> * disable write cache for all disks
> * disable scrubbing
> * reinstall with new OS (from centos7 to ubuntu 20.04)
> * disable cluster_network (so there is only one way to communicate)
> * increase txqueuelen on the network interfaces
> * everything together
>
>
> What we try next: add more SATA controllers, so there are not 24
disks
> attached to a single controller, but I doubt this will help.
>
> Cheers
>   Boris
>
>
>
> Am Di., 8. März 2022 um 09:10 Uhr schrieb Dan van der Ster <
> dvand...@gmail.com>:
>
>> Here's the reason they exit:
>>
>> 7f1605dc9700 -1 osd.97 486896 _committed_osd_maps marked down 6 >
>> osd_max_markdown_count 5 in last 600.00 seconds, shutting down
>>
>> If an osd flaps (marked down, then up) 6 times in 10 minutes, it
>> exits. (This is a safety measure).
>>
>> It's normally caused by a network issue -- other OSDs are
telling the
>> mon that he is down, but then the OSD himself tells the mon
that he's
>> up!
>>
>> Cheers, Dan
>>
>> On Mon, Mar 7, 2022 at 10:36 PM Boris Behrens  wrote:
>>> Hi,
>>>
>>> we've had the problem with OSDs marked as offline since we
updated to
>>> octopus and hope the problem would be fixed with the latest
patch. We
>> have
>>> this kind of problem only with octopus and there only with the
big s3
>>> cluster.
>>> * Hosts are all Ubuntu 20,04 and we've set the txqueuelen to 10k
>>> * Network interfaces are 20gbit

[ceph-users] Re: octopus (15.2.16) OSDs crash or don't answer heathbeats (and get marked as down)

2022-03-08 Thread Francois Legrand


Hi,
We also had this kind of problems after upgrading to octopus. Maybe you 
can play with the hearthbeat grace time ( 
https://docs.ceph.com/en/latest/rados/configuration/mon-osd-interaction/ 
) to tell osds to wait a little more before declaring another osd down !
We also try to fix the problem by manually compact the down osd 
(something like : systemctl stop ceph-osd@74; sleep 10; 
ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-74 compact; 
systemctl start ceph-osd@74).
This worked a few times, but some osd went down again, thus we simply 
wait for the datas to be reconstructed elswhere and then reinstall the 
dead osd :

ceph osd destroy 74 --yes-i-really-mean-it
ceph-volume lvm zap /dev/sde --destroy
ceph-volume lvm create --osd-id 74 --data /dev/sde

This seems to fix the issue for us (up to now).

F.

Le 08/03/2022 à 09:35, Boris Behrens a écrit :

Yes, this is something we know and we disabled it, because we ran into the
problem that PGs went unavailable when two or more OSDs went offline.

I am searching for the reason WHY this happens.
Currently we have set the service file to restart=always and removed the
StartLimitBurst from the service file.

We just don't understand why the OSDs don't answer the heathbeat. The OSDs
that are flapping are random in terms of Host, Disksize, having SSD
block.db or not.
Network connectivity issues is something that I would rule out, because the
cluster went from "nothing ever happens except IOPS" to "random OSDs are
marked DOWN until they kill themself" with the update from nautilus to
octopus.

I am out of ideas and hoped this was a bug in 15.2.15, but after the update
things got worse (happen more often).
We tried to:
* disable swap
* more swap
* disable bluefs_buffered_io
* disable write cache for all disks
* disable scrubbing
* reinstall with new OS (from centos7 to ubuntu 20.04)
* disable cluster_network (so there is only one way to communicate)
* increase txqueuelen on the network interfaces
* everything together


What we try next: add more SATA controllers, so there are not 24 disks
attached to a single controller, but I doubt this will help.

Cheers
  Boris



Am Di., 8. März 2022 um 09:10 Uhr schrieb Dan van der Ster <
dvand...@gmail.com>:


Here's the reason they exit:

7f1605dc9700 -1 osd.97 486896 _committed_osd_maps marked down 6 >
osd_max_markdown_count 5 in last 600.00 seconds, shutting down

If an osd flaps (marked down, then up) 6 times in 10 minutes, it
exits. (This is a safety measure).

It's normally caused by a network issue -- other OSDs are telling the
mon that he is down, but then the OSD himself tells the mon that he's
up!

Cheers, Dan

On Mon, Mar 7, 2022 at 10:36 PM Boris Behrens  wrote:

Hi,

we've had the problem with OSDs marked as offline since we updated to
octopus and hope the problem would be fixed with the latest patch. We

have

this kind of problem only with octopus and there only with the big s3
cluster.
* Hosts are all Ubuntu 20,04 and we've set the txqueuelen to 10k
* Network interfaces are 20gbit (2x10 in a 802.3ad encap3+4 bond)
* We only use the frontend network.
* All disks are spinning, some have block.db devices.
* All disks are bluestore
* configs are mostly defaults
* we've set the OSDs to restart=always without a limit, because we had

the

problem with unavailable PGs when two OSDs are marked as offline and the
share PGs.

But since we installed the latest patch we are experiencing more OSD

downs

and even crashes.
I tried to remove as much duplicated lines as possible.

Is the numa error a problem?
Why do OSD daemons not respond to hearthbeats? I mean even when the disk

is

totally loaded with IO, the system itself should answer heathbeats, or

am I

missing something?

I really hope some of you could send me on the correct way to solve this
nasty problem.

This is how the latest crash looks like
Mar 07 17:44:15 s3db18 ceph-osd[4530]: 2022-03-07T17:44:15.099+
7f5f05d2a700 -1 osd.161 489755 set_numa_affinity unable to identify

public

interface '' numa node: (2) No such file or directory
...
Mar 07 17:49:07 s3db18 ceph-osd[4530]: 2022-03-07T17:49:07.678+
7f5f05d2a700 -1 osd.161 489774 set_numa_affinity unable to identify

public

interface '' numa node: (2) No such file or directory
Mar 07 17:53:07 s3db18 ceph-osd[4530]: *** Caught signal (Aborted) **
Mar 07 17:53:07 s3db18 ceph-osd[4530]:  in thread 7f5ef1501700
thread_name:tp_osd_tp
Mar 07 17:53:07 s3db18 ceph-osd[4530]:  ceph version 15.2.16
(d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)
Mar 07 17:53:07 s3db18 ceph-osd[4530]:  1: (()+0x143c0) [0x7f5f0d4623c0]
Mar 07 17:53:07 s3db18 ceph-osd[4530]:  2: (pthread_kill()+0x38)
[0x7f5f0d45ef08]
Mar 07 17:53:07 s3db18 ceph-osd[4530]:  3:
(ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*,
unsigned long)+0x471) [0x55a699a01201]
Mar 07 17:53:07 s3db18 ceph-osd[4530]:  4:
(ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, unsigned
long, unsigned long

[ceph-users] Re: cephfs removing multiple snapshots

2021-11-17 Thread Francois Legrand

No, our osds are hdd (no ssd) and we have everything (data and metadata) 
on them (no nvme).


Le 17/11/2021 à 16:49, Arthur Outhenin-Chalandre a écrit :

Hi,

On 11/17/21 16:09, Francois Legrand wrote:
Now we are investingating this snapshot issue and I noticed that as 
long as we remove one snapshot alone, things seems to go well (only 
some pgs in "unknown state" but no global warning nor slow ops, osd 
down or crash). But if we remove several snapshots at the same time 
(I tryed with 2 for the moment), then we start to have some slow ops. 
I guess that if I remove 4 or 5 snapshots at the same time I will end 
with osds marked down and/or crash as we had just after the upgrade 
(I am not sure I want to try that with our production cluster).


Maybe you want to try to tweak `osd_snap_trim_sleep`. On 
Octopus/Pacific with hybrid OSDs the snapshots deletions seems pretty 
stable in our testing. Out of curiosity are your OSD on SSD? I suspect 
that the default setting of `osd_snap_trim_sleep` for SSD OSD could 
affect performance [1].


Cheers,

[1]: 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/FPRB2DW4N427U25LEHYICOKI4C37BKSO/



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] cephfs removing multiple snapshots

2021-11-17 Thread Francois Legrand


Hello,

We recently upgraded our ceph+cephfs cluster from nautilus to octopus.

After the upgrade, we noticed that removal of snapshots was causing a 
lot of problems (lot of slow ops, osd marked down, crashs etc...) so we 
suspended the snapshots for a while so the cluster get stable again for 
more than one week now. We had not these problems under nautilus.


Now we are investingating this snapshot issue and I noticed that as long 
as we remove one snapshot alone, things seems to go well (only some pgs 
in "unknown state" but no global warning nor slow ops, osd down or 
crash). But if we remove several snapshots at the same time (I tryed 
with 2 for the moment), then we start to have some slow ops. I guess 
that if I remove 4 or 5 snapshots at the same time I will end with osds 
marked down and/or crash as we had just after the upgrade (I am not sure 
I want to try that with our production cluster).


So my questions are does someone have noticed this king of problem, does 
the snapshot management have changed between nautilus and octopus, is 
there a way to solve it (apart from removing one snap at a time and 
waiting for the snaptrim to end before removing the next one) ?


We also change the bluefs_buffered_io from false to true (it was set to 
false a long time ago because of the bug 
https://tracker.ceph.com/issues/45337) because it seems that it can help 
(cf. 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/S4ZW7D5J5OAI76F44NNXMTKWNZYYYUJY/). 
Does the osds need to be restarted to make this change effective ?



Thanks.

F.


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Why you might want packages not containers for Ceph deployments

2021-11-08 Thread Francois Legrand


Hi Franck,

I totally agree with your point 3 (also with 1 and 2 indeed). Generally 
speaking, the release cycle of many softwares tends to become faster and 
faster (not only for ceph, but also openstack etc...) and it's really 
hard and tricky to maintain an infrastructure up to date in such 
conditions, even more when you deal with storage. As a result, as you 
perfectly explained it, this gives the impression that the product is 
not that robust, contains a lot of bugs and needs a lot of patches etc. 
Few times upgrades had been released with obvious bugs or regressions 
(e.g DNS problem in 14.2.12,...) and this gives the impression that 
there is an urge to release, even if the corrections are not totally 
tested... which lead to a loose of confidence from the users.


And I am personally in this process !! We wanted to upgrade our Nautilus 
cluster. First we decided to go directly to Pacific, but looking to the 
list it appears to us that Pacific is absolutely not stable enough to be 
considered as a production release. We thus decided to go to octopus... 
maybe we will go to pacific when the v17 will be out.


I thus feel that the "last stable release" (currently pacific) is in 
fact a development release (and the community is the "testing pool" for 
that release) and the truly stable release is the n-1 one (octopus). 
Thus I am fully supporting your request for a LTS release with stability 
as a main goal.


F.



Le 08/11/2021 à 13:21, Frank Schilder a écrit :

Hi all,

I followed this thread with great interest and would like to add my 
opinion/experience/wishes as well.

I believe the question packages versus containers needs a bit more context to 
be really meaningful. This was already mentioned several times with regards to 
documentation. I see the following three topics tightly connected (my 
opinion/answers included):

1. Distribution: Packages are compulsory, containers are optional.
2. Deployment: Ceph adm (yet another deployment framework) and ceph (the actual 
storage system) should be strictly different projects.
3. Release cycles: The release cadence is way too fast, I very much miss a ceph 
LTS branch with at least 10 years back-port support.

These are my short answers/wishes/expectations in this context. I will add 
below some more reasoning as optional reading (warning: wall of text ahead).


1. Distribution
-

I don't think the question is about packages versus containers, because even if 
a distribution should decide not to package ceph any more, other distributors 
certainly will and the user community just moves away from distributions 
without ceph packages. In addition, unless Rad Hat plans to move to a 
source-only container where I run the good old configure - make - make install, 
it will be package based any ways, so packages are there to stay.

Therefore, the way I understand this question is about ceph-adm versus other 
deployment methods. Here, I think the push to a container-based ceph-adm only 
deployment is unlikely to become the no. 1 choice for everyone for good reasons 
already mentioned in earlier messages. In addition, I also believe that 
development of a general deployment tool is currently not sustainable as was 
mentioned by another user. My reasons for this are given in the next section.


2. Deployment
-

In my opinion, it is really important to distinguish three components of any 
open-source project: development (release cycles), distribution and deployment. 
Following the good old philosophy that every tool does exactly one job and does 
it well, each of these components are separate projects, because they 
correspond to different tools.

This implies immediately that ceph documentation should not contain 
documentation about packaging and deployment tools. Each of these ought to be 
strictly separate. If I have a low-level problem with ceph and go to the ceph 
documentation, I do not want to see ceph-adm commands. Ceph documentation 
should be about ceph (the storage system) only. Such a mix-up is leading to 
problems and there were already ceph-user cases where people could not use the 
documentation for trouble shooting, because it showed ceph-adm commands but 
their cluster was not ceph-adm deployed.

In this context, I would prefer if there was a separate ceph-adm-users list so 
that ceph-users can focus on actual ceph problems again.

Now to the point that ceph-adm might be an un-sustainable project. Although at 
a first glance the idea of a generic deployment tool that solves all problems 
with a single command might look appealing, it is likely doomed to fail for a 
simple reason that was already indicated in an earlier message: ceph deployment 
is subject to a complexity paradox. Ceph has a very large configuration space 
and implementing and using a generic tool that covers and understands this 
configuration space is more complex than deploying any specific ceph cluster, 
each of which uses only a tiny subset of the entire

[ceph-users] Re: snaptrim blocks IO on ceph nautilus

2021-11-06 Thread Francois Legrand



Le 06/11/2021 à 16:57, Francois Legrand a écrit :

Hi,

Can you confirm that the changing bluefs_buffered_io to true solved 
your problem ?


Because I have a rather similar problem. My Nautilus cluster was with 
bluefs_buffered_io = false. It was working (even with snaptrim lasting 
a lot, i.e. several hours). I upgraded to octopus and it seems that 
creating/deleting snapshot now creates a lot of instabilities (leading 
to osd marked down or crashing, mgr and mds crashing, MON_DISK_BIG 
warning, mon out of quorum and a tons of slowops and MOSDScrubReserve 
messages in the logs). Compaction of the failed osd seems more or less 
to solve the problem (osds stop to crash). Thus I desactivated the 
snapshots for the moment.


F.

Le 27/07/2020 à 15:59, Manuel Lausch a écrit :

Hi,

since some days I try to debug a problem with snaptrimming under
nautilus.

I have a cluster with Nautilus (v14.2.10) , 44 Nodes á 24 OSDs á 14 TB
I create every day a snapshot for 7 days.

Every time the old snapshot is deleting I have bad IO performcance 
and blocked requests for several seconds until the snaptrim is done.
Settings like snaptrim_sleep and osd_pg_max_concurrent_snap_trims 
don't affect this behavior.


In the debug_osd 10/10 log I see the following:

2020-07-27 11:45:49.976 7fd8b8404700 10 osd.411 22457 dequeue_op 
0x557886edda20 prio 196 cost 0 latency 0.019545 
osd_repop_reply(client.22731418.0:615257 3.636 e22457/22372) v2 pg 
pg[3.636( v 22457'100855 (21737'97756,22457'100855] 
local-lis/les=22372/22374 n=27762 ec=2842/2839 lis/c 22372/22372 
les/c/f 22374/22374/0 22372/22372/22343) [411,36,956,763] r=0 
lpr=22372 luod=22457'100854 crt=22457'100855 lcod 22457'100853 mlcod 
22457'100853 active+clean+snaptrim_wait trimq=[1d~1]]
2020-07-27 11:45:49.976 7fd8b8404700 10 osd.411 22457 dequeue_op 
0x557886edda20 finish
2020-07-27 11:45:49.976 7fd8b8404700 10 osd.411 22457 dequeue_op 
0x557886edc2c0 prio 127 cost 0 latency 0.043165 
MOSDScrubReserve(2.2645 RELEASE e22457) v1 pg pg[2.2645( empty 
local-lis/les=22359/22364 n=0 ec=2403/2403 lis/c 22359/22359 les/c/f 
22364/22367/0 22359/22359/22359) [379,411,884,975] r=1 lpr=22359 
crt=0'0 active mbc={}]
2020-07-27 11:45:49.976 7fd8b8404700 10 osd.411 22457 dequeue_op 
0x557886edc2c0 finish
2020-07-27 11:45:50.039 7fd8b8404700 10 osd.411 pg_epoch: 22457 
pg[3.278e( v 22457'99491 (21594'96426,22457'99491] 
local-lis/les=22359/22362 n=27669 ec=2859/2839 lis/c 22359/22359 
les/c/f 22362/22365/0 22359/22359/22343) [411,379,848,924] r=0 
lpr=22359 crt=22457'99491 lcod 22457'99489 mlcod 22457'99489 
active+clean+snaptrim trimq=[1d~1]] snap_trimmer posting
2020-07-27 11:45:57.801 7fd8b8404700 10 osd.411 pg_epoch: 22457 
pg[3.278e( v 22457'99493 (21594'96426,22457'99493] 
local-lis/les=22359/22362 n=27669 ec=2859/2839 lis/c 22359/22359 
les/c/f 22362/22365/0 22359/22359/22343) [411,379,848,924] r=0 
lpr=22359 luod=22457'99491 crt=22457'99493 lcod 22457'99489 mlcod 
22457'99489 active+clean+snaptrim trimq=[1d~1]] snap_trimmer complete
2020-07-27 11:45:57.801 7fd8b8404700 10 osd.411 22457 dequeue_op 
0x557880ac3760 prio 127 cost 663 latency 7.761823 
osd_repop(osd.217.0:3025 3.1ca5 e22457/22378) v2 pg pg[3.1ca5( v 
22457'100370 (21716'97357,22457'100370] local-lis/les=22378/22379 
n=27532 ec=2855/2839 lis/c 22378/22378 les/c/f 22379/22379/0 
22378/22378/22378) [217,411,551,1055] r=1 lpr=22378 luod=0'0 
lua=22294'16 crt=22457'100370 lcod 22457'100369 active mbc={}]
2020-07-27 11:45:57.801 7fd8b8404700 10 osd.411 22457 dequeue_op 
0x557880ac3760 finish
2020-07-27 11:45:57.801 7fd8b8404700 10 osd.411 22457 dequeue_op 
0x5578813e1e40 prio 127 cost 0 latency 7.494296 
MOSDScrubReserve(2.37e2 REQUEST e22457) v1 pg pg[2.37e2( empty 
local-lis/les=22355/22356 n=0 ec=2412/2412 lis/c 22355/22355 les/c/f 
22356/22356/0 22355/22355/22355) [245,411,834,768] r=1 lpr=22355 
crt=0'0 active mbc={}]
2020-07-27 11:45:57.801 7fd8b8404700 10 osd.411 22457 dequeue_op 
0x5578813e1e40 finish


the dequeueing of ops works without pauses until the „snap_trimmer 
posting“ and „snap_trimmer complete“ loglines. This task takes in 
this example about 7 Seconds. The following operations which are 
dequeued have now a latency of about this time.


I tried to drill down this in the code. (Developers are asked here)
It seems, that the PG will be locked for every operation.
The snap_trimmer posting and complete message comes from 
„osd/PrimaryLogPG.cc“ on line 4700. This indicates me, that the 
process of deleting a snapshot object will sometimes take some time.


After further poking around. I see in „osd/SnapMapper.cc“ the method 
„SnapMapper::get_next_objects_to_trim“ which takes several seconds to 
get finished. I followed this further to the „common/map_cacher.hpp“ 
to the line 94: „int r = driver->get_next(key, &store);“

 From there I lost the pa

[ceph-users] Re: Upgrade to 16.2.6 and osd+mds crash after bluestore_fsck_quick_fix_on_mount true

2021-11-03 Thread Francois Legrand

Hello,
Can you confirm that the bug only affects pacific and not octopus ?
Thanks.
F.

Le 29/10/2021 à 16:39, Neha Ojha a écrit :

On Thu, Oct 28, 2021 at 8:11 AM Igor Fedotov wrote:

On 10/28/2021 12:36 AM, mgrzybowski wrote:

Hi Igor
I'm very happy that You ware able to reproduce and find the bug.
Nice one !
In my opinion at the moment first priority should be to warn other users
in the official upgrade docs:
https://docs.ceph.com/en/latest/releases/pacific/#upgrading-from-octopus-or-nautilus
.

This has been escalated to Ceph dev's community, hopefully to be done
shortly.

We have added a warning in our docs
https://ceph--43706.org.readthedocs.build/en/43706/releases/pacific/#upgrading-from-octopus-or-nautilus.

Thanks,
Neha

Please also note the tracker: https://tracker.ceph.com/issues/53062

and the fix: https://github.com/ceph/ceph/pull/43687

In my particular case ( i have home storage server based on cephfs and
bunch of random hdd's - SMRs too :( )
i restarted osds one at the time after all RADOS objects were
repaired. Unfortuantely four disks
due to recovery strains showed bad sectors, so i have small number of
unfound objects.
Bad disks were removed one by one. Now i'm waiting for backfill, then
scrubs.
Make crashed osd working again could be nice but should not be
neccessery.

What about some kind of export and impoort of PGs. Could this work on
crashed OSDs with failed omap format upgrade?

I can't say for sure what would be the results - export/import should
probably work but omaps in the restored PGs would be still broken.
Highly likely OSDs (and other daemons) would stuck on that invalid
data... Converting ill-formated omaps back to their regular form (either
new or legacy one) looks more straighforward and predictable task...

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: when mds_all_down open "file system" page provoque dashboard crash

2021-09-23 Thread Francois Legrand


The crash report is :

{
    "backtrace": [
    "/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0) [0x7f86044313c0]",
    "gsignal()",
    "abort()",
    "/lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911) [0x7f86042d2911]",
    "/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c) [0x7f86042de38c]",
    "/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3f7) [0x7f86042de3f7]",
    "/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa6a9) [0x7f86042de6a9]",
    "(std::__throw_out_of_range(char const*)+0x41) [0x7f86042d537e]",
    "(Client::resolve_mds(std::__cxx11::basic_stringstd::char_traits, std::allocator > const&, 
std::vector >*)+0x1306) 
[0x563db199e076]",
    "(Client::mds_command(std::__cxx11::basic_stringstd::char_traits, std::allocator > const&, 
std::vector, 
std::allocator >, std::allocatorstd::char_traits, std::allocator > > > const&, 
ceph::buffer::v15_2_0::list const&, ceph::buffer::v15_2_0::list*, 
std::__cxx11::basic_string, 
std::allocator >*, Context*)+0x179) [0x563db19baa69]",

    "/usr/bin/ceph-mgr(+0x1d185d) [0x563db17db85d]",
    "/lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x2a170e) 
[0x7f860d5e770e]",
    "/lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x74d6d) 
[0x7f860d3bad6d]",

    "_PyEval_EvalFrameDefault()",
    "_PyEval_EvalCodeWithName()",
    "_PyFunction_Vectorcall()",
    "/lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x2a8daa) 
[0x7f860d5eedaa]",
    "/lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x74d6d) 
[0x7f860d3bad6d]",

    "_PyEval_EvalFrameDefault()",
    "_PyEval_EvalCodeWithName()",
    "_PyFunction_Vectorcall()",
    "/lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x74d6d) 
[0x7f860d3bad6d]",

    "_PyEval_EvalFrameDefault()",
    "/lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x8006b) 
[0x7f860d3c606b]",

    "PyVectorcall_Call()",
    "_PyEval_EvalFrameDefault()",
    "/lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x8006b) 
[0x7f860d3c606b]",
    "/lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x74d6d) 
[0x7f860d3bad6d]",

    "_PyEval_EvalFrameDefault()",
    "/lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x8006b) 
[0x7f860d3c606b]",
    "/lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x74d6d) 
[0x7f860d3bad6d]",

    "_PyEval_EvalFrameDefault()",
    "/lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x8006b) 
[0x7f860d3c606b]",
    "/lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x2a8e2b) 
[0x7f860d5eee2b]",

    "PyVectorcall_Call()",
    "/lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x116c01) 
[0x7f860d45cc01]",
    "/lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x17d51b) 
[0x7f860d4c351b]",

    "/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7f8604425609]",
    "clone()"
    ],
    "ceph_version": "16.2.5",
    "os_id": "ubuntu",
    "os_name": "Ubuntu",
    "os_version": "20.04.3 LTS (Focal Fossa)",
    "os_version_id": "20.04",
    "process_name": "ceph-mgr",
    "stack_sig": 
"9a65d0019b8102fdaee8fd29c30e3aef3b86660d33fc6cd9bd51f57844872b2a",

    "timestamp": "2021-09-23T12:27:29.137868Z",
    "utsname_machine": "x86_64",
    "utsname_release": "5.4.0-86-generic",
    "utsname_sysname": "Linux",
    "utsname_version": "#97-Ubuntu SMP Fri Sep 17 19:19:40 UTC 2021"
}

Le 23/09/2021 à 14:55, Francois Legrand a écrit :

Hi,

I am testing an upgrade (from 14.2.16 to 16.2.5)  on my ceph test 
cluster (bar metal).


I noticed (when reaching the mds upgrade) that after I stopped all the 
mds, opening the "file system" page on the dashboard result in a crash 
of the dashboard (and also of the mgr). Does someone had this issue ?


F.


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] when mds_all_down open "file system" page provoque dashboard crash

2021-09-23 Thread Francois Legrand


Hi,

I am testing an upgrade (from 14.2.16 to 16.2.5)  on my ceph test 
cluster (bar metal).


I noticed (when reaching the mds upgrade) that after I stopped all the 
mds, opening the "file system" page on the dashboard result in a crash 
of the dashboard (and also of the mgr). Does someone had this issue ?


F.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Why set osd flag to noout during upgrade ?

2021-09-22 Thread Francois Legrand


Hello everybody,

I have a "stupid" question. Why is it recommended in the docs to set the 
osd flag to noout during an upgrade/maintainance (and especially during 
an osd upgrade/maintainance) ?


In my understanding, if an osd goes down, after a while (600s by 
default) it's marked out and the cluster will start to rebuild it's 
content elsewhere in the cluster to maintain the redondancy of the 
datas. This generate some transfer and load on other osds, but that's 
not a big deal !


As soon as the osd is back, it's marked in again and ceph is able to 
determine which data is back and stop the recovery to reuse the 
unchanged datas which are back. Generally, the recovery is as fast as 
with noout flag (because with noout, the data modified during the down 
period still have be copied to the back osd).


Thus is there an other reason apart from limiting the transfer and 
others osds load durind the downtime ?


F

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: usable size for replicated pool with custom rule in pacific dashboard

2021-09-09 Thread Francois Legrand


You are probably right ! But this "verification" seems "stupid" !

I created an additional room (with no osd) and then the doashboard 
doesn't complain anymore !


Indeed, the rule does what we want because  "step choose firstn 0 type 
room" will select the different rooms (2 in our case) and for the first 
one will put 2 copies on different hosts (step chooseleaf firstn 2 type 
host) and then goes to the  remaining room and put the third copy there 
(and eventually fouth if we choose replica 4).


Enforcing the first rule (step choose firstn 0 type room) to have as 
many choice (rooms) as replica means that the second step is then rather 
useless ! That's why it appears to me that this verification is somewhat 
"stupid"... The check should be that the number of replica is not 
greater than the number of rooms x the number of leafs in the second 
step (2 in my case)... but maybe I missed something !


F.


Le 09/09/2021 à 13:23, Ernesto Puerta a écrit :

Hi Francois,

I'm not an expert on CRUSH rule internals, but I checked the code and 
it assumes that the failure domain (first choose/chooseleaf step) 
there is "room": since there are just 2 rooms vs. 3 replicas, it 
doesn't allow you to create a pool with a rule that might not 
optimally work (keep in mind that Dashboard tries to perform some 
extra validations compared to the Ceph CLI).


Kind Regards,
Ernesto


On Thu, Sep 9, 2021 at 12:29 PM Francois Legrand <mailto:f...@lpnhe.in2p3.fr>> wrote:


Hi all,

I have a test ceph cluster with 4 osd servers containing each 3 osds.

The crushmap uses 2 rooms with 2 servers in each room.

We use replica 3 for pools.

I have the following custom crushrule to ensure that I have at
least one
copy of each data in each room.

rule replicated3over2rooms {
 id 1
 type replicated
 min_size 3
 max_size 4
 step take default
 step choose firstn 0 type room
 step chooseleaf firstn 2 type host
 step emit
}

Everything was working well in nautilus/centos7 (I could create pools
using the dashboard and my custom rule).

I upgraded to pacific/ubuntu 20.04 in containers with cephadm.

Now, I cannot create a new pool with replicated3over2rooms using the
dashboard !

If I choose Pool type = replicated, Replicated size = 3, Crush
ruleset =
replicated3over2rooms

The dashboard says :

Minimum: 3
Maximum: 2
The size specified is out of range. A value from 3 to 2 is usable.

And inspecting  replicatedover2rooms ruleset in the dashboard says
that
the parameters are

max_size 4
min_size 3
rule_id 1
usable_size 2

Where that usable_size comes from ? How to correct it ?

If i run the command line

ceph osd pool create test 16 replicated replicated3over2rooms 3

it works !!

Thanks.

F.



___
ceph-users mailing list -- ceph-users@ceph.io
<mailto:ceph-users@ceph.io>
To unsubscribe send an email to ceph-users-le...@ceph.io
<mailto:ceph-users-le...@ceph.io>


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] usable size for replicated pool with custom rule in pacific dashboard

2021-09-09 Thread Francois Legrand


Hi all,

I have a test ceph cluster with 4 osd servers containing each 3 osds.

The crushmap uses 2 rooms with 2 servers in each room.

We use replica 3 for pools.

I have the following custom crushrule to ensure that I have at least one 
copy of each data in each room.


rule replicated3over2rooms {
    id 1
    type replicated
    min_size 3
    max_size 4
    step take default
    step choose firstn 0 type room
    step chooseleaf firstn 2 type host
    step emit
}

Everything was working well in nautilus/centos7 (I could create pools 
using the dashboard and my custom rule).


I upgraded to pacific/ubuntu 20.04 in containers with cephadm.

Now, I cannot create a new pool with replicated3over2rooms using the 
dashboard !


If I choose Pool type = replicated, Replicated size = 3, Crush ruleset = 
replicated3over2rooms


The dashboard says :

Minimum: 3
Maximum: 2
The size specified is out of range. A value from 3 to 2 is usable.

And inspecting  replicatedover2rooms ruleset in the dashboard says that 
the parameters are


max_size 4
min_size 3
rule_id 1
usable_size 2

Where that usable_size comes from ? How to correct it ?

If i run the command line

ceph osd pool create test 16 replicated   replicated3over2rooms 3

it works !!

Thanks.

F.



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Howto upgrade AND change distro

2021-08-30 Thread Francois Legrand


Thanks,
My point is how to reattach safely an osd from the previous server to 
the new installed distro !

Is there a detailed howto réinstall completely a server (or a cluster) ?
F.



Le 27/08/2021 à 19:47,

Message: 1
Date: Fri, 27 Aug 2021 16:43:12 +0100
From: Matthew Vernon 
Subject: [ceph-users] Re: Howto upgrade AND change distro
To: ceph-users@ceph.io
Message-ID: <654262bf-b621-d534-7067-62a3a2abb...@wikimedia.org>
Content-Type: text/plain; charset=utf-8; format=flowed

Hi,

On 27/08/2021 16:16, Francois Legrand wrote:


We are running a ceph nautilus cluster under centos 7. To upgrade to
pacific we need to change to a more recent distro (probably debian or
ubuntu because of the recent announcement about centos 8, but the distro
doesn't matter very much).

However, I could'nt find a clear procedure to upgrade ceph AND the
distro !  As we have more than 100 osds and ~600TB of data, we would
like to avoid as far as possible to wipe the disks and
rebuild/rebalance. It seems to be possible to reinstall a server and
reuse the osds, but the exact procedure remains quite unclear to me.

It's going to be least pain to do the operations separately, which means
you may need to build a set of packages for one or other "end" of the
operation, if you see what I mean?

The Debian and Ubuntu installers both have an "expert mode" which gives
you quite a lot of control which should enable you to upgrade the OS
without touching the OSD disks - but make sure you have backups of all
your Ceph config!

If you're confident (and have enough redundancy), you can set noout
while you upgrade a machine, which will reduce the amount of rebalancing
you have to do when it rejoins the cluster post upgrade.

Regards,

Matthew

[one good thing about Ubuntu's cloud archive is that e.g. you can get
the same version that's default in 20.04 available as packages for 18.04
via UCA meaning you can upgrade Ceph first, and then do the distro
upgrade, and it's pretty painless]






___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Howto upgrade AND change distro

2021-08-27 Thread Francois Legrand


Hello,

We are running a ceph nautilus cluster under centos 7. To upgrade to 
pacific we need to change to a more recent distro (probably debian or 
ubuntu because of the recent announcement about centos 8, but the distro 
doesn't matter very much).


However, I could'nt find a clear procedure to upgrade ceph AND the 
distro !  As we have more than 100 osds and ~600TB of data, we would 
like to avoid as far as possible to wipe the disks and 
rebuild/rebalance. It seems to be possible to reinstall a server and 
reuse the osds, but the exact procedure remains quite unclear to me.


What is the best way to proceed ? Does someone have done that and have a 
rather detailed doc on how to proceed ?


Thanks for your help !

F.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Balancing with upmap

2021-02-01 Thread Francois Legrand

    1    | 34
osd.77    1    6    23    1    0    0    2    1    1    | 35
osd.78    1    6    24    1    0    0    1    1    1    | 35
osd.79    1    6    22    1    1    0    2    1    1    | 35
osd.80    1    6    22    1    1    0    1    1    1    | 34
osd.81    1    6    24    1    1    0    2    1    2    | 38
osd.82    1    6    23    1    0    1    1    1    0    | 34
osd.83    0    6    23    1    1    0    1    1    0    | 33
osd.84    1    6    25    1    1    1    2    1    2    | 40
osd.85    1    6    22    1    0    0    2    0    2    | 34
osd.86    0    6    22    1    0    0    1    0    1    | 31
osd.87    1    6    22    0    0    0    2    1    2    | 34
osd.88    1    8    34    1    1    0    2    1    3    | 51
osd.89    1    7    22    1    1    1    2    1    2    | 38
osd.90    1    6    25    0    1    1    2    1    2    | 39
osd.91    1    8    32    0    1    1    2    1    1    | 47
osd.92    1    6    22    0    1    2    1    1    2    | 36
osd.93    1    7    22    1    1    1    2    1    2    | 38
osd.94    1    6    27    0    1    1    1    1    1    | 39
osd.95    1    7    30    0    1    1    2    1    1    | 44
osd.96    1    10   35    1    1    1    3    1    3    | 56
osd.97    1    6    28    1    1    1    1    1    1    | 41
osd.98    1    6    22    0    1    1    2    0    1    | 34
osd.99    1    6    29    1    1    1    2    1    1    | 43
osd.100   1    6    26    1    1    0    2    0    2    | 39
osd.101   0    6    24    1    0    1    2    1    1    | 36
osd.102   0    6    22    1    0    1    2    0    2    | 34
osd.103   1    6    22    0    1    1    2    1    2    | 36
osd.104   0    6    30    1    1    1    2    1    2    | 44
osd.105   0    6    26    1    1    1    1    0    1    | 37
osd.106   1    11   34    1    1    1    1    1    2    | 53
osd.107   1    8    38    1    1    0    2    1    2    | 54
osd.108   1    8    34    1    1    2    2    1    3    | 53
osd.109   1    9    34    1    1    1    1    1    3    | 52
osd.110   1    8    37    1    1    0    3    1    3    | 55
osd.111   1    8    40    1    1    0    2    1    1    | 55
osd.112   1    8    37    1    1    2    3    1    1    | 55
osd.113   1    8    34    1    1    0    1    1    1    | 48
osd.114   1    11   34    1    1    1    1    1    2    | 53
osd.115   1    11   34    1    1    0    1    1    1    | 51

SUM :    96   768  3072   96   96   96  192   96   192  |

F.

Le 01/02/2021 à 10:26, Dan van der Ster a écrit :

On Mon, Feb 1, 2021 at 10:03 AM Francois Legrand  wrote:

Hi,

Actually we have no EC pools... all are replica 3. And we have only 9 pools.

The average number og pg/osd is not very high (40.6).

Here is the detail of the pools :

pool 2 replicated size 3 min_size 1 crush_rule 2 object_hash rjenkins
pg_num 64 pgp_num 64 last_change 623105 lfor 0/608315/608313 flags
hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 31 replicated size 3 min_size 1 crush_rule 2 object_hash rjenkins
pg_num 32 pgp_num 32 autoscale_mode on last_change 621529 lfor
0/0/171563 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 32 replicated size 3 min_size 1 crush_rule 2 object_hash rjenkins
pg_num 32 pgp_num 32 autoscale_mode on last_change 621529 lfor
436085/436085/436085 flags hashpspool,selfmanaged_snaps stripe_width 0
application rbd
pool 33 replicated size 3 min_size 1 crush_rule 2 object_hash rjenkins
pg_num 32 pgp_num 32 autoscale_mode on last_change 621529 lfor
0/0/171554 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 34 replicated size 3 min_size 1 crush_rule 2 object_hash rjenkins
pg_num 32 pgp_num 32 autoscale_mode on last_change 623470 lfor
0/0/171558 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 35 replicated size 3 min_size 1 crush_rule 2 object_hash rjenkins
pg_num 32 pgp_num 32 last_change 621529 lfor 0/598286/598284 flags
hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16
recovery_priority 5 application cephfs
pool 36 replicated size 3 min_size 1 crush_rule 2 object_hash rjenkins
pg_num 1024 pgp_num 1024 autoscale_mode warn last_change 624174 flags
hashpspool,selfmanaged_snaps stripe_width 0 application cephfs
pool 43 replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins
pg_num 64 pgp_num 64 autoscale_mode warn last_change 624174 flags
hashpspool,selfmanaged_snaps stripe_width 0 application cephfs
pool 44 replicated size 3 min_size 3 crush_rule 2 object_hash rjenkins
pg_num 256 pgp_num 256 autoscale_mode warn last_change 622177 lfor
0/0/449412 flags hashpspool,selfmanaged_snaps stripe_width 0
expected_num_objects 400 target_size_bytes 17592186044416 application rbd

Pools 35 (meta), 36 and 43 (datas) are for cephfs.


How does the distribution for pool 36 look? This pool has the best
chance to be balanced -- the others have too few PGs so you shouldn't
even be wo

[ceph-users] Re: Balancing with upmap

2021-02-01 Thread Francois Legrand

retty non-uniform distribution, because this example
pool id 38 has up to 4 PGs on some OSDs but 1 or 2 on most.
(this is a cluster with the balancer disabled).

The other explanation I can think of is that you have relatively wide
EC pools and few hosts. In that case there would be very little that
the balancer could do to flatten the distribution.
If in doubt, please share your pool details and crush rules so we can
investigate further.

Cheers, Dan




On Sun, Jan 31, 2021 at 5:10 PM Francois Legrand  wrote:

Hi,

After 2 days, the recovery ended. The situation is clearly better (but
still not perfect) with 339.8 Ti available in pools (for 575.8 Ti
available in the whole cluster).

The balancing remains not perfect (31 to 47 pgs on 8TB disks). And the
ceph osd df tree returns :

ID  CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAPMETA
AVAIL   %USE  VAR  PGS STATUS TYPE NAME
   -1   1018.65833-  466 TiB 214 TiB 214 TiB 126 GiB 609 GiB
251 TiB 00   -root default
-15465.66577-  466 TiB 214 TiB 214 TiB 126 GiB 609 GiB
251 TiB 46.04 1.06   -room 1222-2-10
   -3116.41678-  116 TiB  53 TiB  53 TiB 24 GiB 152 GiB
64 TiB 45.45 1.05   -host lpnceph01
0   hdd7.27599  1.0  7.3 TiB 3.7 TiB 3.7 TiB 2.5 GiB  16 GiB
3.5 TiB 51.34 1.18  38 up osd.0
4   hdd7.27599  1.0  7.3 TiB 3.2 TiB 3.2 TiB 2.4 GiB 8.7 GiB
4.1 TiB 44.12 1.01  36 up osd.4
8   hdd7.27699  1.0  7.3 TiB 3.5 TiB 3.5 TiB 2.3 GiB 9.3 GiB
3.7 TiB 48.52 1.12  39 up osd.8
   12   hdd7.27599  1.0  7.3 TiB 3.4 TiB 3.4 TiB 2.4 GiB 9.5 GiB
3.9 TiB 46.69 1.07  37 up osd.12
   16   hdd7.27599  1.0  7.3 TiB 3.5 TiB 3.4 TiB 38 MiB 9.7 GiB
3.8 TiB 47.49 1.09  37 up osd.16
   20   hdd7.27599  1.0  7.3 TiB 3.1 TiB 3.0 TiB 2.4 GiB 8.7 GiB
4.2 TiB 41.95 0.96  34 up osd.20
   24   hdd7.27599  1.0  7.3 TiB 3.5 TiB 3.5 TiB 2.3 GiB 9.8 GiB
3.8 TiB 48.45 1.11  38 up osd.24
   28   hdd7.27599  1.0  7.3 TiB 3.0 TiB 3.0 TiB 55 MiB 8.2 GiB
4.2 TiB 41.74 0.96  32 up osd.28
   32   hdd7.27599  1.0  7.3 TiB 3.2 TiB 3.1 TiB 32 MiB 8.4 GiB
4.1 TiB 43.33 1.00  34 up osd.32
   36   hdd7.27599  1.0  7.3 TiB 3.7 TiB 3.7 TiB 2.4 GiB  11 GiB
3.6 TiB 50.50 1.16  35 up osd.36
   40   hdd7.27599  1.0  7.3 TiB 3.4 TiB 3.3 TiB 2.4 GiB 9.1 GiB
3.9 TiB 46.15 1.06  37 up osd.40
   44   hdd7.27599  1.0  7.3 TiB 3.4 TiB 3.4 TiB 2.3 GiB 9.2 GiB
3.9 TiB 46.28 1.06  36 up osd.44
   48   hdd7.27599  1.0  7.3 TiB 3.3 TiB 3.3 TiB 92 MiB 8.8 GiB
4.0 TiB 44.88 1.03  33 up osd.48
   52   hdd7.27599  1.0  7.3 TiB 3.3 TiB 3.3 TiB 2.4 GiB 9.0 GiB
4.0 TiB 44.86 1.03  33 up osd.52
   56   hdd7.27599  1.0  7.3 TiB 2.9 TiB 2.9 TiB 23 MiB 8.3 GiB
4.4 TiB 39.79 0.92  34 up osd.56
   60   hdd7.27599  1.0  7.3 TiB 3.0 TiB 3.0 TiB 40 MiB 8.3 GiB
4.3 TiB 41.12 0.95  30 up osd.60
   -5116.41600-  116 TiB  54 TiB  54 TiB 30 GiB 150 GiB
63 TiB 46.12 1.06   -host lpnceph02
1   hdd7.27599  1.0  7.3 TiB 3.2 TiB 3.2 TiB 2.2 GiB 8.9 GiB
4.0 TiB 44.53 1.02  37 up osd.1
5   hdd7.27599  1.0  7.3 TiB 3.1 TiB 3.1 TiB 24 MiB 8.3 GiB
4.2 TiB 42.56 0.98  34 up osd.5
9   hdd7.27599  1.0  7.3 TiB 3.8 TiB 3.8 TiB 42 MiB  11 GiB
3.4 TiB 52.61 1.21  38 up osd.9
   13   hdd7.27599  1.0  7.3 TiB 3.1 TiB 3.1 TiB 2.3 GiB 9.7 GiB
4.2 TiB 42.89 0.99  36 up osd.13
   17   hdd7.27599  1.0  7.3 TiB 3.4 TiB 3.4 TiB 2.3 GiB 9.1 GiB
3.9 TiB 46.80 1.08  36 up osd.17
   21   hdd7.27599  1.0  7.3 TiB 3.3 TiB 3.3 TiB 41 MiB 9.2 GiB
4.0 TiB 44.90 1.03  33 up osd.21
   25   hdd7.27599  1.0  7.3 TiB 3.5 TiB 3.5 TiB 2.4 GiB 9.4 GiB
3.7 TiB 48.75 1.12  38 up osd.25
   29   hdd7.27599  1.0  7.3 TiB 3.0 TiB 3.0 TiB 2.3 GiB 8.7 GiB
4.2 TiB 41.91 0.96  34 up osd.29
   33   hdd7.27599  1.0  7.3 TiB 3.4 TiB 3.4 TiB 2.3 GiB 9.4 GiB
3.9 TiB 46.60 1.07  36 up osd.33
   37   hdd7.27599  1.0  7.3 TiB 3.5 TiB 3.5 TiB 4.6 GiB  10 GiB
3.8 TiB 47.90 1.10  34 up osd.37
   41   hdd7.27599  1.0  7.3 TiB 3.3 TiB 3.3 TiB 2.2 GiB  11 GiB
3.9 TiB 45.91 1.06  33 up osd.41
   45   hdd7.27599  1.0  7.3 TiB 3.4 TiB 3.4 TiB 2.4 GiB 9.3 GiB
3.9 TiB 46.85 1.08  35 up osd.45
   49   hdd7.27599  1.0  7.3 TiB 3.3 TiB 3.3 TiB 2.3 GiB 8.9 GiB
4.0 TiB 45.35 1.04  36 up osd.49
   53   hdd7.27599  1.0  7.3 TiB 3.3 TiB 3.3 TiB 36 MiB 9.0 GiB
4.0 TiB 44.85 1.03  33 up osd.53
   57   hdd7.27599  1.0  7.3 TiB 3.3 TiB 3.3 TiB 2.3 GiB 9.0 GiB
4.0 TiB 45.67 1.05  36 up osd.57
   61   hdd7.27599  1.0  7.3 TiB 3.6 TiB 3.6 TiB 2.4 GiB 9.8 GiB
3.7 TiB 49.75 1.14  36 up osd.61
   -9116.41600-  116 TiB  56 TiB  56 TiB 35 GiB 159 GiB
61 TiB 48.03 1.10   -host l

[ceph-users] Re: Balancing with upmap

2021-01-31 Thread Francois Legrand

d": "Sun Jan 31 17:07:47 2021"
}

Can the crush rules for placement be blamed for the inequal repartition ?

F.

Le 29/01/2021 à 23:44, Dan van der Ster a écrit :

Thanks, and thanks for the log file OTR which simply showed:

 2021-01-29 23:17:32.567 7f6155cae700  4 mgr[balancer] prepared 0/10 changes

This indeed means that balancer believes those pools are all balanced
according to the config (which you have set to the defaults).

Could you please also share the output of `ceph osd df tree` so we can
see the distribution and OSD weights?

You might need simply to decrease the upmap_max_deviation from the
default of 5. On our clusters we do:

 ceph config set mgr mgr/balancer/upmap_max_deviation 1

Cheers, Dan

On Fri, Jan 29, 2021 at 11:25 PM Francois Legrand  wrote:

Hi Dan,

Here is the output of ceph balancer status :

/ceph balancer status//
//{//
//"last_optimize_duration": "0:00:00.074965", //
//"plans": [], //
//"mode": "upmap", //
//"active": true, //
//"optimize_result": "Unable to find further optimization, or
pool(s) pg_num is decreasing, or distribution is already perfect", //
//"last_optimize_started": "Fri Jan 29 23:13:31 2021"//
//}/


F.

Le 29/01/2021 à 10:57, Dan van der Ster a écrit :

Hi Francois,

What is the output of `ceph balancer status` ?
Also, can you increase the debug_mgr to 4/5 then share the log file of
the active mgr?

Best,

Dan

On Fri, Jan 29, 2021 at 10:54 AM Francois Legrand  wrote:

Thanks for your suggestion. I will have a look !

But I am a bit surprised that the "official" balancer seems so unefficient !

F.

Le 28/01/2021 à 12:00, Jonas Jelten a écrit :

Hi!

We also suffer heavily from this so I wrote a custom balancer which yields much 
better results:
https://github.com/TheJJ/ceph-balancer

After you run it, it echoes the PG movements it suggests. You can then just run 
those commands the cluster will balance more.
It's kinda work in progress, so I'm glad about your feedback.

Maybe it helps you :)

-- Jonas

On 27/01/2021 17.15, Francois Legrand wrote:

Hi all,
I have a cluster with 116 disks (24 new disks of 16TB added in december and the 
rest of 8TB) running nautilus 14.2.16.
I moved (8 month ago) from crush_compat to upmap balancing.
But the cluster seems not well balanced, with a number of pgs on the 8TB disks 
varying from 26 to 52 ! And an occupation from 35 to 69%.
The recent 16 TB disks are more homogeneous with 48 to 61 pgs and space between 
30 and 43%.
Last week, I realized that some osd were maybe not using upmap because I did a 
ceph osd crush weight-set ls and got (compat) as result.
Thus I ran a ceph osd crush weight-set rm-compat which triggered some 
rebalancing. Now there is no more recovery for 2 days, but the cluster is still 
unbalanced.
As far as I understand, upmap is supposed to reach an equal number of pgs on 
all the disks (I guess weighted by their capacity).
Thus I would expect more or less 30 pgs on the 8TB disks and 60 on the 16TB and 
around 50% usage on all. Which is not the case (by far).
The problem is that it impact the free available space in the pools (264Ti 
while there is more than 578Ti free in the cluster) because free space seems to 
be based on space available before the first osd will be full !
Is it normal ? Did I missed something ? What could I do ?

F.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Balancing with upmap

2021-01-30 Thread Francois Legrand

32.567 7f6155cae700  4 mgr[balancer] prepared 0/10 changes

This indeed means that balancer believes those pools are all balanced
according to the config (which you have set to the defaults).

Could you please also share the output of `ceph osd df tree` so we can
see the distribution and OSD weights?

You might need simply to decrease the upmap_max_deviation from the
default of 5. On our clusters we do:

 ceph config set mgr mgr/balancer/upmap_max_deviation 1

Cheers, Dan

On Fri, Jan 29, 2021 at 11:25 PM Francois Legrand  wrote:

Hi Dan,

Here is the output of ceph balancer status :

/ceph balancer status//
//{//
//"last_optimize_duration": "0:00:00.074965", //
//"plans": [], //
//"mode": "upmap", //
//"active": true, //
//"optimize_result": "Unable to find further optimization, or
pool(s) pg_num is decreasing, or distribution is already perfect", //
//"last_optimize_started": "Fri Jan 29 23:13:31 2021"//
//}/


F.

Le 29/01/2021 à 10:57, Dan van der Ster a écrit :

Hi Francois,

What is the output of `ceph balancer status` ?
Also, can you increase the debug_mgr to 4/5 then share the log file of
the active mgr?

Best,

Dan

On Fri, Jan 29, 2021 at 10:54 AM Francois Legrand  wrote:

Thanks for your suggestion. I will have a look !

But I am a bit surprised that the "official" balancer seems so unefficient !

F.

Le 28/01/2021 à 12:00, Jonas Jelten a écrit :

Hi!

We also suffer heavily from this so I wrote a custom balancer which yields much 
better results:
https://github.com/TheJJ/ceph-balancer

After you run it, it echoes the PG movements it suggests. You can then just run 
those commands the cluster will balance more.
It's kinda work in progress, so I'm glad about your feedback.

Maybe it helps you :)

-- Jonas

On 27/01/2021 17.15, Francois Legrand wrote:

Hi all,
I have a cluster with 116 disks (24 new disks of 16TB added in december and the 
rest of 8TB) running nautilus 14.2.16.
I moved (8 month ago) from crush_compat to upmap balancing.
But the cluster seems not well balanced, with a number of pgs on the 8TB disks 
varying from 26 to 52 ! And an occupation from 35 to 69%.
The recent 16 TB disks are more homogeneous with 48 to 61 pgs and space between 
30 and 43%.
Last week, I realized that some osd were maybe not using upmap because I did a 
ceph osd crush weight-set ls and got (compat) as result.
Thus I ran a ceph osd crush weight-set rm-compat which triggered some 
rebalancing. Now there is no more recovery for 2 days, but the cluster is still 
unbalanced.
As far as I understand, upmap is supposed to reach an equal number of pgs on 
all the disks (I guess weighted by their capacity).
Thus I would expect more or less 30 pgs on the 8TB disks and 60 on the 16TB and 
around 50% usage on all. Which is not the case (by far).
The problem is that it impact the free available space in the pools (264Ti 
while there is more than 578Ti free in the cluster) because free space seems to 
be based on space available before the first osd will be full !
Is it normal ? Did I missed something ? What could I do ?

F.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Balancing with upmap

2021-01-29 Thread Francois Legrand


Hi Dan,

Here is the output of ceph balancer status :

/ceph balancer status//
//{//
//    "last_optimize_duration": "0:00:00.074965", //
//    "plans": [], //
//    "mode": "upmap", //
//    "active": true, //
//    "optimize_result": "Unable to find further optimization, or 
pool(s) pg_num is decreasing, or distribution is already perfect", //

//    "last_optimize_started": "Fri Jan 29 23:13:31 2021"//
//}/


F.

Le 29/01/2021 à 10:57, Dan van der Ster a écrit :

Hi Francois,

What is the output of `ceph balancer status` ?
Also, can you increase the debug_mgr to 4/5 then share the log file of
the active mgr?

Best,

Dan

On Fri, Jan 29, 2021 at 10:54 AM Francois Legrand  wrote:

Thanks for your suggestion. I will have a look !

But I am a bit surprised that the "official" balancer seems so unefficient !

F.

Le 28/01/2021 à 12:00, Jonas Jelten a écrit :

Hi!

We also suffer heavily from this so I wrote a custom balancer which yields much 
better results:
https://github.com/TheJJ/ceph-balancer

After you run it, it echoes the PG movements it suggests. You can then just run 
those commands the cluster will balance more.
It's kinda work in progress, so I'm glad about your feedback.

Maybe it helps you :)

-- Jonas

On 27/01/2021 17.15, Francois Legrand wrote:

Hi all,
I have a cluster with 116 disks (24 new disks of 16TB added in december and the 
rest of 8TB) running nautilus 14.2.16.
I moved (8 month ago) from crush_compat to upmap balancing.
But the cluster seems not well balanced, with a number of pgs on the 8TB disks 
varying from 26 to 52 ! And an occupation from 35 to 69%.
The recent 16 TB disks are more homogeneous with 48 to 61 pgs and space between 
30 and 43%.
Last week, I realized that some osd were maybe not using upmap because I did a 
ceph osd crush weight-set ls and got (compat) as result.
Thus I ran a ceph osd crush weight-set rm-compat which triggered some 
rebalancing. Now there is no more recovery for 2 days, but the cluster is still 
unbalanced.
As far as I understand, upmap is supposed to reach an equal number of pgs on 
all the disks (I guess weighted by their capacity).
Thus I would expect more or less 30 pgs on the 8TB disks and 60 on the 16TB and 
around 50% usage on all. Which is not the case (by far).
The problem is that it impact the free available space in the pools (264Ti 
while there is more than 578Ti free in the cluster) because free space seems to 
be based on space available before the first osd will be full !
Is it normal ? Did I missed something ? What could I do ?

F.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Balancing with upmap

2021-01-29 Thread Francois Legrand


Thanks for your suggestion. I will have a look !

But I am a bit surprised that the "official" balancer seems so unefficient !

F.

Le 28/01/2021 à 12:00, Jonas Jelten a écrit :

Hi!

We also suffer heavily from this so I wrote a custom balancer which yields much 
better results:
https://github.com/TheJJ/ceph-balancer

After you run it, it echoes the PG movements it suggests. You can then just run 
those commands the cluster will balance more.
It's kinda work in progress, so I'm glad about your feedback.

Maybe it helps you :)

-- Jonas

On 27/01/2021 17.15, Francois Legrand wrote:

Hi all,
I have a cluster with 116 disks (24 new disks of 16TB added in december and the 
rest of 8TB) running nautilus 14.2.16.
I moved (8 month ago) from crush_compat to upmap balancing.
But the cluster seems not well balanced, with a number of pgs on the 8TB disks 
varying from 26 to 52 ! And an occupation from 35 to 69%.
The recent 16 TB disks are more homogeneous with 48 to 61 pgs and space between 
30 and 43%.
Last week, I realized that some osd were maybe not using upmap because I did a 
ceph osd crush weight-set ls and got (compat) as result.
Thus I ran a ceph osd crush weight-set rm-compat which triggered some 
rebalancing. Now there is no more recovery for 2 days, but the cluster is still 
unbalanced.
As far as I understand, upmap is supposed to reach an equal number of pgs on 
all the disks (I guess weighted by their capacity).
Thus I would expect more or less 30 pgs on the 8TB disks and 60 on the 16TB and 
around 50% usage on all. Which is not the case (by far).
The problem is that it impact the free available space in the pools (264Ti 
while there is more than 578Ti free in the cluster) because free space seems to 
be based on space available before the first osd will be full !
Is it normal ? Did I missed something ? What could I do ?

F.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Balancing with upmap

2021-01-27 Thread Francois Legrand


Nope !

Le 27/01/2021 à 17:40, Anthony D'Atri a écrit :

Do you have any override reweights set to values less than 1.0?

The REWEIGHT column when you run `ceph osd df`


On Jan 27, 2021, at 8:15 AM, Francois Legrand  wrote:

Hi all,
I have a cluster with 116 disks (24 new disks of 16TB added in december and the 
rest of 8TB) running nautilus 14.2.16.
I moved (8 month ago) from crush_compat to upmap balancing.
But the cluster seems not well balanced, with a number of pgs on the 8TB disks 
varying from 26 to 52 ! And an occupation from 35 to 69%.
The recent 16 TB disks are more homogeneous with 48 to 61 pgs and space between 
30 and 43%.
Last week, I realized that some osd were maybe not using upmap because I did a 
ceph osd crush weight-set ls and got (compat) as result.
Thus I ran a ceph osd crush weight-set rm-compat which triggered some 
rebalancing. Now there is no more recovery for 2 days, but the cluster is still 
unbalanced.
As far as I understand, upmap is supposed to reach an equal number of pgs on 
all the disks (I guess weighted by their capacity).
Thus I would expect more or less 30 pgs on the 8TB disks and 60 on the 16TB and 
around 50% usage on all. Which is not the case (by far).
The problem is that it impact the free available space in the pools (264Ti 
while there is more than 578Ti free in the cluster) because free space seems to 
be based on space available before the first osd will be full !
Is it normal ? Did I missed something ? What could I do ?

F.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Balancing with upmap

2021-01-27 Thread Francois Legrand


Hi all,
I have a cluster with 116 disks (24 new disks of 16TB added in december 
and the rest of 8TB) running nautilus 14.2.16.

I moved (8 month ago) from crush_compat to upmap balancing.
But the cluster seems not well balanced, with a number of pgs on the 8TB 
disks varying from 26 to 52 ! And an occupation from 35 to 69%.
The recent 16 TB disks are more homogeneous with 48 to 61 pgs and space 
between 30 and 43%.
Last week, I realized that some osd were maybe not using upmap because I 
did a ceph osd crush weight-set ls and got (compat) as result.
Thus I ran a ceph osd crush weight-set rm-compat which triggered some 
rebalancing. Now there is no more recovery for 2 days, but the cluster 
is still unbalanced.
As far as I understand, upmap is supposed to reach an equal number of 
pgs on all the disks (I guess weighted by their capacity).
Thus I would expect more or less 30 pgs on the 8TB disks and 60 on the 
16TB and around 50% usage on all. Which is not the case (by far).
The problem is that it impact the free available space in the pools 
(264Ti while there is more than 578Ti free in the cluster) because free 
space seems to be based on space available before the first osd will be 
full !

Is it normal ? Did I missed something ? What could I do ?

F.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: add server in crush map before osd

2020-12-03 Thread Francois Legrand


Thank for your advices.

it was exactly what I needed.

Indeed, I did a :

ceph osd crush add-bucket  host
ceph osd crush move  room=


But also set the norecover, nobackfill and norebalance flags :-)

It worked perfectly as expected.

F.

Le 03/12/2020 à 01:50, Reed Dier a écrit :

Just to piggyback on this, the below are the correct answers.

However, how I do it, which is admittedly not the best way, but it is 
the easy way.

I set the norecover, nobackfill flags
I run my osd creation script against the first disk on the new host to 
make sure that everything is working correctly, and also so that I can 
then manually move my new host bucket where I need it in the crush map 
with

ceph osd crush move {bucket-name} {bucket-type}={bucket-name}


Then I proceed with my script for the rest of the OSDs on that host 
and know that they will fall into the correct crush location.
And then of course I unset the norecover, nobackfill flags so that 
data starts moving.


I only mention this because it ensures that you don't fat finger the 
hostname on manual bucket creation, or the hostname syntax doesn't 
match as expected, and it allows you to course correct after a single 
OSD added, rather than all N OSDs.


Hope thats also helpful.

Reed

On Dec 2, 2020, at 4:38 PM, Dan van der Ster <mailto:d...@vanderster.com>> wrote:


Hi Francois!

If I've understood your question, I think you have two options.

1. You should be able to create an empty host then move it into a room
before creating any osd:

  ceph osd crush add-bucket  host
  ceph osd crush mv  room=

2. Add a custom crush location to ceph.conf on the new server so that
its osds are placed in the correct room/rack/host when they are first
created, e.g.

[osd]
crush location = room=0513-S-0034 rack=SJ04 host=cephdata20b-b7e4a773b6

Does that help?

Cheers, Dan



On Wed, Dec 2, 2020 at 11:29 PM Francois Legrand <mailto:f...@lpnhe.in2p3.fr>> wrote:


Hello,

I have a ceph nautilus cluster. The crushmap is organized with 2 rooms,
servers in these rooms and osd in these servers, I have a crush rule to
replicate data over the servers in different rooms.

Now, I want to add a new server in one of the rooms. My point is that I
would like to specify the room of this new server BEFORE creating osd in
this server (so the data added to the osd will be directly at the right
location). My problem is that it seems that servers appears in the
crushmap only when they have osds... and when you create a first osd,
the server is inserted in the crushmap under the default bucket (so not
in a room and then the first data stored in this osd will not be at the
correct location). I could move it after (if I do it rapidly, there will
be no that much data to move after), but I was wondering if there is a
way to either define the position of a server in the crushmap hierarchy
before creating osd or eventually to specify the room when creating the
first osd ?

F.
___
ceph-users mailing list -- ceph-users@ceph.io 
<mailto:ceph-users@ceph.io>
To unsubscribe send an email to ceph-users-le...@ceph.io 
<mailto:ceph-users-le...@ceph.io>

___
ceph-users mailing list -- ceph-users@ceph.io <mailto:ceph-users@ceph.io>
To unsubscribe send an email to ceph-users-le...@ceph.io 
<mailto:ceph-users-le...@ceph.io>



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] add server in crush map before osd

2020-12-02 Thread Francois Legrand


Hello,

I have a ceph nautilus cluster. The crushmap is organized with 2 rooms, 
servers in these rooms and osd in these servers, I have a crush rule to 
replicate data over the servers in different rooms.


Now, I want to add a new server in one of the rooms. My point is that I 
would like to specify the room of this new server BEFORE creating osd in 
this server (so the data added to the osd will be directly at the right 
location). My problem is that it seems that servers appears in the 
crushmap only when they have osds... and when you create a first osd, 
the server is inserted in the crushmap under the default bucket (so not 
in a room and then the first data stored in this osd will not be at the 
correct location). I could move it after (if I do it rapidly, there will 
be no that much data to move after), but I was wondering if there is a 
way to either define the position of a server in the crushmap hierarchy 
before creating osd or eventually to specify the room when creating the 
first osd ?


F.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: osd regularly wrongly marked down

2020-09-01 Thread Francois Legrand


Hello,
During the night the osd.16 crashed after hitting a suicide timout. Thus 
this morning I did a ceph-kvstore-tool  compact and restarted the osd.
I thus compared the results of ceph daemon osd.16 perf dump I had before 
(i.e. yesterday) and now (after compaction). I noticed a interresting 
difference in msgr_active_connections. Before the compaction it was, for 
all AsyncMessenger::Worker-0, 1 and 2 at a crasy value 
(18446744073709550998) and get back to something comparable to what I 
have for other osds (72).

Does this helps you to identify the problem ?
F.



Le 31/08/2020 à 15:59, Wido den Hollander a écrit :



On 31/08/2020 15:44, Francois Legrand wrote:

Thanks Igor for your answer,

We could try do a compaction of RocksDB manually, but it's not clear 
to me if we have to compact on the mon with something like

ceph-kvstore-tool rocksdb  /var/lib/ceph/mon/mon01/store.db/ compact
or on the concerned osd with
ceph-kvstore-tool rocksdb  /var/lib/ceph/osd/ceph-16/ compact
(or for all osd with a script like in 
https://gist.github.com/wido/b0f0200bd1a2cbbe3307265c5cfb2771 )


You would compact the OSDs, not the MONs. So the last command or my 
script which you linked there.


For my culture, how does compaction works ? Is it done automatically 
in background, regularly, at startup ?


Usually it's done by the OSD in the background, but sometimes an 
offline compact works best.


Because in the logs of the osd we have every 10mn some reports about 
compaction (which suggests that compaction occurs regularly), like :




Yes, that is normal. But the offline compaction is sometimes more 
effective than the online ones are.



2020-08-31 15:06:55.448 7f03fb398700  4 rocksdb: [db/db_impl.cc:777] 
--- DUMPING STATS ---

2020-08-31 15:06:55.448 7f03fb398700  4 rocksdb: [db/db_impl.cc:778]
** DB Stats **
Uptime(secs): 449404.8 total, 600.0 interval
Cumulative writes: 136K writes, 692K keys, 136K commit groups, 1.0 
writes per commit group, ingest: 0.28 GB, 0.00 MB/s
Cumulative WAL: 136K writes, 67K syncs, 2.04 writes per sync, 
written: 0.28 GB, 0.00 MB/s

Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent
Interval writes: 128 writes, 336 keys, 128 commit groups, 1.0 writes 
per commit group, ingest: 0.22 MB, 0.00 MB/s
Interval WAL: 128 writes, 64 syncs, 1.97 writes per sync, written: 
0.00 MB, 0.00 MB/s

Interval stall: 00:00:0.000 H:M:S, 0.0 percent

** Compaction Stats [default] **
Level    Files   Size Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) 
Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) 
CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
 

   L0  1/0   60.48 MB   0.2  0.0 0.0 0.0   0.1 
0.1   0.0   1.0  0.0    163.7 0.52  0.40 2    
0.258   0  0
   L1  0/0    0.00 KB   0.0  0.1 0.1 0.0   0.1 
0.1   0.0   0.5 48.2 26.1 2.32  0.64 1    
2.319    920K   197K
   L2 17/0    1.00 GB   0.8  1.1 0.1 1.1   1.1 
0.0   0.0  18.3 69.8 67.5 16.38  4.97 1   
16.380   4747K    82K
   L3 81/0    4.50 GB   0.9  0.6 0.1 0.5   0.3 
-0.2   0.0   4.3 66.9 36.6 9.23  4.95 2    
4.617   9544K   802K
   L4    285/0   16.64 GB   0.1  2.4 0.3 2.0   0.2 
-1.8   0.0   0.8    110.3 11.7 21.92  4.37 5    
4.384 12M    12M
  Sum    384/0   22.20 GB   0.0  4.2 0.6 3.6   1.8 
-1.8   0.0  21.8 85.2 36.6 50.37 15.32 11    
4.579 28M    13M
  Int  0/0    0.00 KB   0.0  0.0 0.0 0.0   0.0 
0.0   0.0   0.0  0.0  0.0 0.00  0.00 0    
0.000   0  0


** Compaction Stats [default] **
Priority    Files   Size Score Read(GB)  Rn(GB) Rnp1(GB) 
Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) 
CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
--- 

  Low  0/0    0.00 KB   0.0  4.2 0.6 3.6   1.7 
-1.9   0.0   0.0 86.0 35.3 49.86 14.92 9    
5.540 28M    13M
High  0/0    0.00 KB   0.0  0.0 0.0 0.0   0.1 
0.1   0.0   0.0  0.0    150.2 0.40  0.40 1    
0.403   0  0
User  0/0    0.00 KB   0.0  0.0 0.0 0.0   0.0 
0.0   0.0   0.0  0.0    211.7 0.11  0.00 1    
0.114   0  0

Uptime(secs): 449404.8 total, 600.0 interval
Flush(GB): cumulative 0.083, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 1.80 GB write, 0.00 MB/s write, 4.19 GB r

[ceph-users] Re: osd regularly wrongly marked down

2020-08-31 Thread Francois Legrand

idays time (only a few KB/s of io and no 
recover).


We have no standalone fast drive for DB/WAL and nothing in the osds (nor 
mons) logs suggesting any problem (apart the heartbeat_map is_healthy 
timeout).

Thanks
F.


Le 31/08/2020 à 12:15, Igor Fedotov a écrit :

Hi Francois,

given that slow operations are observed for collection listings you 
might want to manually compact RocksDB using ceph-kvstore-tool.


The observed slowdown tends to happen after massive data removals. 
I've seen multiple compains about this issue including some post in 
this mailing list. BTW I can see your post from Jun 24  about slow 
pool removal - couldn't this be a trigger?


Also wondering whether you have standalone fast(SSD/NVMe) drive for 
DB/WAL? Aren't there any BlueFS spillovers which might be relevant?



Thanks,

Igor


On 8/28/2020 11:33 AM, Francois Legrand wrote:

Hi all,

We have a ceph cluster in production with 6 osds servers (with 16x8TB 
disks), 3 mons/mgrs and 3 mdss. Both public and cluster networks are 
in 10GB and works well.


After a major crash in april, we turned the option bluefs_buffered_io 
to false  to workaround the large write bug when bluefs_buffered_io 
was true (we were in version 14.2.8 and the default value at this 
time was true).
Since that time, we regularly have some osds wrongly marked down by 
the cluster after heartbeat timeout (heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15).


Generally the osd restart and the cluster is back healthy, but 
several time, after many of these kick-off the osd reach the 
osd_op_thread_suicide_timeout and goes down definitely.


We increased the osd_op_thread_timeout and 
osd_op_thread_suicide_timeout... The problems still occurs (but less 
frequently).


Few days ago, we upgraded to 14.2.11 and revert the timeout to their 
default value, hoping that it will solve the problem (we thought that 
it should be related to this bug 
https://tracker.ceph.com/issues/45943), but it didn't. We still have 
some osds wrongly marked down.


Can somebody help us to fix this problem ?
Thanks.

Here is an extract of an osd log at failure time:

-
2020-08-28 02:19:05.019 7f03f1384700  0 log_channel(cluster) log 
[DBG] : 44.7d scrub starts
2020-08-28 02:19:25.755 7f040e43d700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15
2020-08-28 02:19:25.755 7f040dc3c700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15

this last line is repeated more than 1000 times
...
2020-08-28 02:20:17.484 7f040d43b700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15
2020-08-28 02:20:17.551 7f03f1384700  0 
bluestore(/var/lib/ceph/osd/ceph-16) log_latency_fn slow operation 
observed for _collection_list, latency = 67.3532s, lat = 67s cid 
=44.7d_head start GHMAX end GHMAX max 25

...
2020-08-28 02:20:22.600 7f040dc3c700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15
2020-08-28 02:21:20.774 7f03f1384700  0 
bluestore(/var/lib/ceph/osd/ceph-16) log_latency_fn slow operation 
observed for _collection_list, latency = 63.223s, lat = 63s cid 
=44.7d_head start 
#44:beffc78d:::rbd_data.1e48e8ab988992.11bd:0# end #MAX# 
max 2147483647
2020-08-28 02:21:20.774 7f03f1384700  1 heartbeat_map reset_timeout 
'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15
2020-08-28 02:21:20.805 7f03f1384700  0 log_channel(cluster) log 
[DBG] : 44.7d scrub ok
2020-08-28 02:21:21.099 7f03fd997700  0 log_channel(cluster) log 
[WRN] : Monitor daemon marked osd.16 down, but it is still running
2020-08-28 02:21:21.099 7f03fd997700  0 log_channel(cluster) log 
[DBG] : map e609411 wrongly marked me down at e609410
2020-08-28 02:21:21.099 7f03fd997700  1 osd.16 609411 
start_waiting_for_healthy

2020-08-28 02:21:21.119 7f03fd997700  1 osd.16 609411 start_boot
2020-08-28 02:21:21.124 7f03f0b83700  1 osd.16 pg_epoch: 609410 
pg[36.3d0( v 609409'481293 (449368'478292,609409'481293] 
local-lis/les=609403/609404 n=154651 ec=435353/435353 lis/c 
609403/609403 les/c/f 609404/609404/0 609410/609410/608752) [25,72] 
r=-1 lpr=609410 pi=[609403,609410)/1 luod=0'0 lua=609392'481198 
crt=609409'481293 lcod 609409'481292 active mbc={}] 
start_peering_interval up [25,72,16] -> [25,72], acting [25,72,16] -> 
[25,72], acting_primary 25 -> 25, up_primary 25 -> 25, role 2 -> -1, 
features acting 4611087854031667199 upacting 4611087854031667199

...
2020-08-28 02:21:21.166 7f03f0b83700  1 osd.16 pg_epoch: 609411 
pg[36.56( v 609409'480511 (449368'477424,609409'480511] 
local-lis/les=609403/609404 n=153854 ec=435353/435353 lis/c 
609403/609403 les/c/f 609404/609404/0 609410/609410/609410) [103,102] 
r=-1 lpr=609410 pi=[609403,609410)/1 crt=60940

[ceph-users] Re: osd regularly wrongly marked down

2020-08-31 Thread Francois Legrand

We tried to rise the osd_memory_target from 4 to 8G but the problem 
still occurs (osd wrongly marked down few times a day).

Does somebody have any clue ?
F.



On Fri, Aug 28, 2020 at 10:34 AM Francois Legrand
mailto:f...@lpnhe.in2p3.fr>> wrote:

Hi all,

We have a ceph cluster in production with 6 osds servers (with
16x8TB
disks), 3 mons/mgrs and 3 mdss. Both public and cluster
networks are in
10GB and works well.

After a major crash in april, we turned the option
bluefs_buffered_io to
false  to workaround the large write bug when
bluefs_buffered_io was
true (we were in version 14.2.8 and the default value at this
time was
true).
Since that time, we regularly have some osds wrongly marked
down by the
cluster after heartbeat timeout (heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15).

Generally the osd restart and the cluster is back healthy, but
several
time, after many of these kick-off the osd reach the
osd_op_thread_suicide_timeout and goes down definitely.

We increased the osd_op_thread_timeout and
osd_op_thread_suicide_timeout... The problems still occurs
(but less
frequently).

Few days ago, we upgraded to 14.2.11 and revert the timeout to
their
default value, hoping that it will solve the problem (we
thought that it
should be related to this bug
https://tracker.ceph.com/issues/45943),
but it didn't. We still have some osds wrongly marked down.

Can somebody help us to fix this problem ?
Thanks.

Here is an extract of an osd log at failure time:

-
2020-08-28 02:19:05.019 7f03f1384700  0 log_channel(cluster)
log [DBG] :
44.7d scrub starts
2020-08-28 02:19:25.755 7f040e43d700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15
2020-08-28 02:19:25.755 7f040dc3c700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15
this last line is repeated more than 1000 times
...
2020-08-28 02:20:17.484 7f040d43b700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15
2020-08-28 02:20:17.551 7f03f1384700  0
bluestore(/var/lib/ceph/osd/ceph-16) log_latency_fn slow
operation
observed for _collection_list, latency = 67.3532s, lat = 67s cid
=44.7d_head start GHMAX end GHMAX max 25
...
2020-08-28 02:20:22.600 7f040dc3c700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15
2020-08-28 02:21:20.774 7f03f1384700  0
bluestore(/var/lib/ceph/osd/ceph-16) log_latency_fn slow
operation
observed for _collection_list, latency = 63.223s, lat = 63s cid
=44.7d_head start
#44:beffc78d:::rbd_data.1e48e8ab988992.11bd:0# end
#MAX# max
2147483647
2020-08-28 02:21:20.774 7f03f1384700  1 heartbeat_map
reset_timeout
'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15
2020-08-28 02:21:20.805 7f03f1384700  0 log_channel(cluster)
log [DBG] :
44.7d scrub ok
2020-08-28 02:21:21.099 7f03fd997700  0 log_channel(cluster)
log [WRN] :
Monitor daemon marked osd.16 down, but it is still running
2020-08-28 02:21:21.099 7f03fd997700  0 log_channel(cluster)
log [DBG] :
map e609411 wrongly marked me down at e609410
2020-08-28 02:21:21.099 7f03fd997700  1 osd.16 609411
start_waiting_for_healthy
2020-08-28 02:21:21.119 7f03fd997700  1 osd.16 609411 start_boot
2020-08-28 02:21:21.124 7f03f0b83700  1 osd.16 pg_epoch: 609410
pg[36.3d0( v 609409'481293 (449368'478292,609409'481293]
local-lis/les=609403/609404 n=154651 ec=435353/435353 lis/c
609403/609403 les/c/f 609404/609404/0 609410/609410/608752)
[25,72] r=-1
lpr=609410 pi=[609403,609410)/1 luod=0'0 lua=609392'481198
crt=609409'481293 lcod 609409'481292 active mbc={}]
start_peering_interval up [25,72,16] -> [25,72], acting
[25,72,16] ->
[25,72], acting_primary 25 -> 25, up_primary 25 -> 25, role 2
-> -1,
features acting 4611087854031667199 upacting 4611087854031667199
...
2020-08-28 02:21:21.166 7f03f0b83700  1 osd.16 pg_epoch: 609411
pg[36.56( v 609409'480511 (449368'477424,609409'480511]
local-lis/les=609403/609404 n=153854 ec=435353/435353 lis/c
609403/609403 les/c/f 609404/60

[ceph-users] osd regularly wrongly marked down

2020-08-28 Thread Francois Legrand


Hi all,

We have a ceph cluster in production with 6 osds servers (with 16x8TB 
disks), 3 mons/mgrs and 3 mdss. Both public and cluster networks are in 
10GB and works well.


After a major crash in april, we turned the option bluefs_buffered_io to 
false  to workaround the large write bug when bluefs_buffered_io was 
true (we were in version 14.2.8 and the default value at this time was 
true).
Since that time, we regularly have some osds wrongly marked down by the 
cluster after heartbeat timeout (heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15).


Generally the osd restart and the cluster is back healthy, but several 
time, after many of these kick-off the osd reach the 
osd_op_thread_suicide_timeout and goes down definitely.


We increased the osd_op_thread_timeout and 
osd_op_thread_suicide_timeout... The problems still occurs (but less 
frequently).


Few days ago, we upgraded to 14.2.11 and revert the timeout to their 
default value, hoping that it will solve the problem (we thought that it 
should be related to this bug https://tracker.ceph.com/issues/45943), 
but it didn't. We still have some osds wrongly marked down.


Can somebody help us to fix this problem ?
Thanks.

Here is an extract of an osd log at failure time:

-
2020-08-28 02:19:05.019 7f03f1384700  0 log_channel(cluster) log [DBG] : 
44.7d scrub starts
2020-08-28 02:19:25.755 7f040e43d700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15
2020-08-28 02:19:25.755 7f040dc3c700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15

this last line is repeated more than 1000 times
...
2020-08-28 02:20:17.484 7f040d43b700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15
2020-08-28 02:20:17.551 7f03f1384700  0 
bluestore(/var/lib/ceph/osd/ceph-16) log_latency_fn slow operation 
observed for _collection_list, latency = 67.3532s, lat = 67s cid 
=44.7d_head start GHMAX end GHMAX max 25

...
2020-08-28 02:20:22.600 7f040dc3c700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15
2020-08-28 02:21:20.774 7f03f1384700  0 
bluestore(/var/lib/ceph/osd/ceph-16) log_latency_fn slow operation 
observed for _collection_list, latency = 63.223s, lat = 63s cid 
=44.7d_head start 
#44:beffc78d:::rbd_data.1e48e8ab988992.11bd:0# end #MAX# max 
2147483647
2020-08-28 02:21:20.774 7f03f1384700  1 heartbeat_map reset_timeout 
'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15
2020-08-28 02:21:20.805 7f03f1384700  0 log_channel(cluster) log [DBG] : 
44.7d scrub ok
2020-08-28 02:21:21.099 7f03fd997700  0 log_channel(cluster) log [WRN] : 
Monitor daemon marked osd.16 down, but it is still running
2020-08-28 02:21:21.099 7f03fd997700  0 log_channel(cluster) log [DBG] : 
map e609411 wrongly marked me down at e609410
2020-08-28 02:21:21.099 7f03fd997700  1 osd.16 609411 
start_waiting_for_healthy

2020-08-28 02:21:21.119 7f03fd997700  1 osd.16 609411 start_boot
2020-08-28 02:21:21.124 7f03f0b83700  1 osd.16 pg_epoch: 609410 
pg[36.3d0( v 609409'481293 (449368'478292,609409'481293] 
local-lis/les=609403/609404 n=154651 ec=435353/435353 lis/c 
609403/609403 les/c/f 609404/609404/0 609410/609410/608752) [25,72] r=-1 
lpr=609410 pi=[609403,609410)/1 luod=0'0 lua=609392'481198 
crt=609409'481293 lcod 609409'481292 active mbc={}] 
start_peering_interval up [25,72,16] -> [25,72], acting [25,72,16] -> 
[25,72], acting_primary 25 -> 25, up_primary 25 -> 25, role 2 -> -1, 
features acting 4611087854031667199 upacting 4611087854031667199

...
2020-08-28 02:21:21.166 7f03f0b83700  1 osd.16 pg_epoch: 609411 
pg[36.56( v 609409'480511 (449368'477424,609409'480511] 
local-lis/les=609403/609404 n=153854 ec=435353/435353 lis/c 
609403/609403 les/c/f 609404/609404/0 609410/609410/609410) [103,102] 
r=-1 lpr=609410 pi=[609403,609410)/1 crt=609409'480511 lcod 
609409'480510 unknown NOTIFY mbc={}] state: transitioning to Stray
2020-08-28 02:21:21.307 7f04073b0700  1 osd.16 609413 set_numa_affinity 
public network em1 numa node 0
2020-08-28 02:21:21.307 7f04073b0700  1 osd.16 609413 set_numa_affinity 
cluster network em2 numa node 0
2020-08-28 02:21:21.307 7f04073b0700  1 osd.16 609413 set_numa_affinity 
objectstore and network numa nodes do not match
2020-08-28 02:21:21.307 7f04073b0700  1 osd.16 609413 set_numa_affinity 
not setting numa affinity
2020-08-28 02:21:21.566 7f040a435700  1 osd.16 609413 tick checking mon 
for new map
2020-08-28 02:21:22.515 7f03fd997700  1 osd.16 609414 state: booting -> 
active
2020-08-28 02:21:22.515 7f03f0382700  1 osd.16 pg_epoch: 609414 
pg[36.20( v 609409'483167 (449368'480117,609409'483167] 
local-lis/les=609403/609404 n=155171 ec=435353/435353 lis/c 
609403/609403 les/c/f 609404/609404/0 609414/609414/609361) [97,16,72] 
r=1 lpr=609414 pi=[609403,609414)/1 crt=609409'483167 lcod 609409'483166 
unknown NO

[ceph-users] osd crashing and rocksdb corruption

2020-08-17 Thread Francois Legrand


Hi all,

*** Short version ***
Is there a way to repair a rocksdb from errors "Encountered error while 
reading data from compression dictionary block Corruption: block 
checksum mismatch" and "_open_db erroring opening db" ?



*** Long version ***
We operate a nautilus ceph cluster (with 100 disks of 8TB in 6 servers + 
4 mons/mgr + 3 mds).
We recently (Monday 20) upgraded from 14.2.7 to 14.2.8. This triggered a 
rebalancing of some data.
Two days later (Wednesday 22) we had a very short power outage. Only one 
of the osd servers went down (and unfortunately died).
This triggered a reconstruction of the losts osds. Operations went fine 
until Saturday 25 where some osds in the 5 remaining servers started to 
crash apparently with no reasons.
We tryed to restart them, but they crashed again. We ended with 18 osd 
down (+ 16 in the dead server so 34 osd downs out of 100).

Looking at the logs we found for all the crashed osd :

-237> 2020-04-25 16:32:51.835 7f1f45527a80  3 rocksdb: 
[table/block_based_table_reader.cc:1117] Encountered error while reading 
data from compression dictionary block Corruption: block checksum 
mismatch: expected 0, got 2729370997  in db/181355.sst offset 
18446744073709551615 size 18446744073709551615


and

2020-04-25 16:05:47.251 7fcbd1e46a80 -1 
bluestore(/var/lib/ceph/osd/ceph-3) _open_db erroring opening db:


We also noticed that the "Encountered error while reading data from 
compression dictionary block Corruption: block checksum mismatch" was 
present few days before the crash.

We also have some osd with this error but still up.

We tryed to repair with :
ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-3 repair
But no success (it ends with _open_db erroring opening db).

Thus does somebody have an idea to fix this or at least know if it's 
possible to repair and correct the "Encountered error while reading data 
from compression dictionary block Corruption: block checksum mismatch" 
and "_open_db erroring opening db" !
Thanks for your help (we are desperate because we will loose datas and 
are fighting to save something) !!!

F.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] ceph nautilus repository index is incomplet

2020-07-09 Thread Francois Legrand


Hello,
It seems that the index of 
https://download.ceph.com/rpm-nautilus/el7/x86_64/ repository is wrong. 
Only the 14.2.10-0.el7 version is available (all previous versions are 
missing despite the fact that the rpms are present in the repository).

It thus seems that the index needs to be corrected.
Who can I contact for that ?

Thanks.
F.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Removing pool in nautilus is incredibly slow

2020-06-26 Thread Francois Legrand

Thanks. I also added osd_op_queue_cut_off to high in global (as you 
mentioned in a previous thread that osd and mds should use it).

F.

Le 26/06/2020 à 16:35, Frank Schilder a écrit :

I never tried "prio" out, but the reports I have seen claim that prio is 
inferior.

However, as far as I know it is safe to change these settings. Unfortunately, 
you need to restart services to apply the changes.

Before you do, check if *all* daemons are using the same setting. Contrary to 
the naming (osd_*), this setting applies to all daemons. I added it to the 
global options and, most notably, performance of the MDS was improved a lot.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

____
From: Francois Legrand 
Sent: 26 June 2020 15:03:23
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: Removing pool in nautilus is incredibly slow

I changed osd_op_queue_cut_off to high and rebooted all the osds. But
the result is more or less the same (storage is still extremely slow,
2h30 to rdb extract a 64GB image !). The only improvement is that it
seems that degraded pgs have disapeared (which is at least a good
point). It seems that there is a problem in priority of operations.
Thus do you think (and also others on the list) that changing the
osd_op_queue setting could help (change to prio or mclock_client).
What are the risks or secondary effects of trying mclock_client on a
production cluster (is it safe) ?
F.

Le 26/06/2020 à 09:46, Frank Schilder a écrit :

I'm using

osd_op_queue = wpq
osd_op_queue_cut_off = high

and these settings are recommended.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Francois Legrand 
Sent: 26 June 2020 09:44:00
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: Removing pool in nautilus is incredibly slow

We are now using osd_op_queue = wpq. Maybe returning to prio should help ?
What are you using on your mimic custer ?
F.

Le 25/06/2020 à 19:28, Frank Schilder a écrit :

OK, this *does* sound bad. I would consider this a show stopper for upgrade 
from mimic.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Francois Legrand 
Sent: 25 June 2020 19:25:14
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Removing pool in nautilus is incredibly slow

I also had this kind of symptoms with nautilus.
Replacing a failed disk (from cluster ok) generates degraded objects.
Also, we have a proxmox cluster accessing vm images stored in our ceph storage 
with rbd.
Each time I had some operation on the ceph cluster like adding or removing a 
pool, most of our proxmox vms lost contact with their system disk in ceph and 
crashed (or remount system storage in read-only mode). At first I thought it 
was a network problem, but now I am sure that it's related to ceph becoming 
unresponsive during background operations.
For now, proxmox cannot even access ceph storage using rbd (it fails with 
timeout).
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: ceph qos

2020-06-26 Thread Francois Legrand

Does somebody uses mclock in a production cluster ?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Removing pool in nautilus is incredibly slow

2020-06-26 Thread Francois Legrand

I changed osd_op_queue_cut_off to high and rebooted all the osds. But 
the result is more or less the same (storage is still extremely slow, 
2h30 to rdb extract a 64GB image !). The only improvement is that it 
seems that degraded pgs have disapeared (which is at least a good 
point). It seems that there is a problem in priority of operations.
Thus do you think (and also others on the list) that changing the 
osd_op_queue setting could help (change to prio or mclock_client).
What are the risks or secondary effects of trying mclock_client on a 
production cluster (is it safe) ?

F.

Le 26/06/2020 à 09:46, Frank Schilder a écrit :

I'm using

osd_op_queue = wpq
osd_op_queue_cut_off = high

and these settings are recommended.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Francois Legrand 
Sent: 26 June 2020 09:44:00
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: Removing pool in nautilus is incredibly slow

We are now using osd_op_queue = wpq. Maybe returning to prio should help ?
What are you using on your mimic custer ?
F.

Le 25/06/2020 à 19:28, Frank Schilder a écrit :

OK, this *does* sound bad. I would consider this a show stopper for upgrade 
from mimic.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Francois Legrand 
Sent: 25 June 2020 19:25:14
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Removing pool in nautilus is incredibly slow

I also had this kind of symptoms with nautilus.
Replacing a failed disk (from cluster ok) generates degraded objects.
Also, we have a proxmox cluster accessing vm images stored in our ceph storage 
with rbd.
Each time I had some operation on the ceph cluster like adding or removing a 
pool, most of our proxmox vms lost contact with their system disk in ceph and 
crashed (or remount system storage in read-only mode). At first I thought it 
was a network problem, but now I am sure that it's related to ceph becoming 
unresponsive during background operations.
For now, proxmox cannot even access ceph storage using rbd (it fails with 
timeout).
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Removing pool in nautilus is incredibly slow

2020-06-26 Thread Francois Legrand

Thanks. I will try to change osd_op_queue_cut_off to high and restart 
everything (and use this downtime to upgrade the servers).

F.

Le 26/06/2020 à 09:46, Frank Schilder a écrit :

I'm using

osd_op_queue = wpq
osd_op_queue_cut_off = high

and these settings are recommended.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Francois Legrand 
Sent: 26 June 2020 09:44:00
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: Removing pool in nautilus is incredibly slow

We are now using osd_op_queue = wpq. Maybe returning to prio should help ?
What are you using on your mimic custer ?
F.

Le 25/06/2020 à 19:28, Frank Schilder a écrit :

OK, this *does* sound bad. I would consider this a show stopper for upgrade 
from mimic.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Francois Legrand 
Sent: 25 June 2020 19:25:14
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Removing pool in nautilus is incredibly slow

I also had this kind of symptoms with nautilus.
Replacing a failed disk (from cluster ok) generates degraded objects.
Also, we have a proxmox cluster accessing vm images stored in our ceph storage 
with rbd.
Each time I had some operation on the ceph cluster like adding or removing a 
pool, most of our proxmox vms lost contact with their system disk in ceph and 
crashed (or remount system storage in read-only mode). At first I thought it 
was a network problem, but now I am sure that it's related to ceph becoming 
unresponsive during background operations.
For now, proxmox cannot even access ceph storage using rbd (it fails with 
timeout).
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Removing pool in nautilus is incredibly slow

2020-06-26 Thread Francois Legrand


We are now using osd_op_queue = wpq. Maybe returning to prio should help ?
What are you using on your mimic custer ?
F.

Le 25/06/2020 à 19:28, Frank Schilder a écrit :

OK, this *does* sound bad. I would consider this a show stopper for upgrade 
from mimic.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Francois Legrand 
Sent: 25 June 2020 19:25:14
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Removing pool in nautilus is incredibly slow

I also had this kind of symptoms with nautilus.
Replacing a failed disk (from cluster ok) generates degraded objects.
Also, we have a proxmox cluster accessing vm images stored in our ceph storage 
with rbd.
Each time I had some operation on the ceph cluster like adding or removing a 
pool, most of our proxmox vms lost contact with their system disk in ceph and 
crashed (or remount system storage in read-only mode). At first I thought it 
was a network problem, but now I am sure that it's related to ceph becoming 
unresponsive during background operations.
For now, proxmox cannot even access ceph storage using rbd (it fails with 
timeout).
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Removing pool in nautilus is incredibly slow

2020-06-26 Thread Francois Legrand

I think he means that after disk failure he waits for the cluster to get back 
to ok (so all data on the lost disk have been reconstructed elsewhere) and then 
the disk is changed. In that case it's normal to have misplaced objects 
(because with the new disk some pgs needs to be migrated to populate this new 
space), but degraded pg does not seems to be the good behaviour !
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Removing pool in nautilus is incredibly slow

2020-06-25 Thread Francois Legrand

For sure, If I could downgrade to mimic I would probably do it !!! So I 
understand that you plan not to upgrade !

F.

Le 25/06/2020 à 19:28, Frank Schilder a écrit :

OK, this *does* sound bad. I would consider this a show stopper for upgrade 
from mimic.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Francois Legrand 
Sent: 25 June 2020 19:25:14
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Removing pool in nautilus is incredibly slow

I also had this kind of symptoms with nautilus.
Replacing a failed disk (from cluster ok) generates degraded objects.
Also, we have a proxmox cluster accessing vm images stored in our ceph storage 
with rbd.
Each time I had some operation on the ceph cluster like adding or removing a 
pool, most of our proxmox vms lost contact with their system disk in ceph and 
crashed (or remount system storage in read-only mode). At first I thought it 
was a network problem, but now I am sure that it's related to ceph becoming 
unresponsive during background operations.
For now, proxmox cannot even access ceph storage using rbd (it fails with 
timeout).
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Removing pool in nautilus is incredibly slow

2020-06-25 Thread Francois Legrand

I also had this kind of symptoms with nautilus. 
Replacing a failed disk (from cluster ok) generates degraded objects.
Also, we have a proxmox cluster accessing vm images stored in our ceph storage 
with rbd. 
Each time I had some operation on the ceph cluster like adding or removing a 
pool, most of our proxmox vms lost contact with their system disk in ceph and 
crashed (or remount system storage in read-only mode). At first I thought it 
was a network problem, but now I am sure that it's related to ceph becoming 
unresponsive during background operations.
For now, proxmox cannot even access ceph storage using rbd (it fails with 
timeout).
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Removing pool in nautilus is incredibly slow

2020-06-25 Thread Francois Legrand


Thanks for the hint.
I tryed but it doesn't seems to change anything...
Moreover, as the osds seems quite loaded I had regularly some osd marked 
down which triggered some new peering and thus more load !!!
I set the osd no down flag, but I still have some osd reported (wrongly) 
as down (and back up in the minute) which generate peering and 
remapping. I don't really understand the action of no down parameter !
Is there a way to tell ceph not to peer immediately after an osd is 
reported down (let say wait for 60s) ?
I am thinking about restarting all osd (or maybe the whole cluster) to 
get osd_op_queue_cut_off changed to high and osd_op_thread_timeout to 
something higher than 15 (but I don't think it will really improve the 
situation).

F.


Le 25/06/2020 à 14:26, Wout van Heeswijk a écrit :

Hi Francois,

Have you already looked at the option "osd_delete_sleep"? It will not 
speed up the process but I will give you some control over your 
cluster performance.


Something like:

ceph tell osd.\* injectargs '--osd_delete_sleep1'
kind regards,

Wout
42on
On 25-06-2020 09:57, Francois Legrand wrote:

Does someone have an idea ?
F.
___
ceph-users mailing list --ceph-users@ceph.io
To unsubscribe send an email toceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Removing pool in nautilus is incredibly slow

2020-06-25 Thread Francois Legrand

Does someone have an idea ?
F.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Removing pool in nautilus is incredibly slow

2020-06-24 Thread Francois Legrand


Hello,
I am running ceph nautilus 14.2.8
I had to remove 2 pools (old cephfs data and metadata pool with 1024 pgs).
The removal of the pools seems to take a incredible time to free the 
space (the data pool I deleted was more than 100 TB and in 36h I got 
back only 10TB). In the meantime, the cluster is extremely slow (a rbd 
extract takes ~1h30 mn for a 32 GB image and writing 10MB in cephfs 
takes half a minute !!) which makes the cluster almost unusable.
It seems that the removal of deleted pg is done by deep-scrubs according 
tohttps://medium.com/opsops/a-very-slow-pool-removal-7089e4ac8301
Also it has been reported that this could be a regression in 
nautilushttps://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/W4M5XQRDBLXFGJGDYZALG6TQ4QBVGGAJ/#W4M5XQRDBLXFGJGDYZALG6TQ4QBVGGAJ 



But I couldn't find a fix or a way to speedup (or slow down) the process 
and get back the cluster to a decent reactivity.

Is there a way ?
Thanks
F.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: How to remove one of two filesystems

2020-06-23 Thread Francois Legrand

Thanks a lot. It works.
I could delete the filesystem and remove the pools (data and metadata).
But now I am facing another problem which is that the removal of the
pools seems to take a incredible time to free the space (the pool I
deleted was about 100TB and in 36h I got back only 10TB). In the
meantime, the cluster is extremely slow (a rbd extract takes ~30 mn for
a 9 GB image and writing 10MB in cephfs takes half a minute !!) which
makes the cluster almost unusable.
It seems that the removal of deleted pg is done by deep-scrubs according
to https://medium.com/opsops/a-very-slow-pool-removal-7089e4ac8301
But I couldn't find a way to speedup the process or to get back the
cluster to a decent reactivity ?

Do you have a suggestion ?
F.

Le 22/06/2020 à 16:40, Patrick Donnelly a écrit :

On Mon, Jun 22, 2020 at 7:29 AM Frank Schilder wrote:

Use

ceph fs set down true

after this all mdses of fs fs_name will become standbys. Now you can cleanly
remove everything.

Wait for the fs to be shown as down in ceph status, the command above is
non-blocking but the shutdown takes a long time. Try to disconnect all clients
first.

If you're planning to delete the file system, it is faster to just do:

ceph fs fail

which will remove all the MDS and mark the cluster as not joinable.
See also:
https://docs.ceph.com/docs/master/cephfs/administration/#taking-the-cluster-down-rapidly-for-deletion-or-disaster-recovery

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] How to remove one of two filesystems

2020-06-22 Thread Francois Legrand


Hello,
I have a ceph cluster (nautilus 14.2.8) with 2 filesystems and 3 mds.
mds1 is managing fs1
mds2 manages fs2
mds3 is standby

I want to completely remove fs1.
It seems that the command to use is ceph fs rm fs1 --yes-i-really-mean-it
and then delete the data and metadata pools with ceph osd pool delete
but in many threads I noticed that you must shutdown the mds before 
running  ceph fs rm.

Is it still the case ?
What happens in my configuration (I have 2 fs) ? If I stop mds1, the 
mds3 will take the management. If I stop mds3 what will mds2 do (try to 
manage the 2 fs or continue only with fs2) ?

Thanks for your advices.
F.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: mds behind on trimming - replay until memory exhausted

2020-06-09 Thread Francois Legrand


Hi,
Actually I let the mds managing the damaged filesystem as it is because 
the files can be read (despite of the warning and errors). Thus I 
restarted the rsyncs to transfer everything to the new filesystem (thus 
on different PG because it's a different cephfs with different pools) 
but without deleting the olds files to avoid killing definitively the 
old mds and the old fs. The number of segment is then more or less 
stable (very high ~123611 but not increasing too much).
I guess that we will have enought space to copy the remaining datas (it 
will be short but I think it will pass). Once everything will be 
transfered and checked, I will destroy the old FS and the damaged pool.

F.

Le 09/06/2020 à 19:50, Frank Schilder a écrit :

Looks like an answer to your other thread takes its time.

Is it a possible option for you to

- copy all readable files using this PG to some other storage,
- remove or clean up the broken PG and
- copy the files back in?

This might lead to a healthy cluster. I don't know a proper procedure though. 
Somehow the ceph fs must play along as files using this will also use other PGs 
and get partly broken.

Have you found other options?

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

____
From: Francois Legrand
Sent: 08 June 2020 16:38:18
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory 
exhausted

I already had some discussion on the list about this problem. But I
should ask again.
We really lost some objects and there are not enought shards to
reconstruct them (it's an erasure coding data pool)... so it cannot be
fixed anymore and we know we have data loss ! I did not marked the PG
out because there are still some parts (objects) which are still present
and we hope to be able to copy them and save a few bytes more ! It would
be great to be able to flush only broken objects, but I don't know how
to do that, even if it's possible !
I thus run some cephfs-data-scan pg_files to identify the files with
data on this pg and the I run a grep -q -m 1 "." "/path_to_damaged_file"
to identify the ones which are really empty (we tested different way to
do this and it seems that's the fastest).
F.


Le 08/06/2020 à 16:07, Frank Schilder a écrit :

OK, now we are talking. It is very well possible that trimming will not start 
until this operation is completed.

If there are enough shards/copies to recover the lost objects, you should try a 
pg repair first. If you did loose too many replicas, there are ways to flush 
this PG out of the system. You will loose data this way. I don't know how to 
repair or flush only broken objects out of a PG, but would hope that this is 
possible.

Before you do anything destructive, open a new thread in this list specifically 
for how to repair/remove this PG with the least possible damage.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

____
From: Francois Legrand
Sent: 08 June 2020 16:00:28
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory 
exhausted

There is no recovery going on, but indeed we have a pg damaged (with
some lost objects due to a major crash few weeks ago)... and there are
some shards of this pg on osd 27 !
That's also why we are migrating all the data out of this FS !
It's certainly related and I guess that  it's trying to remove some
datas that are already lost and it get stuck ! I don't know if there is
a way to tell ceph to forget about these ops ! I guess no.
I thus think that there is not that much to do apart from reading as
much data as we can to save as much as possible.
F.

Le 08/06/2020 à 15:48, Frank Schilder a écrit :

That's strange. Maybe there is another problem. Do you have any other health 
warnings that might be related? Is there some recovery/rebalancing going on?

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Francois Legrand
Sent: 08 June 2020 15:27:59
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory 
exhausted

Thanks again for the hint !
Indeed, I did a
ceph daemon  mds.lpnceph-mds02.in2p3.fr objecter_requests
and it seems that osd 27 is more or less stuck with op of age 34987.5
(while others osd have ages < 1).
I tryed a ceph osd down 27 which resulted in reseting the age but I can
notice that age for osd.27 ops is rising again.
I think I will restart it (btw our osd servers and mds are different
machines).
F.

Le 08/06/2020 à 15:01, Frank Schilder a écrit :

Hi Francois,

this sounds great. At least its operational. I guess it is still using a lot of 
swap while trying to replay operations.

I would disconnect cleanly all clients

[ceph-users] Broken PG in cephfs data_pool (lost objects)

2020-06-08 Thread Francois Legrand


Hi all,
We have a cephfs with data_pool in erasure coding (3+2) ans 1024 pg 
(nautilus 14.2.8).
One of the pgs is partially destroyed (we lost 3 osd thus 3 shards), it 
have 143 objects unfound and is stuck in state 
"active+recovery_unfound+undersized+degraded+remapped".
We then lost some datas (we are using cephfs-data-scan pg_files... to 
identify files with data on the bad pg) .
We thus created a new filesystem (this time with data_pool in replica 3) 
and we are copying all the datas from the broken FS to the new one.
But we need to remove files from the broken FS after copy to free space 
(because there will not be enough space on the cluster). To avoid 
problems of strays we removed the snapshots on the broken FS before 
deleting files.
The point is that the mds managing the broken FS is now "Behind on 
trimming (123036/128) max_segments: 128, num_segments: 123036"
and have 1 slow metadata IOs are blocked > 30 secs, oldest blocked for 
83645 secs.
The slow IO correspond to osd 27 which is acting_primary for the broken 
PG. and the broken pg have a long "snap_trimq": 
"[1e0c~1,1e0e~1,1e12~1,1e16~1,1e18~1,1e1a~1," and 
"snap_trimq_len": 460.
It then seems that cephfs is not able to trim ops corresponding to the 
deletion of objects and snaps which have data on the broken PG, probably 
because the pg is not healty.


Is there a way to tell ceph/cephfs to flush or forget about (only) lost 
objects on the broken pg and get this pg healty enough to perform 
trimming ?

thanks for your help
F.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: mds behind on trimming - replay until memory exhausted

2020-06-08 Thread Francois Legrand

I already had some discussion on the list about this problem. But I 
should ask again.
We really lost some objects and there are not enought shards to 
reconstruct them (it's an erasure coding data pool)... so it cannot be 
fixed anymore and we know we have data loss ! I did not marked the PG 
out because there are still some parts (objects) which are still present 
and we hope to be able to copy them and save a few bytes more ! It would 
be great to be able to flush only broken objects, but I don't know how 
to do that, even if it's possible !
I thus run some cephfs-data-scan pg_files to identify the files with 
data on this pg and the I run a grep -q -m 1 "." "/path_to_damaged_file" 
to identify the ones which are really empty (we tested different way to 
do this and it seems that's the fastest).

F.


Le 08/06/2020 à 16:07, Frank Schilder a écrit :

OK, now we are talking. It is very well possible that trimming will not start 
until this operation is completed.

If there are enough shards/copies to recover the lost objects, you should try a 
pg repair first. If you did loose too many replicas, there are ways to flush 
this PG out of the system. You will loose data this way. I don't know how to 
repair or flush only broken objects out of a PG, but would hope that this is 
possible.

Before you do anything destructive, open a new thread in this list specifically 
for how to repair/remove this PG with the least possible damage.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________
From: Francois Legrand 
Sent: 08 June 2020 16:00:28
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory 
exhausted

There is no recovery going on, but indeed we have a pg damaged (with
some lost objects due to a major crash few weeks ago)... and there are
some shards of this pg on osd 27 !
That's also why we are migrating all the data out of this FS !
It's certainly related and I guess that  it's trying to remove some
datas that are already lost and it get stuck ! I don't know if there is
a way to tell ceph to forget about these ops ! I guess no.
I thus think that there is not that much to do apart from reading as
much data as we can to save as much as possible.
F.

Le 08/06/2020 à 15:48, Frank Schilder a écrit :

That's strange. Maybe there is another problem. Do you have any other health 
warnings that might be related? Is there some recovery/rebalancing going on?

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Francois Legrand 
Sent: 08 June 2020 15:27:59
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory 
exhausted

Thanks again for the hint !
Indeed, I did a
ceph daemon  mds.lpnceph-mds02.in2p3.fr objecter_requests
and it seems that osd 27 is more or less stuck with op of age 34987.5
(while others osd have ages < 1).
I tryed a ceph osd down 27 which resulted in reseting the age but I can
notice that age for osd.27 ops is rising again.
I think I will restart it (btw our osd servers and mds are different
machines).
F.

Le 08/06/2020 à 15:01, Frank Schilder a écrit :

Hi Francois,

this sounds great. At least its operational. I guess it is still using a lot of 
swap while trying to replay operations.

I would disconnect cleanly all clients if you didn't do so already, even any 
read-only clients. Any extra load will just slow down recovery. My best guess 
is, that the MDS is replaying some operations, which is very slow due to swap. 
While doing so, the segments to trim will probably keep increasing for a while 
until it can start trimming.

The slow meta-data IO is an operation hanging in some OSD. You should check 
which OSD it is (ceph health detail) and check if you can see the operation in 
the OSDs OPS queue. I would expect this OSD to have a really long OPS queue. I 
have seen meta-data operations hang for a long time. In case this OSD runs on 
the same server as your MDS, you will probably have to sit it out.

If the meta-data operation is the only operation in the queue, the OSD might 
need a restart. But be careful, if in doubt ask the list first.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Francois Legrand 
Sent: 08 June 2020 14:45:13
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory 
exhausted

Hi Franck,
Finally I dit :
ceph config set global mds_beacon_grace 60
and create /etc/sysctl.d/sysctl-ceph.conf with
vm.min_free_kbytes=4194303
and then
sysctl --system

After that, the mds went to rejoin for a very long time (almost 24
hours) with errors like :
2020-06-07 04:10:36.802 7ff866e2e700  1 heartbeat_map is_healthy
'MDSRank

[ceph-users] Re: mds behind on trimming - replay until memory exhausted

2020-06-08 Thread Francois Legrand

There is no recovery going on, but indeed we have a pg damaged (with 
some lost objects due to a major crash few weeks ago)... and there are 
some shards of this pg on osd 27 !

That's also why we are migrating all the data out of this FS !
It's certainly related and I guess that  it's trying to remove some 
datas that are already lost and it get stuck ! I don't know if there is 
a way to tell ceph to forget about these ops ! I guess no.
I thus think that there is not that much to do apart from reading as 
much data as we can to save as much as possible.

F.

Le 08/06/2020 à 15:48, Frank Schilder a écrit :

That's strange. Maybe there is another problem. Do you have any other health 
warnings that might be related? Is there some recovery/rebalancing going on?

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________
From: Francois Legrand 
Sent: 08 June 2020 15:27:59
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory 
exhausted

Thanks again for the hint !
Indeed, I did a
ceph daemon  mds.lpnceph-mds02.in2p3.fr objecter_requests
and it seems that osd 27 is more or less stuck with op of age 34987.5
(while others osd have ages < 1).
I tryed a ceph osd down 27 which resulted in reseting the age but I can
notice that age for osd.27 ops is rising again.
I think I will restart it (btw our osd servers and mds are different
machines).
F.

Le 08/06/2020 à 15:01, Frank Schilder a écrit :

Hi Francois,

this sounds great. At least its operational. I guess it is still using a lot of 
swap while trying to replay operations.

I would disconnect cleanly all clients if you didn't do so already, even any 
read-only clients. Any extra load will just slow down recovery. My best guess 
is, that the MDS is replaying some operations, which is very slow due to swap. 
While doing so, the segments to trim will probably keep increasing for a while 
until it can start trimming.

The slow meta-data IO is an operation hanging in some OSD. You should check 
which OSD it is (ceph health detail) and check if you can see the operation in 
the OSDs OPS queue. I would expect this OSD to have a really long OPS queue. I 
have seen meta-data operations hang for a long time. In case this OSD runs on 
the same server as your MDS, you will probably have to sit it out.

If the meta-data operation is the only operation in the queue, the OSD might 
need a restart. But be careful, if in doubt ask the list first.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________
From: Francois Legrand 
Sent: 08 June 2020 14:45:13
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory 
exhausted

Hi Franck,
Finally I dit :
ceph config set global mds_beacon_grace 60
and create /etc/sysctl.d/sysctl-ceph.conf with
vm.min_free_kbytes=4194303
and then
sysctl --system

After that, the mds went to rejoin for a very long time (almost 24
hours) with errors like :
2020-06-07 04:10:36.802 7ff866e2e700  1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-06-07 04:10:36.802 7ff866e2e700  0
mds.beacon.lpnceph-mds02.in2p3.fr Skipping beacon heartbeat to monitors
(last acked 14653.8s ago); MDS internal heartbeat is not healthy!
2020-06-07 04:10:37.021 7ff868e32700 -1 monclient: _check_auth_rotating
possible clock skew, rotating keys expired way too early (before
2020-06-07 03:10:37.022271)
and also
2020-06-07 04:10:44.942 7ff86d63b700  0 auth: could not find secret_id=10363
2020-06-07 04:10:44.942 7ff86d63b700  0 cephx: verify_authorizer could
not get service secret for service mds secret_id=10363

but at the end the mds went active ! :-)
I let it at rest from sunday afternoon until this morning.
Indeed I was able to connect clients (in read-only for now) and read the
datas.
I checked the clients connected with ceph tell
mds.lpnceph-mds02.in2p3.fr client ls
and disconnected the few clients still there (with umount) and checked
that they were not connected anymore with the same command.
But I still have the following warnings
MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
   mdslpnceph-mds02.in2p3.fr(mds.0): 1 slow metadata IOs are blocked >
30 secs, oldest blocked for 75372 secs
MDS_TRIM 1 MDSs behind on trimming
   mdslpnceph-mds02.in2p3.fr(mds.0): Behind on trimming (122836/128)
max_segments: 128, num_segments: 122836

and the number of segments is still rising (slowly).
F.


Le 08/06/2020 à 12:00, Frank Schilder a écrit :

Hi Francois,

did you manage to get any further with this?

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 06 June 2020 15:21:59
To: ceph-users; f...@lpnhe.in2p3.fr
Subject: [ceph-users] Re: mds behind on trimming - replay unti

[ceph-users] Re: mds behind on trimming - replay until memory exhausted

2020-06-08 Thread Francois Legrand


Thanks again for the hint !
Indeed, I did a
ceph daemon  mds.lpnceph-mds02.in2p3.fr objecter_requests
and it seems that osd 27 is more or less stuck with op of age 34987.5 
(while others osd have ages < 1).
I tryed a ceph osd down 27 which resulted in reseting the age but I can 
notice that age for osd.27 ops is rising again.
I think I will restart it (btw our osd servers and mds are different 
machines).

F.

Le 08/06/2020 à 15:01, Frank Schilder a écrit :

Hi Francois,

this sounds great. At least its operational. I guess it is still using a lot of 
swap while trying to replay operations.

I would disconnect cleanly all clients if you didn't do so already, even any 
read-only clients. Any extra load will just slow down recovery. My best guess 
is, that the MDS is replaying some operations, which is very slow due to swap. 
While doing so, the segments to trim will probably keep increasing for a while 
until it can start trimming.

The slow meta-data IO is an operation hanging in some OSD. You should check 
which OSD it is (ceph health detail) and check if you can see the operation in 
the OSDs OPS queue. I would expect this OSD to have a really long OPS queue. I 
have seen meta-data operations hang for a long time. In case this OSD runs on 
the same server as your MDS, you will probably have to sit it out.

If the meta-data operation is the only operation in the queue, the OSD might 
need a restart. But be careful, if in doubt ask the list first.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

____
From: Francois Legrand 
Sent: 08 June 2020 14:45:13
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory 
exhausted

Hi Franck,
Finally I dit :
ceph config set global mds_beacon_grace 60
and create /etc/sysctl.d/sysctl-ceph.conf with
vm.min_free_kbytes=4194303
and then
sysctl --system

After that, the mds went to rejoin for a very long time (almost 24
hours) with errors like :
2020-06-07 04:10:36.802 7ff866e2e700  1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-06-07 04:10:36.802 7ff866e2e700  0
mds.beacon.lpnceph-mds02.in2p3.fr Skipping beacon heartbeat to monitors
(last acked 14653.8s ago); MDS internal heartbeat is not healthy!
2020-06-07 04:10:37.021 7ff868e32700 -1 monclient: _check_auth_rotating
possible clock skew, rotating keys expired way too early (before
2020-06-07 03:10:37.022271)
and also
2020-06-07 04:10:44.942 7ff86d63b700  0 auth: could not find secret_id=10363
2020-06-07 04:10:44.942 7ff86d63b700  0 cephx: verify_authorizer could
not get service secret for service mds secret_id=10363

but at the end the mds went active ! :-)
I let it at rest from sunday afternoon until this morning.
Indeed I was able to connect clients (in read-only for now) and read the
datas.
I checked the clients connected with ceph tell
mds.lpnceph-mds02.in2p3.fr client ls
and disconnected the few clients still there (with umount) and checked
that they were not connected anymore with the same command.
But I still have the following warnings
MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
  mdslpnceph-mds02.in2p3.fr(mds.0): 1 slow metadata IOs are blocked >
30 secs, oldest blocked for 75372 secs
MDS_TRIM 1 MDSs behind on trimming
  mdslpnceph-mds02.in2p3.fr(mds.0): Behind on trimming (122836/128)
max_segments: 128, num_segments: 122836

and the number of segments is still rising (slowly).
F.


Le 08/06/2020 à 12:00, Frank Schilder a écrit :

Hi Francois,

did you manage to get any further with this?

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 06 June 2020 15:21:59
To: ceph-users; f...@lpnhe.in2p3.fr
Subject: [ceph-users] Re: mds behind on trimming - replay until memory exhausted

I think you have a problem similar to one I have. The priority of beacons seems 
very low. As soon as something gets busy, beacons are ignored or not sent. This 
was part of your log messages from the MDS. It stopped reporting to the MONs 
due to laggy connection. This laggyness is a result of swapping:


2020-06-05 21:39:06.015 7f251bfe6700  1 mds.0.322900 skipping upkeep
work because connection to Monitors appears laggy

Hence, during the (entire) time you are trying to get the MDS back using swap, 
it will almost certainly stop sending beacons. Therefore, you need to disable 
the time-out temporarily, otherwise the MON will always kill it for no real 
reason. The time-out should be long enough to cover the entire recovery period.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

____
From: Francois Legrand 
Sent: 06 June 2020 11:11
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory 
exhausted

Thanks for the tip,
I w

[ceph-users] Re: mds behind on trimming - replay until memory exhausted

2020-06-08 Thread Francois Legrand


Hi Franck,
Finally I dit :
ceph config set global mds_beacon_grace 60
and create /etc/sysctl.d/sysctl-ceph.conf with
vm.min_free_kbytes=4194303
and then
sysctl --system

After that, the mds went to rejoin for a very long time (almost 24 
hours) with errors like :
2020-06-07 04:10:36.802 7ff866e2e700  1 heartbeat_map is_healthy 
'MDSRank' had timed out after 15
2020-06-07 04:10:36.802 7ff866e2e700  0 
mds.beacon.lpnceph-mds02.in2p3.fr Skipping beacon heartbeat to monitors 
(last acked 14653.8s ago); MDS internal heartbeat is not healthy!
2020-06-07 04:10:37.021 7ff868e32700 -1 monclient: _check_auth_rotating 
possible clock skew, rotating keys expired way too early (before 
2020-06-07 03:10:37.022271)

and also
2020-06-07 04:10:44.942 7ff86d63b700  0 auth: could not find secret_id=10363
2020-06-07 04:10:44.942 7ff86d63b700  0 cephx: verify_authorizer could 
not get service secret for service mds secret_id=10363


but at the end the mds went active ! :-)
I let it at rest from sunday afternoon until this morning.
Indeed I was able to connect clients (in read-only for now) and read the 
datas.
I checked the clients connected with ceph tell 
mds.lpnceph-mds02.in2p3.fr client ls
and disconnected the few clients still there (with umount) and checked 
that they were not connected anymore with the same command.

But I still have the following warnings
MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
    mdslpnceph-mds02.in2p3.fr(mds.0): 1 slow metadata IOs are blocked > 
30 secs, oldest blocked for 75372 secs

MDS_TRIM 1 MDSs behind on trimming
    mdslpnceph-mds02.in2p3.fr(mds.0): Behind on trimming (122836/128) 
max_segments: 128, num_segments: 122836


and the number of segments is still rising (slowly).
F.


Le 08/06/2020 à 12:00, Frank Schilder a écrit :

Hi Francois,

did you manage to get any further with this?

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 06 June 2020 15:21:59
To: ceph-users; f...@lpnhe.in2p3.fr
Subject: [ceph-users] Re: mds behind on trimming - replay until memory exhausted

I think you have a problem similar to one I have. The priority of beacons seems 
very low. As soon as something gets busy, beacons are ignored or not sent. This 
was part of your log messages from the MDS. It stopped reporting to the MONs 
due to laggy connection. This laggyness is a result of swapping:


2020-06-05 21:39:06.015 7f251bfe6700  1 mds.0.322900 skipping upkeep
work because connection to Monitors appears laggy

Hence, during the (entire) time you are trying to get the MDS back using swap, 
it will almost certainly stop sending beacons. Therefore, you need to disable 
the time-out temporarily, otherwise the MON will always kill it for no real 
reason. The time-out should be long enough to cover the entire recovery period.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

____
From: Francois Legrand 
Sent: 06 June 2020 11:11
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory 
exhausted

Thanks for the tip,
I will try that. For now vm.min_free_kbytes = 90112
Indeed, yesterday after your last mail I set mds_beacon_grace to 240.0
but this didn't change anything...
 -27> 2020-06-06 06:15:07.373 7f83e3626700  1
mds.beacon.lpnceph-mds04.in2p3.fr MDS connection to Monitors appears to
be laggy; 332.044s since last acked beacon
Which is the same time since last acked beacon I had before changing the
parameter.
As mds beacon interval is 4 s setting mds_beacon_grace to 240 should
lead to 960 s (16mn).  Thus I think that the bottleneck is elsewhere.
F.


Le 06/06/2020 à 09:47, Frank Schilder a écrit :

Hi Francois,

there is actually one more parameter you might consider changing in case the 
MDS gets kicked out again. For a system under such high memory pressure, the 
value for the kernel parameter vm.min_free_kbytes might need adjusting. You can 
check the current value with

sysctl vm.min_free_kbytes

In your case with heavy swap usage, this value should probably be somewhere 
between 2-4GB.

Careful, do not change this value while memory is in high demand. If not enough 
memory is available, setting this will immediately OOM kill your machine. Make 
sure that plenty of pages are unused. Drop page cache if necessary or reboot 
the machine before setting this value.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 06 June 2020 00:36:13
To: ceph-users; f...@lpnhe.in2p3.fr
Subject: [ceph-users] Re: mds behind on trimming - replay until memory exhausted

Hi Francois,

yes, the beacon grace needs to be higher due to the latency of swap. Not sure 
if 60s will do. For this particular recovery operation, you might want to go 
much higher (1h) and wat

[ceph-users] Re: mds behind on trimming - replay until memory exhausted

2020-06-06 Thread Francois Legrand


Thanks for the tip,
I will try that. For now vm.min_free_kbytes = 90112
Indeed, yesterday after your last mail I set mds_beacon_grace to 240.0 
but this didn't change anything...
   -27> 2020-06-06 06:15:07.373 7f83e3626700  1 
mds.beacon.lpnceph-mds04.in2p3.fr MDS connection to Monitors appears to 
be laggy; 332.044s since last acked beacon
Which is the same time since last acked beacon I had before changing the 
parameter.
As mds beacon interval is 4 s setting mds_beacon_grace to 240 should 
lead to 960 s (16mn).  Thus I think that the bottleneck is elsewhere.

F.


Le 06/06/2020 à 09:47, Frank Schilder a écrit :

Hi Francois,

there is actually one more parameter you might consider changing in case the 
MDS gets kicked out again. For a system under such high memory pressure, the 
value for the kernel parameter vm.min_free_kbytes might need adjusting. You can 
check the current value with

sysctl vm.min_free_kbytes

In your case with heavy swap usage, this value should probably be somewhere 
between 2-4GB.

Careful, do not change this value while memory is in high demand. If not enough 
memory is available, setting this will immediately OOM kill your machine. Make 
sure that plenty of pages are unused. Drop page cache if necessary or reboot 
the machine before setting this value.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 06 June 2020 00:36:13
To: ceph-users; f...@lpnhe.in2p3.fr
Subject: [ceph-users] Re: mds behind on trimming - replay until memory exhausted

Hi Francois,

yes, the beacon grace needs to be higher due to the latency of swap. Not sure 
if 60s will do. For this particular recovery operation, you might want to go 
much higher (1h) and watch the cluster health closely.

Good luck and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

____
From: Francois Legrand 
Sent: 05 June 2020 23:51:04
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] mds behind on trimming - replay until memory exhausted

Hi,
Unfortunately adding swap did not solve the problem !
I added 400 GB of swap. It used about 18GB of swap after consuming all
the ram and stopped with the following logs :

2020-06-05 21:33:31.967 7f251e7eb700  1 mds.lpnceph-mds04.in2p3.fr
Updating MDS map to version 324691 from mon.1
2020-06-05 21:33:40.355 7f251e7eb700  1 mds.lpnceph-mds04.in2p3.fr
Updating MDS map to version 324692 from mon.1
2020-06-05 21:33:59.787 7f251b7e5700  1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-06-05 21:33:59.787 7f251b7e5700  0
mds.beacon.lpnceph-mds04.in2p3.fr Skipping beacon heartbeat to monitors
(last acked 3.99979s ago); MDS internal heartbeat is not healthy!
2020-06-05 21:34:00.287 7f251b7e5700  1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-06-05 21:34:00.287 7f251b7e5700  0
mds.beacon.lpnceph-mds04.in2p3.fr Skipping beacon heartbeat to monitors
(last acked 4.49976s ago); MDS internal heartbeat is not healthy!

2020-06-05 21:39:05.991 7f251bfe6700  1 heartbeat_map reset_timeout
'MDSRank' had timed out after 15
2020-06-05 21:39:06.015 7f251bfe6700  1
mds.beacon.lpnceph-mds04.in2p3.fr MDS connection to Monitors appears to
be laggy; 310.228s since last acked beacon
2020-06-05 21:39:06.015 7f251bfe6700  1 mds.0.322900 skipping upkeep
work because connection to Monitors appears laggy
2020-06-05 21:39:19.838 7f251bfe6700  1 mds.0.322900 skipping upkeep
work because connection to Monitors appears laggy
2020-06-05 21:39:19.869 7f251e7eb700  1 mds.lpnceph-mds04.in2p3.fr
Updating MDS map to version 324694 from mon.1
2020-06-05 21:39:19.869 7f251e7eb700  1 mds.lpnceph-mds04.in2p3.fr Map
removed me (mds.-1 gid:210070681) from cluster due to lost contact;
respawning
2020-06-05 21:39:19.870 7f251e7eb700  1 mds.lpnceph-mds04.in2p3.fr respawn!
--- begin dump of recent events ---
   -> 2020-06-05 19:28:07.982 7f25217f1700  5
mds.beacon.lpnceph-mds04.in2p3.fr received beacon reply up:replay seq
2131 rtt 0.930951
   -9998> 2020-06-05 19:28:11.053 7f251b7e5700  5
mds.beacon.lpnceph-mds04.in2p3.fr Sending beacon up:replay seq 2132
   -9997> 2020-06-05 19:28:11.053 7f251b7e5700 10 monclient:
_send_mon_message to mon.lpnceph-mon02 at v2:134.158.152.210:3300/0
   -9996> 2020-06-05 19:28:12.176 7f25217f1700  5
mds.beacon.lpnceph-mds04.in2p3.fr received beacon reply up:replay seq
2132 rtt 1.12294
   -9995> 2020-06-05 19:28:12.176 7f251e7eb700  1
mds.lpnceph-mds04.in2p3.fr Updating MDS map to version 323302 from mon.1
   -9994> 2020-06-05 19:28:14.290 7f251d7e9700 10 monclient: tick
   -9993> 2020-06-05 19:28:14.290 7f251d7e9700 10 monclient:
_check_auth_rotating have uptodate secrets (they expire after 2020-06-05
19:27:44.290995)
...
2020-06-05 21:39:31.092 7f3c4d5e3700  1 mds.lpnceph-mds04.in2p3.fr
Updating MDS map to version 324749 from mon.

[ceph-users] Re: mds behind on trimming - replay until memory exhausted

2020-06-05 Thread Francois Legrand


Hi,
Unfortunately adding swap did not solve the problem !
I added 400 GB of swap. It used about 18GB of swap after consuming all 
the ram and stopped with the following logs :


2020-06-05 21:33:31.967 7f251e7eb700  1 mds.lpnceph-mds04.in2p3.fr 
Updating MDS map to version 324691 from mon.1
2020-06-05 21:33:40.355 7f251e7eb700  1 mds.lpnceph-mds04.in2p3.fr 
Updating MDS map to version 324692 from mon.1
2020-06-05 21:33:59.787 7f251b7e5700  1 heartbeat_map is_healthy 
'MDSRank' had timed out after 15
2020-06-05 21:33:59.787 7f251b7e5700  0 
mds.beacon.lpnceph-mds04.in2p3.fr Skipping beacon heartbeat to monitors 
(last acked 3.99979s ago); MDS internal heartbeat is not healthy!
2020-06-05 21:34:00.287 7f251b7e5700  1 heartbeat_map is_healthy 
'MDSRank' had timed out after 15
2020-06-05 21:34:00.287 7f251b7e5700  0 
mds.beacon.lpnceph-mds04.in2p3.fr Skipping beacon heartbeat to monitors 
(last acked 4.49976s ago); MDS internal heartbeat is not healthy!


2020-06-05 21:39:05.991 7f251bfe6700  1 heartbeat_map reset_timeout 
'MDSRank' had timed out after 15
2020-06-05 21:39:06.015 7f251bfe6700  1 
mds.beacon.lpnceph-mds04.in2p3.fr MDS connection to Monitors appears to 
be laggy; 310.228s since last acked beacon
2020-06-05 21:39:06.015 7f251bfe6700  1 mds.0.322900 skipping upkeep 
work because connection to Monitors appears laggy
2020-06-05 21:39:19.838 7f251bfe6700  1 mds.0.322900 skipping upkeep 
work because connection to Monitors appears laggy
2020-06-05 21:39:19.869 7f251e7eb700  1 mds.lpnceph-mds04.in2p3.fr 
Updating MDS map to version 324694 from mon.1
2020-06-05 21:39:19.869 7f251e7eb700  1 mds.lpnceph-mds04.in2p3.fr Map 
removed me (mds.-1 gid:210070681) from cluster due to lost contact; 
respawning

2020-06-05 21:39:19.870 7f251e7eb700  1 mds.lpnceph-mds04.in2p3.fr respawn!
--- begin dump of recent events ---
 -> 2020-06-05 19:28:07.982 7f25217f1700  5 
mds.beacon.lpnceph-mds04.in2p3.fr received beacon reply up:replay seq 
2131 rtt 0.930951
 -9998> 2020-06-05 19:28:11.053 7f251b7e5700  5 
mds.beacon.lpnceph-mds04.in2p3.fr Sending beacon up:replay seq 2132
 -9997> 2020-06-05 19:28:11.053 7f251b7e5700 10 monclient: 
_send_mon_message to mon.lpnceph-mon02 at v2:134.158.152.210:3300/0
 -9996> 2020-06-05 19:28:12.176 7f25217f1700  5 
mds.beacon.lpnceph-mds04.in2p3.fr received beacon reply up:replay seq 
2132 rtt 1.12294
 -9995> 2020-06-05 19:28:12.176 7f251e7eb700  1 
mds.lpnceph-mds04.in2p3.fr Updating MDS map to version 323302 from mon.1

 -9994> 2020-06-05 19:28:14.290 7f251d7e9700 10 monclient: tick
 -9993> 2020-06-05 19:28:14.290 7f251d7e9700 10 monclient: 
_check_auth_rotating have uptodate secrets (they expire after 2020-06-05 
19:27:44.290995)

...
2020-06-05 21:39:31.092 7f3c4d5e3700  1 mds.lpnceph-mds04.in2p3.fr 
Updating MDS map to version 324749 from mon.1
2020-06-05 21:39:35.257 7f3c4d5e3700  1 mds.lpnceph-mds04.in2p3.fr 
Updating MDS map to version 324750 from mon.1
2020-06-05 21:39:35.257 7f3c4d5e3700  1 mds.lpnceph-mds04.in2p3.fr Map 
has assigned me to become a standby


However, the mons doesn't seems particularly loaded !
So I am trying to set mds_beacon_grace to 60.0 to see if it helps (I did 
it both for mds and mons daemons because it's seems to be present in 
both conf).

I will tells you if it works.

Any other clue ?
F.

Le 05/06/2020 à 14:44, Frank Schilder a écrit :

Hi Francois,

thanks for the link. The option "mds dump cache after rejoin" is for debugging 
purposes only. It will write the cache after rejoin to a file, but not drop the cache. 
This will not help you. I think this was implemented recently to make it possible to send 
a cache dump file to developers after an MDS crash before the restarting MDS changes the 
cache.

In your case, I would set osd_op_queue_cut_off during the next regular cluster 
service or upgrade.

My best bet right now is to try to add swap. Maybe someone else reading this 
has a better idea or you find a hint in one of the other threads.

Good luck!
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Francois Legrand
Sent: 05 June 2020 14:34:06
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] mds behind on trimming - replay until memory exhausted

Le 05/06/2020 à 14:18, Frank Schilder a écrit :

Hi Francois,


I was also wondering if setting mds dump cache after rejoin could help ?

Haven't heard of that option. Is there some documentation?

I found it on :
https://docs.ceph.com/docs/nautilus/cephfs/mds-config-ref/
mds dump cache after rejoin
Description
Ceph will dump MDS cache contents to a file after rejoining the cache
(during recovery).
Type
Boolean
Default
false

but I don't think it can help in my case, because rejoin occurs after
replay and in my case replay never ends !


I have :
osd_op_queue=wpq
osd_op_queue_cut_off=low
I can try to set osd_op_queue_cut_off to high, but

[ceph-users] Re: mds behind on trimming - replay until memory exhausted

2020-06-05 Thread Francois Legrand




Le 05/06/2020 à 14:18, Frank Schilder a écrit :

Hi Francois,


I was also wondering if setting mds dump cache after rejoin could help ?

Haven't heard of that option. Is there some documentation?

I found it on :
https://docs.ceph.com/docs/nautilus/cephfs/mds-config-ref/
mds dump cache after rejoin
Description
Ceph will dump MDS cache contents to a file after rejoining the cache 
(during recovery).

Type
Boolean
Default
false

but I don't think it can help in my case, because rejoin occurs after 
replay and in my case replay never ends !



I have :
osd_op_queue=wpq
osd_op_queue_cut_off=low
I can try to set osd_op_queue_cut_off to high, but it will be useful
only if the mds get active, true ?

I think so. If you have no clients connected, there should not be queue 
priority issues. Maybe it is best to wait until your cluster is healthy again 
as you will need to restart all daemons. Make sure you set this in [global]. 
When I applied that change and after re-starting all OSDs my MDSes had 
reconnect issues until I set it on them too. I think all daemons use that 
option (the prefix osd_ is misleading).


For sure I would prefer not to restart all daemons because the second 
filesystem is up and running (with production clients).



For now, the mds_cache_memory_limit is set to 8 589 934 592 (so 8GB
which seems reasonable for a mds server with 32/48GB).

This sounds bad. 8GB should not cause any issues. Maybe you are hitting a bug, 
I believe there is a regression in Nautilus. There were recent threads on 
absurdly high memory use by MDSes. Maybe its worth searching for these in the 
list.

I will have a look.


I already force the clients to unmount (and even rebooted the ones from
which the rsync and the rmdir .snaps were launched).

I don't know when the MDS acknowledges this. If is was a clean unmount (i.e. 
without -f or forced by reboot) the MDS should have dropped the clients 
already. If it was an unclean unmount it might not be that easy to get the 
stale client session out. However, I don't know about that.


Moreover when I did that, the mds was already not active but in replay, 
so for sure the unmount was not acknowledged by any mds !



I think that providing more swap maybe the solution ! I will try that if
I cannot find another way to fix it.

If the memory overrun is somewhat limited, this should allow the MDS to trim 
the logs. Will take a while, but it will do eventually.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

____
From: Francois Legrand 
Sent: 05 June 2020 13:46:03
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] mds behind on trimming - replay until memory exhausted

I was also wondering if setting mds dump cache after rejoin could help ?


Le 05/06/2020 à 12:49, Frank Schilder a écrit :

Out of interest, I did the same on a mimic cluster a few months ago, running up 
to 5 parallel rsync sessions without any problems. I moved about 120TB. Each 
rsync was running on a separate client with its own cache. I made sure that the 
sync dirs were all disjoint (no overlap of files/directories).

How many rsync processes are you running in parallel?
Do you have these settings enabled:

osd_op_queue=wpq
osd_op_queue_cut_off=high

WPQ should be default, but osd_op_queue_cut_off=high might not be. Setting the 
latter removed any behind trimming problems we have seen before.

You are in a somewhat peculiar situation. I think you need to trim client 
caches, which requires an active MDS. If your MDS becomes active for at least 
some time, you could try the following (I'm not an expert here, so take with a 
grain of scepticism):

- reduce the MDS cache memory limit to force recall of caps much earlier than 
now
- reduce client cach size
- set "osd_op_queue_cut_off=high" if not already done so, I think this requires 
restart of OSDs, so be careful

At this point, you could watch your restart cycle to see if things improve 
already. Maybe nothing more is required.

If you have good SSDs, you could try to provide temporarily some swap space. It 
saved me once. This will be very slow, but at least it might allow you to move 
forward.

Harder measures:

- stop all I/O from the FS clients, throw users out if necessary
- ideally, try to cleanly (!) shut down clients or force trimming the cache by 
either
* umount or
* sync; echo 3 > /proc/sys/vm/drop_caches
Either of these might hang for a long time. Do not interrupt and do not do 
this on more than one client at a time.

At some point, your active MDS should be able to hold a full session. You 
should then tune the cache and other parameters such that the MDSes can handle 
your rsync sessions.

My experience is that MDSes overrun their cache limits quite a lot. Since I 
reduced mds_cache_memory_limit to not more than half of what is physically 
available, I have not had any problems

[ceph-users] Re: mds behind on trimming - replay until memory exhausted

2020-06-05 Thread Francois Legrand


I was also wondering if setting mds dump cache after rejoin could help ?


Le 05/06/2020 à 12:49, Frank Schilder a écrit :

Out of interest, I did the same on a mimic cluster a few months ago, running up 
to 5 parallel rsync sessions without any problems. I moved about 120TB. Each 
rsync was running on a separate client with its own cache. I made sure that the 
sync dirs were all disjoint (no overlap of files/directories).

How many rsync processes are you running in parallel?
Do you have these settings enabled:

   osd_op_queue=wpq
   osd_op_queue_cut_off=high

WPQ should be default, but osd_op_queue_cut_off=high might not be. Setting the 
latter removed any behind trimming problems we have seen before.

You are in a somewhat peculiar situation. I think you need to trim client 
caches, which requires an active MDS. If your MDS becomes active for at least 
some time, you could try the following (I'm not an expert here, so take with a 
grain of scepticism):

- reduce the MDS cache memory limit to force recall of caps much earlier than 
now
- reduce client cach size
- set "osd_op_queue_cut_off=high" if not already done so, I think this requires 
restart of OSDs, so be careful

At this point, you could watch your restart cycle to see if things improve 
already. Maybe nothing more is required.

If you have good SSDs, you could try to provide temporarily some swap space. It 
saved me once. This will be very slow, but at least it might allow you to move 
forward.

Harder measures:

- stop all I/O from the FS clients, throw users out if necessary
- ideally, try to cleanly (!) shut down clients or force trimming the cache by 
either
   * umount or
   * sync; echo 3 > /proc/sys/vm/drop_caches
   Either of these might hang for a long time. Do not interrupt and do not do 
this on more than one client at a time.

At some point, your active MDS should be able to hold a full session. You 
should then tune the cache and other parameters such that the MDSes can handle 
your rsync sessions.

My experience is that MDSes overrun their cache limits quite a lot. Since I 
reduced mds_cache_memory_limit to not more than half of what is physically 
available, I have not had any problems again.

Hope that helps.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

____
From: Francois Legrand 
Sent: 05 June 2020 11:42:42
To: ceph-users
Subject: [ceph-users] mds behind on trimming - replay until memory exhausted

Hi all,
We have a ceph nautilus cluster (14.2.8) with two cephfs filesystem and
3 mds (1 active for each fs + one failover).
We are transfering all the datas (~600M files) from one FS (which was in
EC 3+2) to the other FS (in R3).
On the old FS we first removed the snapshots (to avoid strays problems
when removing files) and the ran some rsync deleting the files after the
transfer.
The operation should last a few weeks more to complete.
But few days ago, we started to have some warning mds behind on trimming
from the mds managing the old FS.
Yesterday, I restarted the active mds service to force the takeover by
the standby mds (basically because the standby is more powerfull and
have more memory, i.e 48GB over 32).
The standby mds took the rank 0 and started to replay... the mds behind
on trimming came back and the number of segments rised as well as the
memory usage of the server. Finally, it exhausted the memory of the mds
and the service stopped and the previous mds took rank 0 and started to
replay... until memory exhaustion and a new switch of mds etc...
It thus seems that we are in a never ending loop ! And of course, as the
mds is always in replay, the data are not accessible and the transfers
are blocked.
I stopped all the rsync and unmount the clients.

My questions are :
- Does the mds trim during the replay so we could hope that after a
while it will purge everything and the mds will be able to become active
at the end ?
- Is there a way to accelerate the operation or to fix this situation ?

Thanks for you help.
F.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: mds behind on trimming - replay until memory exhausted

2020-06-05 Thread Francois Legrand


Hi,
Thanks for your answer.
I have :
osd_op_queue=wpq
osd_op_queue_cut_off=low

I can try to set osd_op_queue_cut_off to high, but it will be useful 
only if the mds get active, true ?
For now, the mds_cache_memory_limit is set to 8 589 934 592 (so 8GB 
which seems reasonable for a mds server with 32/48GB).
I already force the clients to unmount (and even rebooted the ones from 
which the rsync and the rmdir .snaps were launched).
I think that providing more swap maybe the solution ! I will try that if 
I cannot find another way to fix it.

F.

Le 05/06/2020 à 12:49, Frank Schilder a écrit :

Out of interest, I did the same on a mimic cluster a few months ago, running up 
to 5 parallel rsync sessions without any problems. I moved about 120TB. Each 
rsync was running on a separate client with its own cache. I made sure that the 
sync dirs were all disjoint (no overlap of files/directories).

How many rsync processes are you running in parallel?
Do you have these settings enabled:

   osd_op_queue=wpq
   osd_op_queue_cut_off=high

WPQ should be default, but osd_op_queue_cut_off=high might not be. Setting the 
latter removed any behind trimming problems we have seen before.

You are in a somewhat peculiar situation. I think you need to trim client 
caches, which requires an active MDS. If your MDS becomes active for at least 
some time, you could try the following (I'm not an expert here, so take with a 
grain of scepticism):

- reduce the MDS cache memory limit to force recall of caps much earlier than 
now
- reduce client cach size
- set "osd_op_queue_cut_off=high" if not already done so, I think this requires 
restart of OSDs, so be careful

At this point, you could watch your restart cycle to see if things improve 
already. Maybe nothing more is required.

If you have good SSDs, you could try to provide temporarily some swap space. It 
saved me once. This will be very slow, but at least it might allow you to move 
forward.

Harder measures:

- stop all I/O from the FS clients, throw users out if necessary
- ideally, try to cleanly (!) shut down clients or force trimming the cache by 
either
   * umount or
   * sync; echo 3 > /proc/sys/vm/drop_caches
   Either of these might hang for a long time. Do not interrupt and do not do 
this on more than one client at a time.

At some point, your active MDS should be able to hold a full session. You 
should then tune the cache and other parameters such that the MDSes can handle 
your rsync sessions.

My experience is that MDSes overrun their cache limits quite a lot. Since I 
reduced mds_cache_memory_limit to not more than half of what is physically 
available, I have not had any problems again.

Hope that helps.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

____
From: Francois Legrand 
Sent: 05 June 2020 11:42:42
To: ceph-users
Subject: [ceph-users] mds behind on trimming - replay until memory exhausted

Hi all,
We have a ceph nautilus cluster (14.2.8) with two cephfs filesystem and
3 mds (1 active for each fs + one failover).
We are transfering all the datas (~600M files) from one FS (which was in
EC 3+2) to the other FS (in R3).
On the old FS we first removed the snapshots (to avoid strays problems
when removing files) and the ran some rsync deleting the files after the
transfer.
The operation should last a few weeks more to complete.
But few days ago, we started to have some warning mds behind on trimming
from the mds managing the old FS.
Yesterday, I restarted the active mds service to force the takeover by
the standby mds (basically because the standby is more powerfull and
have more memory, i.e 48GB over 32).
The standby mds took the rank 0 and started to replay... the mds behind
on trimming came back and the number of segments rised as well as the
memory usage of the server. Finally, it exhausted the memory of the mds
and the service stopped and the previous mds took rank 0 and started to
replay... until memory exhaustion and a new switch of mds etc...
It thus seems that we are in a never ending loop ! And of course, as the
mds is always in replay, the data are not accessible and the transfers
are blocked.
I stopped all the rsync and unmount the clients.

My questions are :
- Does the mds trim during the replay so we could hope that after a
while it will purge everything and the mds will be able to become active
at the end ?
- Is there a way to accelerate the operation or to fix this situation ?

Thanks for you help.
F.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] mds behind on trimming - replay until memory exhausted

2020-06-05 Thread Francois Legrand


Hi all,
We have a ceph nautilus cluster (14.2.8) with two cephfs filesystem and 
3 mds (1 active for each fs + one failover).
We are transfering all the datas (~600M files) from one FS (which was in 
EC 3+2) to the other FS (in R3).
On the old FS we first removed the snapshots (to avoid strays problems 
when removing files) and the ran some rsync deleting the files after the 
transfer.

The operation should last a few weeks more to complete.
But few days ago, we started to have some warning mds behind on trimming 
from the mds managing the old FS.
Yesterday, I restarted the active mds service to force the takeover by 
the standby mds (basically because the standby is more powerfull and 
have more memory, i.e 48GB over 32).
The standby mds took the rank 0 and started to replay... the mds behind 
on trimming came back and the number of segments rised as well as the 
memory usage of the server. Finally, it exhausted the memory of the mds 
and the service stopped and the previous mds took rank 0 and started to 
replay... until memory exhaustion and a new switch of mds etc...
It thus seems that we are in a never ending loop ! And of course, as the 
mds is always in replay, the data are not accessible and the transfers 
are blocked.

I stopped all the rsync and unmount the clients.

My questions are :
- Does the mds trim during the replay so we could hope that after a 
while it will purge everything and the mds will be able to become active 
at the end ?

- Is there a way to accelerate the operation or to fix this situation ?

Thanks for you help.
F.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Remove or recreate damaged PG in erasure coding pool

2020-05-14 Thread Francois Legrand


Hello,
We run nautilus 14.2.8 ceph cluster.
After a big crash in which we lost some disks we had a PG down (Erasure 
coding 3+2 pool) and trying to fix it we followed this 
https://medium.com/opsops/recovering-ceph-from-reduced-data-availability-3-pgs-inactive-3-pgs-incomplete-b97cbcb4b5a1
As the PG was reported with 0 objects we first marked a shard as 
complete with ceph-objectstore-tool and restart the osd.

The pg thus went active but reported lost objects !
As we consider the datas on this pg as lost, we try to get rid of this 
with ceph pg 30.3 mark_unfound_lost delete.


This produced some logs like (~3 lines/hour):

2020-05-12 14:45:05.251830 osd.103 (osd.103) 886 : cluster [ERR] 30.3s0 
Unexpected Error: recovery ending with 41: 
{30:c000e27d:::rbd_data.34.c963b6314efb84.0
100:head=435293'2 flags = 
delete,30:c01f1248:::rbd_data.34.7f0c0d1df22f45.0325:head=435293'3 
flags = delete,30:c05e82b2:::rbd_data.34.674d063bdc66d2.0
015:head=435293'4 flags = 
delete,30:c0b2d8e7:::rbd_data.34.6bc88749c741cb.07d0:head=435293'5 
flags = delete,30:c0c3e20e:::rbd_data.34.674d063b
dc66d2.00fb:head=435293'6 flags = 
delete,30:c0c89740:::rbd_data.34.a7f2202210bb39.0bbc:head=435293'7 
flags = delete,30:c0e59ffa:::rbd_data.34.
7f0c0d1df22f45.02fb:head=435293'8 flags = 
delete,30:c0e72bf4:::rbd_data.34.7f0c0d1df22f45.00fa:head=435293'9 
flags = delete,30:c10ab507:::rbd_
data.34.80695c646d9535.0327:head=435293'10 flags = 
delete,30:c219e412:::rbd_data.34.a7f2202210bb39.0fa0:head=435293'11 
flags = delete,30:c29ae
ba3:::rbd_data.34.8038585a0eb9f6.0eb2:head=435293'12 flags = 
delete,30:c29fae09:::rbd_data.34.674d063bdc66d2.148a:head=435293'13 
flags = delet
e,30:c2b77a99:::rbd_data.34.7f0c0d1df22f45.031d:head=435293'14 
flags = 
delete,30:c2c8598f:::rbd_data.34.674d063bdc66d2.02f5:head=435293'15 
fla
gs = 
delete,30:c2dd39fe:::rbd_data.34.6494fb1b0f88bf.030b:head=435293'16 
flags = 
delete,30:c2f6ce39:::rbd_data.34.806ab864459ae5.0109:head=435
293'17 flags = 
delete,30:c2f8a62f:::rbd_data.34.ed0c58ebdc770f.002a:head=435293'18 
flags = delete,30:c306cd86:::rbd_data.34.ed0c58ebdc770f.020
5:head=435293'19 flags = 
delete,30:c30f5230:::rbd_data.34.7f0c0d1df22f45.02f5:head=435293'20 
flags = delete,30:c32b81df:::rbd_data.34.c79f6d1f78a707.0
100:head=435293'21 flags = 
delete,30:c3374080:::rbd_data.34.7f217e33dd742c.07d0:head=435293'22 
flags = delete,30:c3cdbeb5:::rbd_data.34.674dcefe97
f606.0109:head=435293'23 flags = 
delete,30:c3cdd149:::rbd_data.34.674dcefe97f606.0019:head=435293'24 
flags = delete,30:c40946c0:::rbd_data.34.
ded8d21a9d3d8f.02a8:head=435293'25 flags = 
delete,30:c42ed4fd:::rbd_data.34.a6985314ad8dad.0200:head=435293'26 
flags = delete,30:c483a99b:::rb
d_data.34.ed0c58ebdc770f.0a00:head=435293'27 flags = 
delete,30:c49f09d6:::rbd_data.34.7e1c1abf436885.0bb8:head=435293'28 
flags = delete,30:c51
5a4e8:::rbd_data.34.ed0c58ebdc770f.0106:head=435293'29 flags 
= 
delete,30:c5181a8e:::rbd_data.34.9385d45172fa0f.020c:head=435293'30 
flags = del
ete,30:c531de44:::rbd_data.34.6bc88749c741cb.0102:head=435293'31 
flags = 
delete,30:c5427518:::rbd_data.34.806ab864459ae5.06db:head=435293'32 
f
lags = 
delete,30:c5693b53:::rbd_data.34.6494fb1b0f88bf.148a:head=435293'33 
flags = 
delete,30:c5804bc9:::rbd_data.34.ed0cb8730e020c.0105:head=4
35293'34 flags = 
delete,30:c598117e:::rbd_data.34.7f0811fbac0b9d.0327:head=435293'35 
flags = delete,30:c5a64fbd:::rbd_data.34.c963b6314efb84.0
010:head=435293'36 flags = 
delete,30:c5f9e0e5:::rbd_data.34.ed0c58ebdc770f.0f01:head=435293'37 
flags = delete,30:c5ffe1d8:::rbd_data.34.6bc88749c741cb.000
00abe:head=435293'38 flags = 
delete,30:c6ecfaa1:::rbd_data.34.9385d45172fa0f.0002:head=435293'39 
flags = delete,30:c70f:::rbd_data.34.6494fb1b
0f88bf.0106:head=435293'40 flags = 
delete,30:c7a730f4:::rbd_data.34.7f217e33dd742c.06e1:head=435293'41 
flags = delete,30:c7aa79f7:::rbd_data.3

4.674dcefe97f606.0108:head=435293'42 flags = delete}

But yesterday it started to flood the logs (~9 GB of logs/day !) with 
lines like :


2020-05-14 10:36:03.851258 osd.29 [ERR] Error -2 reading object 
30:c24a0173:::rbd_data.34.806ab864459ae5.022d:head
2020-05-14 10:36:03.851333 osd.29 [ERR] Error -2 reading object 
30:c4a41972:::rbd_data.34.6bc88749c741cb.0320:head
2020-05-14 10:36:03.851382 osd.29 [ERR] Error -2 reading object 
30:c543da6f:::rbd_data.34.80695c646d9535.0dce:head
2020-05-14 10:36:03.859900 osd.29 [ERR] Error -2 reading object 
30:c24a0173:::rbd_data.34.806ab864459ae5.

[ceph-users] Recover datas from pg incomplete

2020-05-11 Thread Francois Legrand


Hi,
After a major crash in which we lost few osds, we are stucked with 
incomplete pgs.

At first, peering was blocked with peering_blocked_by_history_les_bound.
Thus we set osd_find_best_info_ignore_history_les true for all osds 
involved in the pg and set the primary osd down to force repeering.
It worked for one pg which is in a replica 3 pool, but for the 2 others 
pgs which are in a erasurce coding (3+2) pool, it didn't worked... and 
the pgs are still incomplete.


We know that we will have data lost, but we would like to minimize it 
and save as much as possible. Also because this pg is part of the data 
pool of a cephfs filesystem and it seems that files are spread among a 
lot of pgs and loosing objects in a pg of the datapool means the loss of 
a huge number of files !


According to https://www.spinics.net/lists/ceph-devel/msg41665.html
a way would be to :
- stop each osd involved in that pg
- export the shards with ceph-objectstore-tool
- compare the size of the shards and select the biggest one 
(alternatively maybe we can also look at the num_objects returned by 
ceph pg query ?)

- Mark it as complete
- restart the osd
- Wait for recover and finally get rid of the missing objects with ceph 
pg 10.2 mark_unfound_lost delete


But on this other source 
https://github.com/TheJJ/ceph-cheatsheet/blob/master/README.md or here 
https://medium.com/opsops/recovering-ceph-from-reduced-data-availability-3-pgs-inactive-3-pgs-incomplete-b97cbcb4b5a1 
it's suggested to remove the other parts  (but I am not sure these 
threads are really related to EC pools).


Could you confirm that we could follow this procedure (or correct it or 
suggests anything else) ?

Thanks for your advices.
F.

PS: Here is a part of the ceph pg 10.2 query return :

    "state": "incomplete",
    "snap_trimq": "[]",
    "snap_trimq_len": 0,
    "epoch": 434321,
    "up": [
    78,
    105,
    90,
    4,
    41
    ],
    "acting": [
    78,
    105,
    90,
    4,
    41
    ],
    "info": {
    "pgid": "10.2s0",
    "state": "incomplete",
    "last_peered": "2020-04-22 09:58:42.505638",
    "last_became_peered": "2020-04-20 11:06:07.701833",
    "num_objects": 161314,
    "num_objects_missing_on_primary": 0,
    "num_objects_missing": 0,
    "num_objects_degraded": 0,
    "num_objects_misplaced": 0,
    "num_objects_unfound": 0,
    "num_objects_dirty": 161314,
    "num_objects_recovered": 1290285,
    "peer_info": [
    "peer": "4(3)",
    "pgid": "10.2s3",
    "state": "active+undersized+degraded+remapped+backfilling",
    "last_peered": "2020-04-25 13:25:12.860435",
    "last_became_peered": "2020-04-22 10:45:45.520125",
    "num_objects": 162869,
    "num_objects_missing_on_primary": 0,
    "num_objects_missing": 0,
    "num_objects_degraded": 85071,
    "num_objects_misplaced": 0,
    "num_objects_unfound": 0,
    "num_objects_dirty": 162869,
    "num_objects_recovered": 1368082,
    "peer": "9(2)",
    "pgid": "10.2s2",
    "state": "down",
    "last_peered": "2020-04-25 13:25:12.860435",
    "last_became_peered": "2020-04-22 10:45:45.520125",
    "num_objects": 162869,
    "num_objects_missing_on_primary": 0,
    "num_objects_missing": 0,
    "num_objects_degraded": 0,
    "num_objects_misplaced": 0,
    "num_objects_unfound": 0,
    "num_objects_dirty": 162869,
    "num_objects_recovered": 1368082,
    "peer": "41(4)",
    "pgid": "10.2s4",
    "state": "unknown",
    "last_peered": "0.00",
    "last_became_peered": "0.00",
    "num_objects": 0,
    "num_objects_missing_on_primary": 0,
    "num_objects_missing": 0,
    "num_objects_degraded": 0,
    "num_objects_misplaced": 0,
    "num_objects_unfound": 0,
    "num_objects_dirty": 0,
    "num_objects_recovered": 0,
    "peer": "46(4)",
    "pgid": "10.2s4",
    "state": "down",
    "last_peered": "2020-04-25 13:25:12.860435",
    "last_became_peered": "2020-04-22 10:45:45.520125",
    "num_objects": 162869,
    "num_objects_missing_on_primary": 0,
    "num_objects_missing": 0,
    "num_objects_degraded": 0,
    "num_objects_misplaced": 0,
    "num_objects_unfound": 0,
    "num_objects_dirty": 162869,

[ceph-users] pg incomplete blocked by destroyed osd

2020-05-04 Thread Francois Legrand


Hi all,
During a crash disaster we had destroyed and reinstalled with a 
different number a few osds.
As an example osd 3 was destroyed and recreated with id 101 by command 
ceph osd purge 3 --yes-i-really-mean-it + ceph osd create (to block id 
3) + ceph-deploy osd create --data /dev/sdxx  and finally ceph 
osd rm 3).
Some of our pgs are now incomplet (which can be understood) but blocked 
by some of the removed osd :

ex: here is an part of the ceph pg 30.3 query
{
    "state": "incomplete",
    "snap_trimq": "[]",
    "snap_trimq_len": 0,
    "epoch": 384075,
    "up": [
    103,
    43,
    29,
    2,
    66
    ],
    "acting": [
    103,
    43,
    29,
    2,
    66
    ],


"peer_info": [
    {
    "peer": "2(3)",
    "pgid": "30.3s3",
    "last_update": "373570'105925965",
    "last_complete": "373570'105925965",
...
},
    "up": [
    103,
    43,
    29,
    2,
    66
    ],
    "acting": [
    103,
    43,
    29,
    2,
    66
    ],
    "avail_no_missing": [],
    "object_location_counts": [],
*"blocked_by": [**
**    3,**
**    49**
**    ],*

    "down_osds_we_would_probe": [
*3*
    ],
    "peering_blocked_by": [],
    "peering_blocked_by_detail": [
    {
*    "detail": "peering_blocked_by_history_les_bound"*
    }
    ]


I don't understand why the removed osd are still considered and present 
in the pg infos.

Is there a way to get rid of that ?
Moreover, we have tons of slow ops (more than 15 000) but I guess that 
the two problems are linked.

Thanks for your help.
F.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] repairing osd rocksdb

2020-05-01 Thread Francois Legrand


Hi,
We had a major crash which ended with ~1/3 of our osd downs.
Trying to fix it we reinstalled a few down osd (that was a mistake, I 
agree) and destroy the datas on it.
Finally, we could fix the problem (thanks to Igor Fedotov) and restart 
almost all of our osds except one for which the rocksdb seems corrupted 
(at least for one file).
Unfortunately, we now have 4 pgs down (all involving the dead osd) and 8 
pg incompletes (some of them also involving the down osd).
Before considering data loss, we would like to try to restart the down 
osds hopping to recover the down pgs and maybe some of the incomplete ones.
Does someone have an idea on how to do that (maybe by removing the file 
corrupting the rocksdb or forcing to ignore the data in error) ?
If it's not possible, how can we fix (even with dataloss) the downs and 
incomplete pgs ?

Thanks for your advices.
F.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: ceph crash hangs forever and recovery stop

2020-04-30 Thread Francois Legrand


Is there a way to purge the crashs ?
For example is it safe and sufficient to delete everything in 
/var/lib/ceph/crash on the nodes ?

F.

Le 30/04/2020 à 17:14, Paul Emmerich a écrit :
Best guess: the recovery process doesn't really stop, but it's just 
that the mgr is dead and it no longer reports the progress


And yeah, I can confirm that having a huge number of crash reports is 
a problem (had a case where a monitoring script crashed due to a 
radosgw-admin bug... lots of crash reports)


Paul

--
Paul Emmerich


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] ceph crash hangs forever and recovery stop

2020-04-30 Thread Francois Legrand


Hi everybody (again),
We recently had a lot of osd crashs (more than 30 osd crashed). This is 
now fixed, but it triggered a huge rebalancing+recovery.
More or less in the same time, we noticed that the ceph crash ls (or 
whatever other ceph crash command) hangs forever and never returns.
And finally, the recovery process stops regularly (after ~1 hour) but it 
can be restarted by reseting the mgr daemon (systemctl restart 
ceph-mgr.target on the active manager).
There is nothing in the logs (the manager still works, the service is 
up, the dashboard is accessible but simply the recovery stops).

We also tryed to reboot the managers, but it doesn't solve the problem.
I guess theses two problems should be linked, but not sure.
Does anybody have a clue ?
Thanks.
F.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: osd crashing and rocksdb corruption

2020-04-30 Thread Francois Legrand


Thanks again for your reactivity and your advices. You saved our lives !

We reactivate recovery/backfilling/rebalancing and it starts the 
recovery. We now have to wait to see how it will evolve.


Last question : We noticed (a few days ago and it still occurs) that 
after ~1h the recovery was stopping (Recovery Throughput drop to 0). We 
could restart it by restarting the ceph-mgr.target on the active 
manager... for another ~1 hour ! It's strange because I cannot see any 
crash or relevant info in the logs !
Moreover the ceph crash command hangs and no way to get output. Maybe 
it's because of the huge number of failures on the osds !
Do you think that this two problems could be related to the osd crashing 
? I will continue to investigate and maybe open a new different thread 
on this topic.

F.

Le 30/04/2020 à 10:57, Igor Fedotov a écrit :
I created the following ticket and PR to track/fix the issue with 
incomplete large writes when bluefs_buffereed_io=1.


https://tracker.ceph.com/issues/45337

https://github.com/ceph/ceph/pull/34836


But In fact setting bluefs_buffered_io to false is the mainstream for 
now, see https://github.com/ceph/ceph/pull/34224


Francois, you can proceed with OSD.21 &.49 I can reproduce the issue 
locally hence no much need in them now.


Still investigating what's happening with OSD.8...

As for reactivating recovery/backfill/rebalancing - I can say for sure 
whether it's safe or not.



Thanks,

Igor


On 4/30/2020 1:39 AM, Francois Legrand wrote:

Hello,
We set bluefs_buffered_io to false for the whole cluster except 2 osd 
(21 and 49) for which we decided to keep the value to true for future 
experiments/troubleshooting as you asked.
We then restarted all the 25 downs osd and they started... except one 
(number 8) which still continue to crash with the same kind of errors.
I tryed a fsck on this osd which ended by a success. I set the debug 
to 20 and recorded the logs.
You will find the logs there if you want to have a look : 
https://we.tl/t-GDvvvi2Gmm
Now we plan to reactivate the recovery, backfill and rebalancing if 
you think it's safe.

F.

Le 29/04/2020 à 16:45, Igor Fedotov a écrit :
So the crash seems to be caused by the same issue - big (and 
presumably incomplete) write and subsequent read failure.


I've managed to repro this locally. So bluefs_buffered_io seems to 
be a remedy for now.


But additionally I can observe multiple slow ops indications in this 
new log and I think they cause those big writes.


And I presume some RGW-triggered ops are in flight - bucket 
resharding or removal or something..


I've seen multiple reports about this stuff causing OSD slowdown, 
high memory utilization and  finally huge reads and/or writes from 
RocksDB.


Don't know how to deal with this at the moment...

Thanks,

Igor


On 4/29/2020 5:33 PM, Francois Legrand wrote:

Here are the logs of the newly crashed osd.
F.

Le 29/04/2020 à 16:21, Igor Fedotov a écrit :
Sounds interesting - could you please share the crash log for 
these new OSDs? They presumably suffer from another issue. At 
least that first crash is caused by something else.



"bluefs buffered io" can be injected on the fly but I expect it to 
help when OSD isn't starting up only.



On 4/29/2020 5:17 PM, Francois Legrand wrote:

Ok we will try that.
Indeed, restarting osd.5 triggered the falling down of two other 
osds in the cluster.
Thus we will set bluefs buffered io = false for all osds and 
force bluefs buffered io = true for one of the downs osds.
Is that modification needs to use injectargs or changing it in 
the configuration is enougth to have it applied on the fly ?

F.

Le 29/04/2020 à 15:56, Igor Fedotov a écrit :

That's bluefs buffered io = false which did the trick.

It modified write path and this presumably has fixed large 
write(s).


Trying to reproduce locally but please preserve at least one 
failing OSD (i.e. do not start it with the disabled buffered io) 
for future experiments/troubleshooting for a while if possible.



Thanks,

Igor



On 4/29/2020 4:50 PM, Francois Legrand wrote:

Hi,
It seems much better with theses options. The osd is now up 
since 10mn without crashing (before it was rebooting after ~1mn).

F.

Le 29/04/2020 à 15:16, Igor Fedotov a écrit :

Hi Francois,

I'll write a more thorough response a bit later.

Meanwhile could you please try OSD startup with the following 
settings now:


debug-bluefs abd debug-bdev = 20

bluefs sync write = false

bluefs buffered io = false


Thanks,

Igor


On 4/29/2020 3:35 PM, Francois Legrand wrote:

Hi Igor,
Here is what we did :
First, as other osd were falling down, we stopped all 
operations with

ceph osd set norecover
ceph osd set norebalance
ceph osd set nobackfill
ceph osd set pause

to avoid other crashs !

Then we moved to your recommandations (still testing on osd 5):
in /etc/ceph/ceph.conf we added:
[osd.5]
    debug bluefs = 20
    debug bdev

[ceph-users] Re: osd crashing and rocksdb corruption

2020-04-29 Thread Francois Legrand


Hello,
We set bluefs_buffered_io to false for the whole cluster except 2 osd 
(21 and 49) for which we decided to keep the value to true for future 
experiments/troubleshooting as you asked.
We then restarted all the 25 downs osd and they started... except one 
(number 8) which still continue to crash with the same kind of errors.
I tryed a fsck on this osd which ended by a success. I set the debug to 
20 and recorded the logs.
You will find the logs there if you want to have a look : 
https://we.tl/t-GDvvvi2Gmm
Now we plan to reactivate the recovery, backfill and rebalancing if you 
think it's safe.

F.

Le 29/04/2020 à 16:45, Igor Fedotov a écrit :
So the crash seems to be caused by the same issue - big (and 
presumably incomplete) write and subsequent read failure.


I've managed to repro this locally. So bluefs_buffered_io seems to be 
a remedy for now.


But additionally I can observe multiple slow ops indications in this 
new log and I think they cause those big writes.


And I presume some RGW-triggered ops are in flight - bucket resharding 
or removal or something..


I've seen multiple reports about this stuff causing OSD slowdown, high 
memory utilization and  finally huge reads and/or writes from RocksDB.


Don't know how to deal with this at the moment...

Thanks,

Igor


On 4/29/2020 5:33 PM, Francois Legrand wrote:

Here are the logs of the newly crashed osd.
F.

Le 29/04/2020 à 16:21, Igor Fedotov a écrit :
Sounds interesting - could you please share the crash log for these 
new OSDs? They presumably suffer from another issue. At least that 
first crash is caused by something else.



"bluefs buffered io" can be injected on the fly but I expect it to 
help when OSD isn't starting up only.



On 4/29/2020 5:17 PM, Francois Legrand wrote:

Ok we will try that.
Indeed, restarting osd.5 triggered the falling down of two other 
osds in the cluster.
Thus we will set bluefs buffered io = false for all osds and force 
bluefs buffered io = true for one of the downs osds.
Is that modification needs to use injectargs or changing it in the 
configuration is enougth to have it applied on the fly ?

F.

Le 29/04/2020 à 15:56, Igor Fedotov a écrit :

That's bluefs buffered io = false which did the trick.

It modified write path and this presumably has fixed large write(s).

Trying to reproduce locally but please preserve at least one 
failing OSD (i.e. do not start it with the disabled buffered io) 
for future experiments/troubleshooting for a while if possible.



Thanks,

Igor



On 4/29/2020 4:50 PM, Francois Legrand wrote:

Hi,
It seems much better with theses options. The osd is now up since 
10mn without crashing (before it was rebooting after ~1mn).

F.

Le 29/04/2020 à 15:16, Igor Fedotov a écrit :

Hi Francois,

I'll write a more thorough response a bit later.

Meanwhile could you please try OSD startup with the following 
settings now:


debug-bluefs abd debug-bdev = 20

bluefs sync write = false

bluefs buffered io = false


Thanks,

Igor


On 4/29/2020 3:35 PM, Francois Legrand wrote:

Hi Igor,
Here is what we did :
First, as other osd were falling down, we stopped all 
operations with

ceph osd set norecover
ceph osd set norebalance
ceph osd set nobackfill
ceph osd set pause

to avoid other crashs !

Then we moved to your recommandations (still testing on osd 5):
in /etc/ceph/ceph.conf we added:
[osd.5]
    debug bluefs = 20
    debug bdev = 20

We ran :
ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-5 -l 
/var/log/ceph/bluestore-tool-fsck-osd-5.log --log-level 20 > 
/var/log/ceph/bluestore-tool-fsck-osd-5.out 2>&1


it ended with fsck success

It seems that the default value for bluefs sync write is false 
(https://github.com/ceph/ceph/blob/v14.2.8/src/common/options.cc), 
thus

we changed /etc/ceph/ceph.conf to :
[osd.5]
    debug bluefs = 20
    debug bdev = 20
    bluefs sync write = true

and restarted the osd. It crashed !
We tryed to change explicitely bluefs sync write = false and 
restarted... same result !


The logs are here : https://we.tl/t-HMiFDu22XH

Moreover, we have a rados gateway pool with hundreds of 4GB 
files. Can this be the origin of the problem ?

Do you thing that removing this pool can help ?

Thanks again for your expertise.
F.


Le 28/04/2020 à 18:52, Igor Fedotov a écrit :
Short update - please treat bluefs_sync_write parameter 
instead of bdev-aio. Changing the latter isn't supported in fact.


On 4/28/2020 7:35 PM, Igor Fedotov wrote:

Francious,

here are some observations got from your log.

1) Rocksdb reports error on the following .sst file:

   -35> 2020-04-28 15:23:47.612 7f4856e82a80 -1 rocksdb: 
Corruption: Bad table magic number: expected 986351839

0377041911, found 12950032858166034944 in db/068269.sst

2) which relates to BlueFS ino 53361:

  -50> 2020-04-28 15:23:45.103 7f4856e82a80 10 bluefs 
open_for_read db/068269.sst (random)
   -49> 2020-04-28 15:23:45.103 7f4856e8

[ceph-users] Re: osd crashing and rocksdb corruption

2020-04-29 Thread Francois Legrand


Here are the logs of the newly crashed osd.
F.

Le 29/04/2020 à 16:21, Igor Fedotov a écrit :
Sounds interesting - could you please share the crash log for these 
new OSDs? They presumably suffer from another issue. At least that 
first crash is caused by something else.



"bluefs buffered io" can be injected on the fly but I expect it to 
help when OSD isn't starting up only.



On 4/29/2020 5:17 PM, Francois Legrand wrote:

Ok we will try that.
Indeed, restarting osd.5 triggered the falling down of two other osds 
in the cluster.
Thus we will set bluefs buffered io = false for all osds and force 
bluefs buffered io = true for one of the downs osds.
Is that modification needs to use injectargs or changing it in the 
configuration is enougth to have it applied on the fly ?

F.

Le 29/04/2020 à 15:56, Igor Fedotov a écrit :

That's bluefs buffered io = false which did the trick.

It modified write path and this presumably has fixed large write(s).

Trying to reproduce locally but please preserve at least one failing 
OSD (i.e. do not start it with the disabled buffered io) for future 
experiments/troubleshooting for a while if possible.



Thanks,

Igor



On 4/29/2020 4:50 PM, Francois Legrand wrote:

Hi,
It seems much better with theses options. The osd is now up since 
10mn without crashing (before it was rebooting after ~1mn).

F.

Le 29/04/2020 à 15:16, Igor Fedotov a écrit :

Hi Francois,

I'll write a more thorough response a bit later.

Meanwhile could you please try OSD startup with the following 
settings now:


debug-bluefs abd debug-bdev = 20

bluefs sync write = false

bluefs buffered io = false


Thanks,

Igor


On 4/29/2020 3:35 PM, Francois Legrand wrote:

Hi Igor,
Here is what we did :
First, as other osd were falling down, we stopped all operations 
with

ceph osd set norecover
ceph osd set norebalance
ceph osd set nobackfill
ceph osd set pause

to avoid other crashs !

Then we moved to your recommandations (still testing on osd 5):
in /etc/ceph/ceph.conf we added:
[osd.5]
    debug bluefs = 20
    debug bdev = 20

We ran :
ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-5 -l 
/var/log/ceph/bluestore-tool-fsck-osd-5.log --log-level 20 > 
/var/log/ceph/bluestore-tool-fsck-osd-5.out 2>&1


it ended with fsck success

It seems that the default value for bluefs sync write is false 
(https://github.com/ceph/ceph/blob/v14.2.8/src/common/options.cc), 
thus

we changed /etc/ceph/ceph.conf to :
[osd.5]
    debug bluefs = 20
    debug bdev = 20
    bluefs sync write = true

and restarted the osd. It crashed !
We tryed to change explicitely bluefs sync write = false and 
restarted... same result !


The logs are here : https://we.tl/t-HMiFDu22XH

Moreover, we have a rados gateway pool with hundreds of 4GB 
files. Can this be the origin of the problem ?

Do you thing that removing this pool can help ?

Thanks again for your expertise.
F.


Le 28/04/2020 à 18:52, Igor Fedotov a écrit :
Short update - please treat bluefs_sync_write parameter instead 
of bdev-aio. Changing the latter isn't supported in fact.


On 4/28/2020 7:35 PM, Igor Fedotov wrote:

Francious,

here are some observations got from your log.

1) Rocksdb reports error on the following .sst file:

   -35> 2020-04-28 15:23:47.612 7f4856e82a80 -1 rocksdb: 
Corruption: Bad table magic number: expected 986351839

0377041911, found 12950032858166034944 in db/068269.sst

2) which relates to BlueFS ino 53361:

  -50> 2020-04-28 15:23:45.103 7f4856e82a80 10 bluefs 
open_for_read db/068269.sst (random)
   -49> 2020-04-28 15:23:45.103 7f4856e82a80 10 bluefs 
open_for_read h 0x557914fb80b0 on file(ino 53361 size 0x
c496f919 mtime 2020-04-28 15:23:39.827515 bdev 1 allocated 
c497 extents [1:0x383db28~c497])


3) and failed read happens to the end (0xc496f8e4~35, last 0x35 
bytes) of this huge(3+GB) file:


 -44> 2020-04-28 15:23:47.514 7f4856e82a80 10 bluefs 
_read_random h 0x557914fb80b0 0xc496f8e4~35 from file(in
o 53361 size 0xc496f919 mtime 2020-04-28 15:23:39.827515 bdev 1 
allocated c497 extents [1:0x383db28~c49

7])
   -43> 2020-04-28 15:23:47.514 7f4856e82a80 20 bluefs 
_read_random left 0x71c 0xc496f8e4~35
   -42> 2020-04-28 15:23:47.514 7f4856e82a80 20 bluefs 
_read_random got 53


4) This .sst file was created from scratch shortly before with 
a single-shot 3+GB write:


 -88> 2020-04-28 15:23:35.661 7f4856e82a80 10 bluefs 
open_for_write db/068269.sst
   -87> 2020-04-28 15:23:35.661 7f4856e82a80 20 bluefs 
open_for_write mapping db/068269.sst to bdev 1
   -86> 2020-04-28 15:23:35.662 7f4856e82a80 10 bluefs 
open_for_write h 0x5579145e7a40 on file(ino 53361 size 0

x0 mtime 2020-04-28 15:23:35.663142 bdev 1 allocated 0 extents [])
   -85> 2020-04-28 15:23:39.826 7f4856e82a80 10 bluefs _flush 
0x5579145e7a40 0x0~c496f919 to file(ino 53361 siz
e 0x0 mtime 2020-04-28 15:23:35.663142 bdev 1 allocated 0 
extents [])


5) Presumably Rocks

[ceph-users] Re: osd crashing and rocksdb corruption

2020-04-29 Thread Francois Legrand


Ok we will try that.
Indeed, restarting osd.5 triggered the falling down of two other osds in 
the cluster.
Thus we will set bluefs buffered io = false for all osds and force 
bluefs buffered io = true for one of the downs osds.
Is that modification needs to use injectargs or changing it in the 
configuration is enougth to have it applied on the fly ?

F.

Le 29/04/2020 à 15:56, Igor Fedotov a écrit :

That's bluefs buffered io = false which did the trick.

It modified write path and this presumably has fixed large write(s).

Trying to reproduce locally but please preserve at least one failing 
OSD (i.e. do not start it with the disabled buffered io) for future 
experiments/troubleshooting for a while if possible.



Thanks,

Igor



On 4/29/2020 4:50 PM, Francois Legrand wrote:

Hi,
It seems much better with theses options. The osd is now up since 
10mn without crashing (before it was rebooting after ~1mn).

F.

Le 29/04/2020 à 15:16, Igor Fedotov a écrit :

Hi Francois,

I'll write a more thorough response a bit later.

Meanwhile could you please try OSD startup with the following 
settings now:


debug-bluefs abd debug-bdev = 20

bluefs sync write = false

bluefs buffered io = false


Thanks,

Igor


On 4/29/2020 3:35 PM, Francois Legrand wrote:

Hi Igor,
Here is what we did :
First, as other osd were falling down, we stopped all operations with
ceph osd set norecover
ceph osd set norebalance
ceph osd set nobackfill
ceph osd set pause

to avoid other crashs !

Then we moved to your recommandations (still testing on osd 5):
in /etc/ceph/ceph.conf we added:
[osd.5]
    debug bluefs = 20
    debug bdev = 20

We ran :
ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-5 -l 
/var/log/ceph/bluestore-tool-fsck-osd-5.log --log-level 20 > 
/var/log/ceph/bluestore-tool-fsck-osd-5.out 2>&1


it ended with fsck success

It seems that the default value for bluefs sync write is false 
(https://github.com/ceph/ceph/blob/v14.2.8/src/common/options.cc), 
thus

we changed /etc/ceph/ceph.conf to :
[osd.5]
    debug bluefs = 20
    debug bdev = 20
    bluefs sync write = true

and restarted the osd. It crashed !
We tryed to change explicitely bluefs sync write = false and 
restarted... same result !


The logs are here : https://we.tl/t-HMiFDu22XH

Moreover, we have a rados gateway pool with hundreds of 4GB files. 
Can this be the origin of the problem ?

Do you thing that removing this pool can help ?

Thanks again for your expertise.
F.


Le 28/04/2020 à 18:52, Igor Fedotov a écrit :
Short update - please treat bluefs_sync_write parameter instead of 
bdev-aio.  Changing the latter isn't supported in fact.


On 4/28/2020 7:35 PM, Igor Fedotov wrote:

Francious,

here are some observations got from your log.

1) Rocksdb reports error on the following .sst file:

   -35> 2020-04-28 15:23:47.612 7f4856e82a80 -1 rocksdb: 
Corruption: Bad table magic number: expected 986351839

0377041911, found 12950032858166034944 in db/068269.sst

2) which relates to BlueFS ino 53361:

  -50> 2020-04-28 15:23:45.103 7f4856e82a80 10 bluefs 
open_for_read db/068269.sst (random)
   -49> 2020-04-28 15:23:45.103 7f4856e82a80 10 bluefs 
open_for_read h 0x557914fb80b0 on file(ino 53361 size 0x
c496f919 mtime 2020-04-28 15:23:39.827515 bdev 1 allocated 
c497 extents [1:0x383db28~c497])


3) and failed read happens to the end (0xc496f8e4~35, last 0x35 
bytes) of this huge(3+GB) file:


 -44> 2020-04-28 15:23:47.514 7f4856e82a80 10 bluefs _read_random 
h 0x557914fb80b0 0xc496f8e4~35 from file(in
o 53361 size 0xc496f919 mtime 2020-04-28 15:23:39.827515 bdev 1 
allocated c497 extents [1:0x383db28~c49

7])
   -43> 2020-04-28 15:23:47.514 7f4856e82a80 20 bluefs 
_read_random left 0x71c 0xc496f8e4~35
   -42> 2020-04-28 15:23:47.514 7f4856e82a80 20 bluefs 
_read_random got 53


4) This .sst file was created from scratch shortly before with a 
single-shot 3+GB write:


 -88> 2020-04-28 15:23:35.661 7f4856e82a80 10 bluefs 
open_for_write db/068269.sst
   -87> 2020-04-28 15:23:35.661 7f4856e82a80 20 bluefs 
open_for_write mapping db/068269.sst to bdev 1
   -86> 2020-04-28 15:23:35.662 7f4856e82a80 10 bluefs 
open_for_write h 0x5579145e7a40 on file(ino 53361 size 0

x0 mtime 2020-04-28 15:23:35.663142 bdev 1 allocated 0 extents [])
   -85> 2020-04-28 15:23:39.826 7f4856e82a80 10 bluefs _flush 
0x5579145e7a40 0x0~c496f919 to file(ino 53361 siz
e 0x0 mtime 2020-04-28 15:23:35.663142 bdev 1 allocated 0 extents 
[])


5) Presumably RocksDB creates this file in an attempt to 
recover/compact/process another existing  file (ino 52405) which 
is pretty large as well


Please find multiple earlier reads, the last one:

   -92> 2020-04-28 15:23:29.857 7f4856e82a80 10 bluefs _read h 
0x5579147286e0 0xc6788000~8000 from file(ino 524
05 size 0xc67888a0 mtime 2020-04-25 13:34:55.325699 bdev 0 
allocated c679 extents [1:0x381c822~1,1:


The rationales for binding these two

[ceph-users] Re: osd crashing and rocksdb corruption

2020-04-29 Thread Francois Legrand


Hi,
It seems much better with theses options. The osd is now up since 10mn 
without crashing (before it was rebooting after ~1mn).

F.

Le 29/04/2020 à 15:16, Igor Fedotov a écrit :

Hi Francois,

I'll write a more thorough response a bit later.

Meanwhile could you please try OSD startup with the following settings 
now:


debug-bluefs abd debug-bdev = 20

bluefs sync write = false

bluefs buffered io = false


Thanks,

Igor


On 4/29/2020 3:35 PM, Francois Legrand wrote:

Hi Igor,
Here is what we did :
First, as other osd were falling down, we stopped all operations with
ceph osd set norecover
ceph osd set norebalance
ceph osd set nobackfill
ceph osd set pause

to avoid other crashs !

Then we moved to your recommandations (still testing on osd 5):
in /etc/ceph/ceph.conf we added:
[osd.5]
    debug bluefs = 20
    debug bdev = 20

We ran :
ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-5 -l 
/var/log/ceph/bluestore-tool-fsck-osd-5.log --log-level 20  > 
/var/log/ceph/bluestore-tool-fsck-osd-5.out  2>&1


it ended with fsck success

It seems that the default value for bluefs sync write is false 
(https://github.com/ceph/ceph/blob/v14.2.8/src/common/options.cc), thus

we changed /etc/ceph/ceph.conf to :
[osd.5]
    debug bluefs = 20
    debug bdev = 20
    bluefs sync write = true

and restarted the osd. It crashed !
We tryed to change explicitely bluefs sync write = false and 
restarted... same result !


The logs are here : https://we.tl/t-HMiFDu22XH

Moreover, we have a rados gateway pool with hundreds of 4GB files. 
Can this be the origin of the problem ?

Do you thing that removing this pool can help ?

Thanks again for your expertise.
F.


Le 28/04/2020 à 18:52, Igor Fedotov a écrit :
Short update - please treat bluefs_sync_write parameter instead of 
bdev-aio.  Changing the latter isn't supported in fact.


On 4/28/2020 7:35 PM, Igor Fedotov wrote:

Francious,

here are some observations got from your log.

1) Rocksdb reports error on the following .sst file:

   -35> 2020-04-28 15:23:47.612 7f4856e82a80 -1 rocksdb: 
Corruption: Bad table magic number: expected 986351839

0377041911, found 12950032858166034944 in db/068269.sst

2) which relates to BlueFS ino 53361:

  -50> 2020-04-28 15:23:45.103 7f4856e82a80 10 bluefs open_for_read 
db/068269.sst (random)
   -49> 2020-04-28 15:23:45.103 7f4856e82a80 10 bluefs 
open_for_read h 0x557914fb80b0 on file(ino 53361 size 0x
c496f919 mtime 2020-04-28 15:23:39.827515 bdev 1 allocated c497 
extents [1:0x383db28~c497])


3) and failed read happens to the end (0xc496f8e4~35, last 0x35 
bytes) of this huge(3+GB) file:


 -44> 2020-04-28 15:23:47.514 7f4856e82a80 10 bluefs _read_random h 
0x557914fb80b0 0xc496f8e4~35 from file(in
o 53361 size 0xc496f919 mtime 2020-04-28 15:23:39.827515 bdev 1 
allocated c497 extents [1:0x383db28~c49

7])
   -43> 2020-04-28 15:23:47.514 7f4856e82a80 20 bluefs _read_random 
left 0x71c 0xc496f8e4~35
   -42> 2020-04-28 15:23:47.514 7f4856e82a80 20 bluefs _read_random 
got 53


4) This .sst file was created from scratch shortly before with a 
single-shot 3+GB write:


 -88> 2020-04-28 15:23:35.661 7f4856e82a80 10 bluefs open_for_write 
db/068269.sst
   -87> 2020-04-28 15:23:35.661 7f4856e82a80 20 bluefs 
open_for_write mapping db/068269.sst to bdev 1
   -86> 2020-04-28 15:23:35.662 7f4856e82a80 10 bluefs 
open_for_write h 0x5579145e7a40 on file(ino 53361 size 0

x0 mtime 2020-04-28 15:23:35.663142 bdev 1 allocated 0 extents [])
   -85> 2020-04-28 15:23:39.826 7f4856e82a80 10 bluefs _flush 
0x5579145e7a40 0x0~c496f919 to file(ino 53361 siz

e 0x0 mtime 2020-04-28 15:23:35.663142 bdev 1 allocated 0 extents [])

5) Presumably RocksDB creates this file in an attempt to 
recover/compact/process another existing  file (ino 52405) which is 
pretty large as well


Please find multiple earlier reads, the last one:

   -92> 2020-04-28 15:23:29.857 7f4856e82a80 10 bluefs _read h 
0x5579147286e0 0xc6788000~8000 from file(ino 524
05 size 0xc67888a0 mtime 2020-04-25 13:34:55.325699 bdev 0 
allocated c679 extents [1:0x381c822~1,1:


The rationales for binding these two files are pretty uncommon file 
sizes.



So you have 3+GB single-shot BlueFS write and immediate read from 
the end of the written extent which returns unexpected magic.


It's well known in the software world that large (2+GB) data 
processing implementations tend to be error-prone. And Ceph is not 
an exception. Here is a couple of recent examples which are pretty 
close to your case:


https://github.com/ceph/ceph/commit/4d33114a40d5ae0d541c36175977ca22789a3b88 



https://github.com/ceph/ceph/commit/207806abaa91259d9bb726c2555e7e21ac541527 



Although they are already fixed in Nautilus 14.2.8 there might be 
others present along the write path (including H/W firmware).


The good news are that failure happens on a newly written file 
(remember invalid magic is read at the end(

[ceph-users] Re: osd crashing and rocksdb corruption

2020-04-29 Thread Francois Legrand

re - this will go via a bit
different write path and may be provide a workaround.

Also please collect debug logs for OSD startup (with both current and
updated bdev-aio parameter) and --debug-bdev/debug-bluefs set to 20.
You can omit --debug-bluestore increase for now to reduce log size.

Thanks,

Igor

On 4/28/2020 5:16 PM, Francois Legrand wrote:

Here is the output of ceph-bluestore-tool bluefs-bdev-sizes
inferring bluefs devices from bluestore path
slot 1 /var/lib/ceph/osd/ceph-5/block -> /dev/dm-17
1 : device size 0x746c000 : own 0x[37e1eb0~4a8290] =
0x4a8290 : using 0x5bc78(23 GiB)

the result of the debug-bluestore (and debug-bluefs) set to 20 for
osd.5

is at the following address (28MB).

https://wetransfer.com/downloads/a193ab15ab5e2395fe2462c963507a7f20200428141355/5da2ebf0d33750a5fde85bf662cf0e6d20200428141415/55849f?utm_campaign=WT_email_tracking&utm_content=general&utm_medium=download_button&utm_source=notify_recipient_email

Thanks for your help.
F.

Le 28/04/2020 à 13:33, Igor Fedotov a écrit :

Hi Francois,

Could you please share OSD startup log with debug-bluestore (and
debug-bluefs) set to 20.

Also please run ceph-bluestore-tool's bluefs-bdev-sizes command and
share the output.

Thanks,

Igor

On 4/28/2020 12:55 AM, Francois Legrand wrote:

Hi all,

*** Short version ***
Is there a way to repair a rocksdb from errors "Encountered error
while reading data from compression dictionary block Corruption:
block checksum mismatch" and "_open_db erroring opening db" ?

*** Long version ***
We operate a nautilus ceph cluster (with 100 disks of 8TB in 6
servers + 4 mons/mgr + 3 mds).
We recently (Monday 20) upgraded from 14.2.7 to 14.2.8. This
triggered a rebalancing of some data.
Two days later (Wednesday 22) we had a very short power outage.
Only one of the osd servers went down (and unfortunately died).
This triggered a reconstruction of the losts osds. Operations went
fine until Saturday 25 where some osds in the 5 remaining servers
started to crash apparently with no reasons.
We tryed to restart them, but they crashed again. We ended with 18
osd down (+ 16 in the dead server so 34 osd downs out of 100).

Looking at the logs we found for all the crashed osd :

-237> 2020-04-25 16:32:51.835 7f1f45527a80 3 rocksdb:
[table/block_based_table_reader.cc:1117] Encountered error while
reading data from compression dictionary block Corruption: block
checksum mismatch: expected 0, got 2729370997 in db/181355.sst
offset 18446744073709551615 size 18446744073709551615

and

2020-04-25 16:05:47.251 7fcbd1e46a80 -1
bluestore(/var/lib/ceph/osd/ceph-3) _open_db erroring opening db:

We also noticed that the "Encountered error while reading data
from compression dictionary block Corruption: block checksum
mismatch" was present few days before the crash.

We also have some osd with this error but still up.

We tryed to repair with :
ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-3 repair
But no success (it ends with _open_db erroring opening db).

Thus does somebody have an idea to fix this or at least know if
it's possible to repair and correct the "Encountered error while
reading data from compression dictionary block Corruption: block
checksum mismatch" and "_open_db erroring opening db" !
Thanks for your help (we are desperate because we will loose datas
and are fighting to save something) !!!

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: osd crashing and rocksdb corruption

2020-04-28 Thread Francois Legrand

the result of the debug-bluestore (and debug-bluefs) set to 20 for osd.5
is at the following address (28MB).

Thanks for your help.
F.

Le 28/04/2020 à 13:33, Igor Fedotov a écrit :

Hi Francois,

Could you please share OSD startup log with debug-bluestore (and
debug-bluefs) set to 20.

Also please run ceph-bluestore-tool's bluefs-bdev-sizes command and
share the output.

Thanks,

Igor

On 4/28/2020 12:55 AM, Francois Legrand wrote:

Hi all,

*** Long version ***
We operate a nautilus ceph cluster (with 100 disks of 8TB in 6
servers + 4 mons/mgr + 3 mds).
We recently (Monday 20) upgraded from 14.2.7 to 14.2.8. This
triggered a rebalancing of some data.
Two days later (Wednesday 22) we had a very short power outage. Only
one of the osd servers went down (and unfortunately died).
This triggered a reconstruction of the losts osds. Operations went
fine until Saturday 25 where some osds in the 5 remaining servers
started to crash apparently with no reasons.
We tryed to restart them, but they crashed again. We ended with 18
osd down (+ 16 in the dead server so 34 osd downs out of 100).

Looking at the logs we found for all the crashed osd :

and

2020-04-25 16:05:47.251 7fcbd1e46a80 -1
bluestore(/var/lib/ceph/osd/ceph-3) _open_db erroring opening db:

We also noticed that the "Encountered error while reading data from
compression dictionary block Corruption: block checksum mismatch" was
present few days before the crash.

We also have some osd with this error but still up.

We tryed to repair with :
ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-3 repair
But no success (it ends with _open_db erroring opening db).

Thus does somebody have an idea to fix this or at least know if it's
possible to repair and correct the "Encountered error while reading
data from compression dictionary block Corruption: block checksum
mismatch" and "_open_db erroring opening db" !
Thanks for your help (we are desperate because we will loose datas
and are fighting to save something) !!!

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] osd crashing and rocksdb corruption

2020-04-27 Thread Francois Legrand


Hi all,

*** Short version ***
Is there a way to repair a rocksdb from errors "Encountered error while 
reading data from compression dictionary block Corruption: block 
checksum mismatch" and "_open_db erroring opening db" ?



*** Long version ***
We operate a nautilus ceph cluster (with 100 disks of 8TB in 6 servers + 
4 mons/mgr + 3 mds).
We recently (Monday 20) upgraded from 14.2.7 to 14.2.8. This triggered a 
rebalancing of some data.
Two days later (Wednesday 22) we had a very short power outage. Only one 
of the osd servers went down (and unfortunately died).
This triggered a reconstruction of the losts osds. Operations went fine 
until Saturday 25 where some osds in the 5 remaining servers started to 
crash apparently with no reasons.
We tryed to restart them, but they crashed again. We ended with 18 osd 
down (+ 16 in the dead server so 34 osd downs out of 100).

Looking at the logs we found for all the crashed osd :

-237> 2020-04-25 16:32:51.835 7f1f45527a80  3 rocksdb: 
[table/block_based_table_reader.cc:1117] Encountered error while reading 
data from compression dictionary block Corruption: block checksum 
mismatch: expected 0, got 2729370997  in db/181355.sst offset 
18446744073709551615 size 18446744073709551615


and

2020-04-25 16:05:47.251 7fcbd1e46a80 -1 
bluestore(/var/lib/ceph/osd/ceph-3) _open_db erroring opening db:


We also noticed that the "Encountered error while reading data from 
compression dictionary block Corruption: block checksum mismatch" was 
present few days before the crash.

We also have some osd with this error but still up.

We tryed to repair with :
ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-3 repair
But no success (it ends with _open_db erroring opening db).

Thus does somebody have an idea to fix this or at least know if it's 
possible to repair and correct the "Encountered error while reading data 
from compression dictionary block Corruption: block checksum mismatch" 
and "_open_db erroring opening db" !
Thanks for your help (we are desperate because we will loose datas and 
are fighting to save something) !!!

F.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Changing failure domain

2020-01-14 Thread Francois Legrand


I don't want to remove cephfs_meta pool but cephfs_datapool.
To be clear :
I have now cephfs consisting of a cephfs_metapool and a cephfs_datapool.
I want to add a new data pool cephfs_datapool2, migrate all data from 
cephfs_datapool to cephfs_datapool2 and then remove the original 
cephfs_datapool.
My goal is to end with a cephfs with cephfs_meta and cephfs_datapool2 
(i.e replace the original cephfs_datapool by cephfs_datapool2).
But from what I've seen, there should be also some "metadata"  in the 
cephfs_datapool (it sounds weird to me) which should remains after 
moving objects and prevent its deletion.

F.

Le 14/01/2020 à 07:54, Konstantin Shalygin a écrit :

On 1/6/20 5:50 PM, Francois Legrand wrote:

I still have few questions before going on.
It seems that some metadata should remains on the original data pool, 
preventing it's deletion 
(http://ceph.com/geen-categorie/ceph-pool-migration/ and 
https://www.spinics.net/lists/ceph-users/msg41374.html).
Thus does doing a cp and then a rm of the original files (instead of 
mv) allows to get rid of the remaining metadata in the original data 
pool ?
Is it then possible to remove the original pool after migration (and 
how, because I guess that I have to define before the default 
location for data to the new pool) ?
How snapshots are affected (do I have to remove all of them before 
the operation) ?


Why you need to remove cephfs_meta pool?



k


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Changing failure domain

2020-01-06 Thread Francois Legrand


Thanks again for your answer.
I still have few questions before going on.
It seems that some metadata should remains on the original data pool, 
preventing it's deletion 
(http://ceph.com/geen-categorie/ceph-pool-migration/ and 
https://www.spinics.net/lists/ceph-users/msg41374.html).
Thus does doing a cp and then a rm of the original files (instead of mv) 
allows to get rid of the remaining metadata in the original data pool ?
Is it then possible to remove the original pool after migration (and 
how, because I guess that I have to define before the default location 
for data to the new pool) ?
How snapshots are affected (do I have to remove all of them before the 
operation) ?


Happy new year.
F.


Le 24/12/2019 à 03:53, Konstantin Shalygin a écrit :

On 12/19/19 10:22 PM, Francois Legrand wrote:
Thus my question is *how can I migrate a data pool in EC of a cephfs 
to another EC pool ?*



I suggest this:

# create you new ec pool

# `ceph osd pool application enable ec_new cephfs`

# `ceph fs add_data_pool cephfs ec_new`

# `setfattr -n ceph.dir.layout -v pool=ec_new /cephfs/ec_migration`

And then copy your content via userland tools.



k



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Changing failure domain

2019-12-19 Thread Francois Legrand


Thanks for you advices.
I thus created a new replica profile :

{
     "rule_id": 2,
     "rule_name": "replicated3over2rooms",
     "ruleset": 2,
     "type": 1,
     "min_size": 3,
     "max_size": 4,
     "steps": [
     {
     "op": "take",
     "item": -1,
     "item_name": "default"
     },
     {
     "op": "choose_firstn",
     "num": 0,
     "type": "room"
     },
     {
     "op": "chooseleaf_firstn",
     "num": 2,
     "type": "host"
     },
# et c'est fini
     {
     "op": "emit"
     }
     ]
}

It works well.

Now I am concerned by the pool in erasure coding. The point is that it's 
the data pool for cephfs (the metadata is in replica 3 and now 
replicated over our two rooms).
For now, the data pool for cephfs is in *erasure coding k=3, m=2* (at 
the creation of the cluster we had only 5 osd servers). As noticed 
befors by Paul Emmerich, this cannot be redundantly splitted over 2 
rooms (as 3 chunks are required to reconstruct the datas).
Now, we have 6 OSD servers, and soon it will be 7, thus I was thinking 
to create a new pool (eg. k=4, m=2 or k=3, m=3) and a rule to split the 
chunks over our 2 rooms and to use this new pool as cache tier to 
migrate softly all the datas from the old pool to the new one. But 
according to 
https://documentation.suse.com/ses/6/html/ses-all/ceph-pools.html#pool-migrate-cache-tier
"You can use the cache tier method to migrate from a replicated pool to 
either an erasure coded or another replicated pool. Migrating from an 
erasure coded pool is not supported."



   Warning: You Cannot Migrate RBD Images and CephFS Exports to
   an EC Pool

You cannot migrate RBD images and CephFS exports from a replicated pool 
to an EC pool. EC pools can store data but not metadata. The header 
object of the RBD will fail to be flushed. The same applies for CephFS.



Thus my question is *how can I migrate a data pool in EC of a cephfs to 
another EC pool ?*

Thanks for your advices.
F.


Le 03/12/2019 à 04:07, Konstantin Shalygin a écrit :

On 12/2/19 5:56 PM, Francois Legrand wrote:
For replica, what is the best way to change crush profile ? Is it to 
create a new replica profile, and set this profile as crush rulest 
for the pool (something like ceph osd pool set {pool-name} 
crush_ruleset my_new_rule) ? 


Indeed. Then you can delete/do what you want with old crush rule.



k




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Changing failure domain

2019-12-02 Thread Francois Legrand


Thanks.
For replica, what is the best way to change crush profile ? Is it to 
create a new replica profile, and set this profile as crush rulest for 
the pool (something like ceph osd pool set {pool-name} crush_ruleset 
my_new_rule) ?


For erasure coding, I would thus have to change the profile at least to 
k=3, m=3 (for now I only have 6 osd servers).
But if I am correct, this cannot be changed for an existing pool and I 
will have to create a new pool and migrate all data from the current one 
to the new one. Is that correct ?


F.


Le 28/11/2019 à 17:51, Paul Emmerich a écrit :

Use a crush rule likes this for replica:

1) root default class XXX
2) choose 2 rooms
3) choose 2 disks

That'll get you 4 OSDs in two rooms and the first 3 of these get data,
the fourth will be ignored. That guarantees that losing a room will
lose you at most 2 out of 3 copies. This is for disaster recovery
only, it'll guarantee durability if you lose a room but not
availability.

3+2 erasure coding cannot be split across two rooms in this way
because, well, you need 3 out of 5 shards to survive, so you cannot
lose half of them.

Paul


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Changing failure domain

2019-11-28 Thread Francois Legrand


Hi,
I have a cephfs in production based on 2 pools (data+metadata).

Data is  in erasure coding with the profile :
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=3
m=2
plugin=jerasure
technique=reed_sol_van
w=8

Metadata is in replicated mode with k=3

The crush rules are as follow :
[
    {
    "rule_id": 0,
    "rule_name": "replicated_rule",
    "ruleset": 0,
    "type": 1,
    "min_size": 1,
    "max_size": 10,
    "steps": [
    {
    "op": "take",
    "item": -1,
    "item_name": "default"
    },
    {
    "op": "chooseleaf_firstn",
    "num": 0,
    "type": "host"
    },
    {
    "op": "emit"
    }
    ]
    },
    {
    "rule_id": 1,
    "rule_name": "ec_data",
    "ruleset": 1,
    "type": 3,
    "min_size": 3,
    "max_size": 5,
    "steps": [
    {
    "op": "set_chooseleaf_tries",
    "num": 5
    },
    {
    "op": "set_choose_tries",
    "num": 100
    },
    {
    "op": "take",
    "item": -1,
    "item_name": "default"
    },
    {
    "op": "chooseleaf_indep",
    "num": 0,
    "type": "host"
    },
    {
    "op": "emit"
    }
    ]
    }
]

When we installed it, everything was in the same room, but know we 
splitted our cluster (6 servers but soon 8) in 2 rooms. Thus we updated 
the crushmap by adding a room layer (with ceph osd crush add-bucket 
room1 room etc)  and move all our servers in the tree to the correct 
place (ceph osd crush move server1 room=room1 etc...).


Now, we would like to change the rules to set a failure domain to room 
instead of host (to be sure that in case of disaster in one of the rooms 
we will still have a copy in the other).


What is the best strategy to do this ?

F.


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

74 matches

Mail list logo