[ceph-users] Re: octopus (15.2.16) OSDs crash or don't answer heathbeats (and get marked as down)
Hi, The last 2 osd I recreated were on december 30 and february 8. I totally agree that ssd cache are a terrible spof. I think that's an option if you use 1 ssd/nvme for 1 or 2 osd, but the cost is then very high. Using 1 ssd for 10 osd increase the risk for almost no gain because the ssd is 10 times faster but has 10 times more access ! Indeed, we did some benches with nvme for the wal db (1 nvme for ~10 osds), and the gain was not tremendous, so we decided not use them ! F. Le 08/03/2022 à 11:57, Boris Behrens a écrit : Hi Francois, thanks for the reminder. We offline compacted all of the OSDs when we reinstalled the hosts with the new OS. But actually reinstalling them was never on my list. I could try that and in the same go I can remove all the cache SSDs (when one SSD share the cache for 10 OSDs this is a horrible SPOF) and reuse the SSDs as OSDs for the smaller pools in a RGW (like log and meta). How long ago did you recreate the earliest OSD? Cheers Boris Am Di., 8. März 2022 um 10:03 Uhr schrieb Francois Legrand : Hi, We also had this kind of problems after upgrading to octopus. Maybe you can play with the hearthbeat grace time ( https://docs.ceph.com/en/latest/rados/configuration/mon-osd-interaction/ ) to tell osds to wait a little more before declaring another osd down ! We also try to fix the problem by manually compact the down osd (something like : systemctl stop ceph-osd@74; sleep 10; ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-74 compact; systemctl start ceph-osd@74). This worked a few times, but some osd went down again, thus we simply wait for the datas to be reconstructed elswhere and then reinstall the dead osd : ceph osd destroy 74 --yes-i-really-mean-it ceph-volume lvm zap /dev/sde --destroy ceph-volume lvm create --osd-id 74 --data /dev/sde This seems to fix the issue for us (up to now). F. Le 08/03/2022 à 09:35, Boris Behrens a écrit : > Yes, this is something we know and we disabled it, because we ran into the > problem that PGs went unavailable when two or more OSDs went offline. > > I am searching for the reason WHY this happens. > Currently we have set the service file to restart=always and removed the > StartLimitBurst from the service file. > > We just don't understand why the OSDs don't answer the heathbeat. The OSDs > that are flapping are random in terms of Host, Disksize, having SSD > block.db or not. > Network connectivity issues is something that I would rule out, because the > cluster went from "nothing ever happens except IOPS" to "random OSDs are > marked DOWN until they kill themself" with the update from nautilus to > octopus. > > I am out of ideas and hoped this was a bug in 15.2.15, but after the update > things got worse (happen more often). > We tried to: > * disable swap > * more swap > * disable bluefs_buffered_io > * disable write cache for all disks > * disable scrubbing > * reinstall with new OS (from centos7 to ubuntu 20.04) > * disable cluster_network (so there is only one way to communicate) > * increase txqueuelen on the network interfaces > * everything together > > > What we try next: add more SATA controllers, so there are not 24 disks > attached to a single controller, but I doubt this will help. > > Cheers > Boris > > > > Am Di., 8. März 2022 um 09:10 Uhr schrieb Dan van der Ster < > dvand...@gmail.com>: > >> Here's the reason they exit: >> >> 7f1605dc9700 -1 osd.97 486896 _committed_osd_maps marked down 6 > >> osd_max_markdown_count 5 in last 600.00 seconds, shutting down >> >> If an osd flaps (marked down, then up) 6 times in 10 minutes, it >> exits. (This is a safety measure). >> >> It's normally caused by a network issue -- other OSDs are telling the >> mon that he is down, but then the OSD himself tells the mon that he's >> up! >> >> Cheers, Dan >> >> On Mon, Mar 7, 2022 at 10:36 PM Boris Behrens wrote: >>> Hi, >>> >>> we've had the problem with OSDs marked as offline since we updated to >>> octopus and hope the problem would be fixed with the latest patch. We >> have >>> this kind of problem only with octopus and there only with the big s3 >>> cluster. >>> * Hosts are all Ubuntu 20,04 and we've set the txqueuelen to 10k >>> * Network interfaces are 20gbit
[ceph-users] Re: octopus (15.2.16) OSDs crash or don't answer heathbeats (and get marked as down)
Hi, We also had this kind of problems after upgrading to octopus. Maybe you can play with the hearthbeat grace time ( https://docs.ceph.com/en/latest/rados/configuration/mon-osd-interaction/ ) to tell osds to wait a little more before declaring another osd down ! We also try to fix the problem by manually compact the down osd (something like : systemctl stop ceph-osd@74; sleep 10; ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-74 compact; systemctl start ceph-osd@74). This worked a few times, but some osd went down again, thus we simply wait for the datas to be reconstructed elswhere and then reinstall the dead osd : ceph osd destroy 74 --yes-i-really-mean-it ceph-volume lvm zap /dev/sde --destroy ceph-volume lvm create --osd-id 74 --data /dev/sde This seems to fix the issue for us (up to now). F. Le 08/03/2022 à 09:35, Boris Behrens a écrit : Yes, this is something we know and we disabled it, because we ran into the problem that PGs went unavailable when two or more OSDs went offline. I am searching for the reason WHY this happens. Currently we have set the service file to restart=always and removed the StartLimitBurst from the service file. We just don't understand why the OSDs don't answer the heathbeat. The OSDs that are flapping are random in terms of Host, Disksize, having SSD block.db or not. Network connectivity issues is something that I would rule out, because the cluster went from "nothing ever happens except IOPS" to "random OSDs are marked DOWN until they kill themself" with the update from nautilus to octopus. I am out of ideas and hoped this was a bug in 15.2.15, but after the update things got worse (happen more often). We tried to: * disable swap * more swap * disable bluefs_buffered_io * disable write cache for all disks * disable scrubbing * reinstall with new OS (from centos7 to ubuntu 20.04) * disable cluster_network (so there is only one way to communicate) * increase txqueuelen on the network interfaces * everything together What we try next: add more SATA controllers, so there are not 24 disks attached to a single controller, but I doubt this will help. Cheers Boris Am Di., 8. März 2022 um 09:10 Uhr schrieb Dan van der Ster < dvand...@gmail.com>: Here's the reason they exit: 7f1605dc9700 -1 osd.97 486896 _committed_osd_maps marked down 6 > osd_max_markdown_count 5 in last 600.00 seconds, shutting down If an osd flaps (marked down, then up) 6 times in 10 minutes, it exits. (This is a safety measure). It's normally caused by a network issue -- other OSDs are telling the mon that he is down, but then the OSD himself tells the mon that he's up! Cheers, Dan On Mon, Mar 7, 2022 at 10:36 PM Boris Behrens wrote: Hi, we've had the problem with OSDs marked as offline since we updated to octopus and hope the problem would be fixed with the latest patch. We have this kind of problem only with octopus and there only with the big s3 cluster. * Hosts are all Ubuntu 20,04 and we've set the txqueuelen to 10k * Network interfaces are 20gbit (2x10 in a 802.3ad encap3+4 bond) * We only use the frontend network. * All disks are spinning, some have block.db devices. * All disks are bluestore * configs are mostly defaults * we've set the OSDs to restart=always without a limit, because we had the problem with unavailable PGs when two OSDs are marked as offline and the share PGs. But since we installed the latest patch we are experiencing more OSD downs and even crashes. I tried to remove as much duplicated lines as possible. Is the numa error a problem? Why do OSD daemons not respond to hearthbeats? I mean even when the disk is totally loaded with IO, the system itself should answer heathbeats, or am I missing something? I really hope some of you could send me on the correct way to solve this nasty problem. This is how the latest crash looks like Mar 07 17:44:15 s3db18 ceph-osd[4530]: 2022-03-07T17:44:15.099+ 7f5f05d2a700 -1 osd.161 489755 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory ... Mar 07 17:49:07 s3db18 ceph-osd[4530]: 2022-03-07T17:49:07.678+ 7f5f05d2a700 -1 osd.161 489774 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory Mar 07 17:53:07 s3db18 ceph-osd[4530]: *** Caught signal (Aborted) ** Mar 07 17:53:07 s3db18 ceph-osd[4530]: in thread 7f5ef1501700 thread_name:tp_osd_tp Mar 07 17:53:07 s3db18 ceph-osd[4530]: ceph version 15.2.16 (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable) Mar 07 17:53:07 s3db18 ceph-osd[4530]: 1: (()+0x143c0) [0x7f5f0d4623c0] Mar 07 17:53:07 s3db18 ceph-osd[4530]: 2: (pthread_kill()+0x38) [0x7f5f0d45ef08] Mar 07 17:53:07 s3db18 ceph-osd[4530]: 3: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, unsigned long)+0x471) [0x55a699a01201] Mar 07 17:53:07 s3db18 ceph-osd[4530]: 4: (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, unsigned long, unsigned long
[ceph-users] Re: cephfs removing multiple snapshots
No, our osds are hdd (no ssd) and we have everything (data and metadata) on them (no nvme). Le 17/11/2021 à 16:49, Arthur Outhenin-Chalandre a écrit : Hi, On 11/17/21 16:09, Francois Legrand wrote: Now we are investingating this snapshot issue and I noticed that as long as we remove one snapshot alone, things seems to go well (only some pgs in "unknown state" but no global warning nor slow ops, osd down or crash). But if we remove several snapshots at the same time (I tryed with 2 for the moment), then we start to have some slow ops. I guess that if I remove 4 or 5 snapshots at the same time I will end with osds marked down and/or crash as we had just after the upgrade (I am not sure I want to try that with our production cluster). Maybe you want to try to tweak `osd_snap_trim_sleep`. On Octopus/Pacific with hybrid OSDs the snapshots deletions seems pretty stable in our testing. Out of curiosity are your OSD on SSD? I suspect that the default setting of `osd_snap_trim_sleep` for SSD OSD could affect performance [1]. Cheers, [1]: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/FPRB2DW4N427U25LEHYICOKI4C37BKSO/ ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] cephfs removing multiple snapshots
Hello, We recently upgraded our ceph+cephfs cluster from nautilus to octopus. After the upgrade, we noticed that removal of snapshots was causing a lot of problems (lot of slow ops, osd marked down, crashs etc...) so we suspended the snapshots for a while so the cluster get stable again for more than one week now. We had not these problems under nautilus. Now we are investingating this snapshot issue and I noticed that as long as we remove one snapshot alone, things seems to go well (only some pgs in "unknown state" but no global warning nor slow ops, osd down or crash). But if we remove several snapshots at the same time (I tryed with 2 for the moment), then we start to have some slow ops. I guess that if I remove 4 or 5 snapshots at the same time I will end with osds marked down and/or crash as we had just after the upgrade (I am not sure I want to try that with our production cluster). So my questions are does someone have noticed this king of problem, does the snapshot management have changed between nautilus and octopus, is there a way to solve it (apart from removing one snap at a time and waiting for the snaptrim to end before removing the next one) ? We also change the bluefs_buffered_io from false to true (it was set to false a long time ago because of the bug https://tracker.ceph.com/issues/45337) because it seems that it can help (cf. https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/S4ZW7D5J5OAI76F44NNXMTKWNZYYYUJY/). Does the osds need to be restarted to make this change effective ? Thanks. F. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Why you might want packages not containers for Ceph deployments
Hi Franck, I totally agree with your point 3 (also with 1 and 2 indeed). Generally speaking, the release cycle of many softwares tends to become faster and faster (not only for ceph, but also openstack etc...) and it's really hard and tricky to maintain an infrastructure up to date in such conditions, even more when you deal with storage. As a result, as you perfectly explained it, this gives the impression that the product is not that robust, contains a lot of bugs and needs a lot of patches etc. Few times upgrades had been released with obvious bugs or regressions (e.g DNS problem in 14.2.12,...) and this gives the impression that there is an urge to release, even if the corrections are not totally tested... which lead to a loose of confidence from the users. And I am personally in this process !! We wanted to upgrade our Nautilus cluster. First we decided to go directly to Pacific, but looking to the list it appears to us that Pacific is absolutely not stable enough to be considered as a production release. We thus decided to go to octopus... maybe we will go to pacific when the v17 will be out. I thus feel that the "last stable release" (currently pacific) is in fact a development release (and the community is the "testing pool" for that release) and the truly stable release is the n-1 one (octopus). Thus I am fully supporting your request for a LTS release with stability as a main goal. F. Le 08/11/2021 à 13:21, Frank Schilder a écrit : Hi all, I followed this thread with great interest and would like to add my opinion/experience/wishes as well. I believe the question packages versus containers needs a bit more context to be really meaningful. This was already mentioned several times with regards to documentation. I see the following three topics tightly connected (my opinion/answers included): 1. Distribution: Packages are compulsory, containers are optional. 2. Deployment: Ceph adm (yet another deployment framework) and ceph (the actual storage system) should be strictly different projects. 3. Release cycles: The release cadence is way too fast, I very much miss a ceph LTS branch with at least 10 years back-port support. These are my short answers/wishes/expectations in this context. I will add below some more reasoning as optional reading (warning: wall of text ahead). 1. Distribution - I don't think the question is about packages versus containers, because even if a distribution should decide not to package ceph any more, other distributors certainly will and the user community just moves away from distributions without ceph packages. In addition, unless Rad Hat plans to move to a source-only container where I run the good old configure - make - make install, it will be package based any ways, so packages are there to stay. Therefore, the way I understand this question is about ceph-adm versus other deployment methods. Here, I think the push to a container-based ceph-adm only deployment is unlikely to become the no. 1 choice for everyone for good reasons already mentioned in earlier messages. In addition, I also believe that development of a general deployment tool is currently not sustainable as was mentioned by another user. My reasons for this are given in the next section. 2. Deployment - In my opinion, it is really important to distinguish three components of any open-source project: development (release cycles), distribution and deployment. Following the good old philosophy that every tool does exactly one job and does it well, each of these components are separate projects, because they correspond to different tools. This implies immediately that ceph documentation should not contain documentation about packaging and deployment tools. Each of these ought to be strictly separate. If I have a low-level problem with ceph and go to the ceph documentation, I do not want to see ceph-adm commands. Ceph documentation should be about ceph (the storage system) only. Such a mix-up is leading to problems and there were already ceph-user cases where people could not use the documentation for trouble shooting, because it showed ceph-adm commands but their cluster was not ceph-adm deployed. In this context, I would prefer if there was a separate ceph-adm-users list so that ceph-users can focus on actual ceph problems again. Now to the point that ceph-adm might be an un-sustainable project. Although at a first glance the idea of a generic deployment tool that solves all problems with a single command might look appealing, it is likely doomed to fail for a simple reason that was already indicated in an earlier message: ceph deployment is subject to a complexity paradox. Ceph has a very large configuration space and implementing and using a generic tool that covers and understands this configuration space is more complex than deploying any specific ceph cluster, each of which uses only a tiny subset of the entire
[ceph-users] Re: snaptrim blocks IO on ceph nautilus
Le 06/11/2021 à 16:57, Francois Legrand a écrit : Hi, Can you confirm that the changing bluefs_buffered_io to true solved your problem ? Because I have a rather similar problem. My Nautilus cluster was with bluefs_buffered_io = false. It was working (even with snaptrim lasting a lot, i.e. several hours). I upgraded to octopus and it seems that creating/deleting snapshot now creates a lot of instabilities (leading to osd marked down or crashing, mgr and mds crashing, MON_DISK_BIG warning, mon out of quorum and a tons of slowops and MOSDScrubReserve messages in the logs). Compaction of the failed osd seems more or less to solve the problem (osds stop to crash). Thus I desactivated the snapshots for the moment. F. Le 27/07/2020 à 15:59, Manuel Lausch a écrit : Hi, since some days I try to debug a problem with snaptrimming under nautilus. I have a cluster with Nautilus (v14.2.10) , 44 Nodes á 24 OSDs á 14 TB I create every day a snapshot for 7 days. Every time the old snapshot is deleting I have bad IO performcance and blocked requests for several seconds until the snaptrim is done. Settings like snaptrim_sleep and osd_pg_max_concurrent_snap_trims don't affect this behavior. In the debug_osd 10/10 log I see the following: 2020-07-27 11:45:49.976 7fd8b8404700 10 osd.411 22457 dequeue_op 0x557886edda20 prio 196 cost 0 latency 0.019545 osd_repop_reply(client.22731418.0:615257 3.636 e22457/22372) v2 pg pg[3.636( v 22457'100855 (21737'97756,22457'100855] local-lis/les=22372/22374 n=27762 ec=2842/2839 lis/c 22372/22372 les/c/f 22374/22374/0 22372/22372/22343) [411,36,956,763] r=0 lpr=22372 luod=22457'100854 crt=22457'100855 lcod 22457'100853 mlcod 22457'100853 active+clean+snaptrim_wait trimq=[1d~1]] 2020-07-27 11:45:49.976 7fd8b8404700 10 osd.411 22457 dequeue_op 0x557886edda20 finish 2020-07-27 11:45:49.976 7fd8b8404700 10 osd.411 22457 dequeue_op 0x557886edc2c0 prio 127 cost 0 latency 0.043165 MOSDScrubReserve(2.2645 RELEASE e22457) v1 pg pg[2.2645( empty local-lis/les=22359/22364 n=0 ec=2403/2403 lis/c 22359/22359 les/c/f 22364/22367/0 22359/22359/22359) [379,411,884,975] r=1 lpr=22359 crt=0'0 active mbc={}] 2020-07-27 11:45:49.976 7fd8b8404700 10 osd.411 22457 dequeue_op 0x557886edc2c0 finish 2020-07-27 11:45:50.039 7fd8b8404700 10 osd.411 pg_epoch: 22457 pg[3.278e( v 22457'99491 (21594'96426,22457'99491] local-lis/les=22359/22362 n=27669 ec=2859/2839 lis/c 22359/22359 les/c/f 22362/22365/0 22359/22359/22343) [411,379,848,924] r=0 lpr=22359 crt=22457'99491 lcod 22457'99489 mlcod 22457'99489 active+clean+snaptrim trimq=[1d~1]] snap_trimmer posting 2020-07-27 11:45:57.801 7fd8b8404700 10 osd.411 pg_epoch: 22457 pg[3.278e( v 22457'99493 (21594'96426,22457'99493] local-lis/les=22359/22362 n=27669 ec=2859/2839 lis/c 22359/22359 les/c/f 22362/22365/0 22359/22359/22343) [411,379,848,924] r=0 lpr=22359 luod=22457'99491 crt=22457'99493 lcod 22457'99489 mlcod 22457'99489 active+clean+snaptrim trimq=[1d~1]] snap_trimmer complete 2020-07-27 11:45:57.801 7fd8b8404700 10 osd.411 22457 dequeue_op 0x557880ac3760 prio 127 cost 663 latency 7.761823 osd_repop(osd.217.0:3025 3.1ca5 e22457/22378) v2 pg pg[3.1ca5( v 22457'100370 (21716'97357,22457'100370] local-lis/les=22378/22379 n=27532 ec=2855/2839 lis/c 22378/22378 les/c/f 22379/22379/0 22378/22378/22378) [217,411,551,1055] r=1 lpr=22378 luod=0'0 lua=22294'16 crt=22457'100370 lcod 22457'100369 active mbc={}] 2020-07-27 11:45:57.801 7fd8b8404700 10 osd.411 22457 dequeue_op 0x557880ac3760 finish 2020-07-27 11:45:57.801 7fd8b8404700 10 osd.411 22457 dequeue_op 0x5578813e1e40 prio 127 cost 0 latency 7.494296 MOSDScrubReserve(2.37e2 REQUEST e22457) v1 pg pg[2.37e2( empty local-lis/les=22355/22356 n=0 ec=2412/2412 lis/c 22355/22355 les/c/f 22356/22356/0 22355/22355/22355) [245,411,834,768] r=1 lpr=22355 crt=0'0 active mbc={}] 2020-07-27 11:45:57.801 7fd8b8404700 10 osd.411 22457 dequeue_op 0x5578813e1e40 finish the dequeueing of ops works without pauses until the „snap_trimmer posting“ and „snap_trimmer complete“ loglines. This task takes in this example about 7 Seconds. The following operations which are dequeued have now a latency of about this time. I tried to drill down this in the code. (Developers are asked here) It seems, that the PG will be locked for every operation. The snap_trimmer posting and complete message comes from „osd/PrimaryLogPG.cc“ on line 4700. This indicates me, that the process of deleting a snapshot object will sometimes take some time. After further poking around. I see in „osd/SnapMapper.cc“ the method „SnapMapper::get_next_objects_to_trim“ which takes several seconds to get finished. I followed this further to the „common/map_cacher.hpp“ to the line 94: „int r = driver->get_next(key, &store);“ From there I lost the pa
[ceph-users] Re: Upgrade to 16.2.6 and osd+mds crash after bluestore_fsck_quick_fix_on_mount true
Hello, Can you confirm that the bug only affects pacific and not octopus ? Thanks. F. Le 29/10/2021 à 16:39, Neha Ojha a écrit : On Thu, Oct 28, 2021 at 8:11 AM Igor Fedotov wrote: On 10/28/2021 12:36 AM, mgrzybowski wrote: Hi Igor I'm very happy that You ware able to reproduce and find the bug. Nice one ! In my opinion at the moment first priority should be to warn other users in the official upgrade docs: https://docs.ceph.com/en/latest/releases/pacific/#upgrading-from-octopus-or-nautilus . This has been escalated to Ceph dev's community, hopefully to be done shortly. We have added a warning in our docs https://ceph--43706.org.readthedocs.build/en/43706/releases/pacific/#upgrading-from-octopus-or-nautilus. Thanks, Neha Please also note the tracker: https://tracker.ceph.com/issues/53062 and the fix: https://github.com/ceph/ceph/pull/43687 In my particular case ( i have home storage server based on cephfs and bunch of random hdd's - SMRs too :( ) i restarted osds one at the time after all RADOS objects were repaired. Unfortuantely four disks due to recovery strains showed bad sectors, so i have small number of unfound objects. Bad disks were removed one by one. Now i'm waiting for backfill, then scrubs. Make crashed osd working again could be nice but should not be neccessery. What about some kind of export and impoort of PGs. Could this work on crashed OSDs with failed omap format upgrade? I can't say for sure what would be the results - export/import should probably work but omaps in the restored PGs would be still broken. Highly likely OSDs (and other daemons) would stuck on that invalid data... Converting ill-formated omaps back to their regular form (either new or legacy one) looks more straighforward and predictable task... ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: when mds_all_down open "file system" page provoque dashboard crash
The crash report is : { "backtrace": [ "/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0) [0x7f86044313c0]", "gsignal()", "abort()", "/lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911) [0x7f86042d2911]", "/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c) [0x7f86042de38c]", "/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3f7) [0x7f86042de3f7]", "/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa6a9) [0x7f86042de6a9]", "(std::__throw_out_of_range(char const*)+0x41) [0x7f86042d537e]", "(Client::resolve_mds(std::__cxx11::basic_stringstd::char_traits, std::allocator > const&, std::vector >*)+0x1306) [0x563db199e076]", "(Client::mds_command(std::__cxx11::basic_stringstd::char_traits, std::allocator > const&, std::vector, std::allocator >, std::allocatorstd::char_traits, std::allocator > > > const&, ceph::buffer::v15_2_0::list const&, ceph::buffer::v15_2_0::list*, std::__cxx11::basic_string, std::allocator >*, Context*)+0x179) [0x563db19baa69]", "/usr/bin/ceph-mgr(+0x1d185d) [0x563db17db85d]", "/lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x2a170e) [0x7f860d5e770e]", "/lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x74d6d) [0x7f860d3bad6d]", "_PyEval_EvalFrameDefault()", "_PyEval_EvalCodeWithName()", "_PyFunction_Vectorcall()", "/lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x2a8daa) [0x7f860d5eedaa]", "/lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x74d6d) [0x7f860d3bad6d]", "_PyEval_EvalFrameDefault()", "_PyEval_EvalCodeWithName()", "_PyFunction_Vectorcall()", "/lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x74d6d) [0x7f860d3bad6d]", "_PyEval_EvalFrameDefault()", "/lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x8006b) [0x7f860d3c606b]", "PyVectorcall_Call()", "_PyEval_EvalFrameDefault()", "/lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x8006b) [0x7f860d3c606b]", "/lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x74d6d) [0x7f860d3bad6d]", "_PyEval_EvalFrameDefault()", "/lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x8006b) [0x7f860d3c606b]", "/lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x74d6d) [0x7f860d3bad6d]", "_PyEval_EvalFrameDefault()", "/lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x8006b) [0x7f860d3c606b]", "/lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x2a8e2b) [0x7f860d5eee2b]", "PyVectorcall_Call()", "/lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x116c01) [0x7f860d45cc01]", "/lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x17d51b) [0x7f860d4c351b]", "/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7f8604425609]", "clone()" ], "ceph_version": "16.2.5", "os_id": "ubuntu", "os_name": "Ubuntu", "os_version": "20.04.3 LTS (Focal Fossa)", "os_version_id": "20.04", "process_name": "ceph-mgr", "stack_sig": "9a65d0019b8102fdaee8fd29c30e3aef3b86660d33fc6cd9bd51f57844872b2a", "timestamp": "2021-09-23T12:27:29.137868Z", "utsname_machine": "x86_64", "utsname_release": "5.4.0-86-generic", "utsname_sysname": "Linux", "utsname_version": "#97-Ubuntu SMP Fri Sep 17 19:19:40 UTC 2021" } Le 23/09/2021 à 14:55, Francois Legrand a écrit : Hi, I am testing an upgrade (from 14.2.16 to 16.2.5) on my ceph test cluster (bar metal). I noticed (when reaching the mds upgrade) that after I stopped all the mds, opening the "file system" page on the dashboard result in a crash of the dashboard (and also of the mgr). Does someone had this issue ? F. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] when mds_all_down open "file system" page provoque dashboard crash
Hi, I am testing an upgrade (from 14.2.16 to 16.2.5) on my ceph test cluster (bar metal). I noticed (when reaching the mds upgrade) that after I stopped all the mds, opening the "file system" page on the dashboard result in a crash of the dashboard (and also of the mgr). Does someone had this issue ? F. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Why set osd flag to noout during upgrade ?
Hello everybody, I have a "stupid" question. Why is it recommended in the docs to set the osd flag to noout during an upgrade/maintainance (and especially during an osd upgrade/maintainance) ? In my understanding, if an osd goes down, after a while (600s by default) it's marked out and the cluster will start to rebuild it's content elsewhere in the cluster to maintain the redondancy of the datas. This generate some transfer and load on other osds, but that's not a big deal ! As soon as the osd is back, it's marked in again and ceph is able to determine which data is back and stop the recovery to reuse the unchanged datas which are back. Generally, the recovery is as fast as with noout flag (because with noout, the data modified during the down period still have be copied to the back osd). Thus is there an other reason apart from limiting the transfer and others osds load durind the downtime ? F ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: usable size for replicated pool with custom rule in pacific dashboard
You are probably right ! But this "verification" seems "stupid" ! I created an additional room (with no osd) and then the doashboard doesn't complain anymore ! Indeed, the rule does what we want because "step choose firstn 0 type room" will select the different rooms (2 in our case) and for the first one will put 2 copies on different hosts (step chooseleaf firstn 2 type host) and then goes to the remaining room and put the third copy there (and eventually fouth if we choose replica 4). Enforcing the first rule (step choose firstn 0 type room) to have as many choice (rooms) as replica means that the second step is then rather useless ! That's why it appears to me that this verification is somewhat "stupid"... The check should be that the number of replica is not greater than the number of rooms x the number of leafs in the second step (2 in my case)... but maybe I missed something ! F. Le 09/09/2021 à 13:23, Ernesto Puerta a écrit : Hi Francois, I'm not an expert on CRUSH rule internals, but I checked the code and it assumes that the failure domain (first choose/chooseleaf step) there is "room": since there are just 2 rooms vs. 3 replicas, it doesn't allow you to create a pool with a rule that might not optimally work (keep in mind that Dashboard tries to perform some extra validations compared to the Ceph CLI). Kind Regards, Ernesto On Thu, Sep 9, 2021 at 12:29 PM Francois Legrand <mailto:f...@lpnhe.in2p3.fr>> wrote: Hi all, I have a test ceph cluster with 4 osd servers containing each 3 osds. The crushmap uses 2 rooms with 2 servers in each room. We use replica 3 for pools. I have the following custom crushrule to ensure that I have at least one copy of each data in each room. rule replicated3over2rooms { id 1 type replicated min_size 3 max_size 4 step take default step choose firstn 0 type room step chooseleaf firstn 2 type host step emit } Everything was working well in nautilus/centos7 (I could create pools using the dashboard and my custom rule). I upgraded to pacific/ubuntu 20.04 in containers with cephadm. Now, I cannot create a new pool with replicated3over2rooms using the dashboard ! If I choose Pool type = replicated, Replicated size = 3, Crush ruleset = replicated3over2rooms The dashboard says : Minimum: 3 Maximum: 2 The size specified is out of range. A value from 3 to 2 is usable. And inspecting replicatedover2rooms ruleset in the dashboard says that the parameters are max_size 4 min_size 3 rule_id 1 usable_size 2 Where that usable_size comes from ? How to correct it ? If i run the command line ceph osd pool create test 16 replicated replicated3over2rooms 3 it works !! Thanks. F. ___ ceph-users mailing list -- ceph-users@ceph.io <mailto:ceph-users@ceph.io> To unsubscribe send an email to ceph-users-le...@ceph.io <mailto:ceph-users-le...@ceph.io> ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] usable size for replicated pool with custom rule in pacific dashboard
Hi all, I have a test ceph cluster with 4 osd servers containing each 3 osds. The crushmap uses 2 rooms with 2 servers in each room. We use replica 3 for pools. I have the following custom crushrule to ensure that I have at least one copy of each data in each room. rule replicated3over2rooms { id 1 type replicated min_size 3 max_size 4 step take default step choose firstn 0 type room step chooseleaf firstn 2 type host step emit } Everything was working well in nautilus/centos7 (I could create pools using the dashboard and my custom rule). I upgraded to pacific/ubuntu 20.04 in containers with cephadm. Now, I cannot create a new pool with replicated3over2rooms using the dashboard ! If I choose Pool type = replicated, Replicated size = 3, Crush ruleset = replicated3over2rooms The dashboard says : Minimum: 3 Maximum: 2 The size specified is out of range. A value from 3 to 2 is usable. And inspecting replicatedover2rooms ruleset in the dashboard says that the parameters are max_size 4 min_size 3 rule_id 1 usable_size 2 Where that usable_size comes from ? How to correct it ? If i run the command line ceph osd pool create test 16 replicated replicated3over2rooms 3 it works !! Thanks. F. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Howto upgrade AND change distro
Thanks, My point is how to reattach safely an osd from the previous server to the new installed distro ! Is there a detailed howto réinstall completely a server (or a cluster) ? F. Le 27/08/2021 à 19:47, Message: 1 Date: Fri, 27 Aug 2021 16:43:12 +0100 From: Matthew Vernon Subject: [ceph-users] Re: Howto upgrade AND change distro To: ceph-users@ceph.io Message-ID: <654262bf-b621-d534-7067-62a3a2abb...@wikimedia.org> Content-Type: text/plain; charset=utf-8; format=flowed Hi, On 27/08/2021 16:16, Francois Legrand wrote: We are running a ceph nautilus cluster under centos 7. To upgrade to pacific we need to change to a more recent distro (probably debian or ubuntu because of the recent announcement about centos 8, but the distro doesn't matter very much). However, I could'nt find a clear procedure to upgrade ceph AND the distro ! As we have more than 100 osds and ~600TB of data, we would like to avoid as far as possible to wipe the disks and rebuild/rebalance. It seems to be possible to reinstall a server and reuse the osds, but the exact procedure remains quite unclear to me. It's going to be least pain to do the operations separately, which means you may need to build a set of packages for one or other "end" of the operation, if you see what I mean? The Debian and Ubuntu installers both have an "expert mode" which gives you quite a lot of control which should enable you to upgrade the OS without touching the OSD disks - but make sure you have backups of all your Ceph config! If you're confident (and have enough redundancy), you can set noout while you upgrade a machine, which will reduce the amount of rebalancing you have to do when it rejoins the cluster post upgrade. Regards, Matthew [one good thing about Ubuntu's cloud archive is that e.g. you can get the same version that's default in 20.04 available as packages for 18.04 via UCA meaning you can upgrade Ceph first, and then do the distro upgrade, and it's pretty painless] ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Howto upgrade AND change distro
Hello, We are running a ceph nautilus cluster under centos 7. To upgrade to pacific we need to change to a more recent distro (probably debian or ubuntu because of the recent announcement about centos 8, but the distro doesn't matter very much). However, I could'nt find a clear procedure to upgrade ceph AND the distro ! As we have more than 100 osds and ~600TB of data, we would like to avoid as far as possible to wipe the disks and rebuild/rebalance. It seems to be possible to reinstall a server and reuse the osds, but the exact procedure remains quite unclear to me. What is the best way to proceed ? Does someone have done that and have a rather detailed doc on how to proceed ? Thanks for your help ! F. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Balancing with upmap
1 | 34 osd.77 1 6 23 1 0 0 2 1 1 | 35 osd.78 1 6 24 1 0 0 1 1 1 | 35 osd.79 1 6 22 1 1 0 2 1 1 | 35 osd.80 1 6 22 1 1 0 1 1 1 | 34 osd.81 1 6 24 1 1 0 2 1 2 | 38 osd.82 1 6 23 1 0 1 1 1 0 | 34 osd.83 0 6 23 1 1 0 1 1 0 | 33 osd.84 1 6 25 1 1 1 2 1 2 | 40 osd.85 1 6 22 1 0 0 2 0 2 | 34 osd.86 0 6 22 1 0 0 1 0 1 | 31 osd.87 1 6 22 0 0 0 2 1 2 | 34 osd.88 1 8 34 1 1 0 2 1 3 | 51 osd.89 1 7 22 1 1 1 2 1 2 | 38 osd.90 1 6 25 0 1 1 2 1 2 | 39 osd.91 1 8 32 0 1 1 2 1 1 | 47 osd.92 1 6 22 0 1 2 1 1 2 | 36 osd.93 1 7 22 1 1 1 2 1 2 | 38 osd.94 1 6 27 0 1 1 1 1 1 | 39 osd.95 1 7 30 0 1 1 2 1 1 | 44 osd.96 1 10 35 1 1 1 3 1 3 | 56 osd.97 1 6 28 1 1 1 1 1 1 | 41 osd.98 1 6 22 0 1 1 2 0 1 | 34 osd.99 1 6 29 1 1 1 2 1 1 | 43 osd.100 1 6 26 1 1 0 2 0 2 | 39 osd.101 0 6 24 1 0 1 2 1 1 | 36 osd.102 0 6 22 1 0 1 2 0 2 | 34 osd.103 1 6 22 0 1 1 2 1 2 | 36 osd.104 0 6 30 1 1 1 2 1 2 | 44 osd.105 0 6 26 1 1 1 1 0 1 | 37 osd.106 1 11 34 1 1 1 1 1 2 | 53 osd.107 1 8 38 1 1 0 2 1 2 | 54 osd.108 1 8 34 1 1 2 2 1 3 | 53 osd.109 1 9 34 1 1 1 1 1 3 | 52 osd.110 1 8 37 1 1 0 3 1 3 | 55 osd.111 1 8 40 1 1 0 2 1 1 | 55 osd.112 1 8 37 1 1 2 3 1 1 | 55 osd.113 1 8 34 1 1 0 1 1 1 | 48 osd.114 1 11 34 1 1 1 1 1 2 | 53 osd.115 1 11 34 1 1 0 1 1 1 | 51 SUM : 96 768 3072 96 96 96 192 96 192 | F. Le 01/02/2021 à 10:26, Dan van der Ster a écrit : On Mon, Feb 1, 2021 at 10:03 AM Francois Legrand wrote: Hi, Actually we have no EC pools... all are replica 3. And we have only 9 pools. The average number og pg/osd is not very high (40.6). Here is the detail of the pools : pool 2 replicated size 3 min_size 1 crush_rule 2 object_hash rjenkins pg_num 64 pgp_num 64 last_change 623105 lfor 0/608315/608313 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd pool 31 replicated size 3 min_size 1 crush_rule 2 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 621529 lfor 0/0/171563 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd pool 32 replicated size 3 min_size 1 crush_rule 2 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 621529 lfor 436085/436085/436085 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd pool 33 replicated size 3 min_size 1 crush_rule 2 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 621529 lfor 0/0/171554 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd pool 34 replicated size 3 min_size 1 crush_rule 2 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 623470 lfor 0/0/171558 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd pool 35 replicated size 3 min_size 1 crush_rule 2 object_hash rjenkins pg_num 32 pgp_num 32 last_change 621529 lfor 0/598286/598284 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs pool 36 replicated size 3 min_size 1 crush_rule 2 object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode warn last_change 624174 flags hashpspool,selfmanaged_snaps stripe_width 0 application cephfs pool 43 replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode warn last_change 624174 flags hashpspool,selfmanaged_snaps stripe_width 0 application cephfs pool 44 replicated size 3 min_size 3 crush_rule 2 object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode warn last_change 622177 lfor 0/0/449412 flags hashpspool,selfmanaged_snaps stripe_width 0 expected_num_objects 400 target_size_bytes 17592186044416 application rbd Pools 35 (meta), 36 and 43 (datas) are for cephfs. How does the distribution for pool 36 look? This pool has the best chance to be balanced -- the others have too few PGs so you shouldn't even be wo
[ceph-users] Re: Balancing with upmap
retty non-uniform distribution, because this example pool id 38 has up to 4 PGs on some OSDs but 1 or 2 on most. (this is a cluster with the balancer disabled). The other explanation I can think of is that you have relatively wide EC pools and few hosts. In that case there would be very little that the balancer could do to flatten the distribution. If in doubt, please share your pool details and crush rules so we can investigate further. Cheers, Dan On Sun, Jan 31, 2021 at 5:10 PM Francois Legrand wrote: Hi, After 2 days, the recovery ended. The situation is clearly better (but still not perfect) with 339.8 Ti available in pools (for 575.8 Ti available in the whole cluster). The balancing remains not perfect (31 to 47 pgs on 8TB disks). And the ceph osd df tree returns : ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAPMETA AVAIL %USE VAR PGS STATUS TYPE NAME -1 1018.65833- 466 TiB 214 TiB 214 TiB 126 GiB 609 GiB 251 TiB 00 -root default -15465.66577- 466 TiB 214 TiB 214 TiB 126 GiB 609 GiB 251 TiB 46.04 1.06 -room 1222-2-10 -3116.41678- 116 TiB 53 TiB 53 TiB 24 GiB 152 GiB 64 TiB 45.45 1.05 -host lpnceph01 0 hdd7.27599 1.0 7.3 TiB 3.7 TiB 3.7 TiB 2.5 GiB 16 GiB 3.5 TiB 51.34 1.18 38 up osd.0 4 hdd7.27599 1.0 7.3 TiB 3.2 TiB 3.2 TiB 2.4 GiB 8.7 GiB 4.1 TiB 44.12 1.01 36 up osd.4 8 hdd7.27699 1.0 7.3 TiB 3.5 TiB 3.5 TiB 2.3 GiB 9.3 GiB 3.7 TiB 48.52 1.12 39 up osd.8 12 hdd7.27599 1.0 7.3 TiB 3.4 TiB 3.4 TiB 2.4 GiB 9.5 GiB 3.9 TiB 46.69 1.07 37 up osd.12 16 hdd7.27599 1.0 7.3 TiB 3.5 TiB 3.4 TiB 38 MiB 9.7 GiB 3.8 TiB 47.49 1.09 37 up osd.16 20 hdd7.27599 1.0 7.3 TiB 3.1 TiB 3.0 TiB 2.4 GiB 8.7 GiB 4.2 TiB 41.95 0.96 34 up osd.20 24 hdd7.27599 1.0 7.3 TiB 3.5 TiB 3.5 TiB 2.3 GiB 9.8 GiB 3.8 TiB 48.45 1.11 38 up osd.24 28 hdd7.27599 1.0 7.3 TiB 3.0 TiB 3.0 TiB 55 MiB 8.2 GiB 4.2 TiB 41.74 0.96 32 up osd.28 32 hdd7.27599 1.0 7.3 TiB 3.2 TiB 3.1 TiB 32 MiB 8.4 GiB 4.1 TiB 43.33 1.00 34 up osd.32 36 hdd7.27599 1.0 7.3 TiB 3.7 TiB 3.7 TiB 2.4 GiB 11 GiB 3.6 TiB 50.50 1.16 35 up osd.36 40 hdd7.27599 1.0 7.3 TiB 3.4 TiB 3.3 TiB 2.4 GiB 9.1 GiB 3.9 TiB 46.15 1.06 37 up osd.40 44 hdd7.27599 1.0 7.3 TiB 3.4 TiB 3.4 TiB 2.3 GiB 9.2 GiB 3.9 TiB 46.28 1.06 36 up osd.44 48 hdd7.27599 1.0 7.3 TiB 3.3 TiB 3.3 TiB 92 MiB 8.8 GiB 4.0 TiB 44.88 1.03 33 up osd.48 52 hdd7.27599 1.0 7.3 TiB 3.3 TiB 3.3 TiB 2.4 GiB 9.0 GiB 4.0 TiB 44.86 1.03 33 up osd.52 56 hdd7.27599 1.0 7.3 TiB 2.9 TiB 2.9 TiB 23 MiB 8.3 GiB 4.4 TiB 39.79 0.92 34 up osd.56 60 hdd7.27599 1.0 7.3 TiB 3.0 TiB 3.0 TiB 40 MiB 8.3 GiB 4.3 TiB 41.12 0.95 30 up osd.60 -5116.41600- 116 TiB 54 TiB 54 TiB 30 GiB 150 GiB 63 TiB 46.12 1.06 -host lpnceph02 1 hdd7.27599 1.0 7.3 TiB 3.2 TiB 3.2 TiB 2.2 GiB 8.9 GiB 4.0 TiB 44.53 1.02 37 up osd.1 5 hdd7.27599 1.0 7.3 TiB 3.1 TiB 3.1 TiB 24 MiB 8.3 GiB 4.2 TiB 42.56 0.98 34 up osd.5 9 hdd7.27599 1.0 7.3 TiB 3.8 TiB 3.8 TiB 42 MiB 11 GiB 3.4 TiB 52.61 1.21 38 up osd.9 13 hdd7.27599 1.0 7.3 TiB 3.1 TiB 3.1 TiB 2.3 GiB 9.7 GiB 4.2 TiB 42.89 0.99 36 up osd.13 17 hdd7.27599 1.0 7.3 TiB 3.4 TiB 3.4 TiB 2.3 GiB 9.1 GiB 3.9 TiB 46.80 1.08 36 up osd.17 21 hdd7.27599 1.0 7.3 TiB 3.3 TiB 3.3 TiB 41 MiB 9.2 GiB 4.0 TiB 44.90 1.03 33 up osd.21 25 hdd7.27599 1.0 7.3 TiB 3.5 TiB 3.5 TiB 2.4 GiB 9.4 GiB 3.7 TiB 48.75 1.12 38 up osd.25 29 hdd7.27599 1.0 7.3 TiB 3.0 TiB 3.0 TiB 2.3 GiB 8.7 GiB 4.2 TiB 41.91 0.96 34 up osd.29 33 hdd7.27599 1.0 7.3 TiB 3.4 TiB 3.4 TiB 2.3 GiB 9.4 GiB 3.9 TiB 46.60 1.07 36 up osd.33 37 hdd7.27599 1.0 7.3 TiB 3.5 TiB 3.5 TiB 4.6 GiB 10 GiB 3.8 TiB 47.90 1.10 34 up osd.37 41 hdd7.27599 1.0 7.3 TiB 3.3 TiB 3.3 TiB 2.2 GiB 11 GiB 3.9 TiB 45.91 1.06 33 up osd.41 45 hdd7.27599 1.0 7.3 TiB 3.4 TiB 3.4 TiB 2.4 GiB 9.3 GiB 3.9 TiB 46.85 1.08 35 up osd.45 49 hdd7.27599 1.0 7.3 TiB 3.3 TiB 3.3 TiB 2.3 GiB 8.9 GiB 4.0 TiB 45.35 1.04 36 up osd.49 53 hdd7.27599 1.0 7.3 TiB 3.3 TiB 3.3 TiB 36 MiB 9.0 GiB 4.0 TiB 44.85 1.03 33 up osd.53 57 hdd7.27599 1.0 7.3 TiB 3.3 TiB 3.3 TiB 2.3 GiB 9.0 GiB 4.0 TiB 45.67 1.05 36 up osd.57 61 hdd7.27599 1.0 7.3 TiB 3.6 TiB 3.6 TiB 2.4 GiB 9.8 GiB 3.7 TiB 49.75 1.14 36 up osd.61 -9116.41600- 116 TiB 56 TiB 56 TiB 35 GiB 159 GiB 61 TiB 48.03 1.10 -host l
[ceph-users] Re: Balancing with upmap
d": "Sun Jan 31 17:07:47 2021" } Can the crush rules for placement be blamed for the inequal repartition ? F. Le 29/01/2021 à 23:44, Dan van der Ster a écrit : Thanks, and thanks for the log file OTR which simply showed: 2021-01-29 23:17:32.567 7f6155cae700 4 mgr[balancer] prepared 0/10 changes This indeed means that balancer believes those pools are all balanced according to the config (which you have set to the defaults). Could you please also share the output of `ceph osd df tree` so we can see the distribution and OSD weights? You might need simply to decrease the upmap_max_deviation from the default of 5. On our clusters we do: ceph config set mgr mgr/balancer/upmap_max_deviation 1 Cheers, Dan On Fri, Jan 29, 2021 at 11:25 PM Francois Legrand wrote: Hi Dan, Here is the output of ceph balancer status : /ceph balancer status// //{// //"last_optimize_duration": "0:00:00.074965", // //"plans": [], // //"mode": "upmap", // //"active": true, // //"optimize_result": "Unable to find further optimization, or pool(s) pg_num is decreasing, or distribution is already perfect", // //"last_optimize_started": "Fri Jan 29 23:13:31 2021"// //}/ F. Le 29/01/2021 à 10:57, Dan van der Ster a écrit : Hi Francois, What is the output of `ceph balancer status` ? Also, can you increase the debug_mgr to 4/5 then share the log file of the active mgr? Best, Dan On Fri, Jan 29, 2021 at 10:54 AM Francois Legrand wrote: Thanks for your suggestion. I will have a look ! But I am a bit surprised that the "official" balancer seems so unefficient ! F. Le 28/01/2021 à 12:00, Jonas Jelten a écrit : Hi! We also suffer heavily from this so I wrote a custom balancer which yields much better results: https://github.com/TheJJ/ceph-balancer After you run it, it echoes the PG movements it suggests. You can then just run those commands the cluster will balance more. It's kinda work in progress, so I'm glad about your feedback. Maybe it helps you :) -- Jonas On 27/01/2021 17.15, Francois Legrand wrote: Hi all, I have a cluster with 116 disks (24 new disks of 16TB added in december and the rest of 8TB) running nautilus 14.2.16. I moved (8 month ago) from crush_compat to upmap balancing. But the cluster seems not well balanced, with a number of pgs on the 8TB disks varying from 26 to 52 ! And an occupation from 35 to 69%. The recent 16 TB disks are more homogeneous with 48 to 61 pgs and space between 30 and 43%. Last week, I realized that some osd were maybe not using upmap because I did a ceph osd crush weight-set ls and got (compat) as result. Thus I ran a ceph osd crush weight-set rm-compat which triggered some rebalancing. Now there is no more recovery for 2 days, but the cluster is still unbalanced. As far as I understand, upmap is supposed to reach an equal number of pgs on all the disks (I guess weighted by their capacity). Thus I would expect more or less 30 pgs on the 8TB disks and 60 on the 16TB and around 50% usage on all. Which is not the case (by far). The problem is that it impact the free available space in the pools (264Ti while there is more than 578Ti free in the cluster) because free space seems to be based on space available before the first osd will be full ! Is it normal ? Did I missed something ? What could I do ? F. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Balancing with upmap
32.567 7f6155cae700 4 mgr[balancer] prepared 0/10 changes This indeed means that balancer believes those pools are all balanced according to the config (which you have set to the defaults). Could you please also share the output of `ceph osd df tree` so we can see the distribution and OSD weights? You might need simply to decrease the upmap_max_deviation from the default of 5. On our clusters we do: ceph config set mgr mgr/balancer/upmap_max_deviation 1 Cheers, Dan On Fri, Jan 29, 2021 at 11:25 PM Francois Legrand wrote: Hi Dan, Here is the output of ceph balancer status : /ceph balancer status// //{// //"last_optimize_duration": "0:00:00.074965", // //"plans": [], // //"mode": "upmap", // //"active": true, // //"optimize_result": "Unable to find further optimization, or pool(s) pg_num is decreasing, or distribution is already perfect", // //"last_optimize_started": "Fri Jan 29 23:13:31 2021"// //}/ F. Le 29/01/2021 à 10:57, Dan van der Ster a écrit : Hi Francois, What is the output of `ceph balancer status` ? Also, can you increase the debug_mgr to 4/5 then share the log file of the active mgr? Best, Dan On Fri, Jan 29, 2021 at 10:54 AM Francois Legrand wrote: Thanks for your suggestion. I will have a look ! But I am a bit surprised that the "official" balancer seems so unefficient ! F. Le 28/01/2021 à 12:00, Jonas Jelten a écrit : Hi! We also suffer heavily from this so I wrote a custom balancer which yields much better results: https://github.com/TheJJ/ceph-balancer After you run it, it echoes the PG movements it suggests. You can then just run those commands the cluster will balance more. It's kinda work in progress, so I'm glad about your feedback. Maybe it helps you :) -- Jonas On 27/01/2021 17.15, Francois Legrand wrote: Hi all, I have a cluster with 116 disks (24 new disks of 16TB added in december and the rest of 8TB) running nautilus 14.2.16. I moved (8 month ago) from crush_compat to upmap balancing. But the cluster seems not well balanced, with a number of pgs on the 8TB disks varying from 26 to 52 ! And an occupation from 35 to 69%. The recent 16 TB disks are more homogeneous with 48 to 61 pgs and space between 30 and 43%. Last week, I realized that some osd were maybe not using upmap because I did a ceph osd crush weight-set ls and got (compat) as result. Thus I ran a ceph osd crush weight-set rm-compat which triggered some rebalancing. Now there is no more recovery for 2 days, but the cluster is still unbalanced. As far as I understand, upmap is supposed to reach an equal number of pgs on all the disks (I guess weighted by their capacity). Thus I would expect more or less 30 pgs on the 8TB disks and 60 on the 16TB and around 50% usage on all. Which is not the case (by far). The problem is that it impact the free available space in the pools (264Ti while there is more than 578Ti free in the cluster) because free space seems to be based on space available before the first osd will be full ! Is it normal ? Did I missed something ? What could I do ? F. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Balancing with upmap
Hi Dan, Here is the output of ceph balancer status : /ceph balancer status// //{// // "last_optimize_duration": "0:00:00.074965", // // "plans": [], // // "mode": "upmap", // // "active": true, // // "optimize_result": "Unable to find further optimization, or pool(s) pg_num is decreasing, or distribution is already perfect", // // "last_optimize_started": "Fri Jan 29 23:13:31 2021"// //}/ F. Le 29/01/2021 à 10:57, Dan van der Ster a écrit : Hi Francois, What is the output of `ceph balancer status` ? Also, can you increase the debug_mgr to 4/5 then share the log file of the active mgr? Best, Dan On Fri, Jan 29, 2021 at 10:54 AM Francois Legrand wrote: Thanks for your suggestion. I will have a look ! But I am a bit surprised that the "official" balancer seems so unefficient ! F. Le 28/01/2021 à 12:00, Jonas Jelten a écrit : Hi! We also suffer heavily from this so I wrote a custom balancer which yields much better results: https://github.com/TheJJ/ceph-balancer After you run it, it echoes the PG movements it suggests. You can then just run those commands the cluster will balance more. It's kinda work in progress, so I'm glad about your feedback. Maybe it helps you :) -- Jonas On 27/01/2021 17.15, Francois Legrand wrote: Hi all, I have a cluster with 116 disks (24 new disks of 16TB added in december and the rest of 8TB) running nautilus 14.2.16. I moved (8 month ago) from crush_compat to upmap balancing. But the cluster seems not well balanced, with a number of pgs on the 8TB disks varying from 26 to 52 ! And an occupation from 35 to 69%. The recent 16 TB disks are more homogeneous with 48 to 61 pgs and space between 30 and 43%. Last week, I realized that some osd were maybe not using upmap because I did a ceph osd crush weight-set ls and got (compat) as result. Thus I ran a ceph osd crush weight-set rm-compat which triggered some rebalancing. Now there is no more recovery for 2 days, but the cluster is still unbalanced. As far as I understand, upmap is supposed to reach an equal number of pgs on all the disks (I guess weighted by their capacity). Thus I would expect more or less 30 pgs on the 8TB disks and 60 on the 16TB and around 50% usage on all. Which is not the case (by far). The problem is that it impact the free available space in the pools (264Ti while there is more than 578Ti free in the cluster) because free space seems to be based on space available before the first osd will be full ! Is it normal ? Did I missed something ? What could I do ? F. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Balancing with upmap
Thanks for your suggestion. I will have a look ! But I am a bit surprised that the "official" balancer seems so unefficient ! F. Le 28/01/2021 à 12:00, Jonas Jelten a écrit : Hi! We also suffer heavily from this so I wrote a custom balancer which yields much better results: https://github.com/TheJJ/ceph-balancer After you run it, it echoes the PG movements it suggests. You can then just run those commands the cluster will balance more. It's kinda work in progress, so I'm glad about your feedback. Maybe it helps you :) -- Jonas On 27/01/2021 17.15, Francois Legrand wrote: Hi all, I have a cluster with 116 disks (24 new disks of 16TB added in december and the rest of 8TB) running nautilus 14.2.16. I moved (8 month ago) from crush_compat to upmap balancing. But the cluster seems not well balanced, with a number of pgs on the 8TB disks varying from 26 to 52 ! And an occupation from 35 to 69%. The recent 16 TB disks are more homogeneous with 48 to 61 pgs and space between 30 and 43%. Last week, I realized that some osd were maybe not using upmap because I did a ceph osd crush weight-set ls and got (compat) as result. Thus I ran a ceph osd crush weight-set rm-compat which triggered some rebalancing. Now there is no more recovery for 2 days, but the cluster is still unbalanced. As far as I understand, upmap is supposed to reach an equal number of pgs on all the disks (I guess weighted by their capacity). Thus I would expect more or less 30 pgs on the 8TB disks and 60 on the 16TB and around 50% usage on all. Which is not the case (by far). The problem is that it impact the free available space in the pools (264Ti while there is more than 578Ti free in the cluster) because free space seems to be based on space available before the first osd will be full ! Is it normal ? Did I missed something ? What could I do ? F. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Balancing with upmap
Nope ! Le 27/01/2021 à 17:40, Anthony D'Atri a écrit : Do you have any override reweights set to values less than 1.0? The REWEIGHT column when you run `ceph osd df` On Jan 27, 2021, at 8:15 AM, Francois Legrand wrote: Hi all, I have a cluster with 116 disks (24 new disks of 16TB added in december and the rest of 8TB) running nautilus 14.2.16. I moved (8 month ago) from crush_compat to upmap balancing. But the cluster seems not well balanced, with a number of pgs on the 8TB disks varying from 26 to 52 ! And an occupation from 35 to 69%. The recent 16 TB disks are more homogeneous with 48 to 61 pgs and space between 30 and 43%. Last week, I realized that some osd were maybe not using upmap because I did a ceph osd crush weight-set ls and got (compat) as result. Thus I ran a ceph osd crush weight-set rm-compat which triggered some rebalancing. Now there is no more recovery for 2 days, but the cluster is still unbalanced. As far as I understand, upmap is supposed to reach an equal number of pgs on all the disks (I guess weighted by their capacity). Thus I would expect more or less 30 pgs on the 8TB disks and 60 on the 16TB and around 50% usage on all. Which is not the case (by far). The problem is that it impact the free available space in the pools (264Ti while there is more than 578Ti free in the cluster) because free space seems to be based on space available before the first osd will be full ! Is it normal ? Did I missed something ? What could I do ? F. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Balancing with upmap
Hi all, I have a cluster with 116 disks (24 new disks of 16TB added in december and the rest of 8TB) running nautilus 14.2.16. I moved (8 month ago) from crush_compat to upmap balancing. But the cluster seems not well balanced, with a number of pgs on the 8TB disks varying from 26 to 52 ! And an occupation from 35 to 69%. The recent 16 TB disks are more homogeneous with 48 to 61 pgs and space between 30 and 43%. Last week, I realized that some osd were maybe not using upmap because I did a ceph osd crush weight-set ls and got (compat) as result. Thus I ran a ceph osd crush weight-set rm-compat which triggered some rebalancing. Now there is no more recovery for 2 days, but the cluster is still unbalanced. As far as I understand, upmap is supposed to reach an equal number of pgs on all the disks (I guess weighted by their capacity). Thus I would expect more or less 30 pgs on the 8TB disks and 60 on the 16TB and around 50% usage on all. Which is not the case (by far). The problem is that it impact the free available space in the pools (264Ti while there is more than 578Ti free in the cluster) because free space seems to be based on space available before the first osd will be full ! Is it normal ? Did I missed something ? What could I do ? F. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: add server in crush map before osd
Thank for your advices. it was exactly what I needed. Indeed, I did a : ceph osd crush add-bucket host ceph osd crush move room= But also set the norecover, nobackfill and norebalance flags :-) It worked perfectly as expected. F. Le 03/12/2020 à 01:50, Reed Dier a écrit : Just to piggyback on this, the below are the correct answers. However, how I do it, which is admittedly not the best way, but it is the easy way. I set the norecover, nobackfill flags I run my osd creation script against the first disk on the new host to make sure that everything is working correctly, and also so that I can then manually move my new host bucket where I need it in the crush map with ceph osd crush move {bucket-name} {bucket-type}={bucket-name} Then I proceed with my script for the rest of the OSDs on that host and know that they will fall into the correct crush location. And then of course I unset the norecover, nobackfill flags so that data starts moving. I only mention this because it ensures that you don't fat finger the hostname on manual bucket creation, or the hostname syntax doesn't match as expected, and it allows you to course correct after a single OSD added, rather than all N OSDs. Hope thats also helpful. Reed On Dec 2, 2020, at 4:38 PM, Dan van der Ster <mailto:d...@vanderster.com>> wrote: Hi Francois! If I've understood your question, I think you have two options. 1. You should be able to create an empty host then move it into a room before creating any osd: ceph osd crush add-bucket host ceph osd crush mv room= 2. Add a custom crush location to ceph.conf on the new server so that its osds are placed in the correct room/rack/host when they are first created, e.g. [osd] crush location = room=0513-S-0034 rack=SJ04 host=cephdata20b-b7e4a773b6 Does that help? Cheers, Dan On Wed, Dec 2, 2020 at 11:29 PM Francois Legrand <mailto:f...@lpnhe.in2p3.fr>> wrote: Hello, I have a ceph nautilus cluster. The crushmap is organized with 2 rooms, servers in these rooms and osd in these servers, I have a crush rule to replicate data over the servers in different rooms. Now, I want to add a new server in one of the rooms. My point is that I would like to specify the room of this new server BEFORE creating osd in this server (so the data added to the osd will be directly at the right location). My problem is that it seems that servers appears in the crushmap only when they have osds... and when you create a first osd, the server is inserted in the crushmap under the default bucket (so not in a room and then the first data stored in this osd will not be at the correct location). I could move it after (if I do it rapidly, there will be no that much data to move after), but I was wondering if there is a way to either define the position of a server in the crushmap hierarchy before creating osd or eventually to specify the room when creating the first osd ? F. ___ ceph-users mailing list -- ceph-users@ceph.io <mailto:ceph-users@ceph.io> To unsubscribe send an email to ceph-users-le...@ceph.io <mailto:ceph-users-le...@ceph.io> ___ ceph-users mailing list -- ceph-users@ceph.io <mailto:ceph-users@ceph.io> To unsubscribe send an email to ceph-users-le...@ceph.io <mailto:ceph-users-le...@ceph.io> ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] add server in crush map before osd
Hello, I have a ceph nautilus cluster. The crushmap is organized with 2 rooms, servers in these rooms and osd in these servers, I have a crush rule to replicate data over the servers in different rooms. Now, I want to add a new server in one of the rooms. My point is that I would like to specify the room of this new server BEFORE creating osd in this server (so the data added to the osd will be directly at the right location). My problem is that it seems that servers appears in the crushmap only when they have osds... and when you create a first osd, the server is inserted in the crushmap under the default bucket (so not in a room and then the first data stored in this osd will not be at the correct location). I could move it after (if I do it rapidly, there will be no that much data to move after), but I was wondering if there is a way to either define the position of a server in the crushmap hierarchy before creating osd or eventually to specify the room when creating the first osd ? F. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: osd regularly wrongly marked down
Hello, During the night the osd.16 crashed after hitting a suicide timout. Thus this morning I did a ceph-kvstore-tool compact and restarted the osd. I thus compared the results of ceph daemon osd.16 perf dump I had before (i.e. yesterday) and now (after compaction). I noticed a interresting difference in msgr_active_connections. Before the compaction it was, for all AsyncMessenger::Worker-0, 1 and 2 at a crasy value (18446744073709550998) and get back to something comparable to what I have for other osds (72). Does this helps you to identify the problem ? F. Le 31/08/2020 à 15:59, Wido den Hollander a écrit : On 31/08/2020 15:44, Francois Legrand wrote: Thanks Igor for your answer, We could try do a compaction of RocksDB manually, but it's not clear to me if we have to compact on the mon with something like ceph-kvstore-tool rocksdb /var/lib/ceph/mon/mon01/store.db/ compact or on the concerned osd with ceph-kvstore-tool rocksdb /var/lib/ceph/osd/ceph-16/ compact (or for all osd with a script like in https://gist.github.com/wido/b0f0200bd1a2cbbe3307265c5cfb2771 ) You would compact the OSDs, not the MONs. So the last command or my script which you linked there. For my culture, how does compaction works ? Is it done automatically in background, regularly, at startup ? Usually it's done by the OSD in the background, but sometimes an offline compact works best. Because in the logs of the osd we have every 10mn some reports about compaction (which suggests that compaction occurs regularly), like : Yes, that is normal. But the offline compaction is sometimes more effective than the online ones are. 2020-08-31 15:06:55.448 7f03fb398700 4 rocksdb: [db/db_impl.cc:777] --- DUMPING STATS --- 2020-08-31 15:06:55.448 7f03fb398700 4 rocksdb: [db/db_impl.cc:778] ** DB Stats ** Uptime(secs): 449404.8 total, 600.0 interval Cumulative writes: 136K writes, 692K keys, 136K commit groups, 1.0 writes per commit group, ingest: 0.28 GB, 0.00 MB/s Cumulative WAL: 136K writes, 67K syncs, 2.04 writes per sync, written: 0.28 GB, 0.00 MB/s Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent Interval writes: 128 writes, 336 keys, 128 commit groups, 1.0 writes per commit group, ingest: 0.22 MB, 0.00 MB/s Interval WAL: 128 writes, 64 syncs, 1.97 writes per sync, written: 0.00 MB, 0.00 MB/s Interval stall: 00:00:0.000 H:M:S, 0.0 percent ** Compaction Stats [default] ** Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop L0 1/0 60.48 MB 0.2 0.0 0.0 0.0 0.1 0.1 0.0 1.0 0.0 163.7 0.52 0.40 2 0.258 0 0 L1 0/0 0.00 KB 0.0 0.1 0.1 0.0 0.1 0.1 0.0 0.5 48.2 26.1 2.32 0.64 1 2.319 920K 197K L2 17/0 1.00 GB 0.8 1.1 0.1 1.1 1.1 0.0 0.0 18.3 69.8 67.5 16.38 4.97 1 16.380 4747K 82K L3 81/0 4.50 GB 0.9 0.6 0.1 0.5 0.3 -0.2 0.0 4.3 66.9 36.6 9.23 4.95 2 4.617 9544K 802K L4 285/0 16.64 GB 0.1 2.4 0.3 2.0 0.2 -1.8 0.0 0.8 110.3 11.7 21.92 4.37 5 4.384 12M 12M Sum 384/0 22.20 GB 0.0 4.2 0.6 3.6 1.8 -1.8 0.0 21.8 85.2 36.6 50.37 15.32 11 4.579 28M 13M Int 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.00 0 0.000 0 0 ** Compaction Stats [default] ** Priority Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop --- Low 0/0 0.00 KB 0.0 4.2 0.6 3.6 1.7 -1.9 0.0 0.0 86.0 35.3 49.86 14.92 9 5.540 28M 13M High 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.1 0.1 0.0 0.0 0.0 150.2 0.40 0.40 1 0.403 0 0 User 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 211.7 0.11 0.00 1 0.114 0 0 Uptime(secs): 449404.8 total, 600.0 interval Flush(GB): cumulative 0.083, interval 0.000 AddFile(Total Files): cumulative 0, interval 0 AddFile(L0 Files): cumulative 0, interval 0 AddFile(Keys): cumulative 0, interval 0 Cumulative compaction: 1.80 GB write, 0.00 MB/s write, 4.19 GB r
[ceph-users] Re: osd regularly wrongly marked down
idays time (only a few KB/s of io and no recover). We have no standalone fast drive for DB/WAL and nothing in the osds (nor mons) logs suggesting any problem (apart the heartbeat_map is_healthy timeout). Thanks F. Le 31/08/2020 à 12:15, Igor Fedotov a écrit : Hi Francois, given that slow operations are observed for collection listings you might want to manually compact RocksDB using ceph-kvstore-tool. The observed slowdown tends to happen after massive data removals. I've seen multiple compains about this issue including some post in this mailing list. BTW I can see your post from Jun 24 about slow pool removal - couldn't this be a trigger? Also wondering whether you have standalone fast(SSD/NVMe) drive for DB/WAL? Aren't there any BlueFS spillovers which might be relevant? Thanks, Igor On 8/28/2020 11:33 AM, Francois Legrand wrote: Hi all, We have a ceph cluster in production with 6 osds servers (with 16x8TB disks), 3 mons/mgrs and 3 mdss. Both public and cluster networks are in 10GB and works well. After a major crash in april, we turned the option bluefs_buffered_io to false to workaround the large write bug when bluefs_buffered_io was true (we were in version 14.2.8 and the default value at this time was true). Since that time, we regularly have some osds wrongly marked down by the cluster after heartbeat timeout (heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15). Generally the osd restart and the cluster is back healthy, but several time, after many of these kick-off the osd reach the osd_op_thread_suicide_timeout and goes down definitely. We increased the osd_op_thread_timeout and osd_op_thread_suicide_timeout... The problems still occurs (but less frequently). Few days ago, we upgraded to 14.2.11 and revert the timeout to their default value, hoping that it will solve the problem (we thought that it should be related to this bug https://tracker.ceph.com/issues/45943), but it didn't. We still have some osds wrongly marked down. Can somebody help us to fix this problem ? Thanks. Here is an extract of an osd log at failure time: - 2020-08-28 02:19:05.019 7f03f1384700 0 log_channel(cluster) log [DBG] : 44.7d scrub starts 2020-08-28 02:19:25.755 7f040e43d700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15 2020-08-28 02:19:25.755 7f040dc3c700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15 this last line is repeated more than 1000 times ... 2020-08-28 02:20:17.484 7f040d43b700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15 2020-08-28 02:20:17.551 7f03f1384700 0 bluestore(/var/lib/ceph/osd/ceph-16) log_latency_fn slow operation observed for _collection_list, latency = 67.3532s, lat = 67s cid =44.7d_head start GHMAX end GHMAX max 25 ... 2020-08-28 02:20:22.600 7f040dc3c700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15 2020-08-28 02:21:20.774 7f03f1384700 0 bluestore(/var/lib/ceph/osd/ceph-16) log_latency_fn slow operation observed for _collection_list, latency = 63.223s, lat = 63s cid =44.7d_head start #44:beffc78d:::rbd_data.1e48e8ab988992.11bd:0# end #MAX# max 2147483647 2020-08-28 02:21:20.774 7f03f1384700 1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15 2020-08-28 02:21:20.805 7f03f1384700 0 log_channel(cluster) log [DBG] : 44.7d scrub ok 2020-08-28 02:21:21.099 7f03fd997700 0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.16 down, but it is still running 2020-08-28 02:21:21.099 7f03fd997700 0 log_channel(cluster) log [DBG] : map e609411 wrongly marked me down at e609410 2020-08-28 02:21:21.099 7f03fd997700 1 osd.16 609411 start_waiting_for_healthy 2020-08-28 02:21:21.119 7f03fd997700 1 osd.16 609411 start_boot 2020-08-28 02:21:21.124 7f03f0b83700 1 osd.16 pg_epoch: 609410 pg[36.3d0( v 609409'481293 (449368'478292,609409'481293] local-lis/les=609403/609404 n=154651 ec=435353/435353 lis/c 609403/609403 les/c/f 609404/609404/0 609410/609410/608752) [25,72] r=-1 lpr=609410 pi=[609403,609410)/1 luod=0'0 lua=609392'481198 crt=609409'481293 lcod 609409'481292 active mbc={}] start_peering_interval up [25,72,16] -> [25,72], acting [25,72,16] -> [25,72], acting_primary 25 -> 25, up_primary 25 -> 25, role 2 -> -1, features acting 4611087854031667199 upacting 4611087854031667199 ... 2020-08-28 02:21:21.166 7f03f0b83700 1 osd.16 pg_epoch: 609411 pg[36.56( v 609409'480511 (449368'477424,609409'480511] local-lis/les=609403/609404 n=153854 ec=435353/435353 lis/c 609403/609403 les/c/f 609404/609404/0 609410/609410/609410) [103,102] r=-1 lpr=609410 pi=[609403,609410)/1 crt=60940
[ceph-users] Re: osd regularly wrongly marked down
We tried to rise the osd_memory_target from 4 to 8G but the problem still occurs (osd wrongly marked down few times a day). Does somebody have any clue ? F. On Fri, Aug 28, 2020 at 10:34 AM Francois Legrand mailto:f...@lpnhe.in2p3.fr>> wrote: Hi all, We have a ceph cluster in production with 6 osds servers (with 16x8TB disks), 3 mons/mgrs and 3 mdss. Both public and cluster networks are in 10GB and works well. After a major crash in april, we turned the option bluefs_buffered_io to false to workaround the large write bug when bluefs_buffered_io was true (we were in version 14.2.8 and the default value at this time was true). Since that time, we regularly have some osds wrongly marked down by the cluster after heartbeat timeout (heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15). Generally the osd restart and the cluster is back healthy, but several time, after many of these kick-off the osd reach the osd_op_thread_suicide_timeout and goes down definitely. We increased the osd_op_thread_timeout and osd_op_thread_suicide_timeout... The problems still occurs (but less frequently). Few days ago, we upgraded to 14.2.11 and revert the timeout to their default value, hoping that it will solve the problem (we thought that it should be related to this bug https://tracker.ceph.com/issues/45943), but it didn't. We still have some osds wrongly marked down. Can somebody help us to fix this problem ? Thanks. Here is an extract of an osd log at failure time: - 2020-08-28 02:19:05.019 7f03f1384700 0 log_channel(cluster) log [DBG] : 44.7d scrub starts 2020-08-28 02:19:25.755 7f040e43d700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15 2020-08-28 02:19:25.755 7f040dc3c700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15 this last line is repeated more than 1000 times ... 2020-08-28 02:20:17.484 7f040d43b700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15 2020-08-28 02:20:17.551 7f03f1384700 0 bluestore(/var/lib/ceph/osd/ceph-16) log_latency_fn slow operation observed for _collection_list, latency = 67.3532s, lat = 67s cid =44.7d_head start GHMAX end GHMAX max 25 ... 2020-08-28 02:20:22.600 7f040dc3c700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15 2020-08-28 02:21:20.774 7f03f1384700 0 bluestore(/var/lib/ceph/osd/ceph-16) log_latency_fn slow operation observed for _collection_list, latency = 63.223s, lat = 63s cid =44.7d_head start #44:beffc78d:::rbd_data.1e48e8ab988992.11bd:0# end #MAX# max 2147483647 2020-08-28 02:21:20.774 7f03f1384700 1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15 2020-08-28 02:21:20.805 7f03f1384700 0 log_channel(cluster) log [DBG] : 44.7d scrub ok 2020-08-28 02:21:21.099 7f03fd997700 0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.16 down, but it is still running 2020-08-28 02:21:21.099 7f03fd997700 0 log_channel(cluster) log [DBG] : map e609411 wrongly marked me down at e609410 2020-08-28 02:21:21.099 7f03fd997700 1 osd.16 609411 start_waiting_for_healthy 2020-08-28 02:21:21.119 7f03fd997700 1 osd.16 609411 start_boot 2020-08-28 02:21:21.124 7f03f0b83700 1 osd.16 pg_epoch: 609410 pg[36.3d0( v 609409'481293 (449368'478292,609409'481293] local-lis/les=609403/609404 n=154651 ec=435353/435353 lis/c 609403/609403 les/c/f 609404/609404/0 609410/609410/608752) [25,72] r=-1 lpr=609410 pi=[609403,609410)/1 luod=0'0 lua=609392'481198 crt=609409'481293 lcod 609409'481292 active mbc={}] start_peering_interval up [25,72,16] -> [25,72], acting [25,72,16] -> [25,72], acting_primary 25 -> 25, up_primary 25 -> 25, role 2 -> -1, features acting 4611087854031667199 upacting 4611087854031667199 ... 2020-08-28 02:21:21.166 7f03f0b83700 1 osd.16 pg_epoch: 609411 pg[36.56( v 609409'480511 (449368'477424,609409'480511] local-lis/les=609403/609404 n=153854 ec=435353/435353 lis/c 609403/609403 les/c/f 609404/60
[ceph-users] osd regularly wrongly marked down
Hi all, We have a ceph cluster in production with 6 osds servers (with 16x8TB disks), 3 mons/mgrs and 3 mdss. Both public and cluster networks are in 10GB and works well. After a major crash in april, we turned the option bluefs_buffered_io to false to workaround the large write bug when bluefs_buffered_io was true (we were in version 14.2.8 and the default value at this time was true). Since that time, we regularly have some osds wrongly marked down by the cluster after heartbeat timeout (heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15). Generally the osd restart and the cluster is back healthy, but several time, after many of these kick-off the osd reach the osd_op_thread_suicide_timeout and goes down definitely. We increased the osd_op_thread_timeout and osd_op_thread_suicide_timeout... The problems still occurs (but less frequently). Few days ago, we upgraded to 14.2.11 and revert the timeout to their default value, hoping that it will solve the problem (we thought that it should be related to this bug https://tracker.ceph.com/issues/45943), but it didn't. We still have some osds wrongly marked down. Can somebody help us to fix this problem ? Thanks. Here is an extract of an osd log at failure time: - 2020-08-28 02:19:05.019 7f03f1384700 0 log_channel(cluster) log [DBG] : 44.7d scrub starts 2020-08-28 02:19:25.755 7f040e43d700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15 2020-08-28 02:19:25.755 7f040dc3c700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15 this last line is repeated more than 1000 times ... 2020-08-28 02:20:17.484 7f040d43b700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15 2020-08-28 02:20:17.551 7f03f1384700 0 bluestore(/var/lib/ceph/osd/ceph-16) log_latency_fn slow operation observed for _collection_list, latency = 67.3532s, lat = 67s cid =44.7d_head start GHMAX end GHMAX max 25 ... 2020-08-28 02:20:22.600 7f040dc3c700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15 2020-08-28 02:21:20.774 7f03f1384700 0 bluestore(/var/lib/ceph/osd/ceph-16) log_latency_fn slow operation observed for _collection_list, latency = 63.223s, lat = 63s cid =44.7d_head start #44:beffc78d:::rbd_data.1e48e8ab988992.11bd:0# end #MAX# max 2147483647 2020-08-28 02:21:20.774 7f03f1384700 1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f03f1384700' had timed out after 15 2020-08-28 02:21:20.805 7f03f1384700 0 log_channel(cluster) log [DBG] : 44.7d scrub ok 2020-08-28 02:21:21.099 7f03fd997700 0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.16 down, but it is still running 2020-08-28 02:21:21.099 7f03fd997700 0 log_channel(cluster) log [DBG] : map e609411 wrongly marked me down at e609410 2020-08-28 02:21:21.099 7f03fd997700 1 osd.16 609411 start_waiting_for_healthy 2020-08-28 02:21:21.119 7f03fd997700 1 osd.16 609411 start_boot 2020-08-28 02:21:21.124 7f03f0b83700 1 osd.16 pg_epoch: 609410 pg[36.3d0( v 609409'481293 (449368'478292,609409'481293] local-lis/les=609403/609404 n=154651 ec=435353/435353 lis/c 609403/609403 les/c/f 609404/609404/0 609410/609410/608752) [25,72] r=-1 lpr=609410 pi=[609403,609410)/1 luod=0'0 lua=609392'481198 crt=609409'481293 lcod 609409'481292 active mbc={}] start_peering_interval up [25,72,16] -> [25,72], acting [25,72,16] -> [25,72], acting_primary 25 -> 25, up_primary 25 -> 25, role 2 -> -1, features acting 4611087854031667199 upacting 4611087854031667199 ... 2020-08-28 02:21:21.166 7f03f0b83700 1 osd.16 pg_epoch: 609411 pg[36.56( v 609409'480511 (449368'477424,609409'480511] local-lis/les=609403/609404 n=153854 ec=435353/435353 lis/c 609403/609403 les/c/f 609404/609404/0 609410/609410/609410) [103,102] r=-1 lpr=609410 pi=[609403,609410)/1 crt=609409'480511 lcod 609409'480510 unknown NOTIFY mbc={}] state: transitioning to Stray 2020-08-28 02:21:21.307 7f04073b0700 1 osd.16 609413 set_numa_affinity public network em1 numa node 0 2020-08-28 02:21:21.307 7f04073b0700 1 osd.16 609413 set_numa_affinity cluster network em2 numa node 0 2020-08-28 02:21:21.307 7f04073b0700 1 osd.16 609413 set_numa_affinity objectstore and network numa nodes do not match 2020-08-28 02:21:21.307 7f04073b0700 1 osd.16 609413 set_numa_affinity not setting numa affinity 2020-08-28 02:21:21.566 7f040a435700 1 osd.16 609413 tick checking mon for new map 2020-08-28 02:21:22.515 7f03fd997700 1 osd.16 609414 state: booting -> active 2020-08-28 02:21:22.515 7f03f0382700 1 osd.16 pg_epoch: 609414 pg[36.20( v 609409'483167 (449368'480117,609409'483167] local-lis/les=609403/609404 n=155171 ec=435353/435353 lis/c 609403/609403 les/c/f 609404/609404/0 609414/609414/609361) [97,16,72] r=1 lpr=609414 pi=[609403,609414)/1 crt=609409'483167 lcod 609409'483166 unknown NO
[ceph-users] osd crashing and rocksdb corruption
Hi all, *** Short version *** Is there a way to repair a rocksdb from errors "Encountered error while reading data from compression dictionary block Corruption: block checksum mismatch" and "_open_db erroring opening db" ? *** Long version *** We operate a nautilus ceph cluster (with 100 disks of 8TB in 6 servers + 4 mons/mgr + 3 mds). We recently (Monday 20) upgraded from 14.2.7 to 14.2.8. This triggered a rebalancing of some data. Two days later (Wednesday 22) we had a very short power outage. Only one of the osd servers went down (and unfortunately died). This triggered a reconstruction of the losts osds. Operations went fine until Saturday 25 where some osds in the 5 remaining servers started to crash apparently with no reasons. We tryed to restart them, but they crashed again. We ended with 18 osd down (+ 16 in the dead server so 34 osd downs out of 100). Looking at the logs we found for all the crashed osd : -237> 2020-04-25 16:32:51.835 7f1f45527a80 3 rocksdb: [table/block_based_table_reader.cc:1117] Encountered error while reading data from compression dictionary block Corruption: block checksum mismatch: expected 0, got 2729370997 in db/181355.sst offset 18446744073709551615 size 18446744073709551615 and 2020-04-25 16:05:47.251 7fcbd1e46a80 -1 bluestore(/var/lib/ceph/osd/ceph-3) _open_db erroring opening db: We also noticed that the "Encountered error while reading data from compression dictionary block Corruption: block checksum mismatch" was present few days before the crash. We also have some osd with this error but still up. We tryed to repair with : ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-3 repair But no success (it ends with _open_db erroring opening db). Thus does somebody have an idea to fix this or at least know if it's possible to repair and correct the "Encountered error while reading data from compression dictionary block Corruption: block checksum mismatch" and "_open_db erroring opening db" ! Thanks for your help (we are desperate because we will loose datas and are fighting to save something) !!! F. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] ceph nautilus repository index is incomplet
Hello, It seems that the index of https://download.ceph.com/rpm-nautilus/el7/x86_64/ repository is wrong. Only the 14.2.10-0.el7 version is available (all previous versions are missing despite the fact that the rpms are present in the repository). It thus seems that the index needs to be corrected. Who can I contact for that ? Thanks. F. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Removing pool in nautilus is incredibly slow
Thanks. I also added osd_op_queue_cut_off to high in global (as you mentioned in a previous thread that osd and mds should use it). F. Le 26/06/2020 à 16:35, Frank Schilder a écrit : I never tried "prio" out, but the reports I have seen claim that prio is inferior. However, as far as I know it is safe to change these settings. Unfortunately, you need to restart services to apply the changes. Before you do, check if *all* daemons are using the same setting. Contrary to the naming (osd_*), this setting applies to all daemons. I added it to the global options and, most notably, performance of the MDS was improved a lot. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ____ From: Francois Legrand Sent: 26 June 2020 15:03:23 To: Frank Schilder; ceph-users@ceph.io Subject: Re: [ceph-users] Re: Removing pool in nautilus is incredibly slow I changed osd_op_queue_cut_off to high and rebooted all the osds. But the result is more or less the same (storage is still extremely slow, 2h30 to rdb extract a 64GB image !). The only improvement is that it seems that degraded pgs have disapeared (which is at least a good point). It seems that there is a problem in priority of operations. Thus do you think (and also others on the list) that changing the osd_op_queue setting could help (change to prio or mclock_client). What are the risks or secondary effects of trying mclock_client on a production cluster (is it safe) ? F. Le 26/06/2020 à 09:46, Frank Schilder a écrit : I'm using osd_op_queue = wpq osd_op_queue_cut_off = high and these settings are recommended. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Francois Legrand Sent: 26 June 2020 09:44:00 To: Frank Schilder; ceph-users@ceph.io Subject: Re: [ceph-users] Re: Removing pool in nautilus is incredibly slow We are now using osd_op_queue = wpq. Maybe returning to prio should help ? What are you using on your mimic custer ? F. Le 25/06/2020 à 19:28, Frank Schilder a écrit : OK, this *does* sound bad. I would consider this a show stopper for upgrade from mimic. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Francois Legrand Sent: 25 June 2020 19:25:14 To: ceph-users@ceph.io Subject: [ceph-users] Re: Removing pool in nautilus is incredibly slow I also had this kind of symptoms with nautilus. Replacing a failed disk (from cluster ok) generates degraded objects. Also, we have a proxmox cluster accessing vm images stored in our ceph storage with rbd. Each time I had some operation on the ceph cluster like adding or removing a pool, most of our proxmox vms lost contact with their system disk in ceph and crashed (or remount system storage in read-only mode). At first I thought it was a network problem, but now I am sure that it's related to ceph becoming unresponsive during background operations. For now, proxmox cannot even access ceph storage using rbd (it fails with timeout). ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: ceph qos
Does somebody uses mclock in a production cluster ? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Removing pool in nautilus is incredibly slow
I changed osd_op_queue_cut_off to high and rebooted all the osds. But the result is more or less the same (storage is still extremely slow, 2h30 to rdb extract a 64GB image !). The only improvement is that it seems that degraded pgs have disapeared (which is at least a good point). It seems that there is a problem in priority of operations. Thus do you think (and also others on the list) that changing the osd_op_queue setting could help (change to prio or mclock_client). What are the risks or secondary effects of trying mclock_client on a production cluster (is it safe) ? F. Le 26/06/2020 à 09:46, Frank Schilder a écrit : I'm using osd_op_queue = wpq osd_op_queue_cut_off = high and these settings are recommended. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Francois Legrand Sent: 26 June 2020 09:44:00 To: Frank Schilder; ceph-users@ceph.io Subject: Re: [ceph-users] Re: Removing pool in nautilus is incredibly slow We are now using osd_op_queue = wpq. Maybe returning to prio should help ? What are you using on your mimic custer ? F. Le 25/06/2020 à 19:28, Frank Schilder a écrit : OK, this *does* sound bad. I would consider this a show stopper for upgrade from mimic. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Francois Legrand Sent: 25 June 2020 19:25:14 To: ceph-users@ceph.io Subject: [ceph-users] Re: Removing pool in nautilus is incredibly slow I also had this kind of symptoms with nautilus. Replacing a failed disk (from cluster ok) generates degraded objects. Also, we have a proxmox cluster accessing vm images stored in our ceph storage with rbd. Each time I had some operation on the ceph cluster like adding or removing a pool, most of our proxmox vms lost contact with their system disk in ceph and crashed (or remount system storage in read-only mode). At first I thought it was a network problem, but now I am sure that it's related to ceph becoming unresponsive during background operations. For now, proxmox cannot even access ceph storage using rbd (it fails with timeout). ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Removing pool in nautilus is incredibly slow
Thanks. I will try to change osd_op_queue_cut_off to high and restart everything (and use this downtime to upgrade the servers). F. Le 26/06/2020 à 09:46, Frank Schilder a écrit : I'm using osd_op_queue = wpq osd_op_queue_cut_off = high and these settings are recommended. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Francois Legrand Sent: 26 June 2020 09:44:00 To: Frank Schilder; ceph-users@ceph.io Subject: Re: [ceph-users] Re: Removing pool in nautilus is incredibly slow We are now using osd_op_queue = wpq. Maybe returning to prio should help ? What are you using on your mimic custer ? F. Le 25/06/2020 à 19:28, Frank Schilder a écrit : OK, this *does* sound bad. I would consider this a show stopper for upgrade from mimic. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Francois Legrand Sent: 25 June 2020 19:25:14 To: ceph-users@ceph.io Subject: [ceph-users] Re: Removing pool in nautilus is incredibly slow I also had this kind of symptoms with nautilus. Replacing a failed disk (from cluster ok) generates degraded objects. Also, we have a proxmox cluster accessing vm images stored in our ceph storage with rbd. Each time I had some operation on the ceph cluster like adding or removing a pool, most of our proxmox vms lost contact with their system disk in ceph and crashed (or remount system storage in read-only mode). At first I thought it was a network problem, but now I am sure that it's related to ceph becoming unresponsive during background operations. For now, proxmox cannot even access ceph storage using rbd (it fails with timeout). ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Removing pool in nautilus is incredibly slow
We are now using osd_op_queue = wpq. Maybe returning to prio should help ? What are you using on your mimic custer ? F. Le 25/06/2020 à 19:28, Frank Schilder a écrit : OK, this *does* sound bad. I would consider this a show stopper for upgrade from mimic. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Francois Legrand Sent: 25 June 2020 19:25:14 To: ceph-users@ceph.io Subject: [ceph-users] Re: Removing pool in nautilus is incredibly slow I also had this kind of symptoms with nautilus. Replacing a failed disk (from cluster ok) generates degraded objects. Also, we have a proxmox cluster accessing vm images stored in our ceph storage with rbd. Each time I had some operation on the ceph cluster like adding or removing a pool, most of our proxmox vms lost contact with their system disk in ceph and crashed (or remount system storage in read-only mode). At first I thought it was a network problem, but now I am sure that it's related to ceph becoming unresponsive during background operations. For now, proxmox cannot even access ceph storage using rbd (it fails with timeout). ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Removing pool in nautilus is incredibly slow
I think he means that after disk failure he waits for the cluster to get back to ok (so all data on the lost disk have been reconstructed elsewhere) and then the disk is changed. In that case it's normal to have misplaced objects (because with the new disk some pgs needs to be migrated to populate this new space), but degraded pg does not seems to be the good behaviour ! ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Removing pool in nautilus is incredibly slow
For sure, If I could downgrade to mimic I would probably do it !!! So I understand that you plan not to upgrade ! F. Le 25/06/2020 à 19:28, Frank Schilder a écrit : OK, this *does* sound bad. I would consider this a show stopper for upgrade from mimic. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Francois Legrand Sent: 25 June 2020 19:25:14 To: ceph-users@ceph.io Subject: [ceph-users] Re: Removing pool in nautilus is incredibly slow I also had this kind of symptoms with nautilus. Replacing a failed disk (from cluster ok) generates degraded objects. Also, we have a proxmox cluster accessing vm images stored in our ceph storage with rbd. Each time I had some operation on the ceph cluster like adding or removing a pool, most of our proxmox vms lost contact with their system disk in ceph and crashed (or remount system storage in read-only mode). At first I thought it was a network problem, but now I am sure that it's related to ceph becoming unresponsive during background operations. For now, proxmox cannot even access ceph storage using rbd (it fails with timeout). ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Removing pool in nautilus is incredibly slow
I also had this kind of symptoms with nautilus. Replacing a failed disk (from cluster ok) generates degraded objects. Also, we have a proxmox cluster accessing vm images stored in our ceph storage with rbd. Each time I had some operation on the ceph cluster like adding or removing a pool, most of our proxmox vms lost contact with their system disk in ceph and crashed (or remount system storage in read-only mode). At first I thought it was a network problem, but now I am sure that it's related to ceph becoming unresponsive during background operations. For now, proxmox cannot even access ceph storage using rbd (it fails with timeout). ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Removing pool in nautilus is incredibly slow
Thanks for the hint. I tryed but it doesn't seems to change anything... Moreover, as the osds seems quite loaded I had regularly some osd marked down which triggered some new peering and thus more load !!! I set the osd no down flag, but I still have some osd reported (wrongly) as down (and back up in the minute) which generate peering and remapping. I don't really understand the action of no down parameter ! Is there a way to tell ceph not to peer immediately after an osd is reported down (let say wait for 60s) ? I am thinking about restarting all osd (or maybe the whole cluster) to get osd_op_queue_cut_off changed to high and osd_op_thread_timeout to something higher than 15 (but I don't think it will really improve the situation). F. Le 25/06/2020 à 14:26, Wout van Heeswijk a écrit : Hi Francois, Have you already looked at the option "osd_delete_sleep"? It will not speed up the process but I will give you some control over your cluster performance. Something like: ceph tell osd.\* injectargs '--osd_delete_sleep1' kind regards, Wout 42on On 25-06-2020 09:57, Francois Legrand wrote: Does someone have an idea ? F. ___ ceph-users mailing list --ceph-users@ceph.io To unsubscribe send an email toceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Removing pool in nautilus is incredibly slow
Does someone have an idea ? F. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Removing pool in nautilus is incredibly slow
Hello, I am running ceph nautilus 14.2.8 I had to remove 2 pools (old cephfs data and metadata pool with 1024 pgs). The removal of the pools seems to take a incredible time to free the space (the data pool I deleted was more than 100 TB and in 36h I got back only 10TB). In the meantime, the cluster is extremely slow (a rbd extract takes ~1h30 mn for a 32 GB image and writing 10MB in cephfs takes half a minute !!) which makes the cluster almost unusable. It seems that the removal of deleted pg is done by deep-scrubs according tohttps://medium.com/opsops/a-very-slow-pool-removal-7089e4ac8301 Also it has been reported that this could be a regression in nautilushttps://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/W4M5XQRDBLXFGJGDYZALG6TQ4QBVGGAJ/#W4M5XQRDBLXFGJGDYZALG6TQ4QBVGGAJ But I couldn't find a fix or a way to speedup (or slow down) the process and get back the cluster to a decent reactivity. Is there a way ? Thanks F. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: How to remove one of two filesystems
Thanks a lot. It works. I could delete the filesystem and remove the pools (data and metadata). But now I am facing another problem which is that the removal of the pools seems to take a incredible time to free the space (the pool I deleted was about 100TB and in 36h I got back only 10TB). In the meantime, the cluster is extremely slow (a rbd extract takes ~30 mn for a 9 GB image and writing 10MB in cephfs takes half a minute !!) which makes the cluster almost unusable. It seems that the removal of deleted pg is done by deep-scrubs according to https://medium.com/opsops/a-very-slow-pool-removal-7089e4ac8301 But I couldn't find a way to speedup the process or to get back the cluster to a decent reactivity ? Do you have a suggestion ? F. Le 22/06/2020 à 16:40, Patrick Donnelly a écrit : On Mon, Jun 22, 2020 at 7:29 AM Frank Schilder wrote: Use ceph fs set down true after this all mdses of fs fs_name will become standbys. Now you can cleanly remove everything. Wait for the fs to be shown as down in ceph status, the command above is non-blocking but the shutdown takes a long time. Try to disconnect all clients first. If you're planning to delete the file system, it is faster to just do: ceph fs fail which will remove all the MDS and mark the cluster as not joinable. See also: https://docs.ceph.com/docs/master/cephfs/administration/#taking-the-cluster-down-rapidly-for-deletion-or-disaster-recovery ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] How to remove one of two filesystems
Hello, I have a ceph cluster (nautilus 14.2.8) with 2 filesystems and 3 mds. mds1 is managing fs1 mds2 manages fs2 mds3 is standby I want to completely remove fs1. It seems that the command to use is ceph fs rm fs1 --yes-i-really-mean-it and then delete the data and metadata pools with ceph osd pool delete but in many threads I noticed that you must shutdown the mds before running ceph fs rm. Is it still the case ? What happens in my configuration (I have 2 fs) ? If I stop mds1, the mds3 will take the management. If I stop mds3 what will mds2 do (try to manage the 2 fs or continue only with fs2) ? Thanks for your advices. F. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: mds behind on trimming - replay until memory exhausted
Hi, Actually I let the mds managing the damaged filesystem as it is because the files can be read (despite of the warning and errors). Thus I restarted the rsyncs to transfer everything to the new filesystem (thus on different PG because it's a different cephfs with different pools) but without deleting the olds files to avoid killing definitively the old mds and the old fs. The number of segment is then more or less stable (very high ~123611 but not increasing too much). I guess that we will have enought space to copy the remaining datas (it will be short but I think it will pass). Once everything will be transfered and checked, I will destroy the old FS and the damaged pool. F. Le 09/06/2020 à 19:50, Frank Schilder a écrit : Looks like an answer to your other thread takes its time. Is it a possible option for you to - copy all readable files using this PG to some other storage, - remove or clean up the broken PG and - copy the files back in? This might lead to a healthy cluster. I don't know a proper procedure though. Somehow the ceph fs must play along as files using this will also use other PGs and get partly broken. Have you found other options? Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ____ From: Francois Legrand Sent: 08 June 2020 16:38:18 To: Frank Schilder; ceph-users Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory exhausted I already had some discussion on the list about this problem. But I should ask again. We really lost some objects and there are not enought shards to reconstruct them (it's an erasure coding data pool)... so it cannot be fixed anymore and we know we have data loss ! I did not marked the PG out because there are still some parts (objects) which are still present and we hope to be able to copy them and save a few bytes more ! It would be great to be able to flush only broken objects, but I don't know how to do that, even if it's possible ! I thus run some cephfs-data-scan pg_files to identify the files with data on this pg and the I run a grep -q -m 1 "." "/path_to_damaged_file" to identify the ones which are really empty (we tested different way to do this and it seems that's the fastest). F. Le 08/06/2020 à 16:07, Frank Schilder a écrit : OK, now we are talking. It is very well possible that trimming will not start until this operation is completed. If there are enough shards/copies to recover the lost objects, you should try a pg repair first. If you did loose too many replicas, there are ways to flush this PG out of the system. You will loose data this way. I don't know how to repair or flush only broken objects out of a PG, but would hope that this is possible. Before you do anything destructive, open a new thread in this list specifically for how to repair/remove this PG with the least possible damage. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ____ From: Francois Legrand Sent: 08 June 2020 16:00:28 To: Frank Schilder; ceph-users Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory exhausted There is no recovery going on, but indeed we have a pg damaged (with some lost objects due to a major crash few weeks ago)... and there are some shards of this pg on osd 27 ! That's also why we are migrating all the data out of this FS ! It's certainly related and I guess that it's trying to remove some datas that are already lost and it get stuck ! I don't know if there is a way to tell ceph to forget about these ops ! I guess no. I thus think that there is not that much to do apart from reading as much data as we can to save as much as possible. F. Le 08/06/2020 à 15:48, Frank Schilder a écrit : That's strange. Maybe there is another problem. Do you have any other health warnings that might be related? Is there some recovery/rebalancing going on? Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Francois Legrand Sent: 08 June 2020 15:27:59 To: Frank Schilder; ceph-users Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory exhausted Thanks again for the hint ! Indeed, I did a ceph daemon mds.lpnceph-mds02.in2p3.fr objecter_requests and it seems that osd 27 is more or less stuck with op of age 34987.5 (while others osd have ages < 1). I tryed a ceph osd down 27 which resulted in reseting the age but I can notice that age for osd.27 ops is rising again. I think I will restart it (btw our osd servers and mds are different machines). F. Le 08/06/2020 à 15:01, Frank Schilder a écrit : Hi Francois, this sounds great. At least its operational. I guess it is still using a lot of swap while trying to replay operations. I would disconnect cleanly all clients
[ceph-users] Broken PG in cephfs data_pool (lost objects)
Hi all, We have a cephfs with data_pool in erasure coding (3+2) ans 1024 pg (nautilus 14.2.8). One of the pgs is partially destroyed (we lost 3 osd thus 3 shards), it have 143 objects unfound and is stuck in state "active+recovery_unfound+undersized+degraded+remapped". We then lost some datas (we are using cephfs-data-scan pg_files... to identify files with data on the bad pg) . We thus created a new filesystem (this time with data_pool in replica 3) and we are copying all the datas from the broken FS to the new one. But we need to remove files from the broken FS after copy to free space (because there will not be enough space on the cluster). To avoid problems of strays we removed the snapshots on the broken FS before deleting files. The point is that the mds managing the broken FS is now "Behind on trimming (123036/128) max_segments: 128, num_segments: 123036" and have 1 slow metadata IOs are blocked > 30 secs, oldest blocked for 83645 secs. The slow IO correspond to osd 27 which is acting_primary for the broken PG. and the broken pg have a long "snap_trimq": "[1e0c~1,1e0e~1,1e12~1,1e16~1,1e18~1,1e1a~1," and "snap_trimq_len": 460. It then seems that cephfs is not able to trim ops corresponding to the deletion of objects and snaps which have data on the broken PG, probably because the pg is not healty. Is there a way to tell ceph/cephfs to flush or forget about (only) lost objects on the broken pg and get this pg healty enough to perform trimming ? thanks for your help F. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: mds behind on trimming - replay until memory exhausted
I already had some discussion on the list about this problem. But I should ask again. We really lost some objects and there are not enought shards to reconstruct them (it's an erasure coding data pool)... so it cannot be fixed anymore and we know we have data loss ! I did not marked the PG out because there are still some parts (objects) which are still present and we hope to be able to copy them and save a few bytes more ! It would be great to be able to flush only broken objects, but I don't know how to do that, even if it's possible ! I thus run some cephfs-data-scan pg_files to identify the files with data on this pg and the I run a grep -q -m 1 "." "/path_to_damaged_file" to identify the ones which are really empty (we tested different way to do this and it seems that's the fastest). F. Le 08/06/2020 à 16:07, Frank Schilder a écrit : OK, now we are talking. It is very well possible that trimming will not start until this operation is completed. If there are enough shards/copies to recover the lost objects, you should try a pg repair first. If you did loose too many replicas, there are ways to flush this PG out of the system. You will loose data this way. I don't know how to repair or flush only broken objects out of a PG, but would hope that this is possible. Before you do anything destructive, open a new thread in this list specifically for how to repair/remove this PG with the least possible damage. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________ From: Francois Legrand Sent: 08 June 2020 16:00:28 To: Frank Schilder; ceph-users Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory exhausted There is no recovery going on, but indeed we have a pg damaged (with some lost objects due to a major crash few weeks ago)... and there are some shards of this pg on osd 27 ! That's also why we are migrating all the data out of this FS ! It's certainly related and I guess that it's trying to remove some datas that are already lost and it get stuck ! I don't know if there is a way to tell ceph to forget about these ops ! I guess no. I thus think that there is not that much to do apart from reading as much data as we can to save as much as possible. F. Le 08/06/2020 à 15:48, Frank Schilder a écrit : That's strange. Maybe there is another problem. Do you have any other health warnings that might be related? Is there some recovery/rebalancing going on? Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Francois Legrand Sent: 08 June 2020 15:27:59 To: Frank Schilder; ceph-users Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory exhausted Thanks again for the hint ! Indeed, I did a ceph daemon mds.lpnceph-mds02.in2p3.fr objecter_requests and it seems that osd 27 is more or less stuck with op of age 34987.5 (while others osd have ages < 1). I tryed a ceph osd down 27 which resulted in reseting the age but I can notice that age for osd.27 ops is rising again. I think I will restart it (btw our osd servers and mds are different machines). F. Le 08/06/2020 à 15:01, Frank Schilder a écrit : Hi Francois, this sounds great. At least its operational. I guess it is still using a lot of swap while trying to replay operations. I would disconnect cleanly all clients if you didn't do so already, even any read-only clients. Any extra load will just slow down recovery. My best guess is, that the MDS is replaying some operations, which is very slow due to swap. While doing so, the segments to trim will probably keep increasing for a while until it can start trimming. The slow meta-data IO is an operation hanging in some OSD. You should check which OSD it is (ceph health detail) and check if you can see the operation in the OSDs OPS queue. I would expect this OSD to have a really long OPS queue. I have seen meta-data operations hang for a long time. In case this OSD runs on the same server as your MDS, you will probably have to sit it out. If the meta-data operation is the only operation in the queue, the OSD might need a restart. But be careful, if in doubt ask the list first. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Francois Legrand Sent: 08 June 2020 14:45:13 To: Frank Schilder; ceph-users Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory exhausted Hi Franck, Finally I dit : ceph config set global mds_beacon_grace 60 and create /etc/sysctl.d/sysctl-ceph.conf with vm.min_free_kbytes=4194303 and then sysctl --system After that, the mds went to rejoin for a very long time (almost 24 hours) with errors like : 2020-06-07 04:10:36.802 7ff866e2e700 1 heartbeat_map is_healthy 'MDSRank
[ceph-users] Re: mds behind on trimming - replay until memory exhausted
There is no recovery going on, but indeed we have a pg damaged (with some lost objects due to a major crash few weeks ago)... and there are some shards of this pg on osd 27 ! That's also why we are migrating all the data out of this FS ! It's certainly related and I guess that it's trying to remove some datas that are already lost and it get stuck ! I don't know if there is a way to tell ceph to forget about these ops ! I guess no. I thus think that there is not that much to do apart from reading as much data as we can to save as much as possible. F. Le 08/06/2020 à 15:48, Frank Schilder a écrit : That's strange. Maybe there is another problem. Do you have any other health warnings that might be related? Is there some recovery/rebalancing going on? Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________ From: Francois Legrand Sent: 08 June 2020 15:27:59 To: Frank Schilder; ceph-users Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory exhausted Thanks again for the hint ! Indeed, I did a ceph daemon mds.lpnceph-mds02.in2p3.fr objecter_requests and it seems that osd 27 is more or less stuck with op of age 34987.5 (while others osd have ages < 1). I tryed a ceph osd down 27 which resulted in reseting the age but I can notice that age for osd.27 ops is rising again. I think I will restart it (btw our osd servers and mds are different machines). F. Le 08/06/2020 à 15:01, Frank Schilder a écrit : Hi Francois, this sounds great. At least its operational. I guess it is still using a lot of swap while trying to replay operations. I would disconnect cleanly all clients if you didn't do so already, even any read-only clients. Any extra load will just slow down recovery. My best guess is, that the MDS is replaying some operations, which is very slow due to swap. While doing so, the segments to trim will probably keep increasing for a while until it can start trimming. The slow meta-data IO is an operation hanging in some OSD. You should check which OSD it is (ceph health detail) and check if you can see the operation in the OSDs OPS queue. I would expect this OSD to have a really long OPS queue. I have seen meta-data operations hang for a long time. In case this OSD runs on the same server as your MDS, you will probably have to sit it out. If the meta-data operation is the only operation in the queue, the OSD might need a restart. But be careful, if in doubt ask the list first. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________ From: Francois Legrand Sent: 08 June 2020 14:45:13 To: Frank Schilder; ceph-users Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory exhausted Hi Franck, Finally I dit : ceph config set global mds_beacon_grace 60 and create /etc/sysctl.d/sysctl-ceph.conf with vm.min_free_kbytes=4194303 and then sysctl --system After that, the mds went to rejoin for a very long time (almost 24 hours) with errors like : 2020-06-07 04:10:36.802 7ff866e2e700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-06-07 04:10:36.802 7ff866e2e700 0 mds.beacon.lpnceph-mds02.in2p3.fr Skipping beacon heartbeat to monitors (last acked 14653.8s ago); MDS internal heartbeat is not healthy! 2020-06-07 04:10:37.021 7ff868e32700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-06-07 03:10:37.022271) and also 2020-06-07 04:10:44.942 7ff86d63b700 0 auth: could not find secret_id=10363 2020-06-07 04:10:44.942 7ff86d63b700 0 cephx: verify_authorizer could not get service secret for service mds secret_id=10363 but at the end the mds went active ! :-) I let it at rest from sunday afternoon until this morning. Indeed I was able to connect clients (in read-only for now) and read the datas. I checked the clients connected with ceph tell mds.lpnceph-mds02.in2p3.fr client ls and disconnected the few clients still there (with umount) and checked that they were not connected anymore with the same command. But I still have the following warnings MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs mdslpnceph-mds02.in2p3.fr(mds.0): 1 slow metadata IOs are blocked > 30 secs, oldest blocked for 75372 secs MDS_TRIM 1 MDSs behind on trimming mdslpnceph-mds02.in2p3.fr(mds.0): Behind on trimming (122836/128) max_segments: 128, num_segments: 122836 and the number of segments is still rising (slowly). F. Le 08/06/2020 à 12:00, Frank Schilder a écrit : Hi Francois, did you manage to get any further with this? Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Frank Schilder Sent: 06 June 2020 15:21:59 To: ceph-users; f...@lpnhe.in2p3.fr Subject: [ceph-users] Re: mds behind on trimming - replay unti
[ceph-users] Re: mds behind on trimming - replay until memory exhausted
Thanks again for the hint ! Indeed, I did a ceph daemon mds.lpnceph-mds02.in2p3.fr objecter_requests and it seems that osd 27 is more or less stuck with op of age 34987.5 (while others osd have ages < 1). I tryed a ceph osd down 27 which resulted in reseting the age but I can notice that age for osd.27 ops is rising again. I think I will restart it (btw our osd servers and mds are different machines). F. Le 08/06/2020 à 15:01, Frank Schilder a écrit : Hi Francois, this sounds great. At least its operational. I guess it is still using a lot of swap while trying to replay operations. I would disconnect cleanly all clients if you didn't do so already, even any read-only clients. Any extra load will just slow down recovery. My best guess is, that the MDS is replaying some operations, which is very slow due to swap. While doing so, the segments to trim will probably keep increasing for a while until it can start trimming. The slow meta-data IO is an operation hanging in some OSD. You should check which OSD it is (ceph health detail) and check if you can see the operation in the OSDs OPS queue. I would expect this OSD to have a really long OPS queue. I have seen meta-data operations hang for a long time. In case this OSD runs on the same server as your MDS, you will probably have to sit it out. If the meta-data operation is the only operation in the queue, the OSD might need a restart. But be careful, if in doubt ask the list first. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ____ From: Francois Legrand Sent: 08 June 2020 14:45:13 To: Frank Schilder; ceph-users Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory exhausted Hi Franck, Finally I dit : ceph config set global mds_beacon_grace 60 and create /etc/sysctl.d/sysctl-ceph.conf with vm.min_free_kbytes=4194303 and then sysctl --system After that, the mds went to rejoin for a very long time (almost 24 hours) with errors like : 2020-06-07 04:10:36.802 7ff866e2e700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-06-07 04:10:36.802 7ff866e2e700 0 mds.beacon.lpnceph-mds02.in2p3.fr Skipping beacon heartbeat to monitors (last acked 14653.8s ago); MDS internal heartbeat is not healthy! 2020-06-07 04:10:37.021 7ff868e32700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-06-07 03:10:37.022271) and also 2020-06-07 04:10:44.942 7ff86d63b700 0 auth: could not find secret_id=10363 2020-06-07 04:10:44.942 7ff86d63b700 0 cephx: verify_authorizer could not get service secret for service mds secret_id=10363 but at the end the mds went active ! :-) I let it at rest from sunday afternoon until this morning. Indeed I was able to connect clients (in read-only for now) and read the datas. I checked the clients connected with ceph tell mds.lpnceph-mds02.in2p3.fr client ls and disconnected the few clients still there (with umount) and checked that they were not connected anymore with the same command. But I still have the following warnings MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs mdslpnceph-mds02.in2p3.fr(mds.0): 1 slow metadata IOs are blocked > 30 secs, oldest blocked for 75372 secs MDS_TRIM 1 MDSs behind on trimming mdslpnceph-mds02.in2p3.fr(mds.0): Behind on trimming (122836/128) max_segments: 128, num_segments: 122836 and the number of segments is still rising (slowly). F. Le 08/06/2020 à 12:00, Frank Schilder a écrit : Hi Francois, did you manage to get any further with this? Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Frank Schilder Sent: 06 June 2020 15:21:59 To: ceph-users; f...@lpnhe.in2p3.fr Subject: [ceph-users] Re: mds behind on trimming - replay until memory exhausted I think you have a problem similar to one I have. The priority of beacons seems very low. As soon as something gets busy, beacons are ignored or not sent. This was part of your log messages from the MDS. It stopped reporting to the MONs due to laggy connection. This laggyness is a result of swapping: 2020-06-05 21:39:06.015 7f251bfe6700 1 mds.0.322900 skipping upkeep work because connection to Monitors appears laggy Hence, during the (entire) time you are trying to get the MDS back using swap, it will almost certainly stop sending beacons. Therefore, you need to disable the time-out temporarily, otherwise the MON will always kill it for no real reason. The time-out should be long enough to cover the entire recovery period. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ____ From: Francois Legrand Sent: 06 June 2020 11:11 To: Frank Schilder; ceph-users Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory exhausted Thanks for the tip, I w
[ceph-users] Re: mds behind on trimming - replay until memory exhausted
Hi Franck, Finally I dit : ceph config set global mds_beacon_grace 60 and create /etc/sysctl.d/sysctl-ceph.conf with vm.min_free_kbytes=4194303 and then sysctl --system After that, the mds went to rejoin for a very long time (almost 24 hours) with errors like : 2020-06-07 04:10:36.802 7ff866e2e700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-06-07 04:10:36.802 7ff866e2e700 0 mds.beacon.lpnceph-mds02.in2p3.fr Skipping beacon heartbeat to monitors (last acked 14653.8s ago); MDS internal heartbeat is not healthy! 2020-06-07 04:10:37.021 7ff868e32700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-06-07 03:10:37.022271) and also 2020-06-07 04:10:44.942 7ff86d63b700 0 auth: could not find secret_id=10363 2020-06-07 04:10:44.942 7ff86d63b700 0 cephx: verify_authorizer could not get service secret for service mds secret_id=10363 but at the end the mds went active ! :-) I let it at rest from sunday afternoon until this morning. Indeed I was able to connect clients (in read-only for now) and read the datas. I checked the clients connected with ceph tell mds.lpnceph-mds02.in2p3.fr client ls and disconnected the few clients still there (with umount) and checked that they were not connected anymore with the same command. But I still have the following warnings MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs mdslpnceph-mds02.in2p3.fr(mds.0): 1 slow metadata IOs are blocked > 30 secs, oldest blocked for 75372 secs MDS_TRIM 1 MDSs behind on trimming mdslpnceph-mds02.in2p3.fr(mds.0): Behind on trimming (122836/128) max_segments: 128, num_segments: 122836 and the number of segments is still rising (slowly). F. Le 08/06/2020 à 12:00, Frank Schilder a écrit : Hi Francois, did you manage to get any further with this? Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Frank Schilder Sent: 06 June 2020 15:21:59 To: ceph-users; f...@lpnhe.in2p3.fr Subject: [ceph-users] Re: mds behind on trimming - replay until memory exhausted I think you have a problem similar to one I have. The priority of beacons seems very low. As soon as something gets busy, beacons are ignored or not sent. This was part of your log messages from the MDS. It stopped reporting to the MONs due to laggy connection. This laggyness is a result of swapping: 2020-06-05 21:39:06.015 7f251bfe6700 1 mds.0.322900 skipping upkeep work because connection to Monitors appears laggy Hence, during the (entire) time you are trying to get the MDS back using swap, it will almost certainly stop sending beacons. Therefore, you need to disable the time-out temporarily, otherwise the MON will always kill it for no real reason. The time-out should be long enough to cover the entire recovery period. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ____ From: Francois Legrand Sent: 06 June 2020 11:11 To: Frank Schilder; ceph-users Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory exhausted Thanks for the tip, I will try that. For now vm.min_free_kbytes = 90112 Indeed, yesterday after your last mail I set mds_beacon_grace to 240.0 but this didn't change anything... -27> 2020-06-06 06:15:07.373 7f83e3626700 1 mds.beacon.lpnceph-mds04.in2p3.fr MDS connection to Monitors appears to be laggy; 332.044s since last acked beacon Which is the same time since last acked beacon I had before changing the parameter. As mds beacon interval is 4 s setting mds_beacon_grace to 240 should lead to 960 s (16mn). Thus I think that the bottleneck is elsewhere. F. Le 06/06/2020 à 09:47, Frank Schilder a écrit : Hi Francois, there is actually one more parameter you might consider changing in case the MDS gets kicked out again. For a system under such high memory pressure, the value for the kernel parameter vm.min_free_kbytes might need adjusting. You can check the current value with sysctl vm.min_free_kbytes In your case with heavy swap usage, this value should probably be somewhere between 2-4GB. Careful, do not change this value while memory is in high demand. If not enough memory is available, setting this will immediately OOM kill your machine. Make sure that plenty of pages are unused. Drop page cache if necessary or reboot the machine before setting this value. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Frank Schilder Sent: 06 June 2020 00:36:13 To: ceph-users; f...@lpnhe.in2p3.fr Subject: [ceph-users] Re: mds behind on trimming - replay until memory exhausted Hi Francois, yes, the beacon grace needs to be higher due to the latency of swap. Not sure if 60s will do. For this particular recovery operation, you might want to go much higher (1h) and wat
[ceph-users] Re: mds behind on trimming - replay until memory exhausted
Thanks for the tip, I will try that. For now vm.min_free_kbytes = 90112 Indeed, yesterday after your last mail I set mds_beacon_grace to 240.0 but this didn't change anything... -27> 2020-06-06 06:15:07.373 7f83e3626700 1 mds.beacon.lpnceph-mds04.in2p3.fr MDS connection to Monitors appears to be laggy; 332.044s since last acked beacon Which is the same time since last acked beacon I had before changing the parameter. As mds beacon interval is 4 s setting mds_beacon_grace to 240 should lead to 960 s (16mn). Thus I think that the bottleneck is elsewhere. F. Le 06/06/2020 à 09:47, Frank Schilder a écrit : Hi Francois, there is actually one more parameter you might consider changing in case the MDS gets kicked out again. For a system under such high memory pressure, the value for the kernel parameter vm.min_free_kbytes might need adjusting. You can check the current value with sysctl vm.min_free_kbytes In your case with heavy swap usage, this value should probably be somewhere between 2-4GB. Careful, do not change this value while memory is in high demand. If not enough memory is available, setting this will immediately OOM kill your machine. Make sure that plenty of pages are unused. Drop page cache if necessary or reboot the machine before setting this value. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Frank Schilder Sent: 06 June 2020 00:36:13 To: ceph-users; f...@lpnhe.in2p3.fr Subject: [ceph-users] Re: mds behind on trimming - replay until memory exhausted Hi Francois, yes, the beacon grace needs to be higher due to the latency of swap. Not sure if 60s will do. For this particular recovery operation, you might want to go much higher (1h) and watch the cluster health closely. Good luck and best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ____ From: Francois Legrand Sent: 05 June 2020 23:51:04 To: Frank Schilder; ceph-users Subject: Re: [ceph-users] mds behind on trimming - replay until memory exhausted Hi, Unfortunately adding swap did not solve the problem ! I added 400 GB of swap. It used about 18GB of swap after consuming all the ram and stopped with the following logs : 2020-06-05 21:33:31.967 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr Updating MDS map to version 324691 from mon.1 2020-06-05 21:33:40.355 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr Updating MDS map to version 324692 from mon.1 2020-06-05 21:33:59.787 7f251b7e5700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-06-05 21:33:59.787 7f251b7e5700 0 mds.beacon.lpnceph-mds04.in2p3.fr Skipping beacon heartbeat to monitors (last acked 3.99979s ago); MDS internal heartbeat is not healthy! 2020-06-05 21:34:00.287 7f251b7e5700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-06-05 21:34:00.287 7f251b7e5700 0 mds.beacon.lpnceph-mds04.in2p3.fr Skipping beacon heartbeat to monitors (last acked 4.49976s ago); MDS internal heartbeat is not healthy! 2020-06-05 21:39:05.991 7f251bfe6700 1 heartbeat_map reset_timeout 'MDSRank' had timed out after 15 2020-06-05 21:39:06.015 7f251bfe6700 1 mds.beacon.lpnceph-mds04.in2p3.fr MDS connection to Monitors appears to be laggy; 310.228s since last acked beacon 2020-06-05 21:39:06.015 7f251bfe6700 1 mds.0.322900 skipping upkeep work because connection to Monitors appears laggy 2020-06-05 21:39:19.838 7f251bfe6700 1 mds.0.322900 skipping upkeep work because connection to Monitors appears laggy 2020-06-05 21:39:19.869 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr Updating MDS map to version 324694 from mon.1 2020-06-05 21:39:19.869 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr Map removed me (mds.-1 gid:210070681) from cluster due to lost contact; respawning 2020-06-05 21:39:19.870 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr respawn! --- begin dump of recent events --- -> 2020-06-05 19:28:07.982 7f25217f1700 5 mds.beacon.lpnceph-mds04.in2p3.fr received beacon reply up:replay seq 2131 rtt 0.930951 -9998> 2020-06-05 19:28:11.053 7f251b7e5700 5 mds.beacon.lpnceph-mds04.in2p3.fr Sending beacon up:replay seq 2132 -9997> 2020-06-05 19:28:11.053 7f251b7e5700 10 monclient: _send_mon_message to mon.lpnceph-mon02 at v2:134.158.152.210:3300/0 -9996> 2020-06-05 19:28:12.176 7f25217f1700 5 mds.beacon.lpnceph-mds04.in2p3.fr received beacon reply up:replay seq 2132 rtt 1.12294 -9995> 2020-06-05 19:28:12.176 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr Updating MDS map to version 323302 from mon.1 -9994> 2020-06-05 19:28:14.290 7f251d7e9700 10 monclient: tick -9993> 2020-06-05 19:28:14.290 7f251d7e9700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2020-06-05 19:27:44.290995) ... 2020-06-05 21:39:31.092 7f3c4d5e3700 1 mds.lpnceph-mds04.in2p3.fr Updating MDS map to version 324749 from mon.
[ceph-users] Re: mds behind on trimming - replay until memory exhausted
Hi, Unfortunately adding swap did not solve the problem ! I added 400 GB of swap. It used about 18GB of swap after consuming all the ram and stopped with the following logs : 2020-06-05 21:33:31.967 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr Updating MDS map to version 324691 from mon.1 2020-06-05 21:33:40.355 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr Updating MDS map to version 324692 from mon.1 2020-06-05 21:33:59.787 7f251b7e5700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-06-05 21:33:59.787 7f251b7e5700 0 mds.beacon.lpnceph-mds04.in2p3.fr Skipping beacon heartbeat to monitors (last acked 3.99979s ago); MDS internal heartbeat is not healthy! 2020-06-05 21:34:00.287 7f251b7e5700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-06-05 21:34:00.287 7f251b7e5700 0 mds.beacon.lpnceph-mds04.in2p3.fr Skipping beacon heartbeat to monitors (last acked 4.49976s ago); MDS internal heartbeat is not healthy! 2020-06-05 21:39:05.991 7f251bfe6700 1 heartbeat_map reset_timeout 'MDSRank' had timed out after 15 2020-06-05 21:39:06.015 7f251bfe6700 1 mds.beacon.lpnceph-mds04.in2p3.fr MDS connection to Monitors appears to be laggy; 310.228s since last acked beacon 2020-06-05 21:39:06.015 7f251bfe6700 1 mds.0.322900 skipping upkeep work because connection to Monitors appears laggy 2020-06-05 21:39:19.838 7f251bfe6700 1 mds.0.322900 skipping upkeep work because connection to Monitors appears laggy 2020-06-05 21:39:19.869 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr Updating MDS map to version 324694 from mon.1 2020-06-05 21:39:19.869 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr Map removed me (mds.-1 gid:210070681) from cluster due to lost contact; respawning 2020-06-05 21:39:19.870 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr respawn! --- begin dump of recent events --- -> 2020-06-05 19:28:07.982 7f25217f1700 5 mds.beacon.lpnceph-mds04.in2p3.fr received beacon reply up:replay seq 2131 rtt 0.930951 -9998> 2020-06-05 19:28:11.053 7f251b7e5700 5 mds.beacon.lpnceph-mds04.in2p3.fr Sending beacon up:replay seq 2132 -9997> 2020-06-05 19:28:11.053 7f251b7e5700 10 monclient: _send_mon_message to mon.lpnceph-mon02 at v2:134.158.152.210:3300/0 -9996> 2020-06-05 19:28:12.176 7f25217f1700 5 mds.beacon.lpnceph-mds04.in2p3.fr received beacon reply up:replay seq 2132 rtt 1.12294 -9995> 2020-06-05 19:28:12.176 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr Updating MDS map to version 323302 from mon.1 -9994> 2020-06-05 19:28:14.290 7f251d7e9700 10 monclient: tick -9993> 2020-06-05 19:28:14.290 7f251d7e9700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2020-06-05 19:27:44.290995) ... 2020-06-05 21:39:31.092 7f3c4d5e3700 1 mds.lpnceph-mds04.in2p3.fr Updating MDS map to version 324749 from mon.1 2020-06-05 21:39:35.257 7f3c4d5e3700 1 mds.lpnceph-mds04.in2p3.fr Updating MDS map to version 324750 from mon.1 2020-06-05 21:39:35.257 7f3c4d5e3700 1 mds.lpnceph-mds04.in2p3.fr Map has assigned me to become a standby However, the mons doesn't seems particularly loaded ! So I am trying to set mds_beacon_grace to 60.0 to see if it helps (I did it both for mds and mons daemons because it's seems to be present in both conf). I will tells you if it works. Any other clue ? F. Le 05/06/2020 à 14:44, Frank Schilder a écrit : Hi Francois, thanks for the link. The option "mds dump cache after rejoin" is for debugging purposes only. It will write the cache after rejoin to a file, but not drop the cache. This will not help you. I think this was implemented recently to make it possible to send a cache dump file to developers after an MDS crash before the restarting MDS changes the cache. In your case, I would set osd_op_queue_cut_off during the next regular cluster service or upgrade. My best bet right now is to try to add swap. Maybe someone else reading this has a better idea or you find a hint in one of the other threads. Good luck! = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Francois Legrand Sent: 05 June 2020 14:34:06 To: Frank Schilder; ceph-users Subject: Re: [ceph-users] mds behind on trimming - replay until memory exhausted Le 05/06/2020 à 14:18, Frank Schilder a écrit : Hi Francois, I was also wondering if setting mds dump cache after rejoin could help ? Haven't heard of that option. Is there some documentation? I found it on : https://docs.ceph.com/docs/nautilus/cephfs/mds-config-ref/ mds dump cache after rejoin Description Ceph will dump MDS cache contents to a file after rejoining the cache (during recovery). Type Boolean Default false but I don't think it can help in my case, because rejoin occurs after replay and in my case replay never ends ! I have : osd_op_queue=wpq osd_op_queue_cut_off=low I can try to set osd_op_queue_cut_off to high, but
[ceph-users] Re: mds behind on trimming - replay until memory exhausted
Le 05/06/2020 à 14:18, Frank Schilder a écrit : Hi Francois, I was also wondering if setting mds dump cache after rejoin could help ? Haven't heard of that option. Is there some documentation? I found it on : https://docs.ceph.com/docs/nautilus/cephfs/mds-config-ref/ mds dump cache after rejoin Description Ceph will dump MDS cache contents to a file after rejoining the cache (during recovery). Type Boolean Default false but I don't think it can help in my case, because rejoin occurs after replay and in my case replay never ends ! I have : osd_op_queue=wpq osd_op_queue_cut_off=low I can try to set osd_op_queue_cut_off to high, but it will be useful only if the mds get active, true ? I think so. If you have no clients connected, there should not be queue priority issues. Maybe it is best to wait until your cluster is healthy again as you will need to restart all daemons. Make sure you set this in [global]. When I applied that change and after re-starting all OSDs my MDSes had reconnect issues until I set it on them too. I think all daemons use that option (the prefix osd_ is misleading). For sure I would prefer not to restart all daemons because the second filesystem is up and running (with production clients). For now, the mds_cache_memory_limit is set to 8 589 934 592 (so 8GB which seems reasonable for a mds server with 32/48GB). This sounds bad. 8GB should not cause any issues. Maybe you are hitting a bug, I believe there is a regression in Nautilus. There were recent threads on absurdly high memory use by MDSes. Maybe its worth searching for these in the list. I will have a look. I already force the clients to unmount (and even rebooted the ones from which the rsync and the rmdir .snaps were launched). I don't know when the MDS acknowledges this. If is was a clean unmount (i.e. without -f or forced by reboot) the MDS should have dropped the clients already. If it was an unclean unmount it might not be that easy to get the stale client session out. However, I don't know about that. Moreover when I did that, the mds was already not active but in replay, so for sure the unmount was not acknowledged by any mds ! I think that providing more swap maybe the solution ! I will try that if I cannot find another way to fix it. If the memory overrun is somewhat limited, this should allow the MDS to trim the logs. Will take a while, but it will do eventually. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ____ From: Francois Legrand Sent: 05 June 2020 13:46:03 To: Frank Schilder; ceph-users Subject: Re: [ceph-users] mds behind on trimming - replay until memory exhausted I was also wondering if setting mds dump cache after rejoin could help ? Le 05/06/2020 à 12:49, Frank Schilder a écrit : Out of interest, I did the same on a mimic cluster a few months ago, running up to 5 parallel rsync sessions without any problems. I moved about 120TB. Each rsync was running on a separate client with its own cache. I made sure that the sync dirs were all disjoint (no overlap of files/directories). How many rsync processes are you running in parallel? Do you have these settings enabled: osd_op_queue=wpq osd_op_queue_cut_off=high WPQ should be default, but osd_op_queue_cut_off=high might not be. Setting the latter removed any behind trimming problems we have seen before. You are in a somewhat peculiar situation. I think you need to trim client caches, which requires an active MDS. If your MDS becomes active for at least some time, you could try the following (I'm not an expert here, so take with a grain of scepticism): - reduce the MDS cache memory limit to force recall of caps much earlier than now - reduce client cach size - set "osd_op_queue_cut_off=high" if not already done so, I think this requires restart of OSDs, so be careful At this point, you could watch your restart cycle to see if things improve already. Maybe nothing more is required. If you have good SSDs, you could try to provide temporarily some swap space. It saved me once. This will be very slow, but at least it might allow you to move forward. Harder measures: - stop all I/O from the FS clients, throw users out if necessary - ideally, try to cleanly (!) shut down clients or force trimming the cache by either * umount or * sync; echo 3 > /proc/sys/vm/drop_caches Either of these might hang for a long time. Do not interrupt and do not do this on more than one client at a time. At some point, your active MDS should be able to hold a full session. You should then tune the cache and other parameters such that the MDSes can handle your rsync sessions. My experience is that MDSes overrun their cache limits quite a lot. Since I reduced mds_cache_memory_limit to not more than half of what is physically available, I have not had any problems
[ceph-users] Re: mds behind on trimming - replay until memory exhausted
I was also wondering if setting mds dump cache after rejoin could help ? Le 05/06/2020 à 12:49, Frank Schilder a écrit : Out of interest, I did the same on a mimic cluster a few months ago, running up to 5 parallel rsync sessions without any problems. I moved about 120TB. Each rsync was running on a separate client with its own cache. I made sure that the sync dirs were all disjoint (no overlap of files/directories). How many rsync processes are you running in parallel? Do you have these settings enabled: osd_op_queue=wpq osd_op_queue_cut_off=high WPQ should be default, but osd_op_queue_cut_off=high might not be. Setting the latter removed any behind trimming problems we have seen before. You are in a somewhat peculiar situation. I think you need to trim client caches, which requires an active MDS. If your MDS becomes active for at least some time, you could try the following (I'm not an expert here, so take with a grain of scepticism): - reduce the MDS cache memory limit to force recall of caps much earlier than now - reduce client cach size - set "osd_op_queue_cut_off=high" if not already done so, I think this requires restart of OSDs, so be careful At this point, you could watch your restart cycle to see if things improve already. Maybe nothing more is required. If you have good SSDs, you could try to provide temporarily some swap space. It saved me once. This will be very slow, but at least it might allow you to move forward. Harder measures: - stop all I/O from the FS clients, throw users out if necessary - ideally, try to cleanly (!) shut down clients or force trimming the cache by either * umount or * sync; echo 3 > /proc/sys/vm/drop_caches Either of these might hang for a long time. Do not interrupt and do not do this on more than one client at a time. At some point, your active MDS should be able to hold a full session. You should then tune the cache and other parameters such that the MDSes can handle your rsync sessions. My experience is that MDSes overrun their cache limits quite a lot. Since I reduced mds_cache_memory_limit to not more than half of what is physically available, I have not had any problems again. Hope that helps. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ____ From: Francois Legrand Sent: 05 June 2020 11:42:42 To: ceph-users Subject: [ceph-users] mds behind on trimming - replay until memory exhausted Hi all, We have a ceph nautilus cluster (14.2.8) with two cephfs filesystem and 3 mds (1 active for each fs + one failover). We are transfering all the datas (~600M files) from one FS (which was in EC 3+2) to the other FS (in R3). On the old FS we first removed the snapshots (to avoid strays problems when removing files) and the ran some rsync deleting the files after the transfer. The operation should last a few weeks more to complete. But few days ago, we started to have some warning mds behind on trimming from the mds managing the old FS. Yesterday, I restarted the active mds service to force the takeover by the standby mds (basically because the standby is more powerfull and have more memory, i.e 48GB over 32). The standby mds took the rank 0 and started to replay... the mds behind on trimming came back and the number of segments rised as well as the memory usage of the server. Finally, it exhausted the memory of the mds and the service stopped and the previous mds took rank 0 and started to replay... until memory exhaustion and a new switch of mds etc... It thus seems that we are in a never ending loop ! And of course, as the mds is always in replay, the data are not accessible and the transfers are blocked. I stopped all the rsync and unmount the clients. My questions are : - Does the mds trim during the replay so we could hope that after a while it will purge everything and the mds will be able to become active at the end ? - Is there a way to accelerate the operation or to fix this situation ? Thanks for you help. F. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: mds behind on trimming - replay until memory exhausted
Hi, Thanks for your answer. I have : osd_op_queue=wpq osd_op_queue_cut_off=low I can try to set osd_op_queue_cut_off to high, but it will be useful only if the mds get active, true ? For now, the mds_cache_memory_limit is set to 8 589 934 592 (so 8GB which seems reasonable for a mds server with 32/48GB). I already force the clients to unmount (and even rebooted the ones from which the rsync and the rmdir .snaps were launched). I think that providing more swap maybe the solution ! I will try that if I cannot find another way to fix it. F. Le 05/06/2020 à 12:49, Frank Schilder a écrit : Out of interest, I did the same on a mimic cluster a few months ago, running up to 5 parallel rsync sessions without any problems. I moved about 120TB. Each rsync was running on a separate client with its own cache. I made sure that the sync dirs were all disjoint (no overlap of files/directories). How many rsync processes are you running in parallel? Do you have these settings enabled: osd_op_queue=wpq osd_op_queue_cut_off=high WPQ should be default, but osd_op_queue_cut_off=high might not be. Setting the latter removed any behind trimming problems we have seen before. You are in a somewhat peculiar situation. I think you need to trim client caches, which requires an active MDS. If your MDS becomes active for at least some time, you could try the following (I'm not an expert here, so take with a grain of scepticism): - reduce the MDS cache memory limit to force recall of caps much earlier than now - reduce client cach size - set "osd_op_queue_cut_off=high" if not already done so, I think this requires restart of OSDs, so be careful At this point, you could watch your restart cycle to see if things improve already. Maybe nothing more is required. If you have good SSDs, you could try to provide temporarily some swap space. It saved me once. This will be very slow, but at least it might allow you to move forward. Harder measures: - stop all I/O from the FS clients, throw users out if necessary - ideally, try to cleanly (!) shut down clients or force trimming the cache by either * umount or * sync; echo 3 > /proc/sys/vm/drop_caches Either of these might hang for a long time. Do not interrupt and do not do this on more than one client at a time. At some point, your active MDS should be able to hold a full session. You should then tune the cache and other parameters such that the MDSes can handle your rsync sessions. My experience is that MDSes overrun their cache limits quite a lot. Since I reduced mds_cache_memory_limit to not more than half of what is physically available, I have not had any problems again. Hope that helps. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ____ From: Francois Legrand Sent: 05 June 2020 11:42:42 To: ceph-users Subject: [ceph-users] mds behind on trimming - replay until memory exhausted Hi all, We have a ceph nautilus cluster (14.2.8) with two cephfs filesystem and 3 mds (1 active for each fs + one failover). We are transfering all the datas (~600M files) from one FS (which was in EC 3+2) to the other FS (in R3). On the old FS we first removed the snapshots (to avoid strays problems when removing files) and the ran some rsync deleting the files after the transfer. The operation should last a few weeks more to complete. But few days ago, we started to have some warning mds behind on trimming from the mds managing the old FS. Yesterday, I restarted the active mds service to force the takeover by the standby mds (basically because the standby is more powerfull and have more memory, i.e 48GB over 32). The standby mds took the rank 0 and started to replay... the mds behind on trimming came back and the number of segments rised as well as the memory usage of the server. Finally, it exhausted the memory of the mds and the service stopped and the previous mds took rank 0 and started to replay... until memory exhaustion and a new switch of mds etc... It thus seems that we are in a never ending loop ! And of course, as the mds is always in replay, the data are not accessible and the transfers are blocked. I stopped all the rsync and unmount the clients. My questions are : - Does the mds trim during the replay so we could hope that after a while it will purge everything and the mds will be able to become active at the end ? - Is there a way to accelerate the operation or to fix this situation ? Thanks for you help. F. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] mds behind on trimming - replay until memory exhausted
Hi all, We have a ceph nautilus cluster (14.2.8) with two cephfs filesystem and 3 mds (1 active for each fs + one failover). We are transfering all the datas (~600M files) from one FS (which was in EC 3+2) to the other FS (in R3). On the old FS we first removed the snapshots (to avoid strays problems when removing files) and the ran some rsync deleting the files after the transfer. The operation should last a few weeks more to complete. But few days ago, we started to have some warning mds behind on trimming from the mds managing the old FS. Yesterday, I restarted the active mds service to force the takeover by the standby mds (basically because the standby is more powerfull and have more memory, i.e 48GB over 32). The standby mds took the rank 0 and started to replay... the mds behind on trimming came back and the number of segments rised as well as the memory usage of the server. Finally, it exhausted the memory of the mds and the service stopped and the previous mds took rank 0 and started to replay... until memory exhaustion and a new switch of mds etc... It thus seems that we are in a never ending loop ! And of course, as the mds is always in replay, the data are not accessible and the transfers are blocked. I stopped all the rsync and unmount the clients. My questions are : - Does the mds trim during the replay so we could hope that after a while it will purge everything and the mds will be able to become active at the end ? - Is there a way to accelerate the operation or to fix this situation ? Thanks for you help. F. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Remove or recreate damaged PG in erasure coding pool
Hello, We run nautilus 14.2.8 ceph cluster. After a big crash in which we lost some disks we had a PG down (Erasure coding 3+2 pool) and trying to fix it we followed this https://medium.com/opsops/recovering-ceph-from-reduced-data-availability-3-pgs-inactive-3-pgs-incomplete-b97cbcb4b5a1 As the PG was reported with 0 objects we first marked a shard as complete with ceph-objectstore-tool and restart the osd. The pg thus went active but reported lost objects ! As we consider the datas on this pg as lost, we try to get rid of this with ceph pg 30.3 mark_unfound_lost delete. This produced some logs like (~3 lines/hour): 2020-05-12 14:45:05.251830 osd.103 (osd.103) 886 : cluster [ERR] 30.3s0 Unexpected Error: recovery ending with 41: {30:c000e27d:::rbd_data.34.c963b6314efb84.0 100:head=435293'2 flags = delete,30:c01f1248:::rbd_data.34.7f0c0d1df22f45.0325:head=435293'3 flags = delete,30:c05e82b2:::rbd_data.34.674d063bdc66d2.0 015:head=435293'4 flags = delete,30:c0b2d8e7:::rbd_data.34.6bc88749c741cb.07d0:head=435293'5 flags = delete,30:c0c3e20e:::rbd_data.34.674d063b dc66d2.00fb:head=435293'6 flags = delete,30:c0c89740:::rbd_data.34.a7f2202210bb39.0bbc:head=435293'7 flags = delete,30:c0e59ffa:::rbd_data.34. 7f0c0d1df22f45.02fb:head=435293'8 flags = delete,30:c0e72bf4:::rbd_data.34.7f0c0d1df22f45.00fa:head=435293'9 flags = delete,30:c10ab507:::rbd_ data.34.80695c646d9535.0327:head=435293'10 flags = delete,30:c219e412:::rbd_data.34.a7f2202210bb39.0fa0:head=435293'11 flags = delete,30:c29ae ba3:::rbd_data.34.8038585a0eb9f6.0eb2:head=435293'12 flags = delete,30:c29fae09:::rbd_data.34.674d063bdc66d2.148a:head=435293'13 flags = delet e,30:c2b77a99:::rbd_data.34.7f0c0d1df22f45.031d:head=435293'14 flags = delete,30:c2c8598f:::rbd_data.34.674d063bdc66d2.02f5:head=435293'15 fla gs = delete,30:c2dd39fe:::rbd_data.34.6494fb1b0f88bf.030b:head=435293'16 flags = delete,30:c2f6ce39:::rbd_data.34.806ab864459ae5.0109:head=435 293'17 flags = delete,30:c2f8a62f:::rbd_data.34.ed0c58ebdc770f.002a:head=435293'18 flags = delete,30:c306cd86:::rbd_data.34.ed0c58ebdc770f.020 5:head=435293'19 flags = delete,30:c30f5230:::rbd_data.34.7f0c0d1df22f45.02f5:head=435293'20 flags = delete,30:c32b81df:::rbd_data.34.c79f6d1f78a707.0 100:head=435293'21 flags = delete,30:c3374080:::rbd_data.34.7f217e33dd742c.07d0:head=435293'22 flags = delete,30:c3cdbeb5:::rbd_data.34.674dcefe97 f606.0109:head=435293'23 flags = delete,30:c3cdd149:::rbd_data.34.674dcefe97f606.0019:head=435293'24 flags = delete,30:c40946c0:::rbd_data.34. ded8d21a9d3d8f.02a8:head=435293'25 flags = delete,30:c42ed4fd:::rbd_data.34.a6985314ad8dad.0200:head=435293'26 flags = delete,30:c483a99b:::rb d_data.34.ed0c58ebdc770f.0a00:head=435293'27 flags = delete,30:c49f09d6:::rbd_data.34.7e1c1abf436885.0bb8:head=435293'28 flags = delete,30:c51 5a4e8:::rbd_data.34.ed0c58ebdc770f.0106:head=435293'29 flags = delete,30:c5181a8e:::rbd_data.34.9385d45172fa0f.020c:head=435293'30 flags = del ete,30:c531de44:::rbd_data.34.6bc88749c741cb.0102:head=435293'31 flags = delete,30:c5427518:::rbd_data.34.806ab864459ae5.06db:head=435293'32 f lags = delete,30:c5693b53:::rbd_data.34.6494fb1b0f88bf.148a:head=435293'33 flags = delete,30:c5804bc9:::rbd_data.34.ed0cb8730e020c.0105:head=4 35293'34 flags = delete,30:c598117e:::rbd_data.34.7f0811fbac0b9d.0327:head=435293'35 flags = delete,30:c5a64fbd:::rbd_data.34.c963b6314efb84.0 010:head=435293'36 flags = delete,30:c5f9e0e5:::rbd_data.34.ed0c58ebdc770f.0f01:head=435293'37 flags = delete,30:c5ffe1d8:::rbd_data.34.6bc88749c741cb.000 00abe:head=435293'38 flags = delete,30:c6ecfaa1:::rbd_data.34.9385d45172fa0f.0002:head=435293'39 flags = delete,30:c70f:::rbd_data.34.6494fb1b 0f88bf.0106:head=435293'40 flags = delete,30:c7a730f4:::rbd_data.34.7f217e33dd742c.06e1:head=435293'41 flags = delete,30:c7aa79f7:::rbd_data.3 4.674dcefe97f606.0108:head=435293'42 flags = delete} But yesterday it started to flood the logs (~9 GB of logs/day !) with lines like : 2020-05-14 10:36:03.851258 osd.29 [ERR] Error -2 reading object 30:c24a0173:::rbd_data.34.806ab864459ae5.022d:head 2020-05-14 10:36:03.851333 osd.29 [ERR] Error -2 reading object 30:c4a41972:::rbd_data.34.6bc88749c741cb.0320:head 2020-05-14 10:36:03.851382 osd.29 [ERR] Error -2 reading object 30:c543da6f:::rbd_data.34.80695c646d9535.0dce:head 2020-05-14 10:36:03.859900 osd.29 [ERR] Error -2 reading object 30:c24a0173:::rbd_data.34.806ab864459ae5.
[ceph-users] Recover datas from pg incomplete
Hi, After a major crash in which we lost few osds, we are stucked with incomplete pgs. At first, peering was blocked with peering_blocked_by_history_les_bound. Thus we set osd_find_best_info_ignore_history_les true for all osds involved in the pg and set the primary osd down to force repeering. It worked for one pg which is in a replica 3 pool, but for the 2 others pgs which are in a erasurce coding (3+2) pool, it didn't worked... and the pgs are still incomplete. We know that we will have data lost, but we would like to minimize it and save as much as possible. Also because this pg is part of the data pool of a cephfs filesystem and it seems that files are spread among a lot of pgs and loosing objects in a pg of the datapool means the loss of a huge number of files ! According to https://www.spinics.net/lists/ceph-devel/msg41665.html a way would be to : - stop each osd involved in that pg - export the shards with ceph-objectstore-tool - compare the size of the shards and select the biggest one (alternatively maybe we can also look at the num_objects returned by ceph pg query ?) - Mark it as complete - restart the osd - Wait for recover and finally get rid of the missing objects with ceph pg 10.2 mark_unfound_lost delete But on this other source https://github.com/TheJJ/ceph-cheatsheet/blob/master/README.md or here https://medium.com/opsops/recovering-ceph-from-reduced-data-availability-3-pgs-inactive-3-pgs-incomplete-b97cbcb4b5a1 it's suggested to remove the other parts (but I am not sure these threads are really related to EC pools). Could you confirm that we could follow this procedure (or correct it or suggests anything else) ? Thanks for your advices. F. PS: Here is a part of the ceph pg 10.2 query return : "state": "incomplete", "snap_trimq": "[]", "snap_trimq_len": 0, "epoch": 434321, "up": [ 78, 105, 90, 4, 41 ], "acting": [ 78, 105, 90, 4, 41 ], "info": { "pgid": "10.2s0", "state": "incomplete", "last_peered": "2020-04-22 09:58:42.505638", "last_became_peered": "2020-04-20 11:06:07.701833", "num_objects": 161314, "num_objects_missing_on_primary": 0, "num_objects_missing": 0, "num_objects_degraded": 0, "num_objects_misplaced": 0, "num_objects_unfound": 0, "num_objects_dirty": 161314, "num_objects_recovered": 1290285, "peer_info": [ "peer": "4(3)", "pgid": "10.2s3", "state": "active+undersized+degraded+remapped+backfilling", "last_peered": "2020-04-25 13:25:12.860435", "last_became_peered": "2020-04-22 10:45:45.520125", "num_objects": 162869, "num_objects_missing_on_primary": 0, "num_objects_missing": 0, "num_objects_degraded": 85071, "num_objects_misplaced": 0, "num_objects_unfound": 0, "num_objects_dirty": 162869, "num_objects_recovered": 1368082, "peer": "9(2)", "pgid": "10.2s2", "state": "down", "last_peered": "2020-04-25 13:25:12.860435", "last_became_peered": "2020-04-22 10:45:45.520125", "num_objects": 162869, "num_objects_missing_on_primary": 0, "num_objects_missing": 0, "num_objects_degraded": 0, "num_objects_misplaced": 0, "num_objects_unfound": 0, "num_objects_dirty": 162869, "num_objects_recovered": 1368082, "peer": "41(4)", "pgid": "10.2s4", "state": "unknown", "last_peered": "0.00", "last_became_peered": "0.00", "num_objects": 0, "num_objects_missing_on_primary": 0, "num_objects_missing": 0, "num_objects_degraded": 0, "num_objects_misplaced": 0, "num_objects_unfound": 0, "num_objects_dirty": 0, "num_objects_recovered": 0, "peer": "46(4)", "pgid": "10.2s4", "state": "down", "last_peered": "2020-04-25 13:25:12.860435", "last_became_peered": "2020-04-22 10:45:45.520125", "num_objects": 162869, "num_objects_missing_on_primary": 0, "num_objects_missing": 0, "num_objects_degraded": 0, "num_objects_misplaced": 0, "num_objects_unfound": 0, "num_objects_dirty": 162869,
[ceph-users] pg incomplete blocked by destroyed osd
Hi all, During a crash disaster we had destroyed and reinstalled with a different number a few osds. As an example osd 3 was destroyed and recreated with id 101 by command ceph osd purge 3 --yes-i-really-mean-it + ceph osd create (to block id 3) + ceph-deploy osd create --data /dev/sdxx and finally ceph osd rm 3). Some of our pgs are now incomplet (which can be understood) but blocked by some of the removed osd : ex: here is an part of the ceph pg 30.3 query { "state": "incomplete", "snap_trimq": "[]", "snap_trimq_len": 0, "epoch": 384075, "up": [ 103, 43, 29, 2, 66 ], "acting": [ 103, 43, 29, 2, 66 ], "peer_info": [ { "peer": "2(3)", "pgid": "30.3s3", "last_update": "373570'105925965", "last_complete": "373570'105925965", ... }, "up": [ 103, 43, 29, 2, 66 ], "acting": [ 103, 43, 29, 2, 66 ], "avail_no_missing": [], "object_location_counts": [], *"blocked_by": [** ** 3,** ** 49** ** ],* "down_osds_we_would_probe": [ *3* ], "peering_blocked_by": [], "peering_blocked_by_detail": [ { * "detail": "peering_blocked_by_history_les_bound"* } ] I don't understand why the removed osd are still considered and present in the pg infos. Is there a way to get rid of that ? Moreover, we have tons of slow ops (more than 15 000) but I guess that the two problems are linked. Thanks for your help. F. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] repairing osd rocksdb
Hi, We had a major crash which ended with ~1/3 of our osd downs. Trying to fix it we reinstalled a few down osd (that was a mistake, I agree) and destroy the datas on it. Finally, we could fix the problem (thanks to Igor Fedotov) and restart almost all of our osds except one for which the rocksdb seems corrupted (at least for one file). Unfortunately, we now have 4 pgs down (all involving the dead osd) and 8 pg incompletes (some of them also involving the down osd). Before considering data loss, we would like to try to restart the down osds hopping to recover the down pgs and maybe some of the incomplete ones. Does someone have an idea on how to do that (maybe by removing the file corrupting the rocksdb or forcing to ignore the data in error) ? If it's not possible, how can we fix (even with dataloss) the downs and incomplete pgs ? Thanks for your advices. F. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: ceph crash hangs forever and recovery stop
Is there a way to purge the crashs ? For example is it safe and sufficient to delete everything in /var/lib/ceph/crash on the nodes ? F. Le 30/04/2020 à 17:14, Paul Emmerich a écrit : Best guess: the recovery process doesn't really stop, but it's just that the mgr is dead and it no longer reports the progress And yeah, I can confirm that having a huge number of crash reports is a problem (had a case where a monitoring script crashed due to a radosgw-admin bug... lots of crash reports) Paul -- Paul Emmerich ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] ceph crash hangs forever and recovery stop
Hi everybody (again), We recently had a lot of osd crashs (more than 30 osd crashed). This is now fixed, but it triggered a huge rebalancing+recovery. More or less in the same time, we noticed that the ceph crash ls (or whatever other ceph crash command) hangs forever and never returns. And finally, the recovery process stops regularly (after ~1 hour) but it can be restarted by reseting the mgr daemon (systemctl restart ceph-mgr.target on the active manager). There is nothing in the logs (the manager still works, the service is up, the dashboard is accessible but simply the recovery stops). We also tryed to reboot the managers, but it doesn't solve the problem. I guess theses two problems should be linked, but not sure. Does anybody have a clue ? Thanks. F. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: osd crashing and rocksdb corruption
Thanks again for your reactivity and your advices. You saved our lives ! We reactivate recovery/backfilling/rebalancing and it starts the recovery. We now have to wait to see how it will evolve. Last question : We noticed (a few days ago and it still occurs) that after ~1h the recovery was stopping (Recovery Throughput drop to 0). We could restart it by restarting the ceph-mgr.target on the active manager... for another ~1 hour ! It's strange because I cannot see any crash or relevant info in the logs ! Moreover the ceph crash command hangs and no way to get output. Maybe it's because of the huge number of failures on the osds ! Do you think that this two problems could be related to the osd crashing ? I will continue to investigate and maybe open a new different thread on this topic. F. Le 30/04/2020 à 10:57, Igor Fedotov a écrit : I created the following ticket and PR to track/fix the issue with incomplete large writes when bluefs_buffereed_io=1. https://tracker.ceph.com/issues/45337 https://github.com/ceph/ceph/pull/34836 But In fact setting bluefs_buffered_io to false is the mainstream for now, see https://github.com/ceph/ceph/pull/34224 Francois, you can proceed with OSD.21 &.49 I can reproduce the issue locally hence no much need in them now. Still investigating what's happening with OSD.8... As for reactivating recovery/backfill/rebalancing - I can say for sure whether it's safe or not. Thanks, Igor On 4/30/2020 1:39 AM, Francois Legrand wrote: Hello, We set bluefs_buffered_io to false for the whole cluster except 2 osd (21 and 49) for which we decided to keep the value to true for future experiments/troubleshooting as you asked. We then restarted all the 25 downs osd and they started... except one (number 8) which still continue to crash with the same kind of errors. I tryed a fsck on this osd which ended by a success. I set the debug to 20 and recorded the logs. You will find the logs there if you want to have a look : https://we.tl/t-GDvvvi2Gmm Now we plan to reactivate the recovery, backfill and rebalancing if you think it's safe. F. Le 29/04/2020 à 16:45, Igor Fedotov a écrit : So the crash seems to be caused by the same issue - big (and presumably incomplete) write and subsequent read failure. I've managed to repro this locally. So bluefs_buffered_io seems to be a remedy for now. But additionally I can observe multiple slow ops indications in this new log and I think they cause those big writes. And I presume some RGW-triggered ops are in flight - bucket resharding or removal or something.. I've seen multiple reports about this stuff causing OSD slowdown, high memory utilization and finally huge reads and/or writes from RocksDB. Don't know how to deal with this at the moment... Thanks, Igor On 4/29/2020 5:33 PM, Francois Legrand wrote: Here are the logs of the newly crashed osd. F. Le 29/04/2020 à 16:21, Igor Fedotov a écrit : Sounds interesting - could you please share the crash log for these new OSDs? They presumably suffer from another issue. At least that first crash is caused by something else. "bluefs buffered io" can be injected on the fly but I expect it to help when OSD isn't starting up only. On 4/29/2020 5:17 PM, Francois Legrand wrote: Ok we will try that. Indeed, restarting osd.5 triggered the falling down of two other osds in the cluster. Thus we will set bluefs buffered io = false for all osds and force bluefs buffered io = true for one of the downs osds. Is that modification needs to use injectargs or changing it in the configuration is enougth to have it applied on the fly ? F. Le 29/04/2020 à 15:56, Igor Fedotov a écrit : That's bluefs buffered io = false which did the trick. It modified write path and this presumably has fixed large write(s). Trying to reproduce locally but please preserve at least one failing OSD (i.e. do not start it with the disabled buffered io) for future experiments/troubleshooting for a while if possible. Thanks, Igor On 4/29/2020 4:50 PM, Francois Legrand wrote: Hi, It seems much better with theses options. The osd is now up since 10mn without crashing (before it was rebooting after ~1mn). F. Le 29/04/2020 à 15:16, Igor Fedotov a écrit : Hi Francois, I'll write a more thorough response a bit later. Meanwhile could you please try OSD startup with the following settings now: debug-bluefs abd debug-bdev = 20 bluefs sync write = false bluefs buffered io = false Thanks, Igor On 4/29/2020 3:35 PM, Francois Legrand wrote: Hi Igor, Here is what we did : First, as other osd were falling down, we stopped all operations with ceph osd set norecover ceph osd set norebalance ceph osd set nobackfill ceph osd set pause to avoid other crashs ! Then we moved to your recommandations (still testing on osd 5): in /etc/ceph/ceph.conf we added: [osd.5] debug bluefs = 20 debug bdev
[ceph-users] Re: osd crashing and rocksdb corruption
Hello, We set bluefs_buffered_io to false for the whole cluster except 2 osd (21 and 49) for which we decided to keep the value to true for future experiments/troubleshooting as you asked. We then restarted all the 25 downs osd and they started... except one (number 8) which still continue to crash with the same kind of errors. I tryed a fsck on this osd which ended by a success. I set the debug to 20 and recorded the logs. You will find the logs there if you want to have a look : https://we.tl/t-GDvvvi2Gmm Now we plan to reactivate the recovery, backfill and rebalancing if you think it's safe. F. Le 29/04/2020 à 16:45, Igor Fedotov a écrit : So the crash seems to be caused by the same issue - big (and presumably incomplete) write and subsequent read failure. I've managed to repro this locally. So bluefs_buffered_io seems to be a remedy for now. But additionally I can observe multiple slow ops indications in this new log and I think they cause those big writes. And I presume some RGW-triggered ops are in flight - bucket resharding or removal or something.. I've seen multiple reports about this stuff causing OSD slowdown, high memory utilization and finally huge reads and/or writes from RocksDB. Don't know how to deal with this at the moment... Thanks, Igor On 4/29/2020 5:33 PM, Francois Legrand wrote: Here are the logs of the newly crashed osd. F. Le 29/04/2020 à 16:21, Igor Fedotov a écrit : Sounds interesting - could you please share the crash log for these new OSDs? They presumably suffer from another issue. At least that first crash is caused by something else. "bluefs buffered io" can be injected on the fly but I expect it to help when OSD isn't starting up only. On 4/29/2020 5:17 PM, Francois Legrand wrote: Ok we will try that. Indeed, restarting osd.5 triggered the falling down of two other osds in the cluster. Thus we will set bluefs buffered io = false for all osds and force bluefs buffered io = true for one of the downs osds. Is that modification needs to use injectargs or changing it in the configuration is enougth to have it applied on the fly ? F. Le 29/04/2020 à 15:56, Igor Fedotov a écrit : That's bluefs buffered io = false which did the trick. It modified write path and this presumably has fixed large write(s). Trying to reproduce locally but please preserve at least one failing OSD (i.e. do not start it with the disabled buffered io) for future experiments/troubleshooting for a while if possible. Thanks, Igor On 4/29/2020 4:50 PM, Francois Legrand wrote: Hi, It seems much better with theses options. The osd is now up since 10mn without crashing (before it was rebooting after ~1mn). F. Le 29/04/2020 à 15:16, Igor Fedotov a écrit : Hi Francois, I'll write a more thorough response a bit later. Meanwhile could you please try OSD startup with the following settings now: debug-bluefs abd debug-bdev = 20 bluefs sync write = false bluefs buffered io = false Thanks, Igor On 4/29/2020 3:35 PM, Francois Legrand wrote: Hi Igor, Here is what we did : First, as other osd were falling down, we stopped all operations with ceph osd set norecover ceph osd set norebalance ceph osd set nobackfill ceph osd set pause to avoid other crashs ! Then we moved to your recommandations (still testing on osd 5): in /etc/ceph/ceph.conf we added: [osd.5] debug bluefs = 20 debug bdev = 20 We ran : ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-5 -l /var/log/ceph/bluestore-tool-fsck-osd-5.log --log-level 20 > /var/log/ceph/bluestore-tool-fsck-osd-5.out 2>&1 it ended with fsck success It seems that the default value for bluefs sync write is false (https://github.com/ceph/ceph/blob/v14.2.8/src/common/options.cc), thus we changed /etc/ceph/ceph.conf to : [osd.5] debug bluefs = 20 debug bdev = 20 bluefs sync write = true and restarted the osd. It crashed ! We tryed to change explicitely bluefs sync write = false and restarted... same result ! The logs are here : https://we.tl/t-HMiFDu22XH Moreover, we have a rados gateway pool with hundreds of 4GB files. Can this be the origin of the problem ? Do you thing that removing this pool can help ? Thanks again for your expertise. F. Le 28/04/2020 à 18:52, Igor Fedotov a écrit : Short update - please treat bluefs_sync_write parameter instead of bdev-aio. Changing the latter isn't supported in fact. On 4/28/2020 7:35 PM, Igor Fedotov wrote: Francious, here are some observations got from your log. 1) Rocksdb reports error on the following .sst file: -35> 2020-04-28 15:23:47.612 7f4856e82a80 -1 rocksdb: Corruption: Bad table magic number: expected 986351839 0377041911, found 12950032858166034944 in db/068269.sst 2) which relates to BlueFS ino 53361: -50> 2020-04-28 15:23:45.103 7f4856e82a80 10 bluefs open_for_read db/068269.sst (random) -49> 2020-04-28 15:23:45.103 7f4856e8
[ceph-users] Re: osd crashing and rocksdb corruption
Here are the logs of the newly crashed osd. F. Le 29/04/2020 à 16:21, Igor Fedotov a écrit : Sounds interesting - could you please share the crash log for these new OSDs? They presumably suffer from another issue. At least that first crash is caused by something else. "bluefs buffered io" can be injected on the fly but I expect it to help when OSD isn't starting up only. On 4/29/2020 5:17 PM, Francois Legrand wrote: Ok we will try that. Indeed, restarting osd.5 triggered the falling down of two other osds in the cluster. Thus we will set bluefs buffered io = false for all osds and force bluefs buffered io = true for one of the downs osds. Is that modification needs to use injectargs or changing it in the configuration is enougth to have it applied on the fly ? F. Le 29/04/2020 à 15:56, Igor Fedotov a écrit : That's bluefs buffered io = false which did the trick. It modified write path and this presumably has fixed large write(s). Trying to reproduce locally but please preserve at least one failing OSD (i.e. do not start it with the disabled buffered io) for future experiments/troubleshooting for a while if possible. Thanks, Igor On 4/29/2020 4:50 PM, Francois Legrand wrote: Hi, It seems much better with theses options. The osd is now up since 10mn without crashing (before it was rebooting after ~1mn). F. Le 29/04/2020 à 15:16, Igor Fedotov a écrit : Hi Francois, I'll write a more thorough response a bit later. Meanwhile could you please try OSD startup with the following settings now: debug-bluefs abd debug-bdev = 20 bluefs sync write = false bluefs buffered io = false Thanks, Igor On 4/29/2020 3:35 PM, Francois Legrand wrote: Hi Igor, Here is what we did : First, as other osd were falling down, we stopped all operations with ceph osd set norecover ceph osd set norebalance ceph osd set nobackfill ceph osd set pause to avoid other crashs ! Then we moved to your recommandations (still testing on osd 5): in /etc/ceph/ceph.conf we added: [osd.5] debug bluefs = 20 debug bdev = 20 We ran : ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-5 -l /var/log/ceph/bluestore-tool-fsck-osd-5.log --log-level 20 > /var/log/ceph/bluestore-tool-fsck-osd-5.out 2>&1 it ended with fsck success It seems that the default value for bluefs sync write is false (https://github.com/ceph/ceph/blob/v14.2.8/src/common/options.cc), thus we changed /etc/ceph/ceph.conf to : [osd.5] debug bluefs = 20 debug bdev = 20 bluefs sync write = true and restarted the osd. It crashed ! We tryed to change explicitely bluefs sync write = false and restarted... same result ! The logs are here : https://we.tl/t-HMiFDu22XH Moreover, we have a rados gateway pool with hundreds of 4GB files. Can this be the origin of the problem ? Do you thing that removing this pool can help ? Thanks again for your expertise. F. Le 28/04/2020 à 18:52, Igor Fedotov a écrit : Short update - please treat bluefs_sync_write parameter instead of bdev-aio. Changing the latter isn't supported in fact. On 4/28/2020 7:35 PM, Igor Fedotov wrote: Francious, here are some observations got from your log. 1) Rocksdb reports error on the following .sst file: -35> 2020-04-28 15:23:47.612 7f4856e82a80 -1 rocksdb: Corruption: Bad table magic number: expected 986351839 0377041911, found 12950032858166034944 in db/068269.sst 2) which relates to BlueFS ino 53361: -50> 2020-04-28 15:23:45.103 7f4856e82a80 10 bluefs open_for_read db/068269.sst (random) -49> 2020-04-28 15:23:45.103 7f4856e82a80 10 bluefs open_for_read h 0x557914fb80b0 on file(ino 53361 size 0x c496f919 mtime 2020-04-28 15:23:39.827515 bdev 1 allocated c497 extents [1:0x383db28~c497]) 3) and failed read happens to the end (0xc496f8e4~35, last 0x35 bytes) of this huge(3+GB) file: -44> 2020-04-28 15:23:47.514 7f4856e82a80 10 bluefs _read_random h 0x557914fb80b0 0xc496f8e4~35 from file(in o 53361 size 0xc496f919 mtime 2020-04-28 15:23:39.827515 bdev 1 allocated c497 extents [1:0x383db28~c49 7]) -43> 2020-04-28 15:23:47.514 7f4856e82a80 20 bluefs _read_random left 0x71c 0xc496f8e4~35 -42> 2020-04-28 15:23:47.514 7f4856e82a80 20 bluefs _read_random got 53 4) This .sst file was created from scratch shortly before with a single-shot 3+GB write: -88> 2020-04-28 15:23:35.661 7f4856e82a80 10 bluefs open_for_write db/068269.sst -87> 2020-04-28 15:23:35.661 7f4856e82a80 20 bluefs open_for_write mapping db/068269.sst to bdev 1 -86> 2020-04-28 15:23:35.662 7f4856e82a80 10 bluefs open_for_write h 0x5579145e7a40 on file(ino 53361 size 0 x0 mtime 2020-04-28 15:23:35.663142 bdev 1 allocated 0 extents []) -85> 2020-04-28 15:23:39.826 7f4856e82a80 10 bluefs _flush 0x5579145e7a40 0x0~c496f919 to file(ino 53361 siz e 0x0 mtime 2020-04-28 15:23:35.663142 bdev 1 allocated 0 extents []) 5) Presumably Rocks
[ceph-users] Re: osd crashing and rocksdb corruption
Ok we will try that. Indeed, restarting osd.5 triggered the falling down of two other osds in the cluster. Thus we will set bluefs buffered io = false for all osds and force bluefs buffered io = true for one of the downs osds. Is that modification needs to use injectargs or changing it in the configuration is enougth to have it applied on the fly ? F. Le 29/04/2020 à 15:56, Igor Fedotov a écrit : That's bluefs buffered io = false which did the trick. It modified write path and this presumably has fixed large write(s). Trying to reproduce locally but please preserve at least one failing OSD (i.e. do not start it with the disabled buffered io) for future experiments/troubleshooting for a while if possible. Thanks, Igor On 4/29/2020 4:50 PM, Francois Legrand wrote: Hi, It seems much better with theses options. The osd is now up since 10mn without crashing (before it was rebooting after ~1mn). F. Le 29/04/2020 à 15:16, Igor Fedotov a écrit : Hi Francois, I'll write a more thorough response a bit later. Meanwhile could you please try OSD startup with the following settings now: debug-bluefs abd debug-bdev = 20 bluefs sync write = false bluefs buffered io = false Thanks, Igor On 4/29/2020 3:35 PM, Francois Legrand wrote: Hi Igor, Here is what we did : First, as other osd were falling down, we stopped all operations with ceph osd set norecover ceph osd set norebalance ceph osd set nobackfill ceph osd set pause to avoid other crashs ! Then we moved to your recommandations (still testing on osd 5): in /etc/ceph/ceph.conf we added: [osd.5] debug bluefs = 20 debug bdev = 20 We ran : ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-5 -l /var/log/ceph/bluestore-tool-fsck-osd-5.log --log-level 20 > /var/log/ceph/bluestore-tool-fsck-osd-5.out 2>&1 it ended with fsck success It seems that the default value for bluefs sync write is false (https://github.com/ceph/ceph/blob/v14.2.8/src/common/options.cc), thus we changed /etc/ceph/ceph.conf to : [osd.5] debug bluefs = 20 debug bdev = 20 bluefs sync write = true and restarted the osd. It crashed ! We tryed to change explicitely bluefs sync write = false and restarted... same result ! The logs are here : https://we.tl/t-HMiFDu22XH Moreover, we have a rados gateway pool with hundreds of 4GB files. Can this be the origin of the problem ? Do you thing that removing this pool can help ? Thanks again for your expertise. F. Le 28/04/2020 à 18:52, Igor Fedotov a écrit : Short update - please treat bluefs_sync_write parameter instead of bdev-aio. Changing the latter isn't supported in fact. On 4/28/2020 7:35 PM, Igor Fedotov wrote: Francious, here are some observations got from your log. 1) Rocksdb reports error on the following .sst file: -35> 2020-04-28 15:23:47.612 7f4856e82a80 -1 rocksdb: Corruption: Bad table magic number: expected 986351839 0377041911, found 12950032858166034944 in db/068269.sst 2) which relates to BlueFS ino 53361: -50> 2020-04-28 15:23:45.103 7f4856e82a80 10 bluefs open_for_read db/068269.sst (random) -49> 2020-04-28 15:23:45.103 7f4856e82a80 10 bluefs open_for_read h 0x557914fb80b0 on file(ino 53361 size 0x c496f919 mtime 2020-04-28 15:23:39.827515 bdev 1 allocated c497 extents [1:0x383db28~c497]) 3) and failed read happens to the end (0xc496f8e4~35, last 0x35 bytes) of this huge(3+GB) file: -44> 2020-04-28 15:23:47.514 7f4856e82a80 10 bluefs _read_random h 0x557914fb80b0 0xc496f8e4~35 from file(in o 53361 size 0xc496f919 mtime 2020-04-28 15:23:39.827515 bdev 1 allocated c497 extents [1:0x383db28~c49 7]) -43> 2020-04-28 15:23:47.514 7f4856e82a80 20 bluefs _read_random left 0x71c 0xc496f8e4~35 -42> 2020-04-28 15:23:47.514 7f4856e82a80 20 bluefs _read_random got 53 4) This .sst file was created from scratch shortly before with a single-shot 3+GB write: -88> 2020-04-28 15:23:35.661 7f4856e82a80 10 bluefs open_for_write db/068269.sst -87> 2020-04-28 15:23:35.661 7f4856e82a80 20 bluefs open_for_write mapping db/068269.sst to bdev 1 -86> 2020-04-28 15:23:35.662 7f4856e82a80 10 bluefs open_for_write h 0x5579145e7a40 on file(ino 53361 size 0 x0 mtime 2020-04-28 15:23:35.663142 bdev 1 allocated 0 extents []) -85> 2020-04-28 15:23:39.826 7f4856e82a80 10 bluefs _flush 0x5579145e7a40 0x0~c496f919 to file(ino 53361 siz e 0x0 mtime 2020-04-28 15:23:35.663142 bdev 1 allocated 0 extents []) 5) Presumably RocksDB creates this file in an attempt to recover/compact/process another existing file (ino 52405) which is pretty large as well Please find multiple earlier reads, the last one: -92> 2020-04-28 15:23:29.857 7f4856e82a80 10 bluefs _read h 0x5579147286e0 0xc6788000~8000 from file(ino 524 05 size 0xc67888a0 mtime 2020-04-25 13:34:55.325699 bdev 0 allocated c679 extents [1:0x381c822~1,1: The rationales for binding these two
[ceph-users] Re: osd crashing and rocksdb corruption
Hi, It seems much better with theses options. The osd is now up since 10mn without crashing (before it was rebooting after ~1mn). F. Le 29/04/2020 à 15:16, Igor Fedotov a écrit : Hi Francois, I'll write a more thorough response a bit later. Meanwhile could you please try OSD startup with the following settings now: debug-bluefs abd debug-bdev = 20 bluefs sync write = false bluefs buffered io = false Thanks, Igor On 4/29/2020 3:35 PM, Francois Legrand wrote: Hi Igor, Here is what we did : First, as other osd were falling down, we stopped all operations with ceph osd set norecover ceph osd set norebalance ceph osd set nobackfill ceph osd set pause to avoid other crashs ! Then we moved to your recommandations (still testing on osd 5): in /etc/ceph/ceph.conf we added: [osd.5] debug bluefs = 20 debug bdev = 20 We ran : ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-5 -l /var/log/ceph/bluestore-tool-fsck-osd-5.log --log-level 20 > /var/log/ceph/bluestore-tool-fsck-osd-5.out 2>&1 it ended with fsck success It seems that the default value for bluefs sync write is false (https://github.com/ceph/ceph/blob/v14.2.8/src/common/options.cc), thus we changed /etc/ceph/ceph.conf to : [osd.5] debug bluefs = 20 debug bdev = 20 bluefs sync write = true and restarted the osd. It crashed ! We tryed to change explicitely bluefs sync write = false and restarted... same result ! The logs are here : https://we.tl/t-HMiFDu22XH Moreover, we have a rados gateway pool with hundreds of 4GB files. Can this be the origin of the problem ? Do you thing that removing this pool can help ? Thanks again for your expertise. F. Le 28/04/2020 à 18:52, Igor Fedotov a écrit : Short update - please treat bluefs_sync_write parameter instead of bdev-aio. Changing the latter isn't supported in fact. On 4/28/2020 7:35 PM, Igor Fedotov wrote: Francious, here are some observations got from your log. 1) Rocksdb reports error on the following .sst file: -35> 2020-04-28 15:23:47.612 7f4856e82a80 -1 rocksdb: Corruption: Bad table magic number: expected 986351839 0377041911, found 12950032858166034944 in db/068269.sst 2) which relates to BlueFS ino 53361: -50> 2020-04-28 15:23:45.103 7f4856e82a80 10 bluefs open_for_read db/068269.sst (random) -49> 2020-04-28 15:23:45.103 7f4856e82a80 10 bluefs open_for_read h 0x557914fb80b0 on file(ino 53361 size 0x c496f919 mtime 2020-04-28 15:23:39.827515 bdev 1 allocated c497 extents [1:0x383db28~c497]) 3) and failed read happens to the end (0xc496f8e4~35, last 0x35 bytes) of this huge(3+GB) file: -44> 2020-04-28 15:23:47.514 7f4856e82a80 10 bluefs _read_random h 0x557914fb80b0 0xc496f8e4~35 from file(in o 53361 size 0xc496f919 mtime 2020-04-28 15:23:39.827515 bdev 1 allocated c497 extents [1:0x383db28~c49 7]) -43> 2020-04-28 15:23:47.514 7f4856e82a80 20 bluefs _read_random left 0x71c 0xc496f8e4~35 -42> 2020-04-28 15:23:47.514 7f4856e82a80 20 bluefs _read_random got 53 4) This .sst file was created from scratch shortly before with a single-shot 3+GB write: -88> 2020-04-28 15:23:35.661 7f4856e82a80 10 bluefs open_for_write db/068269.sst -87> 2020-04-28 15:23:35.661 7f4856e82a80 20 bluefs open_for_write mapping db/068269.sst to bdev 1 -86> 2020-04-28 15:23:35.662 7f4856e82a80 10 bluefs open_for_write h 0x5579145e7a40 on file(ino 53361 size 0 x0 mtime 2020-04-28 15:23:35.663142 bdev 1 allocated 0 extents []) -85> 2020-04-28 15:23:39.826 7f4856e82a80 10 bluefs _flush 0x5579145e7a40 0x0~c496f919 to file(ino 53361 siz e 0x0 mtime 2020-04-28 15:23:35.663142 bdev 1 allocated 0 extents []) 5) Presumably RocksDB creates this file in an attempt to recover/compact/process another existing file (ino 52405) which is pretty large as well Please find multiple earlier reads, the last one: -92> 2020-04-28 15:23:29.857 7f4856e82a80 10 bluefs _read h 0x5579147286e0 0xc6788000~8000 from file(ino 524 05 size 0xc67888a0 mtime 2020-04-25 13:34:55.325699 bdev 0 allocated c679 extents [1:0x381c822~1,1: The rationales for binding these two files are pretty uncommon file sizes. So you have 3+GB single-shot BlueFS write and immediate read from the end of the written extent which returns unexpected magic. It's well known in the software world that large (2+GB) data processing implementations tend to be error-prone. And Ceph is not an exception. Here is a couple of recent examples which are pretty close to your case: https://github.com/ceph/ceph/commit/4d33114a40d5ae0d541c36175977ca22789a3b88 https://github.com/ceph/ceph/commit/207806abaa91259d9bb726c2555e7e21ac541527 Although they are already fixed in Nautilus 14.2.8 there might be others present along the write path (including H/W firmware). The good news are that failure happens on a newly written file (remember invalid magic is read at the end(
[ceph-users] Re: osd crashing and rocksdb corruption
re - this will go via a bit different write path and may be provide a workaround. Also please collect debug logs for OSD startup (with both current and updated bdev-aio parameter) and --debug-bdev/debug-bluefs set to 20. You can omit --debug-bluestore increase for now to reduce log size. Thanks, Igor On 4/28/2020 5:16 PM, Francois Legrand wrote: Here is the output of ceph-bluestore-tool bluefs-bdev-sizes inferring bluefs devices from bluestore path slot 1 /var/lib/ceph/osd/ceph-5/block -> /dev/dm-17 1 : device size 0x746c000 : own 0x[37e1eb0~4a8290] = 0x4a8290 : using 0x5bc78(23 GiB) the result of the debug-bluestore (and debug-bluefs) set to 20 for osd.5 is at the following address (28MB). https://wetransfer.com/downloads/a193ab15ab5e2395fe2462c963507a7f20200428141355/5da2ebf0d33750a5fde85bf662cf0e6d20200428141415/55849f?utm_campaign=WT_email_tracking&utm_content=general&utm_medium=download_button&utm_source=notify_recipient_email Thanks for your help. F. Le 28/04/2020 à 13:33, Igor Fedotov a écrit : Hi Francois, Could you please share OSD startup log with debug-bluestore (and debug-bluefs) set to 20. Also please run ceph-bluestore-tool's bluefs-bdev-sizes command and share the output. Thanks, Igor On 4/28/2020 12:55 AM, Francois Legrand wrote: Hi all, *** Short version *** Is there a way to repair a rocksdb from errors "Encountered error while reading data from compression dictionary block Corruption: block checksum mismatch" and "_open_db erroring opening db" ? *** Long version *** We operate a nautilus ceph cluster (with 100 disks of 8TB in 6 servers + 4 mons/mgr + 3 mds). We recently (Monday 20) upgraded from 14.2.7 to 14.2.8. This triggered a rebalancing of some data. Two days later (Wednesday 22) we had a very short power outage. Only one of the osd servers went down (and unfortunately died). This triggered a reconstruction of the losts osds. Operations went fine until Saturday 25 where some osds in the 5 remaining servers started to crash apparently with no reasons. We tryed to restart them, but they crashed again. We ended with 18 osd down (+ 16 in the dead server so 34 osd downs out of 100). Looking at the logs we found for all the crashed osd : -237> 2020-04-25 16:32:51.835 7f1f45527a80 3 rocksdb: [table/block_based_table_reader.cc:1117] Encountered error while reading data from compression dictionary block Corruption: block checksum mismatch: expected 0, got 2729370997 in db/181355.sst offset 18446744073709551615 size 18446744073709551615 and 2020-04-25 16:05:47.251 7fcbd1e46a80 -1 bluestore(/var/lib/ceph/osd/ceph-3) _open_db erroring opening db: We also noticed that the "Encountered error while reading data from compression dictionary block Corruption: block checksum mismatch" was present few days before the crash. We also have some osd with this error but still up. We tryed to repair with : ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-3 repair But no success (it ends with _open_db erroring opening db). Thus does somebody have an idea to fix this or at least know if it's possible to repair and correct the "Encountered error while reading data from compression dictionary block Corruption: block checksum mismatch" and "_open_db erroring opening db" ! Thanks for your help (we are desperate because we will loose datas and are fighting to save something) !!! F. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: osd crashing and rocksdb corruption
Here is the output of ceph-bluestore-tool bluefs-bdev-sizes inferring bluefs devices from bluestore path slot 1 /var/lib/ceph/osd/ceph-5/block -> /dev/dm-17 1 : device size 0x746c000 : own 0x[37e1eb0~4a8290] = 0x4a8290 : using 0x5bc78(23 GiB) the result of the debug-bluestore (and debug-bluefs) set to 20 for osd.5 is at the following address (28MB). https://wetransfer.com/downloads/a193ab15ab5e2395fe2462c963507a7f20200428141355/5da2ebf0d33750a5fde85bf662cf0e6d20200428141415/55849f?utm_campaign=WT_email_tracking&utm_content=general&utm_medium=download_button&utm_source=notify_recipient_email Thanks for your help. F. Le 28/04/2020 à 13:33, Igor Fedotov a écrit : Hi Francois, Could you please share OSD startup log with debug-bluestore (and debug-bluefs) set to 20. Also please run ceph-bluestore-tool's bluefs-bdev-sizes command and share the output. Thanks, Igor On 4/28/2020 12:55 AM, Francois Legrand wrote: Hi all, *** Short version *** Is there a way to repair a rocksdb from errors "Encountered error while reading data from compression dictionary block Corruption: block checksum mismatch" and "_open_db erroring opening db" ? *** Long version *** We operate a nautilus ceph cluster (with 100 disks of 8TB in 6 servers + 4 mons/mgr + 3 mds). We recently (Monday 20) upgraded from 14.2.7 to 14.2.8. This triggered a rebalancing of some data. Two days later (Wednesday 22) we had a very short power outage. Only one of the osd servers went down (and unfortunately died). This triggered a reconstruction of the losts osds. Operations went fine until Saturday 25 where some osds in the 5 remaining servers started to crash apparently with no reasons. We tryed to restart them, but they crashed again. We ended with 18 osd down (+ 16 in the dead server so 34 osd downs out of 100). Looking at the logs we found for all the crashed osd : -237> 2020-04-25 16:32:51.835 7f1f45527a80 3 rocksdb: [table/block_based_table_reader.cc:1117] Encountered error while reading data from compression dictionary block Corruption: block checksum mismatch: expected 0, got 2729370997 in db/181355.sst offset 18446744073709551615 size 18446744073709551615 and 2020-04-25 16:05:47.251 7fcbd1e46a80 -1 bluestore(/var/lib/ceph/osd/ceph-3) _open_db erroring opening db: We also noticed that the "Encountered error while reading data from compression dictionary block Corruption: block checksum mismatch" was present few days before the crash. We also have some osd with this error but still up. We tryed to repair with : ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-3 repair But no success (it ends with _open_db erroring opening db). Thus does somebody have an idea to fix this or at least know if it's possible to repair and correct the "Encountered error while reading data from compression dictionary block Corruption: block checksum mismatch" and "_open_db erroring opening db" ! Thanks for your help (we are desperate because we will loose datas and are fighting to save something) !!! F. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] osd crashing and rocksdb corruption
Hi all, *** Short version *** Is there a way to repair a rocksdb from errors "Encountered error while reading data from compression dictionary block Corruption: block checksum mismatch" and "_open_db erroring opening db" ? *** Long version *** We operate a nautilus ceph cluster (with 100 disks of 8TB in 6 servers + 4 mons/mgr + 3 mds). We recently (Monday 20) upgraded from 14.2.7 to 14.2.8. This triggered a rebalancing of some data. Two days later (Wednesday 22) we had a very short power outage. Only one of the osd servers went down (and unfortunately died). This triggered a reconstruction of the losts osds. Operations went fine until Saturday 25 where some osds in the 5 remaining servers started to crash apparently with no reasons. We tryed to restart them, but they crashed again. We ended with 18 osd down (+ 16 in the dead server so 34 osd downs out of 100). Looking at the logs we found for all the crashed osd : -237> 2020-04-25 16:32:51.835 7f1f45527a80 3 rocksdb: [table/block_based_table_reader.cc:1117] Encountered error while reading data from compression dictionary block Corruption: block checksum mismatch: expected 0, got 2729370997 in db/181355.sst offset 18446744073709551615 size 18446744073709551615 and 2020-04-25 16:05:47.251 7fcbd1e46a80 -1 bluestore(/var/lib/ceph/osd/ceph-3) _open_db erroring opening db: We also noticed that the "Encountered error while reading data from compression dictionary block Corruption: block checksum mismatch" was present few days before the crash. We also have some osd with this error but still up. We tryed to repair with : ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-3 repair But no success (it ends with _open_db erroring opening db). Thus does somebody have an idea to fix this or at least know if it's possible to repair and correct the "Encountered error while reading data from compression dictionary block Corruption: block checksum mismatch" and "_open_db erroring opening db" ! Thanks for your help (we are desperate because we will loose datas and are fighting to save something) !!! F. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Changing failure domain
I don't want to remove cephfs_meta pool but cephfs_datapool. To be clear : I have now cephfs consisting of a cephfs_metapool and a cephfs_datapool. I want to add a new data pool cephfs_datapool2, migrate all data from cephfs_datapool to cephfs_datapool2 and then remove the original cephfs_datapool. My goal is to end with a cephfs with cephfs_meta and cephfs_datapool2 (i.e replace the original cephfs_datapool by cephfs_datapool2). But from what I've seen, there should be also some "metadata" in the cephfs_datapool (it sounds weird to me) which should remains after moving objects and prevent its deletion. F. Le 14/01/2020 à 07:54, Konstantin Shalygin a écrit : On 1/6/20 5:50 PM, Francois Legrand wrote: I still have few questions before going on. It seems that some metadata should remains on the original data pool, preventing it's deletion (http://ceph.com/geen-categorie/ceph-pool-migration/ and https://www.spinics.net/lists/ceph-users/msg41374.html). Thus does doing a cp and then a rm of the original files (instead of mv) allows to get rid of the remaining metadata in the original data pool ? Is it then possible to remove the original pool after migration (and how, because I guess that I have to define before the default location for data to the new pool) ? How snapshots are affected (do I have to remove all of them before the operation) ? Why you need to remove cephfs_meta pool? k ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Changing failure domain
Thanks again for your answer. I still have few questions before going on. It seems that some metadata should remains on the original data pool, preventing it's deletion (http://ceph.com/geen-categorie/ceph-pool-migration/ and https://www.spinics.net/lists/ceph-users/msg41374.html). Thus does doing a cp and then a rm of the original files (instead of mv) allows to get rid of the remaining metadata in the original data pool ? Is it then possible to remove the original pool after migration (and how, because I guess that I have to define before the default location for data to the new pool) ? How snapshots are affected (do I have to remove all of them before the operation) ? Happy new year. F. Le 24/12/2019 à 03:53, Konstantin Shalygin a écrit : On 12/19/19 10:22 PM, Francois Legrand wrote: Thus my question is *how can I migrate a data pool in EC of a cephfs to another EC pool ?* I suggest this: # create you new ec pool # `ceph osd pool application enable ec_new cephfs` # `ceph fs add_data_pool cephfs ec_new` # `setfattr -n ceph.dir.layout -v pool=ec_new /cephfs/ec_migration` And then copy your content via userland tools. k ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Changing failure domain
Thanks for you advices. I thus created a new replica profile : { "rule_id": 2, "rule_name": "replicated3over2rooms", "ruleset": 2, "type": 1, "min_size": 3, "max_size": 4, "steps": [ { "op": "take", "item": -1, "item_name": "default" }, { "op": "choose_firstn", "num": 0, "type": "room" }, { "op": "chooseleaf_firstn", "num": 2, "type": "host" }, # et c'est fini { "op": "emit" } ] } It works well. Now I am concerned by the pool in erasure coding. The point is that it's the data pool for cephfs (the metadata is in replica 3 and now replicated over our two rooms). For now, the data pool for cephfs is in *erasure coding k=3, m=2* (at the creation of the cluster we had only 5 osd servers). As noticed befors by Paul Emmerich, this cannot be redundantly splitted over 2 rooms (as 3 chunks are required to reconstruct the datas). Now, we have 6 OSD servers, and soon it will be 7, thus I was thinking to create a new pool (eg. k=4, m=2 or k=3, m=3) and a rule to split the chunks over our 2 rooms and to use this new pool as cache tier to migrate softly all the datas from the old pool to the new one. But according to https://documentation.suse.com/ses/6/html/ses-all/ceph-pools.html#pool-migrate-cache-tier "You can use the cache tier method to migrate from a replicated pool to either an erasure coded or another replicated pool. Migrating from an erasure coded pool is not supported." Warning: You Cannot Migrate RBD Images and CephFS Exports to an EC Pool You cannot migrate RBD images and CephFS exports from a replicated pool to an EC pool. EC pools can store data but not metadata. The header object of the RBD will fail to be flushed. The same applies for CephFS. Thus my question is *how can I migrate a data pool in EC of a cephfs to another EC pool ?* Thanks for your advices. F. Le 03/12/2019 à 04:07, Konstantin Shalygin a écrit : On 12/2/19 5:56 PM, Francois Legrand wrote: For replica, what is the best way to change crush profile ? Is it to create a new replica profile, and set this profile as crush rulest for the pool (something like ceph osd pool set {pool-name} crush_ruleset my_new_rule) ? Indeed. Then you can delete/do what you want with old crush rule. k ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Changing failure domain
Thanks. For replica, what is the best way to change crush profile ? Is it to create a new replica profile, and set this profile as crush rulest for the pool (something like ceph osd pool set {pool-name} crush_ruleset my_new_rule) ? For erasure coding, I would thus have to change the profile at least to k=3, m=3 (for now I only have 6 osd servers). But if I am correct, this cannot be changed for an existing pool and I will have to create a new pool and migrate all data from the current one to the new one. Is that correct ? F. Le 28/11/2019 à 17:51, Paul Emmerich a écrit : Use a crush rule likes this for replica: 1) root default class XXX 2) choose 2 rooms 3) choose 2 disks That'll get you 4 OSDs in two rooms and the first 3 of these get data, the fourth will be ignored. That guarantees that losing a room will lose you at most 2 out of 3 copies. This is for disaster recovery only, it'll guarantee durability if you lose a room but not availability. 3+2 erasure coding cannot be split across two rooms in this way because, well, you need 3 out of 5 shards to survive, so you cannot lose half of them. Paul ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Changing failure domain
Hi, I have a cephfs in production based on 2 pools (data+metadata). Data is in erasure coding with the profile : crush-failure-domain=host crush-root=default jerasure-per-chunk-alignment=false k=3 m=2 plugin=jerasure technique=reed_sol_van w=8 Metadata is in replicated mode with k=3 The crush rules are as follow : [ { "rule_id": 0, "rule_name": "replicated_rule", "ruleset": 0, "type": 1, "min_size": 1, "max_size": 10, "steps": [ { "op": "take", "item": -1, "item_name": "default" }, { "op": "chooseleaf_firstn", "num": 0, "type": "host" }, { "op": "emit" } ] }, { "rule_id": 1, "rule_name": "ec_data", "ruleset": 1, "type": 3, "min_size": 3, "max_size": 5, "steps": [ { "op": "set_chooseleaf_tries", "num": 5 }, { "op": "set_choose_tries", "num": 100 }, { "op": "take", "item": -1, "item_name": "default" }, { "op": "chooseleaf_indep", "num": 0, "type": "host" }, { "op": "emit" } ] } ] When we installed it, everything was in the same room, but know we splitted our cluster (6 servers but soon 8) in 2 rooms. Thus we updated the crushmap by adding a room layer (with ceph osd crush add-bucket room1 room etc) and move all our servers in the tree to the correct place (ceph osd crush move server1 room=room1 etc...). Now, we would like to change the rules to set a failure domain to room instead of host (to be sure that in case of disaster in one of the rooms we will still have a copy in the other). What is the best strategy to do this ? F. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io