[ceph-users] Re: cephadm and remoto package
Hi Shashi, I just ran into this myself, and I thought I'd share the solution/workaround that I applied. On 15/05/2023 22:08, Shashi Dahal wrote: Hi, I followed this documentation: https://docs.ceph.com/en/pacific/cephadm/adoption/ This is the error I get when trying to enable cephadm. ceph mgr module enable cephadm Error ENOENT: module 'cephadm' reports that it cannot run on the active manager daemon: loading remoto library:No module named 'remoto' (pass --force to force enablement) When I import remoto, it imports just fine. OS is ubuntu 20.04 focal As far as I can see, this issue applies to non-containerized Ceph Pacific deployments — such as ones orchestrated with ceph-ansible — running on Debian or Ubuntu. There is no python3-remoto package on those platforms, so you can't install remoto by "regular" installation means (that is, apt/apt-get). It looks to me like this issue was introduced in Pacific, and then went away in Quincy because that release dropped remoto and replaced it with asyncssh (for which a Debian/Ubuntu package does exist). If you start out on Octopus with ceph-ansible and do the Cephadm migration *then*, you're apparently fine too, and you can subsequently use Cephadm to upgrade to Pacific and Quincy. I think it's just this particular combination — (a) run on Debian/Ubuntu, (b) deploy non-containerized, *and* (c) start your deployment on Pacific, where Cephadm adoption breaks. The problem has apparently been known for a while (see https://tracker.ceph.com/issues/43415), but the recommendation appears to have been "just run mgr on a different OS then", which is frequently not a viable option. I tried (like you did, I assume) to just pip-install remoto, and if I opened a Python console and typed "import remoto" it imported just fine, but apparently the cephadm mgr module didn't like that. I've now traced this down to the following line that shows up in the ceph-mgr log if you bump "debug mgr" to 10/10: 2023-06-26T10:01:34.799+ 7fb0979ba500 10 mgr[py] Computed sys.path '/usr/share/ceph/mgr:/local/lib/python3.8/dist-packages:/lib/python3/dist-packages:/lib/python3.8/dist-packages:lib/python38.zip:/lib/python3.8:/lib/python3.8/lib-dynload' Note the /local/lib/python3.8/dist-packages path, which does not exist on Ubuntu Focal. It's properly /usr/local/lib/python3.8/dist-packages, and this is where "pip install", when run as root outside a virtualenv, installs packages to. I think the incorrect sys.path may actually be a build or packaging bug in the community packages built for Debian/Ubuntu, but I'm not 100% certain. At any rate, the combined workaround for this issue, for me, is: (1) pip install remoto (this installs remoto into /usr/local/lib/python3.8/dist-packages) (2) ln -s /usr/local/lib/python3.8/dist-packages /local/lib/python3.8/dist-packages (this makes pip-installed packages available to ceph-mgr) (3) restart all ceph-mgr instances (4) ceph mgr module enable cephadm Cheers, Florian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: A change in Ceph leadership...
On 15/10/2021 17:13, Josh Durgin wrote: Thanks so much Sage, it's difficult to put into words how much you've done over the years. You're always a beacon of the best aspects of open source - kindness, wisdom, transparency, and authenticity. So many folks have learned so much from you, and that's reflected in the vibrant Ceph community around the world. All the best in whatever you do in the future! Josh I wanted to write something very similar but Josh put it perfectly. Seconded. Thank you Sage! Cheers, Florian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Bogus Entries in RGW Usage Log / Large omap object in rgw.log pool
Hi David, On 28/10/2019 20:44, David Monschein wrote: > Hi All, > > Running an object storage cluster, originally deployed with Nautilus > 14.2.1 and now running 14.2.4. > > Last week I was alerted to a new warning from my object storage cluster: > > [root@ceph1 ~]# ceph health detail > HEALTH_WARN 1 large omap objects > LARGE_OMAP_OBJECTS 1 large omap objects > 1 large objects found in pool 'default.rgw.log' > Search the cluster log for 'Large omap object found' for more details. > > I looked into this and found the object and pool in question > (default.rgw.log): > > [root@ceph1 /var/log/ceph]# grep -R -i 'Large omap object found' . > ./ceph.log:2019-10-24 12:21:26.984802 osd.194 (osd.194) 715 : cluster > [WRN] Large omap object found. Object: 5:0fbdcb32:usage::usage.17:head > Key count: 702330 Size (bytes): 92881228 > > [root@ceph1 ~]# ceph --format=json pg ls-by-pool default.rgw.log | jq '.[]' | > egrep '(pgid|num_large_omap_objects)' | grep -v '"num_large_omap_objects": > 0,' | grep -B1 num_large_omap_objects > "pgid": "5.70", > "num_large_omap_objects": 1, > While I was investigating, I noticed an enormous amount of entries in > the RGW usage log: > > [root@ceph ~]# radosgw-admin usage show | grep -c bucket > 223326 > [...] I recently ran into a similar issue: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AQNGVY7VJ3K6ZGRSTX3E5XIY7DBNPDHW/ You have 702,330 keys on that omap object, so you would have been bitten by the default for osd_deep_scrub_large_omap_object_key_threshold having been revised down from 2,000,000 to 200,000 in 14.2.3: https://github.com/ceph/ceph/commit/d8180c57ac9083f414a23fd393497b2784377735 https://tracker.ceph.com/issues/40583 That's why you didn't see this warning before your recent upgrade. > There are entries for over 223k buckets! This was pretty scary to see, > considering we only have maybe 500 legitimate buckets in this fairly new > cluster. Almost all of the entries in the usage log are bogus entries > from anonymous users. It looks like someone/something was scanning, > looking for vulnerabilities, etc. Here are a few example entries, notice > none of the operations were successful: Caveat: whether or not you really *want* to trim the usage log is up to you to decide. If you are suspecting you are dealing with a security breach, you should definitely export and preserve the usage log before you trim it, or else delay trimming until you have properly investigated your problem. *If* you decide you no longer need those usage log entries, you can use "radosgw-admin usage trim" with appropriate --start-date, --end-date, and/or --uid options, to clean them up: https://docs.ceph.com/docs/nautilus/radosgw/admin/#trim-usage Please let me know if that information is helpful. Thank you! Cheers, Florian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Static website hosting with RGW
On 25/10/2019 02:38, Oliver Freyermuth wrote: > Also, if there's an expert on this: Exposing a bucket under a tenant as > static website is not possible since the colon (:) can't be encoded in DNS, > right? There are certainly much better-qualified radosgw experts than I am, but as I understand it multi-tenanted radosgw is incompatible with bucket hostnames in general (whether static websites are involved or not), for the very reason you mention. It's documented here: https://docs.ceph.com/docs/nautilus/radosgw/multitenancy/#accessing-buckets-with-explicit-tenants (look for "Note that it’s not possible to supply an explicit tenant using a hostname"). Cheers, Florian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Recurring issue: PG is inconsistent, but lists no inconsistent objects
On 14/10/2019 22:57, Reed Dier wrote: > I had something slightly similar to you. > > However, my issue was specific/limited to the device_health_metrics pool > that is auto-created with 1 PG when you turn that mgr feature on. > > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg56315.html Thank you — yes that does look superficially similar, though in my case it's an RGW pool. (Also, my sympathy on the OSD crashes; that must have been quite the jolt.) However, the similarities unfortunately end where the pg repair fixes things for you. For me, the scrub error keeps coming back. It's quite odd. Cheers, Florian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Recurring issue: PG is inconsistent, but lists no inconsistent objects
On 14/10/2019 17:21, Dan van der Ster wrote: >> I'd appreciate a link to more information if you have one, but a PG >> autoscaling problem wouldn't really match with the issue already >> appearing in pre-Nautilus releases. :) > > https://github.com/ceph/ceph/pull/30479 Thanks! But no, this doesn't look like a likely culprit, for the reason that we also saw this in Luminous and hence, *definitely* without splits or merges in play. Has anyone else seen these scrub false positives — if that's what they are? Cheers, Florian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Recurring issue: PG is inconsistent, but lists no inconsistent objects
On 14/10/2019 13:29, Dan van der Ster wrote: >> Hi Dan, >> >> what's in the log is (as far as I can see) consistent with the pg query >> output: >> >> 2019-10-14 08:33:57.345 7f1808fb3700 0 log_channel(cluster) log [DBG] : >> 10.10d scrub starts >> 2019-10-14 08:33:57.345 7f1808fb3700 -1 log_channel(cluster) log [ERR] : >> 10.10d scrub : stat mismatch, got 0/1 objects, 0/0 clones, 0/1 dirty, >> 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 0/11 bytes, >> 0/0 manifest objects, 0/0 hit_set_archive bytes. >> 2019-10-14 08:33:57.345 7f1808fb3700 -1 log_channel(cluster) log [ERR] : >> 10.10d scrub 1 errors >> >> Have you seen this before? > > Yes occasionally we see stat mismatches -- repair always fixes > definitively though. Not here, sadly. That error keeps coming back, always in the same PG, and only in that PG. > Are you using PG autoscaling? There's a known issue there which > generates stat mismatches. I'd appreciate a link to more information if you have one, but a PG autoscaling problem wouldn't really match with the issue already appearing in pre-Nautilus releases. :) Cheers, Florian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Recurring issue: PG is inconsistent, but lists no inconsistent objects
On 14/10/2019 13:20, Dan van der Ster wrote: > Hey Florian, > > What does the ceph.log ERR or ceph-osd log show for this inconsistency? > > -- Dan Hi Dan, what's in the log is (as far as I can see) consistent with the pg query output: 2019-10-14 08:33:57.345 7f1808fb3700 0 log_channel(cluster) log [DBG] : 10.10d scrub starts 2019-10-14 08:33:57.345 7f1808fb3700 -1 log_channel(cluster) log [ERR] : 10.10d scrub : stat mismatch, got 0/1 objects, 0/0 clones, 0/1 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 0/11 bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes. 2019-10-14 08:33:57.345 7f1808fb3700 -1 log_channel(cluster) log [ERR] : 10.10d scrub 1 errors Have you seen this before? Cheers, Florian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Recurring issue: PG is inconsistent, but lists no inconsistent objects
Hello, I am running into an "interesting" issue with a PG that is being flagged as inconsistent during scrub (causing the cluster to go to HEALTH_ERR), but doesn't actually appear to contain any inconsistent objects. $ ceph health detail HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent OSD_SCRUB_ERRORS 1 scrub errors PG_DAMAGED Possible data damage: 1 pg inconsistent pg 10.10d is active+clean+inconsistent, acting [15,13] $ rados list-inconsistent-obj 10.10d {"epoch":12138,"inconsistents":[]} "ceph pg query" (see below) on that PG does report num_scrub_errors=1, num_shallow_scrub_errors=1, and num_objects_dirty=1. "osd scrub auto repair = true" is set on all OSDs, but the PG never auto-repairs. (This is a test cluster, the pool size is 2 — this may preclude auto repair from ever kicking in; I'm not sure on that one.) "ceph pg repair" does repair, but the issue reappears on the next scheduled scrub. This issue was first discovered while the cluster was on Jewel/Filestore. In an event like this I would normally suspect either a problem with an individual OSD, or a bug in the FileStore code. But the cluster has had *all* of it's OSDs replaced since, as part of a full Jewel→Luminous→Nautilus upgrade and a FileStore→BlueStore conversion. The issue still persists. A full "ceph pg 10.10d query" result is below. If anyone has ideas on how to permanently fix this issue, I'd be most grateful. Thanks! Cheers, Florian { "state": "active+clean+inconsistent", "snap_trimq": "[]", "snap_trimq_len": 0, "epoch": 12143, "up": [ 15, 13 ], "acting": [ 15, 13 ], "acting_recovery_backfill": [ "13", "15" ], "info": { "pgid": "10.10d", "last_update": "100'11", "last_complete": "100'11", "log_tail": "0'0", "last_user_version": 11, "last_backfill": "MAX", "last_backfill_bitwise": 0, "purged_snaps": [], "history": { "epoch_created": 45, "epoch_pool_created": 45, "last_epoch_started": 12139, "last_interval_started": 12138, "last_epoch_clean": 12139, "last_interval_clean": 12138, "last_epoch_split": 0, "last_epoch_marked_full": 0, "same_up_since": 12138, "same_interval_since": 12138, "same_primary_since": 12114, "last_scrub": "100'11", "last_scrub_stamp": "2019-10-14 08:33:57.347097", "last_deep_scrub": "100'11", "last_deep_scrub_stamp": "2019-10-11 14:09:29.016946", "last_clean_scrub_stamp": "2019-10-11 14:09:29.016946" }, "stats": { "version": "100'11", "reported_seq": "4927", "reported_epoch": "12143", "state": "active+clean+inconsistent", "last_fresh": "2019-10-14 08:33:57.347147", "last_change": "2019-10-14 08:33:57.347147", "last_active": "2019-10-14 08:33:57.347147", "last_peered": "2019-10-14 08:33:57.347147", "last_clean": "2019-10-14 08:33:57.347147", "last_became_active": "2019-10-11 14:44:09.312226", "last_became_peered": "2019-10-11 14:44:09.312226", "last_unstale": "2019-10-14 08:33:57.347147", "last_undegraded": "2019-10-14 08:33:57.347147", "last_fullsized": "2019-10-14 08:33:57.347147", "mapping_epoch": 12138, "log_start": "0'0", "ondisk_log_start": "0'0", "created": 45, "last_epoch_clean": 12139, "parent": "0.0", "parent_split_bits": 0, "last_scrub": "100'11", "last_scrub_stamp": "2019-10-14 08:33:57.347097", "last_deep_scrub": "100'11", "last_deep_scrub_stamp": "2019-10-11 14:09:29.016946", "last_clean_scrub_stamp": "2019-10-11 14:09:29.016946", "log_size": 11, "ondisk_log_size": 11, "stats_invalid": false, "dirty_stats_invalid": false, "omap_stats_invalid": false, "hitset_stats_invalid": false, "hitset_bytes_stats_invalid": true, "pin_stats_invalid": true, "manifest_stats_invalid": true, "snaptrimq_len": 0, "stat_sum": { "num_bytes": 11, "num_objects": 1, "num_object_clones": 0, "num_object_copies": 2, "num_objects_missing_on_primary": 0, "num_objects_missing": 0, "num_objects_degraded": 0, "num_objects_misplaced": 0, "num_objects_unfound": 0, "num_objects_dirty": 1, "num_whiteouts": 0, "num_read": 33, "num_read_kb": 22, "num_write": 11,
[ceph-users] Re: Large omap objects in radosgw .usage pool: is there a way to reshard the rgw usage log?
On 09/10/2019 09:07, Florian Haas wrote: > Also, is anyone aware of any adverse side effects of increasing these > thresholds, and/or changing the usage log sharding settings, that I > should keep in mind here? Sorry, I should have checked the latest in the list archives; Paul Emmerich has just recently commented here on the threshold setting: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-October/037087.html So that one looks OK to bump, but the question with about resharding the usage log still stands. (The untrimmed usage log, in my case, would have blasted the old 2M keys threshold, too.) Cheers, Florian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Large omap objects in radosgw .usage pool: is there a way to reshard the rgw usage log?
Hi, I am currently dealing with a cluster that's been in use for 5 years and during that time, has never had its radosgw usage log trimmed. Now that the cluster has been upgraded to Nautilus (and has completed a full deep-scrub), it is in a permanent state of HEALTH_WARN because of one large omap object: $ ceph health detail HEALTH_WARN 1 large omap objects LARGE_OMAP_OBJECTS 1 large omap objects 1 large objects found in pool '.usage' As far as I can tell, there are two thresholds that can trigger that warning: * The default omap object size warning threshold, osd_deep_scrub_large_omap_object_value_sum_threshold, is 1G. * The default omap object key count warning threshold, osd_deep_scrub_large_omap_object_key_threshold, is 20. In this case, this was the original situation: osd.6 [WRN] : Large omap object found. Object: 15:169282cd:::usage.20:head Key count: 5834118 Size (bytes): 917351868 So that's 5.8M keys (way above threshold) and 875 MiB total object size (below threshold, but not by much). The usage log in this case was no longer needed that far back, so I trimmed it to keep only the entries from this year (radosgw-admin usage trim --end-date 2018-12-31), a process that took upward of an hour. After the trim (and a deep-scrub of the PG in question¹), my situation looks like this: osd.6 [WRN] Large omap object found. Object: 15:169282cd:::usage.20:head Key count: 1185694 Size (bytes): 187061564 So both the key count and the total object size have diminished by about 80%, which is about what you expect when you trim 5 years of usage log down to 1 year of usage log. However, my key count is still almost 6 times the threshold. I am aware that I can silence the warning by increasing osd_deep_scrub_large_omap_object_key_threshold by a factor of 10, but that's not my question. My question is what I can do to prevent the usage log from creating such large omap objects in the first place. Now, there's something else that you should know about this radosgw, which is that it is configured with the defaults for usage log sharding: rgw_usage_max_shards = 32 rgw_usage_max_user_shards = 1 ... and this cluster's radosgw is pretty much being used by a single application user. So the fact that it's happy to shard the usage log 32 ways is irrelevant as long as it puts the usage log for one user all into one shard. So, I am assuming that if I bump rgw_usage_max_user_shards up to, say, 16 or 32, all *new* usage log entries will be sharded. But I am not aware of any way to reshard the *existing* usage log. Is there such a thing? Otherwise, it seems like the only option in this situation would be to clear the usage log altogether, and tweak the sharding knobs, which should at least make the problem not reappear. Or, else, bump osd_deep_scrub_large_omap_object_key_threshold and just live with the large object. Also, is anyone aware of any adverse side effects of increasing these thresholds, and/or changing the usage log sharding settings, that I should keep in mind here? Thanks in advance for your thoughts. Cheers, Florian ¹For anyone reading this in the archives because they've run into the same problem, and wondering how you find out which PGs in a pool have too-large objects, here's a jq one-liner: ceph --format=json pg ls-by-pool \ | jq '.pg_stats[]|select(.stat_sum.num_large_omap_objects>0)' ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Heavily-linked lists.ceph.com pipermail archive now appears to lead to 404s
On 03/09/2019 18:42, Ilya Dryomov wrote: > On Tue, Sep 3, 2019 at 6:29 PM Florian Haas wrote: >> >> Hi, >> >> replying to my own message here in a shameless attempt to re-up this. I >> really hope that the list archive can be resurrected in one way or >> another... > > Adding David, who managed the transition. > > Thanks, > > Ilya It looks like the archives are available again at the original location. Thank you, this will help a lot of people! Cheers, Florian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Heavily-linked lists.ceph.com pipermail archive now appears to lead to 404s
Hi, replying to my own message here in a shameless attempt to re-up this. I really hope that the list archive can be resurrected in one way or another... Cheers, Florian On 29/08/2019 15:00, Florian Haas wrote: > Hi, > > is there any chance the list admins could copy the pipermail archive > from lists.ceph.com over to lists.ceph.io? It seems to contain an awful > lot of messages referred elsewhere by their archive URL, many (all?) of > which appear to now lead to 404s. > > Example: google "Set existing pools to use hdd device class only". The > top hit is a link to > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/029078.html: > > $ curl -IL > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/029078.html > HTTP/1.1 301 Moved Permanently > Server: nginx/1.10.3 (Ubuntu) > Date: Thu, 29 Aug 2019 12:48:13 GMT > Content-Type: text/html > Content-Length: 194 > Connection: keep-alive > Location: > https://lists.ceph.io/pipermail/ceph-users-ceph.com/2018-August/029078.html > Strict-Transport-Security: max-age=31536000 > > HTTP/1.1 404 Not Found > Server: nginx > Date: Thu, 29 Aug 2019 12:48:14 GMT > Content-Type: text/html; charset=utf-8 > Content-Length: 3774 > Connection: keep-alive > X-Frame-Options: SAMEORIGIN > Vary: Accept-Language, Cookie > Content-Language: en > > Or maybe this is just a redirect rule that needs to be cleverer or more > specific, rather than the apparent catch-all .com/.io redirect? > > Cheers, > Florian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Heavily-linked lists.ceph.com pipermail archive now appears to lead to 404s
Hi, is there any chance the list admins could copy the pipermail archive from lists.ceph.com over to lists.ceph.io? It seems to contain an awful lot of messages referred elsewhere by their archive URL, many (all?) of which appear to now lead to 404s. Example: google "Set existing pools to use hdd device class only". The top hit is a link to http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/029078.html: $ curl -IL http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/029078.html HTTP/1.1 301 Moved Permanently Server: nginx/1.10.3 (Ubuntu) Date: Thu, 29 Aug 2019 12:48:13 GMT Content-Type: text/html Content-Length: 194 Connection: keep-alive Location: https://lists.ceph.io/pipermail/ceph-users-ceph.com/2018-August/029078.html Strict-Transport-Security: max-age=31536000 HTTP/1.1 404 Not Found Server: nginx Date: Thu, 29 Aug 2019 12:48:14 GMT Content-Type: text/html; charset=utf-8 Content-Length: 3774 Connection: keep-alive X-Frame-Options: SAMEORIGIN Vary: Accept-Language, Cookie Content-Language: en Or maybe this is just a redirect rule that needs to be cleverer or more specific, rather than the apparent catch-all .com/.io redirect? Cheers, Florian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Luminous and mimic: adding OSD can crash mon(s) and lead to loss of quorum
On 23/08/2019 22:14, Paul Emmerich wrote: > On Fri, Aug 23, 2019 at 3:54 PM Florian Haas wrote: >> >> On 23/08/2019 13:34, Paul Emmerich wrote: >>> Is this reproducible with crushtool? >> >> Not for me. >> >>> ceph osd getcrushmap -o crushmap >>> crushtool -i crushmap --update-item XX 1.0 osd.XX --loc host >>> hostname-that-doesnt-exist-yet -o crushmap.modified >>> Replacing XX with the osd ID you tried to add. >> >> Just checking whether this was intentional. As the issue pops up when >> adding an new OSD *on* a new host, not moving an existing OSD *to* a new >> host, I would have used --add-item here. Is there a specific reason why >> you're suggesting to test with --update-item? > > yes, update should map to create or move which it should use internally > >> >> At any rate, I tried with multiple different combinations (this is on a >> 12.2.12 test cluster; I can't test this in production): > > which also ran into this bug? The idea of using crushtool is to not > crash your production cluster but just the local tool. Ah, gotcha. I thought you wanted me to be able to at least do "ceph osd setcrushmap" with the resulting crushmap, which would require a running cluster. So yes, doing this completely offline shows that you're definitely on to something. I am able to crash crushtool with the original crushmap, and what it appears to be falling over on is a choose_args map in there. I've updated the bug report with this comment: https://tracker.ceph.com/issues/40029#note-11 It would seem that there are two workarounds at this stage for pre-Nautilus users with a choose_args map in their crushmap, and who for some reason are unable to upgrade to Nautilus yet: 1. Add host buckets manually before adding new OSDs. 2. Drop any choose_args map from their crushmap. As it happens I am not aware of any way to do #2 other than - using getcrushmap, - decompiling the crushmap, - dropping the choose_args map from the textual representation of the crushmap, - recompiling, and then - using setcrushmap. Are you, by any chance? Thanks again for your help! Cheers, Florian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Luminous and mimic: adding OSD can crash mon(s) and lead to loss of quorum
On 23/08/2019 13:34, Paul Emmerich wrote: > Is this reproducible with crushtool? Not for me. > ceph osd getcrushmap -o crushmap > crushtool -i crushmap --update-item XX 1.0 osd.XX --loc host > hostname-that-doesnt-exist-yet -o crushmap.modified > Replacing XX with the osd ID you tried to add. Just checking whether this was intentional. As the issue pops up when adding an new OSD *on* a new host, not moving an existing OSD *to* a new host, I would have used --add-item here. Is there a specific reason why you're suggesting to test with --update-item? At any rate, I tried with multiple different combinations (this is on a 12.2.12 test cluster; I can't test this in production): 0. Get the current reference crushmap: # ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.05846 root default -5 0.01949 host daisy 0 hdd 0.01949 osd.0 up 1.0 1.0 -7 0.01949 host eric 1 hdd 0.01949 osd.1 up 1.0 1.0 -3 0.01949 host frank 2 hdd 0.01949 osd.2 up 1.0 1.0 # ceph osd getcrushmap -o crushmap 11 1. "Update" a nonexistent OSD belonging to a nonexistent host (your suggestion): # crushtool -i crushmap --update-item 59 0.01949 osd.59 --loc host nonexistent -o crushmap-update-nonexistent-to-nonexistent # ceph osd setcrushmap -i crushmap-update-nonexistent-to-nonexistent 12 # ceph osd tree ID CLASS WEIGHT TYPE NAMESTATUS REWEIGHT PRI-AFF -9 0.01949 host nonexistent 59 0.01949 osd.59 DNE0 -1 0.05846 root default -5 0.01949 host daisy 0 hdd 0.01949 osd.0up 1.0 1.0 -7 0.01949 host eric 1 hdd 0.01949 osd.1up 1.0 1.0 -3 0.01949 host frank 2 hdd 0.01949 osd.2up 1.0 1.0 # ceph osd setcrushmap -i crushmap 13 2. Add a nonexistent OSD belonging to a nonexistent host (I think this is functionally identical): # crushtool -i crushmap --add-item 59 0.01949 osd.59 --loc host nonexistent -o crushmap-add-nonexistent-to-nonexistent # ceph osd setcrushmap -i crushmap-add-nonexistent-to-nonexistent 14 # ceph osd tree ID CLASS WEIGHT TYPE NAMESTATUS REWEIGHT PRI-AFF -9 0.01949 host nonexistent 59 0.01949 osd.59 DNE0 -1 0.05846 root default -5 0.01949 host daisy 0 hdd 0.01949 osd.0up 1.0 1.0 -7 0.01949 host eric 1 hdd 0.01949 osd.1up 1.0 1.0 -3 0.01949 host frank 2 hdd 0.01949 osd.2up 1.0 1.0 # ceph osd setcrushmap -i crushmap 15 3. Move an existing OSD to a nonexistent host: # crushtool -i crushmap --update-item 0 0.01949 osd.0 --loc host nonexistent -o crushmap-update-existing-to-nonexistent # ceph osd setcrushmap -i crushmap-update-existing-to-nonexistent 16 # ceph osd tree ID CLASS WEIGHT TYPE NAMESTATUS REWEIGHT PRI-AFF -9 0.01949 host nonexistent 0 hdd 0.01949 osd.0up 1.0 1.0 -1 0.03897 root default -5 0 host daisy -7 0.01949 host eric 1 hdd 0.01949 osd.1up 1.0 1.0 -3 0.01949 host frank 2 hdd 0.01949 osd.2up 1.0 1.0 # ceph osd setcrushmap -i crushmap 17 None of these crashed any mon. However, there's this line in the bug report: -19> 2019-08-22 10:08:11.897364 7f93797ab700 0 mon.cc-ceph-osd11-fra1@0(leader).osd e302401 create-or-move crush item name 'osd.59' initial_weight 1.6374 at location {host=cc-ceph-osd26-fra1,root=default} So it's not trying to move the item to just a nonexistent host, but to a nonexistent host *in the default root*. So I retried the above commands with "--loc host nonexistent --loc root default". No change other than everything showing up under default; no mon crash. And then I tried one more which was to *first* add just a new OSD under the default root, and *then* moving that OSD to a new, nonexistent host, also under the default root. Again, no mon crash. So I'm afraid I am unable to reproduce this with crushtool and setcrushmap. And I can't get my mons to crash with "ceph osd crush move", either: ceph osd crush move osd.59 host=nonexistent root=default moved item id 59 name 'osd.59' to location {host=nonexistent,root=default} in crush map Cheers, Florian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Luminous and mimic: adding OSD can crash mon(s) and lead to loss of quorum
Hi everyone, there are a couple of bug reports about this in Redmine but only one (unanswered) mailing list message[1] that I could find. So I figured I'd raise the issue here again and copy the original reporters of the bugs (they are BCC'd, because in case they are no longer subscribed it wouldn't be appropriate to share their email addresses with the list). This is about https://tracker.ceph.com/issues/40029, and https://tracker.ceph.com/issues/39978 (the latter of which was recently closed as a duplicate of the former). In short, it appears that at least in luminous and mimic (I haven't tried nautilus yet), it's possible to crash a mon when attempting to add a new OSD as it's trying to inject itself into the crush map under its host bucket, when that host bucket does not exist yet. What's worse is that when the OSD's "ceph osd new" process has thus crashed the leader mon, a new leader is elected and in case the "ceph osd new" process is still running on the OSD node, it will promptly connect to that mon, and kill it too. This then continues until sufficiently many mons have died for quorum to be lost. The recovery steps appear to involve - killing the "ceph osd new" process, - restarting mons until you regain quorum, - and then running "ceph osd purge" to drop the problematic OSD entry from the crushmap and osdmap. The issue can apparently be worked around by adding the host buckets to the crushmap manually before adding the new OSDs, but surely this isn't intended to be a prerequisite, at least not to the point of mons crashing otherwise? Also I am guessing that this is some weird corner case rooted in an unusual combination of contributing factors, because otherwise I am guessing more people would be bitten by this problem. Anyone able to share their thoughts on this one? Have more people run into this? Cheers, Florian [1] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-May/034880.html — interestingly I could find this message in the pipermail archive but none in the one that my MUA keeps for me. So perhaps that message wasn't delivered to all subscribers, which might be why it has gone unanswered. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: RBD, OpenStack Nova, libvirt, qemu-guest-agent, and FIFREEZE: is this working as intended?
Just following up here to report back and close the loop: On 21/08/2019 16:51, Jason Dillaman wrote: > It just looks like this was an oversight from the OpenStack developers > when Nova RBD "direct" ephemeral image snapshot support was added [1]. > I would open a bug ticket against Nova for the issue. Done: https://bugs.launchpad.net/nova/+bug/1841160 Thanks again for your help! Cheers, Florian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: RBD, OpenStack Nova, libvirt, qemu-guest-agent, and FIFREEZE: is this working as intended?
On 21/08/2019 18:05, dhils...@performair.com wrote: > Florian; > > Forgive my lack of knowledge of OpenStack, and your environment / use case. > > Why would you need / want to snapshot an ephemeral disk? Isn't the point of > ephemeral storage to not be persistent? Fair point, but please consider that if you use an ephemeral VM as a template for other VMs (a common motivation for snapshotting), you might not care about the consistency of the VMs themselves, but you probably do care about the consistency of the template. But, for that use-case you could argue that you should just shut down the VM and take a clean snapshot then. However, in OpenStack Nova you may also use boot-from-volume, meaning you're running a VM that is expected to be *wholly* persistent, rather than ephemeral, and in that case the consistency of a snapshot taken while the instance is running is rather important. So just to be sure I took your cue and retested to see whether the same issue also applied to an instance using boot-from-volume. And lo and behold, the problem does not apply — if I configure an instance to boot from a volume, I get fsfreeze just as intended. (I have yet to dig up the code path for this.) So, evidently the situation can be summarized as: - Ephemeral boot, *without* RBD, with or without attached volumes: freeze/thaw if hw_qemu_guest_agent=yes, resulting in consistent snapshots. - Ephemeral boot *from* RBD, also with or without attached volumes: no freeze/thaw, resulting in potentially inconsistent snapshots even with hw_qemu_guest_agent=yes. - Boot-from-volume from RBD: freeze/thaw if hw_qemu_guest_agent=yes, resulting in consistent snapshots. Bit odd, that. :) But at least there's another available workaround: if you need to ensure snapshot consistency, use boot-from-volume. Thanks for the nudge in that direction, Dominic! Cheers, Florian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] RBD, OpenStack Nova, libvirt, qemu-guest-agent, and FIFREEZE: is this working as intended?
Hi everyone, apologies in advance; this will be long. It's also been through a bunch of edits and rewrites, so I don't know how well I'm expressing myself at this stage — please holler if anything is unclear and I'll be happy to try to clarify. I am currently in the process of investigating the behavior of OpenStack Nova instances when being snapshotted and suspended, in conjunction with qemu-guest-agent (qemu-ga). I realize that RBD-backed Nova/libvirt instances are expected to behave differently from file-backed ones, but I think I might have reason to believe that the RBD-backed ones are indeed behaving incorrectly, and I'd like to verify that. So first up, for comparison, let's recap how a Nova/libvirt/KVM instance behaves when it is *not* backed by RBD (such as, it's using a qcow2 file that is on a Nova compute node in /var/lib/nova/instances), is booted from an image with the hw_qemu_guest_agent=yes meta property set, and runs qemu-guest-agent within the guest: - User issues "nova suspend" or "openstack server suspend". - If nova-compute on the compute node decides that the instance has qemu-guest-agent running (which is the case if it's qemu or kvm, and its image has hw_qemu_guest_agent=yes), it sends a guest-sync command over the guest agent VirtIO serial port. This command registers in the qemu-ga log file in the guest. - nova-compute on the compute node sends a libvirt managed-save command. - Nova reports the instance as suspended. - User issues "nova resume" or "openstack server resume". - nova-compute on the compute node sends a libvirt start command. - Again, if nova-compute on the compute node knows that the instance has qemu-guest-agent running, it sends another command over the serial port, namely guest-set-time. This, too, registers in the guest's qemu-ga log. - Nova reports the instance as active (running normally) again. Now, when I instead use a Nova environment that is fully RBD-backed, I see exactly the same behavior as described above. So I know that in principle, nova-compute/qemu-ga communication works in both an RBD-backed and a non-RBD-backed environment. However, things appear to get very different when it comes to snapshots. Again, starting with a file-backed environment: - User issues "nova image-create" or "openstack server image create". - If nova-compute on the compute node decides that the instance can be quiesced (which is the case if it's qemu or kvm, and its image has hw_qemu_guest_agent=yes), then it sends a "guest-fsfreeze-freeze" command over the guest agent VirtIO serial port. - The guest agent inside the guest loops over all mounted filesystems, and issues the FIFREEZE ioctl (which maps to the kernel freeze_super() function). This can be seen in the qemu-ga log file in the guest, and it is also verifiable by using ftrace on the qemu-ga PID and checking for the freeze_super() function call. - nova-compute then takes a live snapshot of the instance. - Once complete, the guest gets a "guest-fsfreeze-thaw" command, and again I can see this in the qemu-ga log, and with ftrace. And now with RBD: - User issues "nova image-create" or "openstack server image create". - The guest-fsfreeze-freeze agent command never happens. Now I can see the info message from https://opendev.org/openstack/nova/src/commit/7bf75976016aae5d458eca9f6ddac92bfe75dc59/nova/virt/libvirt/driver.py#L2048 in my nova-compute log, which confirms that we're attempting a live snapshot. I also do *not* see the warning from https://opendev.org/openstack/nova/src/commit/7bf75976016aae5d458eca9f6ddac92bfe75dc59/nova/virt/libvirt/driver.py#L2068, so it looks like the direct_snapshot() call from https://opendev.org/openstack/nova/src/commit/7bf75976016aae5d458eca9f6ddac92bfe75dc59/nova/virt/libvirt/driver.py#L2058 succeeds. This is defined in https://opendev.org/openstack/nova/src/commit/7bf75976016aae5d458eca9f6ddac92bfe75dc59/nova/virt/libvirt/imagebackend.py#L1055 and it uses RBD functionality only. Importantly, it never interacts with qemu-ga, so it appears to not worry at all about freezing the filesystem. (Which does seem to contradict https://docs.ceph.com/docs/master/rbd/rbd-openstack/?highlight=uuid#image-properties, by the way, so that may be a documentation bug.) Now here's another interesting part. Were the direct snapshot to fail, if I read https://opendev.org/openstack/nova/src/commit/7bf75976016aae5d458eca9f6ddac92bfe75dc59/nova/virt/libvirt/driver.py#L2081 and https://opendev.org/openstack/nova/src/commit/7bf75976016aae5d458eca9f6ddac92bfe75dc59/nova/virt/libvirt/driver.py#L2144 correctly, the fallback behavior would be as follows: The domain would next be "suspended" (note, again this is Nova suspend, which maps to libvirt managed-save per https://opendev.org/openstack/nova/src/commit/7bf75976016aae5d458eca9f6ddac92bfe75dc59/nova/virt/libvirt/guest.py#L504), then snapshotted using a libvirt call and resumed again post-snapshot. In which case there would be a
[ceph-users] Re: BlueStore _txc_add_transaction errors (possibly related to bug #38724)
On 12/08/2019 21:07, Alexandre Marangone wrote: >> rados -p volumes stat 'obj-vS6RN9\uQwvXU9DP' >> error stat-ing volumes/obj-vS6RN9\uQwvXU9DP: (2) No such file or directory > I believe you need to substitute \u with _ Yes indeed, thank you! Cheers, Florian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: BlueStore _txc_add_transaction errors (possibly related to bug #38724)
Hi Tom, responding back on this briefly so that people are in the loop; I'll have more details in a blog post that I hope to get around to writing. On 12/08/2019 11:34, Thomas Byrne - UKRI STFC wrote: >> And bluestore should refuse to start if the configured limit is > 4GB. Or >> something along those lines... > > Just on this point - Bluestore OSDs will fail to start with an > osd_max_object_size >=4GB with a helpful error message about the Bluestore > hard limit. I was mildly amused when I discovered that luminous OSDs can > start with osd_max_object_size = 4GB - 1 byte, but mimic OSDs require it to > be <= 4GB - 2 bytes to start without an error. I haven't checked to see if > nautilus OSDs require <= 4GB - 3 bytes yet. Yes but that doesn't help users much for clusters where very large objects already exist. Even in Luminous, osd_max_object_size defaults to 128M, but if an OSD already has objects larger than that, it will still happily start up and serve data with FileStore — and crash any newly added BlueStore OSDs unfortunate enough to be mapped to a PG with one or more objects that are 4GiB or larger. The pending PR to make this a scrub error even on FileStore OSDs mitigates this issue (https://github.com/ceph/ceph/pull/29579), but it'll still cause a somewhat unexpected surprise for people who have just updated to a version including that fix and suddenly see tons of scrub errors — they would be easily forgiven for assuming they've run into a regression that involves false positives on scrub. "Hey, none of these errors were here before the upgrade, surely there's a problem with the software rather than my data!" We've progressed further in the interim and it appears like I can give all-clears on a couple of concerns that we had: 1. It looks like these objects were not created by an RBD going haywire, but by something actually using librados to create them, presumably long before the cluster ever went into production. 2. I am not changing the subject line so I don't mess up people's list archives if their MUA doesn't correctly thread based on In-Reply-To or References, but it's now evident that this is *not* related to bug #38724 but instead really just due to objects being too large for BlueStore, like Sage said in his first reply. Thanks for the answer — by the way I have been imploring all my colleagues to watch your Cephalocon talk,[1] which was excellent. Cheers, Florian [1] https://youtu.be/niFNZN5EKvE ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: BlueStore _txc_add_transaction errors (possibly related to bug #38724)
Hi Sage! Whoa that was quick. :) On 09/08/2019 16:27, Sage Weil wrote: >> https://tracker.ceph.com/issues/38724#note-26 > > { > "op_num": 2, > "op_name": "truncate", > "collection": "2.293_head", > "oid": > "#-4:c96337db:::temp_recovering_2.293_11123'6472830_288833_head:head#", > "offset": 4457615932 > }, > > That offsize (size) is > 4 GB. BlueStore has a hard limit of 2^32-1 for > object sizes (because it uses a uint32_t). This cluster appears to have > some ginormous rados objects. Until those are removed, you > can't/shouldn't use bluestore. OK, this is interesting. This is an OpenStack Cinder volumes pool, so all the objects in there belong to RBDs. I couldn't think of any situation in which RBD would create a huge object like that. But, as it happens that PG is currently mapped to a primary OSD that is still on FileStore, so I can do a "find -size +1G" on that mount point, and here's what I get: -rw-r--r-- 1 ceph ceph 4457615932 Mar 29 2018 DIR_3/DIR_9/DIR_6/DIR_C/obj-vS6RN9\uQwvXU9DP__head_DBECC693__2 So, bingo. That's a 4.2GB size file whose size matches that offset exactly. But I'm not familiar with that object name format. How did that object get here? And how do I remove it, considering I seem to be unable to access it? rados -p volumes stat 'obj-vS6RN9\uQwvXU9DP' error stat-ing volumes/obj-vS6RN9\uQwvXU9DP: (2) No such file or directory Or is that file just an artifact that doesn't even map to an object? This is turning out to be a learning experience. :) Thanks again for your help! Cheers, Florian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] BlueStore _txc_add_transaction errors (possibly related to bug #38724)
Hi everyone, it seems there have been several reports in the past related to BlueStore OSDs crashing from unhandled errors in _txc_add_transaction: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-April/03.html http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-January/032172.html http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-December/031960.html http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-December/031964.html Bug #38724 tracks this, has been fixed in master with https://github.com/ceph/ceph/pull/27929, and is pending backports (and, I dare say, is *probably* misclassified as being only minor, as this does cause potential data loss as soon as it affects enough OSDs simultaneously): https://tracker.ceph.com/issues/38724 We just ran into a similar issue with a couple of BlueStore OSDs that we recently added to a Luminous (12.2.12) cluster that was upgraded from Jewel, and hence, still largely runs on FileStore. I say similar because evidently other people reporting this problem have been running into ENOENT (No such file or directory) or ENOTEMPTY (Directory not empty); for us it's interestingly E2BIG (Argument list too long): https://tracker.ceph.com/issues/38724#note-26 So I'm wondering if someone could shed light on these questions: * Is this the same issue as that which https://github.com/ceph/ceph/pull/27929 fixes? * Thus, since https://github.com/ceph/ceph/pull/29115 (the Nautilus backport for that fix) has been merged, but is not yet included in a release, do *Nautilus* users get a fix in the upcoming 14.2.3 release, and once they update, would this bug go away with no further intervention required? * For users on *Luminous*, since https://tracker.ceph.com/issues/39694 (the Luminous version of 38724) says "non-trivial backport", is it fair to say that a fix might still take a while for that release? * Finally, are Luminous users safe from this bug if they keep using, or revert to, FileStore? Thanks in advance for your thoughts! Please keep Erik CC'd on your reply. Cheers, Florian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io