[ceph-users] mds crash loop - Server.cc: 7503: FAILED ceph_assert(in->first <= straydn->first)
Hi all, We have a cephfs cluster in production for about 2 months and, for the past 2-3 weeks, we are regularly experiencing MDS crash loops (every 3-4 hours if we have some user activity). A temporary fix is to remove the MDSs in error (or unknown) state, stop samba & nfs-ganesha gateways, then wipe all sessions. Sometimes, we have to repeat this procedure 2 or 3 times to have our cephfs back and working... When looking in the MDS log files, I noticed that all crashs have the following stack trace: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/mds/Server.cc: 7503: FAILED ceph_assert(in->first <= straydn->first) ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7eff2644bcce] 2: /usr/lib64/ceph/libceph-common.so.2(+0x276ee8) [0x7eff2644bee8] 3: (Server::_unlink_local(boost::intrusive_ptr&, CDentry*, CDentry*)+0x106a) [0x559c8f83331a] 4: (Server::handle_client_unlink(boost::intrusive_ptr&)+0x4d9) [0x559c8f837fe9] 5: (Server::dispatch_client_request(boost::intrusive_ptr&)+0xefb) [0x559c8f84e82b] 6: (Server::handle_client_request(boost::intrusive_ptr const&)+0x3fc) [0x559c8f859aac] 7: (Server::dispatch(boost::intrusive_ptr const&)+0x12b) [0x559c8f86258b] 8: (MDSRank::handle_message(boost::intrusive_ptr const&)+0xbb4) [0x559c8f7bf374] 9: (MDSRank::_dispatch(boost::intrusive_ptr const&, bool)+0x7bb) [0x559c8f7c19eb] 10: (MDSRank::retry_dispatch(boost::intrusive_ptr const&)+0x16) [0x559c8f7c1f86] 11: (MDSContext::complete(int)+0x56) [0x559c8fac0906] 12: (MDSRank::_advance_queues()+0x84) [0x559c8f7c0a54] 13: (MDSRank::_dispatch(boost::intrusive_ptr const&, bool)+0x204) [0x559c8f7c1434] 14: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr const&)+0x55) [0x559c8f7c1fe5] 15: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr const&)+0x128) [0x559c8f7b1f28] 16: (DispatchQueue::entry()+0x126a) [0x7eff266894da] 17: (DispatchQueue::DispatchThread::entry()+0x11) [0x7eff26739e21] 18: /lib64/libpthread.so.0(+0x814a) [0x7eff2543214a] 19: clone() I found an analog case in the ceph tracker website ( [ https://tracker.ceph.com/issues/41147 | https://tracker.ceph.com/issues/41147 ] ) so I suspected an inode corruption and I started a cephfs scrub (ceph tell mds.cephfsvol:0 scrub start / recursive,repair). As we have a lot of files (about 200 millions entries for 200 TB), I don't know how long time it will take nor: - If this will correct the situation - What to do to avoid the same situation in the future Some information about our ceph cluster (pacific 16.2.6 with containers): ** ceph -s ** cluster: id: 2943b4fe-2063-11ec-a560-e43d1a1bc30f health: HEALTH_WARN 1 MDSs report oversized cache services: mon: 5 daemons, quorum cephp03,cephp06,cephp05,cephp01,cephp02 (age 12d) mgr: cephp01.smfvfd(active, since 12d), standbys: cephp02.equfuj mds: 2/2 daemons up, 4 standby osd: 264 osds: 264 up (since 12d), 264 in (since 9w) rbd-mirror: 1 daemon active (1 hosts) task status: scrub status: mds.cephfsvol.cephp02.wsokro: idle+waiting paths [/] mds.cephfsvol.cephp05.qneike: active paths [/] data: volumes: 1/1 healthy pools: 5 pools, 2176 pgs objects: 595.12M objects, 200 TiB usage: 308 TiB used, 3.3 PiB / 3.6 PiB avail pgs: 2167 active+clean 7 active+clean+scrubbing+deep 2 active+clean+scrubbing io: client: 39 KiB/s rd, 152 KiB/s wr, 27 op/s rd, 27 op/s wr ** # ceph fs get cephfsvol ** Filesystem 'cephfsvol' (1) fs_name cephfsvol epoch 106554 flags 12 created 2021-09-28T14:19:54.399567+ modified 2022-02-08T12:57:00.653514+ tableserver 0 root 0 session_timeout 60 session_autoclose 300 max_file_size 5497558138880 required_client_features {} last_failure 0 last_failure_osd_epoch 41205 compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2} max_mds 2 in 0,1 up {0=4044909,1=3325354} failed damaged stopped data_pools [3,4] metadata_pool 2 inline_data disabled balancer standby_count_wanted 1 [mds.cephfsvol.cephp05.qneike{0:4044909} state up:active seq 1789 export targets 1 join_fscid=1 addr [v2:10.2.100.5:6800/2702983829,v1:10.2.100.5:6801/2702983829] compat {c=[1],r=[1],i=[7ff]}] [mds.cephfsvol.cephp02.wsokro{1:32bdaa} state up:active seq 18a02 export targets 0 join_fscid=1 addr [v2:10.2.100.2:1a90/aa660301,v1:10.2.100.2:1a91/aa66
[ceph-users] Re: ceph_assert(start >= coll_range_start && start < coll_range_end)
Okay, I definitely need here some help. The crashing OSD moved with the PG. so The PG seems to have the issue I moved (via upmaps ) all 4 replicas to filestore OSDs. After this the error seems to be solved. No OSD crashed after this. A deep-scrub of the PG didn't throw any error. So I moved the first shard back to a bluestore OSD. This worked flawlessly as well. A deep scrub after this showed one object missing. The same which was obviously the cause of the prior crashes. repair seemed to fixed the object. But a further deep-scrub brings back the same error. Even putting the object again with rados put didn't help. now I have two "missing" objects. (the head and the snapshot from overwriting) Here the scrub error and reapair from the osd log 2022-02-08 14:04:43.751 7f600dfec700 -1 log_channel(cluster) log [ERR] : 1.7fff shard 3 1::::c76c7ac2014adb9f0f0837ac1e85fd1e241af225908b6a0c3d3a44d6b866e732_0040:head : missing 2022-02-08 14:04:43.751 7f600dfec700 -1 log_channel(cluster) log [ERR] : 1.7fff deep-scrub 1 missing, 0 inconsistent objects 2022-02-08 14:04:43.751 7f600dfec700 -1 log_channel(cluster) log [ERR] : 1.7fff deep-scrub 1 errors 2022-02-08 13:52:09.111 7f600dfec700 -1 log_channel(cluster) log [ERR] : 1.7fff shard 3 1::::c76c7ac2014adb9f0f0837ac1e85fd1e241af225908b6a0c3d3a44d6b866e732_0040:head : missing 2022-02-08 13:52:09.111 7f600dfec700 -1 log_channel(cluster) log [ERR] : 1.7fff repair 1 missing, 0 inconsistent objects 2022-02-08 13:52:09.111 7f600dfec700 -1 log_channel(cluster) log [ERR] : 1.7fff repair 1 errors, 1 fixed and here the new scrub error with the two missings 2022-02-08 14:19:10.990 7f600dfec700 0 log_channel(cluster) log [DBG] : 1.7fff deep-scrub starts 2022-02-08 14:25:17.749 7f600dfec700 -1 log_channel(cluster) log [ERR] : 1.7fff shard 3 1::::c76c7ac2014adb9f0f0837ac1e85fd1e241af225908b6a0c3d3a44d6b866e732_0040:974 : missing 2022-02-08 14:25:17.749 7f600dfec700 -1 log_channel(cluster) log [ERR] : 1.7fff shard 3 1::::c76c7ac2014adb9f0f0837ac1e85fd1e241af225908b6a0c3d3a44d6b866e732_0040:head : missing 2022-02-08 14:25:17.750 7f600dfec700 -1 log_channel(cluster) log [ERR] : 1.7fff deep-scrub 2 missing, 0 inconsistent objects 2022-02-08 14:25:17.750 7f600dfec700 -1 log_channel(cluster) log [ERR] : 1.7fff deep-scrub 2 errors Can someone help me here? I don't have any clue. Regards Manuel On Mon, 7 Feb 2022 16:51:16 +0100 Manuel Lausch wrote: > Hi, > > I am migrating from filestore to bluestore (workflow is draining osd, > and reformat it with bluestore) > > Now I have two OSDs which crashes to the same time with the following > error. Restarting of the OSD works for some time until they crash > again. > >-40> 2022-02-07 16:28:20.489 7f550723a700 20 > bluestore(/var/lib/ceph/osd/ceph-410).collection(1.7fff_head 0x564161314600) > r 0 v.len 512 >-39> 2022-02-07 16:28:20.489 7f550723a700 15 > bluestore(/var/lib/ceph/osd/ceph-410) getattrs 1.7fff_head > #1:ffeb:::9b6886fa3639e64c892813ba7c9da9f4411f0a5fb73c89517b5f3f68acdaa388_0040:head# >-38> 2022-02-07 16:28:20.489 7f550723a700 10 > bluestore(/var/lib/ceph/osd/ceph-410) getattrs 1.7fff_head > #1:ffeb:::9b6886fa3639e64c892813ba7c9da9f4411f0a5fb73c89517b5f3f68acdaa388_0040:head# > = 0 >-37> 2022-02-07 16:28:20.489 7f550723a700 10 > bluestore(/var/lib/ceph/osd/ceph-410) stat 1.7fff_head > #1:ffef:::bda22ca861e6999694841deb44bce5d37d7c35d0ffc9387d649d80acf818c341_0014f39d:head# >-36> 2022-02-07 16:28:20.489 7f550723a700 20 > bluestore(/var/lib/ceph/osd/ceph-410).collection(1.7fff_head 0x564161314600) > get_onode oid > #1:ffef:::bda22ca861e6999694841deb44bce5d37d7c35d0ffc9387d649d80acf818c341_0014f39d:head# > key > 0x7f8001ffef216264'a22ca861e6999694841deb44bce5d37d7c35d0ffc9387d649d80acf818c341_0014f39d!='0xfffe'o' >-35> 2022-02-07 16:28:20.489 7f550723a700 20 > bluestore(/var/lib/ceph/osd/ceph-410).collection(1.7fff_head 0x564161314600) > r 0 v.len 843 >-34> 2022-02-07 16:28:20.489 7f550723a700 15 > bluestore(/var/lib/ceph/osd/ceph-410) getattrs 1.7fff_head > #1:ffef:::bda22ca861e6999694841deb44bce5d37d7c35d0ffc9387d649d80acf818c341_0014f39d:head# >-33> 2022-02-07 16:28:20.489 7f550723a700 10 > bluestore(/var/lib/ceph/osd/ceph-410) getattrs 1.7fff_head > #1:ffef:::bda22ca861e6999694841deb44bce5d37d7c35d0ffc9387d649d80acf818c341_0014f39d:head# > = 0 >-32> 2022-02-07 16:28:20.489 7f550723a700 10 > bluestore(/var/lib/ceph/osd/ceph-410) stat 1.7fff_head > #1:fffb:::98c8a3708cceb042f5ec0d5dd49416968adc95cf6019796fdf6ae1a1f7fd929d_0040:head# >-31> 2022-02-07 16:28:20.489 7f550723a700 20 > bluestore(/var/lib/ceph/osd/ceph-410).collection(1.7fff_head 0x564161314600) > get_onode oid > #1:fffb:::98c8a3708cceb042f5ec0d5dd49416968adc95cf6019796fdf6ae1a1f7fd929d_0040:head# > key > 0x7f8001ff
[ceph-users] Re: cephfs: [ERR] loaded dup inode
On Tue, Feb 8, 2022 at 1:04 PM Frank Schilder wrote: > The reason for this seemingly strange behaviour was an old static snapshot > taken in an entirely different directory. Apparently, ceph fs snapshots are > not local to an FS directory sub-tree but always global on the entire FS > despite the fact that you can only access the sub-tree in the snapshot, which > easily leads to the wrong conclusion that only data below the directory is in > the snapshot. As a consequence, the static snapshot was accumulating the > garbage from the rotating snapshots even though these sub-trees were > completely disjoint. So are you saying that if I do this I'll have 1M files in stray? mkdir /a cd /a for i in {1..100}; do touch $i; done # create 1M files in /a cd .. mkdir /b mkdir /b/.snap/testsnap # create a snap in the empty dir /b rm -rf /a/ Cheers, Dan ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Random scrub errors (omap_digest_mismatch) on pgs of RADOSGW metadata pools (bug 53663)
Hey there again, there now was a question from Neha Ojha in https://tracker.ceph.com/issues/53663 about providing OSD debug logs for a manual deep-scrub on (inconsistent) PGs. I did provide the logs of two of those deep-scrubs via ceph-post-file already. But since data inconsistencies are the worse of bugs and adding some unpredictability to their occurrence we likely need more evidence to have a chance to narrow this down. And since you seem to observe something similar, could you gather and post debug info about them to the ticket as well maybe? Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] RGW automation encryption - still testing only?
Is RGW encryption for all objects at rest still testing only, and if not, which version is it considered stable in?: https://docs.ceph.com/en/latest/radosgw/encryption/#automatic-encryption-for-testing-only David ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: RGW automation encryption - still testing only?
hi David, that method of encryption based on rgw_crypt_default_encryption_key will never be officially supported. however, support for SSE-S3 encryption [1] is nearly complete in [2] (cc Marcus), and we hope to include that in the quincy release - and if not, we'll backport it to quincy in an early point release can SSE-S3 with PutBucketEncryption satisfy your use case? [1] https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingServerSideEncryption.html [2] https://github.com/ceph/ceph/pull/44494 On Tue, Feb 8, 2022 at 10:44 AM David Orman wrote: > > Is RGW encryption for all objects at rest still testing only, and if not, > which version is it considered stable in?: > > https://docs.ceph.com/en/latest/radosgw/encryption/#automatic-encryption-for-testing-only > > David > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: RGW automation encryption - still testing only?
On Tue, Feb 8, 2022 at 11:11 AM Casey Bodley wrote: > > hi David, > > that method of encryption based on rgw_crypt_default_encryption_key > will never be officially supported. to expand on why: rgw_crypt_default_encryption_key requires the key material to be stored insecurely in ceph's config, and cannot support key rotation > however, support for SSE-S3 > encryption [1] is nearly complete in [2] (cc Marcus), and we hope to > include that in the quincy release - and if not, we'll backport it to > quincy in an early point release > > can SSE-S3 with PutBucketEncryption satisfy your use case? > > [1] > https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingServerSideEncryption.html > [2] https://github.com/ceph/ceph/pull/44494 > > On Tue, Feb 8, 2022 at 10:44 AM David Orman wrote: > > > > Is RGW encryption for all objects at rest still testing only, and if not, > > which version is it considered stable in?: > > > > https://docs.ceph.com/en/latest/radosgw/encryption/#automatic-encryption-for-testing-only > > > > David > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: RGW automation encryption - still testing only?
Totally understand, I'm not really a fan of service-managed encryption keys as a general rule vs. client-managed. I just thought I'd probe about capabilities considered stable before embarking on our own work. SSE-S3 would be a reasonable middle-ground. I appreciate the PR link, that's very helpful. On Tue, Feb 8, 2022 at 10:29 AM Casey Bodley wrote: > On Tue, Feb 8, 2022 at 11:11 AM Casey Bodley wrote: > > > > hi David, > > > > that method of encryption based on rgw_crypt_default_encryption_key > > will never be officially supported. > > to expand on why: rgw_crypt_default_encryption_key requires the key > material to be stored insecurely in ceph's config, and cannot support > key rotation > > > however, support for SSE-S3 > > encryption [1] is nearly complete in [2] (cc Marcus), and we hope to > > include that in the quincy release - and if not, we'll backport it to > > quincy in an early point release > > > > can SSE-S3 with PutBucketEncryption satisfy your use case? > > > > [1] > https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingServerSideEncryption.html > > [2] https://github.com/ceph/ceph/pull/44494 > > > > On Tue, Feb 8, 2022 at 10:44 AM David Orman > wrote: > > > > > > Is RGW encryption for all objects at rest still testing only, and if > not, > > > which version is it considered stable in?: > > > > > > > https://docs.ceph.com/en/latest/radosgw/encryption/#automatic-encryption-for-testing-only > > > > > > David > > > ___ > > > ceph-users mailing list -- ceph-users@ceph.io > > > To unsubscribe send an email to ceph-users-le...@ceph.io > > > > > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: cephfs: [ERR] loaded dup inode
On Tue, Feb 8, 2022 at 7:30 AM Dan van der Ster wrote: > > On Tue, Feb 8, 2022 at 1:04 PM Frank Schilder wrote: > > The reason for this seemingly strange behaviour was an old static snapshot > > taken in an entirely different directory. Apparently, ceph fs snapshots are > > not local to an FS directory sub-tree but always global on the entire FS > > despite the fact that you can only access the sub-tree in the snapshot, > > which easily leads to the wrong conclusion that only data below the > > directory is in the snapshot. As a consequence, the static snapshot was > > accumulating the garbage from the rotating snapshots even though these > > sub-trees were completely disjoint. > > So are you saying that if I do this I'll have 1M files in stray? No, happily. The thing that's happening here post-dates my main previous stretch on CephFS and I had forgotten it, but there's a note in the developer docs: https://docs.ceph.com/en/latest/dev/cephfs-snapshots/#hard-links (I fortuitously stumbled across this from an entirely different direction/discussion just after seeing this thread and put the pieces together!) Basically, hard links are *the worst*. For everything in filesystems. I spent a lot of time trying to figure out how to handle hard links being renamed across snapshots[1] and never managed it, and the eventual "solution" was to give up and do the degenerate thing: If there's a file with multiple hard links, that file is a member of *every* snapshot. Doing anything about this will take a lot of time. There's probably an opportunity to improve it for users of the subvolumes library, as those subvolumes do get tagged a bit, so I'll see if we can look into that. But for generic CephFS, I'm not sure what the solution will look like at all. Sorry folks. :/ -Greg [1]: The issue is that, if you have a hard linked file in two places, you would expect it to be snapshotted whenever a snapshot covering either location occurs. But in CephFS the file can only live in one location, and the other location has to just hold a reference to it instead. So say you have inode Y at path A, and then hard link it in at path B. Given how snapshots work, when you open up Y from A, you would need to check all the snapshots that apply from both A and B's trees. But 1) opening up other paths is a challenge all on its own, and 2) without an inode and its backtrace to provide a lookup resolve point, it's impossible to maintain a lookup that scales and is possible to keep consistent. (Oh, I did just have one idea, but I'm not sure if it would fix every issue or just that scalable backtrace lookup: https://tracker.ceph.com/issues/54205) > > mkdir /a > cd /a > for i in {1..100}; do touch $i; done # create 1M files in /a > cd .. > mkdir /b > mkdir /b/.snap/testsnap # create a snap in the empty dir /b > rm -rf /a/ > > > Cheers, Dan > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: RGW automation encryption - still testing only?
On Tue, Feb 8, 2022 at 11:55 AM Stefan Schueffler wrote: > > Hi Casey, > > great news to hear about "SSE-S3 almost implemented" :-) > > One question about that - will the implementation have one key per bucket, or > one key per individual object? > > Amazon (as per the public available docs) is using one unique key per object > - and encrypts the key on top of this with a per bucket or master key that > regularly rotates. my understanding is that there are per-object keys, and key-encryption-keys that can either be per-bucket, per-user, or global depending on ceph config > > https://docs.aws.amazon.com/AmazonS3/latest/userguide/serv-side-encryption.html > > Best > Stefan > > > > > Am 08.02.2022 um 17:11 schrieb Casey Bodley : > > hi David, > > that method of encryption based on rgw_crypt_default_encryption_key > will never be officially supported. however, support for SSE-S3 > encryption [1] is nearly complete in [2] (cc Marcus), and we hope to > include that in the quincy release - and if not, we'll backport it to > quincy in an early point release > > can SSE-S3 with PutBucketEncryption satisfy your use case? > > [1] > https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingServerSideEncryption.html > [2] https://github.com/ceph/ceph/pull/44494 > > On Tue, Feb 8, 2022 at 10:44 AM David Orman wrote: > > > Is RGW encryption for all objects at rest still testing only, and if not, > which version is it considered stable in?: > > https://docs.ceph.com/en/latest/radosgw/encryption/#automatic-encryption-for-testing-only > > David > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] R release naming
Hi folks, As we near the end of the Quincy cycle, it's time to choose a name for the next release. This etherpad began a while ago, so there are some votes already, however we wanted to open it up for anyone who hasn't voted yet. Add your +1 to the name you prefer here, or add a new option: https://pad.ceph.com/p/r Josh ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Monitoring slow ops
Hi all, We have found that RGW access problems on our clusters almost always coincide with slow ops in "ceph -s". Is there any good way to monitor and alert on slow ops from prometheus metrics? We are running Nautilus but I'd be interested in any changes that might help in newer versions, as well. Thanks, Trey Palmer ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Monitoring slow ops
Hi, > On 9 Feb 2022, at 09:03, Benoît Knecht wrote: > > I don't remember in which Ceph release it was introduced, but on Pacific > there's a metric called `ceph_healthcheck_slow_ops`. At least in Nautilus this metric exists k ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io