[ceph-users] Re: [EXT] mclock scheduler kills clients IOs
On Tue, Sep 17, 2024 at 08:48:11PM -0400, Anthony D'Atri wrote: Were all three in the same failure domain? No they were all in different failure domain. -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [EXT] mclock scheduler kills clients IOs
On Tue, Sep 17, 2024 at 04:22:40PM +0200, Denis Polom wrote: Hi, yes mclock scheduler doesn't looks like stable and ready for production Ceph cluster. I just switched back to wpq and everything goes smoothly. In our cluster all IO stopped when I set 3 OSD to out when running Mclock. After switching to WPQ and had run deep-scrub on all PG the result was 698 corrupted objects that Ceph could not fix. So no, I would not say Mclock i production ready. We have set all out cluster to WPQ. -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: How to specify id on newly created OSD with Ceph Orchestrator
On Fri, Jul 26, 2024 at 04:18:05PM +0200, Iztok Gregori wrote: On 26/07/24 12:35, Kai Stian Olstad wrote: On Tue, Jul 23, 2024 at 08:24:21AM +0200, Iztok Gregori wrote: Am I missing something obvious or with Ceph orchestrator there are non way to specify an id during the OSD creation? You can use osd_id_claims. I tried the osd_id_claims in a yaml file like this: service_type: osd placement: hosts: - data_devices:paths: - /dev/ osd_id_claims:: [''] An then applied it, but the created OSD didn't have the id I specified. It could be that the syntax of my yaml is wrong, but I gave me no errors when I applied it. I didn't try to directly specify the osd_id_claims on the command line. The command should be something like this: # ceph orch daemon add osd :,osd_id_claims= According to the documentation[1] you can use osd_id_claim. I use: ceph orch daemon add osd :data_devices=,osd_id_claims= The difference is "data_devices=", if you need it or not I don't know. I don't know if it matters, but I've deleted/removed (not replaced) the OSD (the OSD id wasn't present in the crush map anymore, not even as "destroyed"). It might, I don't think I have tried without --replace since I use a script to replace devices in Ceph so I never forget to add the --replace. [1] https://docs.ceph.com/en/reef/cephadm/services/osd/?highlight=osd_id_claims#ceph.deployment.drive_group.DriveGroupSpec.osd_id_claims -- Kai Stian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: How to specify id on newly created OSD with Ceph Orchestrator
On Tue, Jul 23, 2024 at 08:24:21AM +0200, Iztok Gregori wrote: Am I missing something obvious or with Ceph orchestrator there are non way to specify an id during the OSD creation? You can use osd_id_claims. This command is for replacing a HDD in hybrid osd.344 and reuse the block.db device on the SSD. ceph orch daemon add osd :data_devices=/dev/sdX,db_devices=/dev/ceph-/osd-block-,osd_id_claims=344 -- Kai Stian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: cephadm rgw ssl certificate config
On Thu, Jul 18, 2024 at 10:49:02AM +, Eugen Block wrote: And after restarting the daemon, it seems to work. So my question is, how do you deal with per-host certificates and rgw? Any comments are appreciated. By not dealing with it, sort of. Since we run our own CA, so I create one certificate with all the names of all the rgw hosts including their IP addressees in the certificate Subject Alt Names(SAN). -- Kai Stian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Lot of spams on the list
On 24.06.2024 19:15, Anthony D'Atri wrote: * Subscription is now moderated * The three worst spammers (you know who they are) have been removed * I’ve deleted tens of thousands of crufty mail messages from the queue The list should work normally now. Working on the backlog of held messages. 99% are bogus, but I want to be careful wrt baby and bathwater. Will the archive[1] also be clean up? [1] https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/ -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Lousy recovery for mclock and reef
On 24.05.2024 21:07, Mazzystr wrote: I did the obnoxious task of updating ceph.conf and restarting all my osds. ceph --admin-daemon /var/run/ceph/ceph-osd.*.asok config get osd_op_queue { "osd_op_queue": "wpq" } I have some spare memory on my target host/osd and increased the target memory of that OSD to 10 Gb and restarted. No effect observed. In fact mem usage on the host is stable so I don't think the change took effect even with updating ceph.conf, restart and a direct asok config set. target memory value is confirmed to be set via asok config get Nothing has helped. I still cannot break the 21 MiB/s barrier. Does anyone have any more ideas? For recovery you can adjust the following. osd_max_backfills default is 1, in my system I get the best performance with 3 and wpq. The following I have not adjusted myself, but you can try. osd_recovery_max_active is default to 3. osd_recovery_op_priority is default to 3, a lower number increases the priority for recovery. All of them can be runtime adjusted. -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Setting S3 bucket policies with multi-tenants
On 12.04.2024 20:54, Wesley Dillingham wrote: Did you actually get this working? I am trying to replicate your steps but am not being successful doing this with multi-tenant. This is what we are using, the second statement is so that bucket owner will have access to the object that the user is uploading. s3-policy.json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "AWS": [ "arn:aws:iamuser/" ] }, "Action": [ "s3:ListBucket", "s3:GetObject", "s3:PutObject" ], "Resource": [ "arn:aws:s3:::/*", "arn:aws:s3:::" ] }, { "Sid": "owner_full_access", "Effect": "Allow", "Principal": { "AWS": [ "arn:aws:iamuser/" ] }, "Action": "s3:*", "Resource": "arn:aws:s3:::*" } ] } And then run s3cmd setpolicy s3-policy.json s3:// -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Large number of misplaced PGs but little backfill going on
On Mon, Mar 25, 2024 at 10:58:24PM +0100, Kai Stian Olstad wrote: On Mon, Mar 25, 2024 at 09:28:01PM +0100, Torkil Svensgaard wrote: My tally came to 412 out of 539 OSDs showing up in a blocked_by list and that is about every OSD with data prior to adding ~100 empty OSDs. How 400 read targets and 100 write targets can only equal ~60 backfills with osd_max_backill set at 3 just makes no sense to me but alas. It seems I can just increase osd_max_backfill even further to get the numbers I want so that will do. Thank you all for taking the time to look at this. It's a huge change and 42% of you data need to be moved. And this move is not only to the new OSD but also between the existing OSD, but they are busy with backfilling so they have no free backfill reservation. I do recommend this document by Joshua Baergen at Digital Ocean that explains backfilling and the problem with it and there solution, a tool called pgremapper. Forgot the link https://ceph.io/assets/pdfs/user_dev_meeting_2023_10_19_joshua_baergen.pdf -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Large number of misplaced PGs but little backfill going on
On Mon, Mar 25, 2024 at 09:28:01PM +0100, Torkil Svensgaard wrote: My tally came to 412 out of 539 OSDs showing up in a blocked_by list and that is about every OSD with data prior to adding ~100 empty OSDs. How 400 read targets and 100 write targets can only equal ~60 backfills with osd_max_backill set at 3 just makes no sense to me but alas. It seems I can just increase osd_max_backfill even further to get the numbers I want so that will do. Thank you all for taking the time to look at this. It's a huge change and 42% of you data need to be moved. And this move is not only to the new OSD but also between the existing OSD, but they are busy with backfilling so they have no free backfill reservation. I do recommend this document by Joshua Baergen at Digital Ocean that explains backfilling and the problem with it and there solution, a tool called pgremapper. -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Large number of misplaced PGs but little backfill going on
On Sat, Mar 23, 2024 at 12:09:29PM +0100, Torkil Svensgaard wrote: The other output is too big for pastebin and I'm not familiar with paste services, any suggestion for a preferred way to share such output? You can attached files to the mail here on the list. -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Reef (18.2): Some PG not scrubbed/deep scrubbed for 1 month
On Fri, Mar 22, 2024 at 06:51:44PM +0100, Frédéric Nass wrote: The OSD run bench and update osd_mclock_max_capacity_iops_{hdd,ssd} every time the OSD is started. If you check the OSD log you'll see it does the bench. Are you sure about the update on every start? Does the update happen only if the benchmark result is < 500 iops? Looks like the OSD does not remove any set configuration when the benchmark result is > 500 iops. Otherwise, the extremely low value that Michel reported earlier (less than 1 iops) would have been updated over time. I guess. I'm not completely sure, it's a couple a month since I used mclock, have switch back to wpq because of a nasty bug in mclock that can freeze cluster I/O. It could be because I was testing osd_mclock_force_run_benchmark_on_init. The OSD had DB on SSD and data on HDD, so the measured to about 1700 IOPS and was ignored because of the 500 limit. So only the SSD got the osd_mclock_max_capacity_iops_ssd set. -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Reef (18.2): Some PG not scrubbed/deep scrubbed for 1 month
On Fri, Mar 22, 2024 at 04:29:21PM +0100, Frédéric Nass wrote: A/ these incredibly low values were calculated a while back with an unmature version of the code or under some specific hardware conditions and you can hope this won't happen again The OSD run bench and update osd_mclock_max_capacity_iops_{hdd,ssd} every time the OSD is started. If you check the OSD log you'll see it does the bench. -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: pg repair doesn't fix "got incorrect hash on read" / "candidate had an ec hash mismatch"
Hi Eugen, thank you for the reply. The OSD was drained over the weekend, so OSD 223 and 269 have only the problematic PG 404.bc. I don't think moving the PG would help since I don't have any empty OSD to move it to, and a move would not fix the hash mismatch. The reason I just want to have the problematic PG on the OSDs is to reduce recovery time. I would need to set min_size to 4 in an EC 4+2, and stop them both at the same time to force a rebuild of the corrupted part of PG that is on osd 223 and 269, since repair doesn't fix it. I'm debating with myself if I should 1. Stop both OSD 223 and 269, 2. Just one of them. Stopping them both, I'm guarantied that part of the PG on 223 and 269 is rebuild from the 4 other, 297, 276, 136 and 197 that doesn't have any errors. OSD 223 is the master in the EC, pg 404.bc acting [223,297,269,276,136,197] So maybe just stop that one, wait for recovery and the run deep-scrub to check if things look better. But would it then use corrupted data on osd 269 to rebuild. - Kai Stian Olstad On 26.02.2024 10:19, Eugen Block wrote: Hi, I think your approach makes sense. But I'm wondering if moving only the problematic PGs to different OSDs could have an effect as well. I assume that moving the 2 PGs is much quicker than moving all BUT those 2 PGs. If that doesn't work you could still fall back to draining the entire OSDs (except for the problematic PG). Regards, Eugen Zitat von Kai Stian Olstad : Hi, No one have any comment at all? I'm not picky so any speculation, guessing, I would, I wouldn't, should work and so one would be highly appreciated. Since 4 out of 6 in EC 4+2 is OK and ceph pg repair doesn't solve it I think the following might work. pg 404.bc acting [223,297,269,276,136,197] - Use pgremapper to move all PG on OSD 223 and 269 except 404.bc to other OSD. - Set min_since to 4, ceph osd pool set default.rgw.buckets.data min_size 4 - Stop osd 223 and 269 What I hope will happen is that Ceph then recreate 404.bc shard s0(osd.223) and s2(osd.269) since they are now down from the remaining shards s1(osd.297), s3(osd.276), s4(osd.136) and s5(osd.197) _Any_ comment is highly appreciated. - Kai Stian Olstad On 21.02.2024 13:27, Kai Stian Olstad wrote: Hi, Short summary PG 404.bc is an EC 4+2 where s0 and s2 report hash mismtach for 698 objects. Ceph pg repair doesn't fix it, because if you run deep-srub on the PG after repair is finished, it still report scrub errors. Why can't ceph pg repair repair this, it has 4 out of 6 should be able to reconstruct the corrupted shards? Is there a way to fix this? Like delete object s0 and s2 so it's forced to recreate them? Long detailed summary A short backstory. * This is aftermath of problems with mclock, post "17.2.7: Backfilling deadlock / stall / stuck / standstill" [1]. - 4 OSDs had a few bad sectors, set all 4 out and cluster stopped. - Solution was to swap from mclock to wpq and restart alle OSD. - When all backfilling was finished all 4 OSD was replaced. - osd.223 and osd.269 was 2 of the 4 OSDs that was replaced. PG / pool 404 is EC 4+2 default.rgw.buckets.data 9 days after the osd.223 og osd.269 was replaced, deep-scub was run and reported errors ceph status --- HEALTH_ERR 1396 scrub errors; Possible data damage: 1 pg inconsistent [ERR] OSD_SCRUB_ERRORS: 1396 scrub errors [ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent pg 404.bc is active+clean+inconsistent, acting [223,297,269,276,136,197] I then run repair ceph pg repair 404.bc And ceph status showed this ceph status --- HEALTH_WARN Too many repaired reads on 2 OSDs [WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 2 OSDs osd.223 had 698 reads repaired osd.269 had 698 reads repaired But osd.223 and osd.269 is new disks and the disks has no SMART error or any I/O error in OS logs. So I tried to run deep-scrub again on the PG. ceph pg deep-scrub 404.bc And got this result. ceph status --- HEALTH_ERR 1396 scrub errors; Too many repaired reads on 2 OSDs; Possible data damage: 1 pg inconsistent [ERR] OSD_SCRUB_ERRORS: 1396 scrub errors [WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 2 OSDs osd.223 had 698 reads repaired osd.269 had 698 reads repaired [ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent pg 404.bc is active+clean+scrubbing+deep+inconsistent+repair, acting [223,297,269,276,136,197] 698 + 698 = 1396 so the same amount of errors. Run repair again on 404.bc and ceph status is HEALTH_WARN Too many repaired reads on 2 OSDs [WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 2 OSDs osd.223 had 1396 reads repaired osd.269 had 1396 reads repaired So even when repair finish it doesn't fix the problem since the
[ceph-users] Re: pg repair doesn't fix "got incorrect hash on read" / "candidate had an ec hash mismatch"
Hi Eugen, thank you for the reply. The OSD was drained over the weekend, so OSD 223 and 269 have only the problematic PG 404.bc. I don't think moving the PG would help since I don't have any empty OSD to move it to, and a move would not fix the hash mismatch. The reason I just want to have the problematic PG on the OSDs is to reduce recovery time. I would need to set min_size to 4 in an EC 4+2, and stop them both at the same time to force a rebuild of the corrupted part of PG that is on osd 223 and 269, since repair doesn't fix it. I'm debating with myself if I should 1. Stop both OSD 223 and 269, 2. Just one of them. Stopping them both, I'm guarantied that part of the PG on 223 and 269 is rebuild from the 4 other, 297, 276, 136 and 197 that doesn't have any errors. OSD 223 is the master in the EC, pg 404.bc acting [223,297,269,276,136,197] So maybe just stop that one, wait for recovery and the run deep-scrub to check if things look better. But would it then use corrupted data on osd 269 to rebuild. - Kai Stian Olstad On 26.02.2024 10:19, Eugen Block wrote: Hi, I think your approach makes sense. But I'm wondering if moving only the problematic PGs to different OSDs could have an effect as well. I assume that moving the 2 PGs is much quicker than moving all BUT those 2 PGs. If that doesn't work you could still fall back to draining the entire OSDs (except for the problematic PG). Regards, Eugen Zitat von Kai Stian Olstad : Hi, No one have any comment at all? I'm not picky so any speculation, guessing, I would, I wouldn't, should work and so one would be highly appreciated. Since 4 out of 6 in EC 4+2 is OK and ceph pg repair doesn't solve it I think the following might work. pg 404.bc acting [223,297,269,276,136,197] - Use pgremapper to move all PG on OSD 223 and 269 except 404.bc to other OSD. - Set min_since to 4, ceph osd pool set default.rgw.buckets.data min_size 4 - Stop osd 223 and 269 What I hope will happen is that Ceph then recreate 404.bc shard s0(osd.223) and s2(osd.269) since they are now down from the remaining shards s1(osd.297), s3(osd.276), s4(osd.136) and s5(osd.197) _Any_ comment is highly appreciated. - Kai Stian Olstad On 21.02.2024 13:27, Kai Stian Olstad wrote: Hi, Short summary PG 404.bc is an EC 4+2 where s0 and s2 report hash mismtach for 698 objects. Ceph pg repair doesn't fix it, because if you run deep-srub on the PG after repair is finished, it still report scrub errors. Why can't ceph pg repair repair this, it has 4 out of 6 should be able to reconstruct the corrupted shards? Is there a way to fix this? Like delete object s0 and s2 so it's forced to recreate them? Long detailed summary A short backstory. * This is aftermath of problems with mclock, post "17.2.7: Backfilling deadlock / stall / stuck / standstill" [1]. - 4 OSDs had a few bad sectors, set all 4 out and cluster stopped. - Solution was to swap from mclock to wpq and restart alle OSD. - When all backfilling was finished all 4 OSD was replaced. - osd.223 and osd.269 was 2 of the 4 OSDs that was replaced. PG / pool 404 is EC 4+2 default.rgw.buckets.data 9 days after the osd.223 og osd.269 was replaced, deep-scub was run and reported errors ceph status --- HEALTH_ERR 1396 scrub errors; Possible data damage: 1 pg inconsistent [ERR] OSD_SCRUB_ERRORS: 1396 scrub errors [ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent pg 404.bc is active+clean+inconsistent, acting [223,297,269,276,136,197] I then run repair ceph pg repair 404.bc And ceph status showed this ceph status --- HEALTH_WARN Too many repaired reads on 2 OSDs [WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 2 OSDs osd.223 had 698 reads repaired osd.269 had 698 reads repaired But osd.223 and osd.269 is new disks and the disks has no SMART error or any I/O error in OS logs. So I tried to run deep-scrub again on the PG. ceph pg deep-scrub 404.bc And got this result. ceph status --- HEALTH_ERR 1396 scrub errors; Too many repaired reads on 2 OSDs; Possible data damage: 1 pg inconsistent [ERR] OSD_SCRUB_ERRORS: 1396 scrub errors [WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 2 OSDs osd.223 had 698 reads repaired osd.269 had 698 reads repaired [ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent pg 404.bc is active+clean+scrubbing+deep+inconsistent+repair, acting [223,297,269,276,136,197] 698 + 698 = 1396 so the same amount of errors. Run repair again on 404.bc and ceph status is HEALTH_WARN Too many repaired reads on 2 OSDs [WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 2 OSDs osd.223 had 1396 reads repaired osd.269 had 1396 reads repaired So even when repair finish it doesn't fix the problem since the
[ceph-users] Re: pg repair doesn't fix "got incorrect hash on read" / "candidate had an ec hash mismatch"
Hi, No one have any comment at all? I'm not picky so any speculation, guessing, I would, I wouldn't, should work and so one would be highly appreciated. Since 4 out of 6 in EC 4+2 is OK and ceph pg repair doesn't solve it I think the following might work. pg 404.bc acting [223,297,269,276,136,197] - Use pgremapper to move all PG on OSD 223 and 269 except 404.bc to other OSD. - Set min_since to 4, ceph osd pool set default.rgw.buckets.data min_size 4 - Stop osd 223 and 269 What I hope will happen is that Ceph then recreate 404.bc shard s0(osd.223) and s2(osd.269) since they are now down from the remaining shards s1(osd.297), s3(osd.276), s4(osd.136) and s5(osd.197) _Any_ comment is highly appreciated. - Kai Stian Olstad On 21.02.2024 13:27, Kai Stian Olstad wrote: Hi, Short summary PG 404.bc is an EC 4+2 where s0 and s2 report hash mismtach for 698 objects. Ceph pg repair doesn't fix it, because if you run deep-srub on the PG after repair is finished, it still report scrub errors. Why can't ceph pg repair repair this, it has 4 out of 6 should be able to reconstruct the corrupted shards? Is there a way to fix this? Like delete object s0 and s2 so it's forced to recreate them? Long detailed summary A short backstory. * This is aftermath of problems with mclock, post "17.2.7: Backfilling deadlock / stall / stuck / standstill" [1]. - 4 OSDs had a few bad sectors, set all 4 out and cluster stopped. - Solution was to swap from mclock to wpq and restart alle OSD. - When all backfilling was finished all 4 OSD was replaced. - osd.223 and osd.269 was 2 of the 4 OSDs that was replaced. PG / pool 404 is EC 4+2 default.rgw.buckets.data 9 days after the osd.223 og osd.269 was replaced, deep-scub was run and reported errors ceph status --- HEALTH_ERR 1396 scrub errors; Possible data damage: 1 pg inconsistent [ERR] OSD_SCRUB_ERRORS: 1396 scrub errors [ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent pg 404.bc is active+clean+inconsistent, acting [223,297,269,276,136,197] I then run repair ceph pg repair 404.bc And ceph status showed this ceph status --- HEALTH_WARN Too many repaired reads on 2 OSDs [WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 2 OSDs osd.223 had 698 reads repaired osd.269 had 698 reads repaired But osd.223 and osd.269 is new disks and the disks has no SMART error or any I/O error in OS logs. So I tried to run deep-scrub again on the PG. ceph pg deep-scrub 404.bc And got this result. ceph status --- HEALTH_ERR 1396 scrub errors; Too many repaired reads on 2 OSDs; Possible data damage: 1 pg inconsistent [ERR] OSD_SCRUB_ERRORS: 1396 scrub errors [WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 2 OSDs osd.223 had 698 reads repaired osd.269 had 698 reads repaired [ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent pg 404.bc is active+clean+scrubbing+deep+inconsistent+repair, acting [223,297,269,276,136,197] 698 + 698 = 1396 so the same amount of errors. Run repair again on 404.bc and ceph status is HEALTH_WARN Too many repaired reads on 2 OSDs [WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 2 OSDs osd.223 had 1396 reads repaired osd.269 had 1396 reads repaired So even when repair finish it doesn't fix the problem since they reappear again after a deep-scrub. The log for osd.223 and osd.269 contain "got incorrect hash on read" and "candidate had an ec hash mismatch" for 698 unique objects. But i only show the logs for 1 of the 698 object, the log is the same for the other 697 objects. osd.223 log (only showing 1 of 698 object named 2021-11-08T19%3a43%3a50,145489260+00%3a00) --- Feb 20 10:31:00 ceph-hd-003 ceph-osd[3665432]: osd.223 pg_epoch: 231235 pg[404.bcs0( v 231235'1636919 (231078'1632435,231235'1636919] local-lis/les=226263/226264 n=296580 ec=36041/27862 lis/c=226263/226263 les/c/f=226264/230954/0 sis=226263) [223,297,269,276,136,197]p223(0) r=0 lpr=226263 crt=231235'1636919 lcod 231235'1636918 mlcod 231235'1636918 active+clean+scrubbing+deep+inconsistent+repair [ 404.bcs0: REQ_SCRUB ] MUST_REPAIR MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] _scan_list 404:3d001f95:::1f244892-a2e7-406b-aa62-1b13511333a2.625411.3__multipart_2021-11-08T19%3a43%3a50,145489260+00%3a00.2~OoetD5vkh8fyh-2eeR7GF5rZK7d5EVa.1:head got incorrect hash on read 0xc5d1dd1b != expected 0x7c2f86d7 Feb 20 10:31:01 ceph-hd-003 ceph-osd[3665432]: log_channel(cluster) log [ERR] : 404.bc shard 223(0) soid 404:3d001f95:::1f244892-a2e7-406b-aa62-1b13511333a2.625411.3__multipart_2021-11-08T19%3a43%3a50,145489260+00%3a00.2~OoetD5vkh8fyh-2eeR7GF5rZK7d5EVa.1:head : candidate had an ec hash mismatch Feb 20 10:31:01 ceph-hd-003 ceph-osd[366
[ceph-users] Re: Some questions about cephadm
On 21.02.2024 17:07, wodel youchi wrote: - The documentation of ceph does not indicate what versions of grafana, prometheus, ...etc should be used with a certain version. - I am trying to deploy Quincy, I did a bootstrap to see what containers were downloaded and their version. - I am asking because I need to use a local registry to deploy those images. You need to check the cephadm source for the version you would like to use https://github.com/ceph/ceph/blob/v17.2.7/src/cephadm/cephadm#L46 -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] pg repair doesn't fix "got incorrect hash on read" / "candidate had an ec hash mismatch"
r) log [ERR] : 404.bc shard 269(2) soid 404:3d001f95:::1f244892-a2e7-406b-aa62-1b13511333a2.625411.3__multipart_2021-11-08T19%3a43%3a50,145489260+00%3a00.2~OoetD5vkh8fyh-2eeR7GF5rZK7d5EVa.1:head : candidate had an ec hash mismatch osd.269 log (only showing 1 of 698 object named 2021-11-08T19%3a43%3a50,145489260+00%3a00) --- Feb 20 10:31:00 ceph-hd-001 ceph-osd[3656897]: osd.269 pg_epoch: 231235 pg[404.bcs2( v 231235'1636919 (231078'1632435,231235'1636919] local-lis/les=226263/226264 n=296580 ec=36041/27862 lis/c=226263/226263 les/c/f=226264/230954/0 sis=226263) [223,297,269,276,136,197]p223(0) r=2 lpr=226263 luod=0'0 crt=231235'1636919 mlcod 231235'1636919 active mbc={}] _scan_list 404:3d001f95:::1f244892-a2e7-406b-aa62-1b13511333a2.625411.3__multipart_2021-11-08T19%3a43%3a50,145489260+00%3a00.2~OoetD5vkh8fyh-2eeR7GF5rZK7d5EVa.1:head got incorrect hash on read 0x7c0871dc != expected 0xcf6f4c58 The log for the other osd in the PG osd.297, osd.276, osd.136 and osd.197 doesn't show any error. If I try to get the object it failes $ s3cmd s3://benchfiles/2021-11-08T19:43:50,145489260+00:00 download: 's3://benchfiles/2021-11-08T19:43:50,145489260+00:00' -> './2021-11-08T19:43:50,145489260+00:00' [1 of 1] ERROR: Download of './2021-11-08T19:43:50,145489260+00:00' failed (Reason: 500 (UnknownError)) ERROR: S3 error: 500 (UnknownError) And the RGW log show this Feb 21 08:27:06 ceph-mon-1 radosgw[1747]: == starting new request req=0x7f94b744d660 = Feb 21 08:27:06 ceph-mon-1 radosgw[1747]: WARNING: set_req_state_err err_no=5 resorting to 500 Feb 21 08:27:06 ceph-mon-1 radosgw[1747]: == starting new request req=0x7f94b6e41660 = Feb 21 08:27:06 ceph-mon-1 radosgw[1747]: == req done req=0x7f94b744d660 op status=-5 http_status=500 latency=0.02568s == Feb 21 08:27:06 ceph-mon-1 radosgw[1747]: beast: 0x7f94b744d660: 110.2.0.46 - test1 [21/Feb/2024:08:27:06.021 +] "GET /benchfiles/2021-11-08T19%3A43%3A50%2C145489260%2B00%3A00 HTTP/1.1" 500 226 - - - latency=0.020000568s [1] https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/IPHBE3DLW5ABCZHSNYOBUBSI3TLWVD22/#OE3QXLAJIY6NU7PNMGHP47UK2CBZJPUG -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: PG stuck at recovery
On 19.02.2024 23:23, Anthony D'Atri wrote: After wrangling with this myself, both with 17.2.7 and to an extent with 17.2.5, I'd like to follow up here and ask: Those who have experienced this, were the affected PGs * Part of an EC pool? * Part of an HDD pool? * Both? Both in my case, EC is 4+2 jerasure blaum_roth and the HDD is hybrid where DB is on SSD shared by 5 HDD. And in your cases? -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Installing ceph s3.
On 12.02.2024 18:15, Albert Shih wrote: I couldn't find a documentation about how to install a S3/Swift API (as I understand it's RadosGW) on quincy. It depends on how you have install Ceph. If your are using Cephadm the docs is here https://docs.ceph.com/en/reef/cephadm/services/rgw/ I can find some documentation on octupus (https://docs.ceph.com/en/octopus/install/ceph-deploy/install-ceph-gateway/) ceph-deploy is deprecated https://docs.ceph.com/en/reef/install/ -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: PG stuck at recovery
You don't say anything about the Ceph version you are running. I had an similar issue with 17.2.7, and is seams to be an issue with mclock, when I switch to wpq everything worked again. You can read more about it here https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/IPHBE3DLW5ABCZHSNYOBUBSI3TLWVD22/#OE3QXLAJIY6NU7PNMGHP47UK2CBZJPUG - Kai Stian Olstad On Tue, Feb 06, 2024 at 06:35:26AM -, LeonGao wrote: Hi community We have a new Ceph cluster deployment with 100 nodes. When we are draining an OSD host from the cluster, we see a small amount of PGs that cannot make any progress to the end. From the logs and metrics, it seems like the recovery progress is stuck (0 recovery ops for several days). Would like to get some ideas on this. Re-peering and OSD restart do resolve to mitigate the issue but we want to get to the root cause of it as draining and recovery happen frequently. I have put some debugging information below. Any help is appreciated, thanks! ceph -s pgs: 4210926/7380034104 objects misplaced (0.057%) 41198 active+clean 71active+remapped+backfilling 12active+recovering One of the stuck PG: 6.38f1 active+remapped+backfilling [313,643,727] 313 [313,643,717] 313 PG query result: ceph pg 6.38f1 query { "snap_trimq": "[]", "snap_trimq_len": 0, "state": "active+remapped+backfilling", "epoch": 246856, "up": [ 313, 643, 727 ], "acting": [ 313, 643, 717 ], "backfill_targets": [ "727" ], "acting_recovery_backfill": [ "313", "643", "717", "727" ], "info": { "pgid": "6.38f1", "last_update": "212333'38916", "last_complete": "212333'38916", "log_tail": "80608'37589", "last_user_version": 38833, "last_backfill": "MAX", "purged_snaps": [], "history": { "epoch_created": 3726, "epoch_pool_created": 3279, "last_epoch_started": 243987, "last_interval_started": 243986, "last_epoch_clean": 220174, "last_interval_clean": 220173, "last_epoch_split": 3726, "last_epoch_marked_full": 0, "same_up_since": 238347, "same_interval_since": 243986, "same_primary_since": 3728, "last_scrub": "212333'38916", "last_scrub_stamp": "2024-01-29T13:43:10.654709+", "last_deep_scrub": "212333'38916", "last_deep_scrub_stamp": "2024-01-28T07:43:45.920198+", "last_clean_scrub_stamp": "2024-01-29T13:43:10.654709+", "prior_readable_until_ub": 0 }, "stats": { "version": "212333'38916", "reported_seq": 413425, "reported_epoch": 246856, "state": "active+remapped+backfilling", "last_fresh": "2024-02-05T21:14:40.838785+", "last_change": "2024-02-03T22:33:43.052272+", "last_active": "2024-02-05T21:14:40.838785+", "last_peered": "2024-02-05T21:14:40.838785+", "last_clean": "2024-02-03T04:26:35.168232+", "last_became_active": "2024-02-03T22:31:16.037823+", "last_became_peered": "2024-02-03T22:31:16.037823+", "last_unstale": "2024-02-05T21:14:40.838785+", "last_undegraded": "2024-02-05T21:14:40.838785+", "last_fullsized": "2024-02-05T21:14:40.838785+", "mapping_epoch": 243986, "log_start": "80608'37589", "ondisk_log_start": "80608'37589", "created": 3726, "last_epoch_clean": 220174, "parent": "0.0", "parent_split_bits": 14, "last_scrub": "212333'38916", "last_scrub_stamp": "2024-01-29T13:43:10.654709+", "last_deep_scrub": "212333'38916"
[ceph-users] Re: how can install latest dev release?
On 31.01.2024 09:38, garcetto wrote: good morning, how can i install latest dev release using cephadm? Have you looked at this page? https://docs.ceph.com/en/latest/install/containers/#development-builds -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: 17.2.7: Backfilling deadlock / stall / stuck / standstill
On 26.01.2024 23:09, Mark Nelson wrote: For what it's worth, we saw this last week at Clyso on two separate customer clusters on 17.2.7 and also solved it by moving back to wpq. We've been traveling this week so haven't created an upstream tracker for it yet, but we're back to recommending wpq to our customers for all production cluster deployments until we figure out what's going on. Thank you for confirming, switching to wpq solved my problem too, and I have switch all production clusters to wpq. I guess all my logs is gone by now, but I try to recreate the situation in the test cluster. -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: 17.2.7: Backfilling deadlock / stall / stuck / standstill
On 26.01.2024 22:08, Wesley Dillingham wrote: I faced a similar issue. The PG just would never finish recovery. Changing all OSDs in the PG to "osd_op_queue wpq" and then restarting them serially ultimately allowed the PG to recover. Seemed to be some issue with mclock. Thank you Wes, switching to wpq and restart the OSDs fixed it for me too. -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] 17.2.7: Backfilling deadlock / stall / stuck / standstill
Hi, This is a cluster running 17.2.7 upgraded from 16.2.6 on the 15 January 2024. On Monday 22 January we had 4 HDD all on different server with I/O-error because of some damage sectors, the OSD is hybrid so the DB is on SSD, 5 HDD share 1 SSD. I set the OSD out, ceph osd out 223 269 290 318 and all hell broke loose. I took only minutes before the users complained about Ceph not working. Ceph status reportet slow OPS on the OSDs that was set to out, and “ceph tell osd. dump_ops_in_flight” against the out OSDs it just hang, after 30 minutes I stopped the dump command. Long story short I ended up running “ceph osd set nobackfill” to slow ops was gone and then unset it when the slow ops message disappeared. I needed to run that all the time so the cluster didn’t come to a holt so this oneliner loop was used “while true; do ceph -s | grep -qE "oldest one blocked for [0-9]{2,}" && (date; ceph osd set nobackfill; sleep 15; ceph osd unset nobackfill); sleep 10; done” But now 4 days later the backfilling has stopped progressing completely and the number of misplaced object is increasing. Some PG has 0 misplaced object but sill have backfilling state, and been in this state for over 24 hours now. I have a hunch that it’s because of PG 404.6e7 is in state “active+recovering+degraded+remapped” it’s been in this state for over 48 hours. It’s has possible 2 missing object, but since they are not unfound I can’t delete them with “ceph pg 404.6e7 mark_unfound_lost delete” Could someone please help to solve this? Down below is some output of ceph commands, I’ll also attache them. ceph status (only removed information about no running scrub and deep_scrub) --- cluster: id: b321e76e-da3a-11eb-b75c-4f948441dcd0 health: HEALTH_WARN Degraded data redundancy: 2/6294904971 objects degraded (0.000%), 1 pg degraded services: mon: 3 daemons, quorum ceph-mon-1,ceph-mon-2,ceph-mon-3 (age 11d) mgr: ceph-mon-1.ptrsea(active, since 11d), standbys: ceph-mon-2.mfdanx mds: 1/1 daemons up, 1 standby osd: 355 osds: 355 up (since 22h), 351 in (since 4d); 18 remapped pgs rgw: 7 daemons active (7 hosts, 1 zones) data: volumes: 1/1 healthy pools: 14 pools, 3945 pgs objects: 1.14G objects, 1.1 PiB usage: 1.8 PiB used, 1.2 PiB / 3.0 PiB avail pgs: 2/6294904971 objects degraded (0.000%) 2980455/6294904971 objects misplaced (0.047%) 3901 active+clean 22 active+clean+scrubbing+deep 17 active+remapped+backfilling 4active+clean+scrubbing 1active+recovering+degraded+remapped io: client: 167 MiB/s rd, 13 MiB/s wr, 6.02k op/s rd, 2.35k op/s wr ceph health detail (only removed information about no running scrub and deep_scrub) --- HEALTH_WARN Degraded data redundancy: 2/6294902067 objects degraded (0.000%), 1 pg degraded [WRN] PG_DEGRADED: Degraded data redundancy: 2/6294902067 objects degraded (0.000%), 1 pg degraded pg 404.6e7 is active+recovering+degraded+remapped, acting [223,274,243,290,286,283] ceph pg 202.6e7 list_unfound --- { "num_missing": 2, "num_unfound": 0, "objects": [], "state": "Active", "available_might_have_unfound": true, "might_have_unfound": [], "more": false } ceph pg 404.6e7 query | jq .recovery_state --- [ { "name": "Started/Primary/Active", "enter_time": "2024-01-26T09:08:41.918637+", "might_have_unfound": [ { "osd": "243(2)", "status": "already probed" }, { "osd": "274(1)", "status": "already probed" }, { "osd": "275(0)", "status": "already probed" }, { "osd": "283(5)", "status": "already probed" }, { "osd": "286(4)", "status": "already probed" }, { "osd": "290(3)", "status": "already probed" }, { "osd": "335(3)", "status": "already probed" } ], "recovery_progress": { "backfill_targets": [ "275(0)", "335(3)" ], "waiting_on_backfill": [], "last_backfill_started": "404:e76011a9:::1f244892-a2e7-406b-aa62-1b13511333a2.625411.18_56463c71-286c-4399-8d5d-0c278b7c97fd:head", "backfill_info": { "begin": "MIN", "end": "MIN", "objects": [] }, "peer_backfill_info": [], "backfills_in_flight": [], "recovering": [], "pg_backend": { "recovery_ops": [], "read_ops": [] } } }, { "name": "Started", "enter_time": "2024-01-26T09:08:40.909151+" } ] ceph pg ls recovering backfilling --- PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOGLOG_DUPS STATE SINCE VERSION REPORTED UP ACTING 404.bc287986 0
[ceph-users] Re: podman / docker issues
On 25.01.2024 18:19, Marc wrote: More and more I am annoyed with the 'dumb' design decisions of redhat. Just now I have an issue on an 'air gapped' vm that I am unable to start a docker/podman container because it tries to contact the repository to update the image and instead of using the on disk image it just fails. (Not to mention the %$#$%#$ that design containers to download stuff from the internet on startup) I was wondering if this is also an issue with ceph-admin. Is there an issue with starting containers when container image repositories are not available or when there is no internet connection. Of course cephadm will fail if the container registry is not available avaiable and the image isn't pulled locally. But you don't need to use the official registry, so using it air-gaped is not a problem. Just download the images you need to your local registry and specify it, some details are here https://docs.ceph.com/en/reef/cephadm/install/#deployment-in-an-isolated-environment The containers themself don't need to download anything at start. -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Cephadm orchestrator and special label _admin in 17.2.7
On 23.01.2024 18:19, Albert Shih wrote: Just like to known if it's a very bad idea to do a rsync of /etc/ceph from the «_admin» server to the other ceph cluster server. I in fact add something like for host in `cat /usr/local/etc/ceph_list_noeuds.txt` do /usr/bin/rsync -av /etc/ceph/ceph* $host:/etc/ceph/ done in a cronjob Why not just add the _admin label to the host and let Ceph do the job? You can also run this to get the ceph.conf copied to all host ceph config set mgr/cephadm/manage_etc_ceph_ceph_conf true Anyway, I don't se any problem with rsync it, it's just ceph.conf and the admin key. -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: About lost disk with erasure code
On 27.12.2023 04:54, Phong Tran Thanh wrote: Thank you for your knowledge. I have a question. Which pool is affected when the PG is down, and how can I show it? When a PG is down, is only one pool affected or are multiple pools affected? If only 1 PG is down only 1 pool is affected. The name of a PG is {pool-num}.{pg-id} and the pools number you find with "ceph osd lspools". ceph health detail will show which PG is down and all other issues. ceph pg ls will show you all PG, their status and the OSD they are running on. Some useful links https://docs.ceph.com/en/quincy/rados/operations/monitoring-osd-pg/#monitoring-pg-states https://docs.ceph.com/en/quincy/rados/troubleshooting/troubleshooting-pg/ https://docs.ceph.com/en/latest/dev/placement-group/#user-visible-pg-states -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph 16.2.14: osd crash, bdev() _aio_thread got r=-1 ((1) Operation not permitted)
On Sun, Dec 03, 2023 at 06:53:08AM +0200, Zakhar Kirpichenko wrote: One of our 16.2.14 cluster OSDs crashed again because of the dreaded https://tracker.ceph.com/issues/53906 bug. It would be good to understand what has triggered this condition and how it can be resolved without rebooting the whole host. I would very much appreciate any suggestions. If you look closely at 53906 you'll see it's a duplicate of https://tracker.ceph.com/issues/53907 In there you have the fix and a workaround until next minor is released. -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: ceph osd dump_historic_ops
On Fri, Dec 01, 2023 at 04:33:20PM +0700, Phong Tran Thanh wrote: I have a problem with my osd, i want to show dump_historic_ops of osd I follow the guide: https://www.ibm.com/docs/en/storage-fusion/2.6?topic=alerts-cephosdslowops But when i run command ceph daemon osd.8 dump_historic_ops show the error, the command run on node with osd.8 Can't get admin socket path: unable to get conf option admin_socket for osd: b"error parsing 'osd': expected string of the form TYPE.ID, valid types are: auth, mon, osd, mds, mgr, client\n" I am running ceph cluster reef version by cephadmin install What should I do? The easiest is use tell, then you can run it on any node that have access to ceph. ceph tell osd.8 dump_historic_ops ceph tell osd.8 help will give you all you can do with tell. -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: How to speed up rgw lifecycle
On Tue, Nov 28, 2023 at 02:55:56PM +0700, VÔ VI wrote: My ceph cluster is using s3 with three pools and obj/s approximately 4.5k obj/s and the rgw lifecycle delete per pool is only 60-70 objects/s How can I speed up the lc rgw process? 60 70 objects/s is too slow It explained in the documentation, have you tried that? https://docs.ceph.com/en/reef/radosgw/config-ref/#lifecycle-settings -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph 16.2.x excessive logging, how to reduce?
On 09.10.2023 10:05, Zakhar Kirpichenko wrote: I did try to play with various debug settings. The issue is that mons produce logs of all commands issued by clients, not just mgr. For example, an Openstack Cinder node asking for space it can use: Oct 9 07:59:01 ceph03 bash[4019]: debug 2023-10-09T07:59:01.303+ This log say that it's bash with PID 4019 that is creating the log entry. Maybe start there, check what what other thing you are running on the server that creates this messages. -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: cannot repair a handful of damaged pg's
On 06.10.2023 17:48, Wesley Dillingham wrote: A repair is just a type of scrub and it is also limited by osd_max_scrubs which in pacific is 1. If another scrub is occurring on any OSD in the PG it wont start. do "ceph osd set noscrub" and "ceph osd set nodeep-scrub" wait for all scrubs to stop (a few seconds probably) Then issue the pg repair command again. It may start. You also have pgs in backfilling state. Note that by default OSDs in backfill or backfill_wait also wont perform scrubs. You can modify this behavior with `ceph config set osd osd_scrub_during_recovery true` I would suggest only setting that after the noscub flags are set and the only scrub you want to get processed is your manual repair. Then rm the scrub_during_recovery config item before unsetting the noscrub flags. Hi Simon Just to add to Wes's answer, Cern have made a nice script that do the steps Wes explained above https://github.com/cernceph/ceph-scripts/blob/master/tools/scrubbing/autorepair.sh that you might want to take a look at. -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Questions about PG auto-scaling and node addition
On Wed, Sep 13, 2023 at 04:33:32PM +0200, Christophe BAILLON wrote: We have a cluster with 21 nodes, each having 12 x 18TB, and 2 NVMe for db/wal. We need to add more nodes. The last time we did this, the PGs remained at 1024, so the number of PGs per OSD decreased. Currently, we are at 43 PGs per OSD. Does auto-scaling work correctly in Ceph version 17.2.5? I would believe so, it's working as designed, default the auto-scaler increasing number PGs based on how much data is stored. So when you add OSDs, data usage is the same and therefor no scaling is done. Should we increase the number of PGs before adding nodes? Adding nodes/OSDs and changing number of PGs involves a lot of data being copied around. So if those two could be combined you only need to copied the data once instead of twice. But if that is smart or possible I'm not sure of. Should we keep PG auto-scaling active? If we disable auto-scaling, should we increase the number of PGs to reach 100 PGs per OSD? If you know how much of the data is going to be stored in a pool the best way is to set the number of PG up front. Because every time the auto-scaler changed the number of PGs you will have a huge amount of data being copied around to other OSDs. You can set the target size or target ratio[1] and the auto-scaler with set the appropriate number of PGs on the pool. But if you know how much data is going to be stored in a pool you can turn it of and just set it manually. 100 is a rule of thumb, but with so large disk you could or maybe should consider having a higher number of PGs per OSD. [1] https://docs.ceph.com/en/quincy/rados/operations/placement-groups/#viewing-pg-scaling-recommendations -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: precise/best way to check ssd usage
On Fri, Jul 28, 2023 at 07:13:33PM +, Marc wrote: I have a use % between 48% and 57%, and assume that with a node failure 1/3 (only using 3x repl.) of this 57% needs to be able to migrate and added to a different node. If you by this mean you have 3 nodes with 3x replica and failure domain set to host, it's my understanding no data will be migrated/backfilled when a node fails. The reason is that there is nowhere to copy the data to, to fulfill the crush rule one copy on 3 different hosts. -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [EXTERNAL] How to change RGW certificate in Cephadm?
On Thu, Jun 15, 2023 at 03:58:40PM +, Beaman, Joshua wrote: We resolved our HAProxy woes by creating a custom jinja2 template and deploying as: ceph config-key set mgr/cephadm/services/ingress/haproxy.cfg -i /tmp/haproxy.cfg.j2 Thanks, wish I knew that a few month ago before I threw out ingress. But we redeploy new certs the same way you described, and then: ceph orch reconfig ingress.rgw.default.default ceph orch restart rgw.default.default This is all done in the same ansible playbook we use to do initial deployment, but I don’t see anything else in there that looks like it would be needed to update the certs. After testing this I will claim this is a bug. The first time "ceph orch apply -i /etc/ceph/rgw.yml" is run it creates to keys mgr/cephadm/spex.rgw.pech and rgw/cert/rgw.pech But later when the spec file is updated and apply is run again only mgr/cephadm/spex.rgw.pech is updated. When the RGW start the log says it using the certificate in rgw/cert/rgw.pech So, if I read out the certificate from mgr/cephadm/spex.rgw.pech and add that in rgw/cert/rgw.pech and then restart the RGW it picks up the new certificate. The command to do this ceph config-key get mgr/cephadm/spex.rgw.pech | jq -r .spec.spec.rgw_frontend_ssl_certificate | ceph config-key set rgw/cert/rgw.pech - ceph orch restart rgw.pech My claim is that Ceph should update "rgw/cert/rgw.pech" when "mgr/cephadm/spex.rgw.pech" is updated. -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Bottleneck between loadbalancer and rgws
On Wed, Jun 14, 2023 at 02:19:14PM +, Szabo, Istvan (Agoda) wrote: I'll try to increase in my small cluster, let's see is there any improvement there, thank you. Any reason if has memory enough to not increase? I tried to find where I read it but with no luck. I think it said it's more beneficial to run more RGW on same host than increasing rgw_max_concurrent_requests without any explanation. In my search for where I read it I did find this https://ceph.io/en/news/blog/2022/three-large-scale-clusters/ witch links to https://tracker.ceph.com/issues/54124 And here they set rgw_max_concurrent_requests to 10240 https://www.seagate.com/content/dam/seagate/migrated-assets/www-content/solutions/partners/red-hat/_shared/files/st-seagate-rhcs5-detail-f29951wg-202110-en.pdf So I think the only way to find out it to increase it and see what happens. -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [EXTERNAL] How to change RGW certificate in Cephadm?
On Wed, Jun 14, 2023 at 03:43:17PM +, Beaman, Joshua wrote: Do you have an ingress service for HAProxy/keepalived? If so, that’s the service that you will need to have orch redeploy/restart. If not, maybe try `ceph orch redeploy pech` ? No ingress, but we did have it running at one time with spec file service_type: ingress service_id: rgw.pech This was removed a while ago with ceph orch rm ingress.rgw.pech because haproxy did not have sane values for our environment, timeout was to low and it was hard coded. We then applied the spec file in my previous mail. So we are only running multiple RGW with SSL. Load balancing and HA is done with PowerDNS with LUA-records. ceph orch redeploy pech only gives me an error pech is not a valid daemon name We have a servie named rgw.pech ceph orch ls --service_name=rgw.pech NAME PORTS RUNNING REFRESHED AGE PLACEMENT rgw.pech ?:443 7/7 4m ago 22h label:cog But running ceph orch redeploy rgw.pech will redeploy all 7 RGW, and would be the same as ceph orch daemon redeploy rgw.pech.pech-mon-3.upnvrd but only redeploy one of them. From: Kai Stian Olstad The certificate is about to expire so I would like to update it. I updated rgw.yml spec with the new certificate and run ceph orch apply -i /etc/ceph/rgw.yml But nothing happened, so I tried to redeploy one of them with ceph orch daemon redeploy rgw.pech.pech-mon-3.upnvrd It redeployed the RGW, but still uses the old certificate. ceph config-key list | grep rgw gives me two keys of interest mgr/cephadm/spec.rgw.pech and rgw/cert/rgw.pech The content of mgr/cephadm/spec.rgw.pech is the new spec file with the updated certificates, but the rgw/cert/rgw.pech only contains certificate and private key, but the certificate is the old ones about to expire. When I run ceph orch daemon redeploy rgw.pech.pech-mon-3.upnvrd The log says it using rgw/cert/rgw.pech witch contains the old certificate. 0 framework: beast 0 framework conf key: ssl_port, val: 443 0 framwwork conf key: ssl_certificate, val: config://rgw/cert/rgw.pech -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Bottleneck between loadbalancer and rgws
On Wed, Jun 14, 2023 at 01:44:40PM +, Szabo, Istvan (Agoda) wrote: I have a dedicated loadbalancer pairs separated on 2x baremetal servers and behind the haproxy balancers I have 3 mon/mgr/rgw nodes. Each rgw node has 2rgw on it so in the cluster altogether 6, (now I just added one more so currently 9). Today I see pretty high GET latency in the cluster (3-4s) and seems like the limitations are the gateways: https://i.ibb.co/ypXFL34/1.png In this netstat seems like maxed out the established connections around 2-3k. When I've added one more gateway it increased. Seems like the gateway node or the gateway instance has some limitation. What is the value which is around 1000,I haven't really found it and affect GET and limit the connections on linux? It could be rgw_max_concurrent_requests[1] witch is default at 1024. I read somewhere that it should not be increased, but could be increase it to 2048. But the recommended action was to add more gateways instead. [1] https://docs.ceph.com/en/quincy/radosgw/config-ref/#confval-rgw_max_concurrent_requests -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] How to change RGW certificate in Cephadm?
When I enabled RGW in cephadm I used this spec file rgw.yml service_type: rgw service_id: pech placement: label: cog spec: ssl: true rgw_frontend_ssl_certificate: | -BEGIN CERTIFICATE- -END CERTIFICATE- -BEGIN CERTIFICATE- -END CERTIFICATE- -BEGIN CERTIFICATE- -END CERTIFICATE- -BEGIN RSA PRIVATE KEY- -END RSA PRIVATE KEY- And enabled it with ceph orch apply -i /etc/ceph/rgw.yml The certificate is about to expire so I would like to update it. I updated rgw.yml spec with the new certificate and run ceph orch apply -i /etc/ceph/rgw.yml But nothing happened, so I tried to redeploy one of them with ceph orch daemon redeploy rgw.pech.pech-mon-3.upnvrd It redeployed the RGW, but still uses the old certificate. ceph config-key list | grep rgw gives me two keys of interest mgr/cephadm/spec.rgw.pech and rgw/cert/rgw.pech The content of mgr/cephadm/spec.rgw.pech is the new spec file with the updated certificates, but the rgw/cert/rgw.pech only contains certificate and private key, but the certificate is the old ones about to expire. I have looked in the documentation and can't find how to update the certificate for RGW. Can anyone shed some light on how to replace the certificate? -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: s3 compatible interface
On Wed, Mar 01, 2023 at 08:39:56AM -0500, Daniel Gryniewicz wrote: We're actually writing this for RGW right now. It'll be a bit before it's productized, but it's in the works. Just curious, what is the use cases for this feature? S3 against CephFS? -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: 1 pg recovery_unfound after multiple crash of an OSD
Hi Just a follow up, the issue was solved by running command ceph pg 404.1ff mark_unfound_lost delete - Kai Stian Olstad On 04.01.2023 13:00, Kai Stian Olstad wrote: Hi We are running Ceph 16.2.6 deployed with Cephadm. Around Christmas OSD 245 and 327 had about 20 read error so I set them to out. Around new year another OSD 313 more or less died since is become so slow that it triggered Linux default I/O-timeout of 30 seconds. In this period the OSD crashed 8 times and was restartet by Systemd and we ended up with [WRN] OBJECT_UNFOUND: 1/416287126 objects unfound (0.000%) pg 404.1ff has 1 unfound objects [ERR] PG_DAMAGED: Possible data damage: 1 pg recovery_unfound pg 404.1ff is active+recovery_unfound+degraded+remapped, acting [208,220,269,175,313,329], 1 unfound [WRN] PG_DEGRADED: Degraded data redundancy: 5/2364745884 objects degraded (0.000%), 1 pg degraded pg 404.1ff is active+recovery_unfound+degraded+remapped, acting [208,220,269,175,313,329], 1 unfound The pool 404 is "default.rgw.buckets.data" and pool 404 is erasure encoding 4+2. I have search for a solution but with no luck, what I have tried is - Restarted all 6 OSD for the PG one by one - Running repair of 404.1ff Output of following command - ceph -s - ceph health detail - ceph pg ls | grep -e PG -e ^404.1ff - ceph osd pool ls detail | grep 404 - ceph osd tree out - ceph crash ls | grep -e ID -e osd.313 - ceph pg 404.1ff list_unfound - ceph pg 404.1ff Is appended below, can also be read here https://gitlab.com/-/snippets/2479624 or cloned with "git clone https://gitlab.com/-/snippets/2479624"; Does anyone have any idea on how to resolv the problem? Any help is much appreciated. - Kai Stian Olstad :: ceph-s.txt :: ceph -s --- cluster: id: d13c6b81-51ee-4d22-84e9-456f9307296c health: HEALTH_ERR 1/416287125 objects unfound (0.000%) Possible data damage: 1 pg recovery_unfound Degraded data redundancy: 5/2364745860 objects degraded (0.000%), 1 pg degraded services: mon: 3 daemons, quorum ceph-mon-1,ceph-mon-2,ceph-mon-3 (age 2M) mgr: ceph-mon-2.mfdanx(active, since 3w), standbys: ceph-mon-1.ptrsea mds: 1/1 daemons up, 1 standby osd: 355 osds: 355 up (since 20h), 352 in (since 2d); 1 remapped pgs rgw: 4 daemons active (4 hosts, 1 zones) data: volumes: 1/1 healthy pools: 14 pools, 2505 pgs objects: 416.29M objects, 540 TiB usage: 939 TiB used, 2.1 PiB / 3.0 PiB avail pgs: 5/2364745860 objects degraded (0.000%) 137931/2364745860 objects misplaced (0.006%) 1/416287125 objects unfound (0.000%) 2489 active+clean 14 active+clean+scrubbing+deep 1active+recovery_unfound+degraded+remapped 1active+clean+scrubbing io: client: 38 MiB/s rd, 23 MiB/s wr, 2.58k op/s rd, 326 op/s wr progress: Global Recovery Event (6d) [===.] (remaining: 3m) :: ceph_health_detail.txt :: ceph health detail -- HEALTH_ERR 1/416287126 objects unfound (0.000%); Possible data damage: 1 pg recovery_unfound; Degraded data redundancy: 5/2364745884 objects degraded (0.000%), 1 pg degraded [WRN] OBJECT_UNFOUND: 1/416287126 objects unfound (0.000%) pg 404.1ff has 1 unfound objects [ERR] PG_DAMAGED: Possible data damage: 1 pg recovery_unfound pg 404.1ff is active+recovery_unfound+degraded+remapped, acting [208,220,269,175,313,329], 1 unfound [WRN] PG_DEGRADED: Degraded data redundancy: 5/2364745884 objects degraded (0.000%), 1 pg degraded pg 404.1ff is active+recovery_unfound+degraded+remapped, acting [208,220,269,175,313,329], 1 unfound :: ceph_pg_ls.txt :: ceph pg ls | grep -e PG -e ^404.1ff --- PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOGSTATE SINCE VERSION REPORTED UP ACTING SCRUB_STAMP DEEP_SCRUB_STAMP 404.1ff 137912 5 1379081 282417561722 0 0 5528 active+recovery_unfound+degraded+remapped 19h141748'724163 141748:3558203 [208,220,269,175,343,329]p208 [208,220,269,175,313,329]p208 2022-12-31T19:27:10.993286+ 2022-12-31T19:27:10.993286+ :: ceph_osd_pool_ls_detail.txt :: ceph osd pool ls detail | grep 404 -- pool 404 'default.rgw.buckets.data' erasure profile ec42-jerasure-blaum_roth-hdd size 6 min_size 5 crush_rule 2 object_hash rjenkins pg_num 2048 pgp_num 2048 autoscale_mode on last_change 124077 lfor 0/52091/108555 flags hashpspool stripe_width 229376 target_size_bytes 1099511627776000 application rgw :: ceph_osd_tree_out.txt :: ceph osd tree out
[ceph-users] 1 pg recovery_unfound after multiple crash of an OSD
Hi We are running Ceph 16.2.6 deployed with Cephadm. Around Christmas OSD 245 and 327 had about 20 read error so I set them to out. Around new year another OSD 313 more or less died since is become so slow that it triggered Linux default I/O-timeout of 30 seconds. In this period the OSD crashed 8 times and was restartet by Systemd and we ended up with [WRN] OBJECT_UNFOUND: 1/416287126 objects unfound (0.000%) pg 404.1ff has 1 unfound objects [ERR] PG_DAMAGED: Possible data damage: 1 pg recovery_unfound pg 404.1ff is active+recovery_unfound+degraded+remapped, acting [208,220,269,175,313,329], 1 unfound [WRN] PG_DEGRADED: Degraded data redundancy: 5/2364745884 objects degraded (0.000%), 1 pg degraded pg 404.1ff is active+recovery_unfound+degraded+remapped, acting [208,220,269,175,313,329], 1 unfound The pool 404 is "default.rgw.buckets.data" and pool 404 is erasure encoding 4+2. I have search for a solution but with no luck, what I have tried is - Restarted all 6 OSD for the PG one by one - Running repair of 404.1ff Output of following command - ceph -s - ceph health detail - ceph pg ls | grep -e PG -e ^404.1ff - ceph osd pool ls detail | grep 404 - ceph osd tree out - ceph crash ls | grep -e ID -e osd.313 - ceph pg 404.1ff list_unfound - ceph pg 404.1ff Is appended below, can also be read here https://gitlab.com/-/snippets/2479624 or cloned with "git clone https://gitlab.com/-/snippets/2479624"; Does anyone have any idea on how to resolv the problem? Any help is much appreciated. - Kai Stian Olstad :: ceph-s.txt :: ceph -s --- cluster: id: d13c6b81-51ee-4d22-84e9-456f9307296c health: HEALTH_ERR 1/416287125 objects unfound (0.000%) Possible data damage: 1 pg recovery_unfound Degraded data redundancy: 5/2364745860 objects degraded (0.000%), 1 pg degraded services: mon: 3 daemons, quorum ceph-mon-1,ceph-mon-2,ceph-mon-3 (age 2M) mgr: ceph-mon-2.mfdanx(active, since 3w), standbys: ceph-mon-1.ptrsea mds: 1/1 daemons up, 1 standby osd: 355 osds: 355 up (since 20h), 352 in (since 2d); 1 remapped pgs rgw: 4 daemons active (4 hosts, 1 zones) data: volumes: 1/1 healthy pools: 14 pools, 2505 pgs objects: 416.29M objects, 540 TiB usage: 939 TiB used, 2.1 PiB / 3.0 PiB avail pgs: 5/2364745860 objects degraded (0.000%) 137931/2364745860 objects misplaced (0.006%) 1/416287125 objects unfound (0.000%) 2489 active+clean 14 active+clean+scrubbing+deep 1active+recovery_unfound+degraded+remapped 1active+clean+scrubbing io: client: 38 MiB/s rd, 23 MiB/s wr, 2.58k op/s rd, 326 op/s wr progress: Global Recovery Event (6d) [===.] (remaining: 3m) :: ceph_health_detail.txt :: ceph health detail -- HEALTH_ERR 1/416287126 objects unfound (0.000%); Possible data damage: 1 pg recovery_unfound; Degraded data redundancy: 5/2364745884 objects degraded (0.000%), 1 pg degraded [WRN] OBJECT_UNFOUND: 1/416287126 objects unfound (0.000%) pg 404.1ff has 1 unfound objects [ERR] PG_DAMAGED: Possible data damage: 1 pg recovery_unfound pg 404.1ff is active+recovery_unfound+degraded+remapped, acting [208,220,269,175,313,329], 1 unfound [WRN] PG_DEGRADED: Degraded data redundancy: 5/2364745884 objects degraded (0.000%), 1 pg degraded pg 404.1ff is active+recovery_unfound+degraded+remapped, acting [208,220,269,175,313,329], 1 unfound :: ceph_pg_ls.txt :: ceph pg ls | grep -e PG -e ^404.1ff --- PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOGSTATE SINCE VERSION REPORTED UP ACTING SCRUB_STAMP DEEP_SCRUB_STAMP 404.1ff 137912 5 1379081 282417561722 0 0 5528 active+recovery_unfound+degraded+remapped19h 141748'724163 141748:3558203 [208,220,269,175,343,329]p208 [208,220,269,175,313,329]p208 2022-12-31T19:27:10.993286+ 2022-12-31T19:27:10.993286+ :: ceph_osd_pool_ls_detail.txt :: ceph osd pool ls detail | grep 404 -- pool 404 'default.rgw.buckets.data' erasure profile ec42-jerasure-blaum_roth-hdd size 6 min_size 5 crush_rule 2 object_hash rjenkins pg_num 2048 pgp_num 2048 autoscale_mode on last_change 124077 lfor 0/52091/108555 flags hashpspool stripe_width 229376 target_size_bytes 1099511627776000 application rgw :: ceph_osd_tree_out.txt :: ceph osd tree out - ID CLASS WEIGHT TYPE NAME STATUS REWEI
[ceph-users] Re: CephFS: Isolating folders for different users
On 22.12.2022 15:47, Jonas Schwab wrote: Now the question: Since I established this setup more or less through trial and error, I was wondering if there is a more elegant/better approach than what is outlined above? You can use namespace so you don't need separate pools. Unfortunately the documentation is sparse on the subject, I use it with subvolume like this # Create a subvolume ceph fs subvolume create --pool_layout --namespace-isolated The subvolume is created with namespace fsvolume_ You can also find the name with ceph fs subvolume info | jq -r .pool_namespace # Create a user with access to the subvolume and the namespace ## First find the path to the subvolume ceph fs subvolume getpath ## Create the user ceph auth get-or-create client. mon 'allow r' mds 'allow rw path=' osd 'allow rw pool= namespace=fsvolumens_' I have found this by looking at how Openstack does it and some trial and error. -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Mails not getting through?
On 16.11.2022 13:21, E Taka wrote: gmail marks too many messages on this mailing list as spam. You can fix that by creating a filter in Gmail for ceph-users@ceph.io and check the "Never send it to Spam". -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Mails not getting through?
On 16.11.2022 00:25, Daniel Brunner wrote: are my mails not getting through? is anyone receiving my emails? You can check this yourself by checking the archives https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/ If you see your mail there, they are getting through. -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: monitoring drives
On 17.10.2022 12:52, Ernesto Puerta wrote: - Ceph already exposes SMART-based health-checks, metrics and alerts from the devicehealth/diskprediction modules <https://docs.ceph.com/en/latest/rados/operations/devices/#enabling-monitoring>. I find this kind of high-level monitoring more digestible to operators than low-level SMART metrics. Marc that started this thread was asking about SAS disk. smartctl doesn't show much SMART Attributes on SAS disk, but some drive only have error log like this Error counter log: Errors Corrected by Total Correction GigabytesTotal ECC rereads/errors algorithm processeduncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 00 0 0 376907 93335.728 0 write: 02 0 22113307 17978.600 0 verify:00 0 0848 0.002 0 But for the drive I have is look like they all have SMART Health Status. "SMART Health Status: OK" Ceph doesn't support SMART or any status on SAS disk today, I only get the message "No SMART data available". I have gathered "smartctl -x --json=vo" log for the 6 types of SAS this I have in my possession. You can find them here if interested [1] [1] https://gitlab.com/-/snippets/2431089 -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Can't setup Basic Ceph Client
On 08.07.2022 16:18, Jean-Marc FONTANA wrote: We're planning to use rbd too and get block device for a linux server. In order to do that, we installed ceph-common packages and created ceph.conf and ceph.keyring as explained at Basic Ceph Client Setup — Ceph Documentation <https://docs.ceph.com/en/pacific/cephadm/client-setup/> (https://docs.ceph.com/en/pacific/cephadm/client-setup/) This does not work. Ceph seems to be installed $ dpkg -l | grep ceph-common ii ceph-common 16.2.9-1~bpo11+1 amd64 common utilities to mount and interact with a ceph storage cluster ii python3-ceph-common 16.2.9-1~bpo11+1 all Python 3 utility libraries for Ceph $ ceph -v ceph version 16.2.9 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific (stable) But, when using commands that interact with the cluster, we get this message $ ceph -s 2022-07-08T15:51:24.965+0200 7f773b7fe700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1] [errno 13] RADOS permission denied (error connecting to the cluster) The default user for ceph is the admin/client.admin do you have that key in your keyring? And is the keyring file readable for the user running the ceph commands? -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: cephadm host maintenance
On 14.07.2022 11:01, Steven Goodliff wrote: If i get anywhere with detecting the instance is the active manager handling that in Ansible i will reply back here. I use this - command: ceph mgr stat register: r - debug: msg={{ (r.stdout | from_json).active_name.split(".")[0] }} This works because the first part of the instance name is the hostname. -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Is it normal Ceph reports "Degraded data redundancy" in normal use?
On 18.04.2022 21:35, Wesley Dillingham wrote: If you mark an osd "out" but not down / you dont stop the daemon do the PGs go remapped or do they go degraded then as well? First I made sure the balancer was active, then I marked one osd "out", "ceph osd out 34" and check status every 2 seconds for 2 minutes, no degraded messages. The only new messages in ceph -s was 12 remapped pgs and "11 active-remapped+backfilling" and "1 active+remapped+backfill_wait" Previously I had to set all osd(15 disks) on a host to out and there was no issue with PG in degraded state. -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Is it normal Ceph reports "Degraded data redundancy" in normal use?
On 29.03.2022 14:56, Sandor Zeestraten wrote: I was wondering if you ever found out anything more about this issue. Unfortunately no, so I turned it off. I am running into similar degradation issues while running rados bench on a new 16.2.6 cluster. In our case it's with a replicated pool, but the degradation problems also go away when we turn off the balancer. So this goes a long way of confirming there are something wrong with the balancer since we now see it on two different installation. -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: ceph namespace access control
On Wed, Mar 23, 2022 at 07:14:22AM +0200, Budai Laszlo wrote: > Hello all, > > what capabilities a ceph user should have in order to be able to create rbd > images in one namespace only? > > I have tried the following: > > [root@ceph1 ~]# rbd namespace ls --format=json > [{"name":"user1"},{"name":"user2"}] > > [root@ceph1 ~]# ceph auth get-or-create client.user2 mon 'profile rbd' osd > 'allow rwx pool=rbd namespace=user2' -o /etc/ceph/client.user2.keyring Instead of using allow use profile on the osd too and it will set the correct permissions. # ceph auth get-or-create client.user2 mon 'profile rbd' osd 'profile rbd pool=rbd namespace=user2' -o /etc/ceph/client.user2.keyring -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: RadosGW S3 range on a 0 byte object gives 416 Range Not Satisfiable
On 22.03.2022 09:40, Ulrich Klein wrote: Yup, completely agree. I find the 416 also a bit surprising, whether in Ceph/RGW or plain HTTP. Consistency between other highly used software would be nice. Just to make sure: I am not at all involved in Ceph development, so don’t send a feature request to me :) Of course, I would never refer someone to send a feature request to a person even if you were a Ceph developer, I would consider that rude, the tracker exist for that :-) -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: RadosGW S3 range on a 0 byte object gives 416 Range Not Satisfiable
On 21.03.2022 15:35, Ulrich Klein wrote: RFC 7233 4.4 <https://datatracker.ietf.org/doc/html/rfc7233#section-4.4>. 416 Range Not Satisfiable The 416 (Range Not Satisfiable) status code indicates that none of the ranges in the request's Range header field (Section 3.1 <https://datatracker.ietf.org/doc/html/rfc7233#section-3.1>) overlap The section 3.1 say "A server MAY ignore the Range header field." For example: HTTP/1.1 416 Range Not Satisfiable Date: Fri, 20 Jan 2012 15:41:54 GMT Content-Range: bytes */47022 Note: Because servers are free to ignore Range, many implementations will simply respond with the entire selected representation in a 200 (OK) response. That is partly because This is what Nginx and Apache do, if you specify range when the file has 0 bytes they will return 200. So they are ignore range with 0 bytes files but not when the bytes is grater than 0. On 21. 03 2022, at 15:11, Ulrich Klein wrote: With a bit of HTTP background I’d say: bytes=0-100 means: First byte to to 100nd byte. First byte is byte #0 On an empty object there is no first byte, i.e. not satisfiable ==> 416 Should be the same as on a single byte object and bytes=1-100 200 OK should only be correct, if the server or a proxy in between doesn’t support range requests. After reading your text and links I do concur that returning 416 with 0 bytes with range bytes=0-100 is not wrong, but I also believe that it would be correct to return 200 OK as Nginx and Apache do, since range can be ignored. I think our user of Ceph is used to how Nginx and Apache works and that is the reason they wondered if it was something wrong with Ceph. So I think the answer to them will be, It's according to spec but you can always put in a feature request. -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] RadosGW S3 range on a 0 byte object gives 416 Range Not Satisfiable
Hi Ceph v16.2.6. Using GET with Range: bytes=0-100 it fails with 416 if the object is 0 byte. I tried reading the http specification[1] on the subject but did not get any wiser unfortunately. I did a test with curl and range against a 0 byte file on Nginx and it returned 200 OK. Does anyone know it's correct to return 416 on 0 byte object with range or should this be considered a bug in Ceph. [1] https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35.1 -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Replace HDD with cephadm
On 15.03.2022 10:10, Jimmy Spets wrote: Thanks for your reply. I have two things that I am unsure of: - Is the OSD UUID the same for all OSD:s or should it be unique for each? It's unique and generated when you run ceph-volume lvm prepare or add an OSD. You can find OSD UUID/FSID for existing OSD in /var/lib/ceph/FSID>/osd./fsid - Have I understood correctly that in your example the OSD is not encrypted? Yes, it's not encrypted. -- Kai Stian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: rbd namespace create - operation not supported
On 11.03.2022 14:04, Ilya Dryomov wrote: On Fri, Mar 11, 2022 at 8:04 AM Kai Stian Olstad wrote: Isn't namespace supported with erasure encoded pools? RBD images can't be created in EC pools, so attempting to create RBD namespaces there is pointless. The way to store RBD image data in an EC pool is to create an image in a replicated pool (possibly in a custom namespace) and specify --data-pool: $ rbd namespace create --pool rep3 --namespace testspace $ rbd create --size 10G --pool rep3 --namespace testspace --data-pool ec42 --image testimage This worked like a charm. The image metadata (header object, etc) would be stored in rep3 (replicated pool), while the data objects would go to ec42 (EC pool). I see the meta pool is using OMAP so I guess that's the reason it need to be a replicated pool, makes sense. Thank you for the help Ilya. -- Kai Stian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Replace HDD with cephadm
On 10.03.2022 14:48, Jimmy Spets wrote: I have a Ceph Pacific cluster managed by cephadm. The nodes have six HDD:s and one NVME that is shared between the six HDD:s. The OSD spec file looks like this: service_type: osd service_id: osd_spec_default placement: host_pattern: '*' data_devices: rotational: 1 db_devices: rotational: 0 size: '800G:1200G' db_slots: 6 encrypted: true I need to replace one of the HDD:s that is broken. How do I replace the HDD in the OSD connecting it to the old HDD:s db_slot? Last time I tried, cephadm could not replace a disk where the db was on a separate drive. It would just add is as a new OSD without the db on a separate disk. So to avoid this, remove all the active OSD spec so the disk wont be added automatically by cephadm. Then you need to manual add the disk. This is unfortunately not described anywhere, but the procedure I follow is this and the osd is osd.152 Find the VG og LV of the block db for the OSD. root@osd-host:~# ls -l /var/lib/ceph/*/osd.152/block.db lrwxrwxrwx 1 167 167 90 Dec 1 12:58 /var/lib/ceph/b321e76e-da3a-11eb-b75c-4f948441dcd0/osd.152/block.db -> /dev/ceph-10215920-77ea-4d50-b153-162477116b4c/osd-db-25762869-20d5-49b1-9ff4-378af8f679c4 VG = ceph-10215920-77ea-4d50-b153-162477116b4c LV = osd-db-25762869-20d5-49b1-9ff4-378af8f679c4 If you have already removed it, you'll find it in /var/lib/ceph/*/removed/ Then you remove the OSD. root@admin:~# ceph orch osd rm 152 --replace Scheduled OSD(s) for removal When the disk is removed from Ceph you can replace it with a new one. Look in dmesg what the new disk is named, in my case it's /dev/sdt Prepare the new disk root@osd-host:~# cephadm shell root@osd-host:/# ceph auth get client.bootstrap-osd >/var/lib/ceph/bootstrap-osd/ceph.keyring exported keyring for client.bootstrap-osd # Here you need to use the VG/LV you found above so you can reuse the db volume. root@osd-host:~# ceph-volume lvm prepare --bluestore --no-systemd --osd-id 152 --data /dev/sdt --block.db ceph-10215920-77ea-4d50-b153-162477116b4c/osd-db-25762869-20d5-49b1-9ff4-378af8f679c4 < removed some output > Running command: /usr/bin/ceph-osd --cluster ceph --osd-objectstore bluestore --mkfs -i 152 --monmap /var/lib/ceph/osd/ceph-152/activate.monmap --keyfile - --bluestore-block-db-path /dev/ceph-10215920-77ea-4d50-b153-162477116b4c/osd-db-25762869-20d5-49b1-9ff4-378af8f679c4 --osd-data /var/lib/ceph/osd/ceph-152/ --osd-uuid 517213f3-0715-4d23-8103-6a34b1f8ef08 --setuser ceph --setgroup ceph stderr: 2021-12-01T11:50:33.613+ 7ff013614080 -1 bluestore(/var/lib/ceph/osd/ceph-152/) _read_fsid unparsable uuid --> ceph-volume lvm prepare successful for: /dev/sdt Here you need the --osd-uuid which is 517213f3-0715-4d23-8103-6a34b1f8ef08 Then you need a json file containing ceph info and osd authentication, this file can be created like this root@admin:~# printf '{\n"config": "%s",\n"keyring": "%s"\n}\n' "$(ceph config generate-minimal-conf | sed -e ':a;N;$!ba;s/\n/\\n/g' -e 's/\t/\\t/g' -e 's/$/\\n/')" "$(ceph auth get osd.152 | head -n 2 | sed -e ':a;N;$!ba;s/\n/\\n/g' -e 's/\t/\\t/g' -e 's/$/\\n/')" >config-osd.152.json You might need to copy the json file to the OSD-host depending on where you run the command. The --osd-uuid above is the same at --osd-fsid in this command, thank you for consistent naming. root@osd-host:~# cephadm deploy --fsid --name osd.152 --config-json config-osd.152.json --osd-fsid 517213f3-0715-4d23-8103-6a34b1f8ef08 And then the OSD should be back up and running. This is the way I have found to do OSD replacement, it might be an easier way of doing it but I have not found that. -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] rbd namespace create - operation not supported
Hi I'm trying to create namespace in an rbd pool, but get operation not supported. This is on a 16.2.6 Cephadm installed on Ubuntu 20.04.3. The pool is erasure encoded and the commands I run was the following. cephadm shell ceph osd pool create rbd 32 32 erasure ec42-jerasure-blaum_roth-hdd --autoscale-mode=warn ceph osd pool set rbd allow_ec_overwrites true rbd pool init --pool rbd rbd namespace create --pool rbd --namespace testspace rbd: failed to created namespace: (95) Operation not supported 2022-03-11T06:13:30.570+ 7f4a9426e2c0 -1 librbd::api::Namespace: create: failed to add namespace: (95) Operation not supported Isn't namespace supported with erasure encoded pools? -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Unclear on metadata config for new Pacific cluster
On Wed, Feb 23, 2022 at 12:02:53PM +, Adam Huffman wrote: > On Wed, 23 Feb 2022 at 11:25, Eugen Block wrote: > > > How exactly did you determine that there was actual WAL data on the HDDs? > > > I couldn't say exactly what it was, but 7 or so TBs was in use, even with > no user data at all. When you have DB on a separate disk the DB size count towards total size of the osd. But this DB space is considered used so you will see a lot of used space. -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: The Return of Ceph Planet
On 04.02.2022 00:00, Mike Perez wrote: If you have a Ceph category feed you would like added; please email me your RSS feed URL. While you are mention RSS, any reason for the RSS feed on the ceph.com blog/news was removed? It used to be https://ceph.com/community/blog/feed/ but after the change I can't find the feed URL. -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: airgap install
On 21.12.2021 09:41, Marc wrote: I have also an 'airgapped install' but with rpm's, simply cloning the necessary repositories. Why go through all these efforts trying to get this to work via containers? For me that is completely new to Ceph started with the documentation[1] and the recommended method is Cephadm or Rook, so I those Cephadm. Unfortunately I do regret it, not because of the container mirroring since that is the easy part, but because of lacking documentation, lacking feature like replacing disk(where DB is on a shared SSD), bugs and other quirks. Cephadm is not what I would consider stable and ready for production, so if I had to choose today it would not be Cephadm, but more likely manual install from deb with my own Ansible code or ceph-ansible. Because then I would have a lot of documentation on how to solve things. [1] https://docs.ceph.com/en/pacific/install/index.html -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: airgap install
On 17.12.2021 11:06, Zoran Bošnjak wrote: Kai, thank you for your answer. It looks like the "ceph config set mgr..." commands are the key part, to specify my local registry. However, I haven't got that far with the installation. I have tried various options, but I have problems already with the bootstrap step. I have documented the procedure (and the errors) here: https://github.com/zoranbosnjak/ceph-install#readme Would you please have a look and suggest corrections. I have looked it over and checked the cephadm source code. Ideally, I would like to run administrative commands from a dedicated (admin) node... or alternatively to setup mon nodes to be able to run administrative commands... The bootstrap command you need to run on one of the nodes and/or the node you want the monitor to run on. After that you can install cephadm or ceph-common to use you admin node for the rest. So the error you get is this Non-zero exit code 22 from /usr/bin/docker run --rm --ipc=host --net=host --entrypoint /usr/bin/ceph -e CONTAINER_IMAGE=admin:5000/ceph/ceph:v16 -e NODE_NAME=node01 -v /var/log/ceph/da017daa-5f18-11ec-a05c-37b574681fc7:/var/log/ceph:z -v /tmp/ceph-tmph9jxliaz:/etc/ceph/ceph.client.admin.keyring:z -v /tmp/ceph-tmpsecch5kc:/etc/ceph/ceph.conf:z admin:5000/ceph/ceph:v16 orch host add node01 /usr/bin/ceph: stderr Error EINVAL: Can not automatically resolve ip address of host where active mgr is running. Please explicitly provide the address. It is trying to find the IP address for the node01 but fails to do so. So you need to look into you DNS settings so it possible to determine the IP for a hostname. Checking for the IP is a reason change(16.2.6 or .7) https://github.com/ceph/ceph/pull/42772 to close this issue https://tracker.ceph.com/issues/51667 -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: airgap install
On Mon, Dec 13, 2021 at 06:18:55PM +, Zoran Bošnjak wrote: > I am using "ubuntu 20.04" and I am trying to install "ceph pacific" version > with "cephadm". > > Are there any instructions available about using "cephadm bootstrap" and > other related commands in an airgap environment (that is: on the local > network, without internet access)? Unfortunately they say cephadm is stable but I would call it beta because of lacking feature, bugs and missing documentation. I can give you some pointers. The best source to find the images you need is in cephadm code and for 16.2.7 you find it here [1]. cephadm bootstrap has the --image option to specify what image to use. I also run the bootstrap with --skip-monitoring-stack, if not it fails since it can't find the images. After that you can update the monitor containers to you registry. cephadm shell ceph config set mgr mgr/cephadm/container_image_prometheus ceph config set mgr mgr/cephadm/container_image_node_exporter ceph config set mgr mgr/cephadm/container_image_grafana ceph config set mgr mgr/cephadm/container_image_alertmanager Check the result with ceph config get mgr To deploy the monitoring ceph mgr module enable prometheus ceph orch apply node-exporter '*' ceph orch apply alertmanager --placement ... ceph orch apply prometheus --placement ... ceph orch apply grafana --placement ... This should be what you need to get Ceph running in an isolated network. [1] https://github.com/ceph/ceph/blob/v16.2.7/src/cephadm/cephadm#L50-L61 -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Is it normal Ceph reports "Degraded data redundancy" in normal use?
On 21.09.2021 09:11, Kobi Ginon wrote: for sure the balancer affects the status Of course, but setting several PG to degraded is something else. i doubt that your customers will be writing so many objects in the same rate of the Test. I only need 2 host running rados bench to get several PG in degrade state. maybe you need to play with the balancer configuration a bit. Maybe, but a balancer should not set the cluster health to warning with several PG in degraded state. It should be possible to do this cleanly, copy data and delete the source when copy is OK. Could start with this The balancer mode can be changed to crush-compat mode, which is backward compatible with older clients, and will make small changes to the data distribution over time to ensure that OSDs are equally utilized. https://docs.ceph.com/en/latest/rados/operations/balancer/ I will probably just turn it off before I set the cluster in production. side note: i m using indeed an old version of ceph ( nautilus)+ blancer configured and runs rado benchmarks , but did not saw such a problem. on the other hand i m not using pg_autoscaler i set the pools PG number in advanced according to assumption of the percentage each pool will be using Could be that you do use this Mode and the combination of auto scaler and balancer is what reveals this issue If you look at my initial post you will se that the pool is created with --autoscale-mode=off The cluster is running 16.2.5 and is empty except for one pool with one PG created by Cephadm. -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Is it normal Ceph reports "Degraded data redundancy" in normal use?
On 17.09.2021 16:10, Eugen Block wrote: Since I'm trying to test different erasure encoding plugin and technique I don't want the balancer active. So I tried setting it to none as Eguene suggested, and to my surprise I did not get any degraded messages at all, and the cluster was in HEALTH_OK the whole time. Interesting, maybe the balancer works differently now? Or it works differently under heavy load? It would be strange that the balancer normal operation is to put the cluster in degraded mode. The only suspicious lines I see are these: Sep 17 06:30:01 pech-mon-1 conmon[1337]: debug 2021-09-17T06:30:01.402+ 7f66b0329700 1 heartbeat_map reset_timeout 'Monitor::cpu_tp thread 0x7f66b0329700' had timed out after 0.0s But I'm not sure if this is related. The out OSDs shouldn't have any impact on this test. Did you monitor the network saturation during these tests with iftop or something similar? I did not, so I rerun the test this morning. All the servers have 2x25Gbit/s NIC in bonding with LACP 802.3ad layer3+4. The peak on the active monitor was 27 Mbit/s and less on the other 2 monitors. I also checked the CPU(Xeon 5222 3.8 GHz) and non of the cores was saturated, and network statistics show no errors or drops. So perhaps there is a bug in the balancer code? -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Is it normal Ceph reports "Degraded data redundancy" in normal use?
On 16.09.2021 15:51, Josh Baergen wrote: I assume it's the balancer module. If you write lots of data quickly into the cluster the distribution can vary and the balancer will try to even out the placement. The balancer won't cause degradation, only misplaced objects. Since I'm trying to test different erasure encoding plugin and technique I don't want the balancer active. So I tried setting it to none as Eguene suggested, and to my surprise I did not get any degraded messages at all, and the cluster was in HEALTH_OK the whole time. Degraded data redundancy: 260/11856050 objects degraded (0.014%), 1 pg degraded That status definitely indicates that something is wrong. Check your cluster logs on your mons (/var/log/ceph/ceph.log) for the cause; my guess is that you have OSDs flapping (rapidly going down and up again) due to either overload (disk or network) or some sort of misconfiguration. So I enabled the balancer and run the rados bench again and the degraded messages is back. I guess the equivalent log to /var/log/ceph/ceph.log in Cephadm is journalctl -u ceph-b321e76e-da3a-11eb-b75c-4f948441...@mon.pech-mon-1.service There are no messages about osd being marked down, so I don't understand why this is happening. I probably need to raise some verbose value. I have attach the log from journalctl, it start at 06:30:00 when I started the rados bench and included a few lines after the first degrade message at 06:31.06. Just be aware that 15 OSD is set to out, since I have some problem with the a HBA on one host, all test has been done with those 15 OSD in status out. -- Kai Stian OlstadSep 17 06:30:00 pech-mon-1 conmon[1337]: debug 2021-09-17T06:29:59.994+ 7f66b232d700 0 log_channel(cluster) log [INF] : overall HEALTH_OK Sep 17 06:30:00 pech-mon-1 conmon[1337]: cluster 2021-09-17T06:29:59.317530+ mgr.pech-mon-1.ptrsea Sep 17 06:30:00 pech-mon-1 conmon[1337]: (mgr.245802) 345745 : cluster [DBG] pgmap v347889: 1025 pgs: 1025 active+clean; 0 B data, 73 TiB used, 2.8 PiB / 2.9 PiB avail Sep 17 06:30:00 pech-mon-1 conmon[1337]: cluster 2021-09-17T06:30:00.000143+ mon.pech-mon-1 (mon.0) 1166236 : Sep 17 06:30:00 pech-mon-1 conmon[1337]: cluster [INF] overall HEALTH_OK Sep 17 06:30:01 pech-mon-1 conmon[1337]: debug 2021-09-17T06:30:01.318+ 7f66afb28700 0 mon.pech-mon-1@0(leader) e7 handle_command mon_command({"prefix": "osd pg-upmap-items", "format": "json", "pgid": "12.6d", "id": [293, 327]} v 0) v1 Sep 17 06:30:01 pech-mon-1 conmon[1337]: debug 2021-09-17T06:30:01.318+ 7f66afb28700 0 log_channel(audit) log [INF] : from='mgr.245802 10.0.1.10:0/136830414' entity='mgr.pech-mon-1.ptrsea' cmd=[{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "12.6d", "id": [293, 327]}]: dispatch Sep 17 06:30:01 pech-mon-1 conmon[1337]: debug 2021-09-17T06:30:01.318+ 7f66afb28700 0 mon.pech-mon-1@0(leader) e7 handle_command mon_command({"prefix": "osd pg-upmap-items", "format": "json", "pgid": "12.144", "id": [307, 351]} v 0) v1 Sep 17 06:30:01 pech-mon-1 conmon[1337]: debug 2021-09-17T06:30:01.318+ 7f66afb28700 0 log_channel(audit) log [INF] : from='mgr.245802 10.0.1.10:0/136830414' entity='mgr.pech-mon-1.ptrsea' cmd=[{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "12.144", "id": [307, 351]}]: dispatch Sep 17 06:30:01 pech-mon-1 conmon[1337]: debug 2021-09-17T06:30:01.322+ 7f66afb28700 0 mon.pech-mon-1@0(leader) e7 handle_command mon_command({"prefix": "osd pg-upmap-items", "format": "json", "pgid": "12.17d", "id": [144, 136]} v 0) v1 Sep 17 06:30:01 pech-mon-1 conmon[1337]: debug 2021-09-17T06:30:01.322+ 7f66afb28700 0 log_channel(audit) log [INF] : from='mgr.245802 10.0.1.10:0/136830414' entity='mgr.pech-mon-1.ptrsea' cmd=[{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "12.17d", "id": [144, 136]}]: dispatch Sep 17 06:30:01 pech-mon-1 conmon[1337]: debug 2021-09-17T06:30:01.322+ 7f66afb28700 0 mon.pech-mon-1@0(leader) e7 handle_command mon_command({"prefix": "osd pg-upmap-items", "format": "json", "pgid": "12.1a2", "id": [199, 189]} v 0) v1 Sep 17 06:30:01 pech-mon-1 conmon[1337]: debug 2021-09-17T06:30:01.322+ 7f66afb28700 0 log_channel(audit) log [INF] : from='mgr.245802 10.0.1.10:0/136830414' entity='mgr.pech-mon-1.ptrsea' cmd=[{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "12.1a2", "id": [199, 189]}]: dispatch Sep 17 06:30:01 pech-mon-1 conmon[1337]: debug 2021-09-17T06:30:01.322+ 7f66afb28700 0 mon.pech-mon-1@0(leader) e7 handle_command mon_command({"prefix": "osd pg-upmap-items", "format": "json", "pgid": "12.1e1", "id": [289, 344]} v 0) v1 Sep 17 06:30:01 pech-mon-1 conmon[1337]: debug 2021-09-17T06:30:01.322+ 7f66afb28700 0 log_channel(audit) log [INF] : from='mgr.245802 10.0.1.10:0/136830414' entity='mgr.pech-mon-1.ptrsea' cmd=[{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "12.1e1", "id": [289, 344]}]: dispatch Sep 17 06:30:01
[ceph-users] Is it normal Ceph reports "Degraded data redundancy" in normal use?
Hi I'm testing a Ceph cluster with "rados bench", it's an empty Cephadm install that only has one pool device_health_metrics. Create a pool with 1024 pg on the hdd devices(15 servers has HDDs and 13 has SSDs) ceph osd pool create pool-ec32-isa-reed_sol_van-hdd 1024 1024 erasue ec32-isa-reed_sol_van-hdd --autoscale-mode=off I then run "rados bench" from the 13 SSD hosts at the same time. rados bench -p pool-ec32-isa-reed_sol_van-hdd 600 write --no-cleanup After just a few seconds "ceph -s" starts to reports degraded data redundancy Here is some examples during the 10 minutes testing period Degraded data redundancy: 260/11856050 objects degraded (0.014%), 1 pg degraded Degraded data redundancy: 260/1856050 objects degraded (0.014%), 1 pg degraded Degraded data redundancy: 1 pg undersized Degraded data redundancy: 1688/3316225 objects degraded (0.051%), 3 pgs degraded Degraded data redundancy: 5457/7005845 objects degraded (0.078%), 3 pgs degraded, 9 pgs undersized Degraded data redundancy: 1 pg undersized Degraded data redundancy: 4161/7005845 objects degraded (0.059%), 3 pgs degraded Degraded data redundancy: 4315/7005845 objects degraded (0.062%), 2 pgs degraded, 4 pgs undersized So my question is, it normal that Ceph report degraded under normal use? or do I have a problem somewhere that I need to investigate? -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: MTU mismatch error in Ceph dashboard
On 04.08.2021 20:31, Ernesto Puerta wrote: Could you please go to the Prometheus UI and share the output of the following query "node_network_mtu_bytes"? That'd be useful to understand the issue. If you can open a tracker issue here: https://tracker.ceph.com/projects/dashboard/issues/new ? Found a issue reported under MGR https://tracker.ceph.com/issues/52028 - mgr/dashboard: Incorrect MTU mismatch warning -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: MTU mismatch error in Ceph dashboard
On 04.08.2021 22:06, Paul Giralt (pgiralt) wrote: I did notice that docker0 has an MTU of 1500 as do the eno1 and eno2 interfaces which I’m not using. I’m not sure if that’s related to the error. I’ve been meaning to try changing the MTU on the eno interfaces just to see if that makes a difference but haven’t gotten around to it. If you look at the message it says which interface it is. It does check and report on all the interfaces, even those that is in DOWN state which it shouldn't. -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Cephadm and multipath.
Hi Peter Please remember to include the list address in your reply. I will not trim so people on the list can read you answer. On 29.07.2021 12:43, Peter Childs wrote: On Thu, 29 Jul 2021 at 10:37, Kai Stian Olstad wrote: A little disclaimer, I have never used multipath with Ceph. On 28.07.2021 20:19, Peter Childs wrote: > I have a number of disk trays, with 25 ssd's in them, these are > attached to > my servers via a pair of sas cables, so that multipath is used to join > the > together again and maximize speed etc. > > Using cephadm how can I create the osd's? You can use the commands in the documentation [1] "ceph orch daemon add osd :" But you need to configure the LVM correctly to make this work. That was my thought, but it was not working, but now it is vgcreate test /dev/mapper/mpatha lvcreate -l 190776 -n testlv test ceph orch daemon add osd dampwood18:test/testlv Created osd(s) 1361 on host 'dampwood18' I think I can live with that I think there is room for improvement here, but I'm happy with creating the vgs and lvs before I use the disks. If you could not run cephadm shell ceph orch daemon add osd dampwood18:/dev/mapper/mpatha I would consider that a bug. > It looks like it should be possible to use ceph-volume but I've not > really > worked out yet how to access ceph-volume within cephadm. Even if I've > got > to format them with lvm first. (The docs are slightly confusing here) > > It looks like the ceph disk inventory system can't cope with multipath? If by "ceph disk inventory system" you mean OSD service specification[2] then yes, I don't think it's possible to use it with multipath. When you add a disk to Ceph with cephadm it will use LVM to create a Physical Volume(PV) of that device and create Volume Group(VG) on the disk and then create a Logical Volume(LV) that use the whole VG. And the configuration in Ceph reference the VG/LV so Ceph should not have a problem with multipath. But since you have multipath, LVM might have a problem with that if not configured correctly. LVM will scan disk for LVM signature and try to create the devices for the LV it finds. So you need to make sure that the LVM only scan the multipath device paths and not the individual disk the OS sees. Hmm I think we might have "room for improvement" in this area, Either the osd spec needs to include all the options for weird disks that people might come up with, and allocating them to classes as well, There are lot of limitation in the OSD service spec and handling drives in Cephadm. Just try to replace a HDD disk with the DB on a SSD, that is a pain at the moment. or all the options available to ceph-volume need to be exposed to orchestration which would also working, currently it feels like some of the complex options in ceph are not available to cephadm yet and you need to work out how to do it. You have "cephadm ceph-volume" or you could run "cephadm shell" and then run all the ceph commands. I'm new to ceph and I like the theory having come from a Spectrum Scale background, and I'm still trying to get to grips with how things work. My Ceph cluster has got 3 types of drive, these multipathed 800G ssds, Disks on nodes with lots of memory (256G between 30Disks) and Disks on nodes with very little memory (48G between 60Disks) hence why I was trying to get disk specs to work. I've actually got it working with a little kernel tuning and must get around to writing it up so I can share where I've got to.. As mention the OSD service spec has a lot of limitation. The default memory size for an OSD is 4GB, so your 48 GB/60 disks would need some configuration and I'm not sure if it's feasible to run them with so little memory. Thanks Peter -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Cephadm and multipath.
A little disclaimer, I have never used multipath with Ceph. On 28.07.2021 20:19, Peter Childs wrote: I have a number of disk trays, with 25 ssd's in them, these are attached to my servers via a pair of sas cables, so that multipath is used to join the together again and maximize speed etc. Using cephadm how can I create the osd's? You can use the commands in the documentation [1] "ceph orch daemon add osd :" But you need to configure the LVM correctly to make this work. It looks like it should be possible to use ceph-volume but I've not really worked out yet how to access ceph-volume within cephadm. Even if I've got to format them with lvm first. (The docs are slightly confusing here) It looks like the ceph disk inventory system can't cope with multipath? If by "ceph disk inventory system" you mean OSD service specification[2] then yes, I don't think it's possible to use it with multipath. When you add a disk to Ceph with cephadm it will use LVM to create a Physical Volume(PV) of that device and create Volume Group(VG) on the disk and then create a Logical Volume(LV) that use the whole VG. And the configuration in Ceph reference the VG/LV so Ceph should not have a problem with multipath. But since you have multipath, LVM might have a problem with that if not configured correctly. LVM will scan disk for LVM signature and try to create the devices for the LV it finds. So you need to make sure that the LVM only scan the multipath device paths and not the individual disk the OS sees. [1] https://docs.ceph.com/en/latest/cephadm/osd/#creating-new-osds [2] https://docs.ceph.com/en/latest/cephadm/osd/#advanced-osd-service-specifications -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Cephadm: How to remove a stray daemon ghost
On 22.07.2021 13:56, Kai Stian Olstad wrote: Hi I have a warning that says "1 stray daemon(s) not managed by cephadm" What i did is the following. I have 3 nodes that the mon should run on, but because of a bug in 16.2.4 I couldn't run on then since they are in different subnet. But this was fixed in 16.2.5 so i upgraded without issues. but i got a health warning root@pech-mon-1:~# ceph health detail HEALTH_WARN 1 stray daemon(s) not managed by cephadm [WRN] CEPHADM_STRAY_DAEMON: 1 stray daemon(s) not managed by cephadm stray daemon mon.pech-mds-1 on host pech-cog-1 not managed by cephadm I think this relates to this issue https://tracker.ceph.com/issues/50272 I restart the active mgr and the other mgr become active and the stray message went away. -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Cephadm: How to remove a stray daemon ghost
Hi I have a warning that says "1 stray daemon(s) not managed by cephadm" What i did is the following. I have 3 nodes that the mon should run on, but because of a bug in 16.2.4 I couldn't run on then since they are in different subnet. But this was fixed in 16.2.5 so i upgraded without issues. Before I started it looked like this root@pech-mon-1:~# ceph orch ps | grep ^mon NAME HOSTPORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID mon.pech-cog-1 pech-cog-1 running (23h) 9m ago 3w 1182M2048M 16.2.5 6933c2a0b7dd b226c1714777 mon.pech-mds-1 pech-mds-1 running (23h) 7m ago 3w 1147M2048M 16.2.5 6933c2a0b7dd 40f8e268afca mon.pech-mon-1 pech-mon-1 running (23h) 2m ago 3w 1161M2048M 16.2.5 6933c2a0b7dd b358057dcb3a To place the daemon on correct hosts I run this root@pech-mon-1:~# ceph orch apply mon pech-mon-1,pech-mon-2,pech-mon-3 Scheduled mon update... And that worked fine. root@pech-mon-1:~# ceph orch ps |grep ^mon NAME HOSTPORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID mon.pech-mon-1 pech-mon-1 running (23h) 6s ago 3w 1360M2048M 16.2.5 6933c2a0b7dd b358057dcb3a mon.pech-mon-2 pech-mon-2 running (13s) 6s ago 13s 287M2048M 16.2.5 6933c2a0b7dd 25a68933c119 mon.pech-mon-3 pech-mon-3 running (11s) 6s ago 11s 241M2048M 16.2.5 6933c2a0b7dd be0c6e5a5fdf but i got a health warning root@pech-mon-1:~# ceph health detail HEALTH_WARN 1 stray daemon(s) not managed by cephadm [WRN] CEPHADM_STRAY_DAEMON: 1 stray daemon(s) not managed by cephadm stray daemon mon.pech-mds-1 on host pech-cog-1 not managed by cephadm The strange thing is daemon mon.pech-mds-1 has never run on pech-cog-1. And the problem is that I can not find this supposedly stray damon. With ansible I run "podman ps" on all nodes and removed the osd, node and crash damone from the output $ ansible pech -u root -m shell -a "podman ps" | grep ceph | awk '{ print $NF }' | egrep -v "osd|node|crash" | sort ceph--alertmanager.pech-mds-1 ceph--grafana.pech-cog-2 ceph--mgr.pech-mon-1.ptrsea ceph--mgr.pech-mon-2.mfdanx ceph--mon.pech-mon-1 ceph--mon.pech-mon-2 ceph--mon.pech-mon-3 ceph--prometheus.pech-mds-1 No stray daemon here also with ansible I run "cephadm ls" on all of them and removed the osd, node and crash damone from the output $ ansible pech -u root -m shell -a "cephadm ls | jq .[].name" | grep '^"' | egrep -v "osd|node|crash" | sort "alertmanager.pech-mds-1" "grafana.pech-cog-2" "mgr.pech-mon-1.ptrsea" "mgr.pech-mon-2.mfdanx" "mon.pech-mon-1" "mon.pech-mon-2" "mon.pech-mon-3" "prometheus.pech-mds-1" No stray daemon here either. Does anyone know how to find this supposedly stray daemon? -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Having issues to start more than 24 OSDs per host
On 22.06.2021 17:27, David Orman wrote: https://tracker.ceph.com/issues/50526 https://github.com/alfredodeza/remoto/issues/62 If you're brave (YMMV, test first non-prod), we pushed an image with the issue we encountered fixed as per above here: https://hub.docker.com/repository/docker/ormandj/ceph/tags?page=1 that you can use to install with. Thank you David. I could not add 1 host with 15 HDD and 3 SSD without it hanging forever. I used your patch and created a new container and could add in 15 hosts 15 HDD and 3 SSD in each without any issue. (I'm a little confused why a breaking install/upgrade issue like this has been allowed to sit) You and me both. -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: cephadm: How to replace failed HDD where DB is on SSD
On 27.05.2021 11:53, Eugen Block wrote: This test was on ceph version 15.2.8. On Pacific (ceph version 16.2.4) this also works for me for initial deployment of an entire host: +-+-+--+--+--+-+ |SERVICE |NAME |HOST |DATA |DB|WAL | +-+-+--+--+--+-+ |osd |ssd-hdd-mix |pacific1 |/dev/vdb |/dev/vdd |-| |osd |ssd-hdd-mix |pacific1 |/dev/vdc |/dev/vdd |-| +-+-+--+--+--+-+ But it doesn't work if I remove one OSD, just like you describe. This is what ceph-volume reports: ---snip--- [ceph: root@pacific1 /]# ceph-volume lvm batch --report /dev/vdc --db-devices /dev/vdd --block-db-size 3G --> passed data devices: 1 physical, 0 LVM --> relative data size: 1.0 --> passed block_db devices: 1 physical, 0 LVM --> 1 fast devices were passed, but none are available Total OSDs: 0 TypePath LV Size % of device ---snip--- I know that this has already worked in Octopus, I did test it successfully not long ago. Thank you for trying, so it looks like a bug. Searching through the issue tracker I find few issues related to replacing OSD, but it doesn't look like they get much attention. I tried to find a way to add the disk manually, did not find any documentation about it, but looking at the source code, some issues with some trial and error I ended up with this. Since the LV is deleted I recreated it with the same name. # lvcreate -l 91570 -n osd-block-db-449bd001-eb32-46de-ab80-a1cbcd293d69 ceph-block-dbs-563432b7-f52d-4cfe-b952-11542594843b In "cephadm shell" # cephadm shell # ceph auth get client.bootstrap-osd >/var/lib/ceph/bootstrap-osd/ceph.keyring # ceph-volume lvm prepare --bluestore --no-systemd --data /dev/sdt --block.db ceph-block-dbs-563432b7-f52d-4cfe-b952-11542594843b/osd-block-db-449bd001-eb32-46de-ab80-a1cbcd293d69 Need to have a json file for the "cephadm deploy" # printf '{\n"config": "%s",\n"keyring": "%s"\n}\n' "$(ceph config generate-minimal-conf | sed -e ':a;N;$!ba;s/\n/\\n/g' -e 's/\t/\\t/g' -e 's/$/\\n/')" "$(ceph auth get osd.178 | head -n 2 | sed -e ':a;N;$!ba;s/\n/\\n/g' -e 's/\t/\\t/g' -e 's/$/\\n/')" >config-osd.178.json Exit cephadm shell and run # cephadm --image ceph:v15.2.9 deploy --fsid 3614abcc-201c-11eb-995a-2794bcc75ae0 --config-json /var/lib/ceph/3614abcc-201c-11eb-995a-2794bcc75ae0/home/config-osd.178.json --osd-fsid 9227e8ae-92eb-429e-9c7f-d4a2b75afb8e And the OSD is back, but the VG name on the HDD is missing block in it's name, just a cosmetic thing so I leave it as is. LVVG Attr LSize osd-block-9227e8ae-92eb-429e-9c7f-d4a2b75afb8e ceph-46f42262-d3dc-4dc3-8952-eec3e4a2c178 -wi-ao 12.47t osd-block-2da790bc-a74c-41da-8772-3b8aac77001c ceph-block-1b5ad7e7-2e24-4315-8a05-7439ab782b45 -wi-ao 12.47t The fist one is the new OSD and the second one is one that cephadm itself created. -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: cephadm: How to replace failed HDD where DB is on SSD
On 27.05.2021 11:17, Eugen Block wrote: That's not how it's supposed to work. I tried the same on an Octopus cluster and removed all filters except: data_devices: rotational: 1 db_devices: rotational: 0 My Octopus test osd nodes have two HDDs and one SSD, I removed all OSDs and redeployed on one node. This spec file results in three standalone OSDs! Without the other filters this won't work as expected, it seems. I'll try again on Pacific with the same test and see where that goes. This spec did worked for me when I initially deployed with Octopus 15.2.5. -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: cephadm: How to replace failed HDD where DB is on SSD
On 27.05.2021 10:46, Eugen Block wrote: Hi, The VG has 357.74GB of free space of total 5.24TB so I did actually tried different values like "30G:", "30G", "300G:", "300G", "357G". I also tied some crazy high numbers and some ranges, but don't remember the values. But none of them worked. the size parameter is filtering the disk size, not the size you want the db to have (that's block_db_size). Your SSD disk size is 1.8 TB so your specs could look something like this: block_db_size: 360G data_devices: size: "12T:" rotational: 1 db_devices: size: ":2T" rotational: 0 filter_logic: AND ... But I was under the impression that this all should of course work with just the rotational flags, I'm confused that it doesn't. Can you try with these specs to see if you get the OSD deployed? I tried this one hdd-test-from-eugen.yml --- service_type: osd service_id: hdd placement: host_pattern: 'pech-hd-*' block_db_size: 360G data_devices: size: "12T:" rotational: 1 db_devices: size: ":2T" rotational: 0 filter_logic: AND But it doesn't find any disk. I also tried this, but with the same result. service_type: osd service_id: hdd placement: host_pattern: 'pech-hd-*' block_db_size: 360G data_devices: rotational: 1 db_devices: rotational: 0 filter_logic: AND I'll try again with Octopus to see if I see similar behaviour. Very much appreciated, thanks. -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: cephadm: How to replace failed HDD where DB is on SSD
On 26.05.2021 22:14, David Orman wrote: We've found that after doing the osd rm, you can use: "ceph-volume lvm zap --osd-id 178 --destroy" on the server with that OSD as per: https://docs.ceph.com/en/latest/ceph-volume/lvm/zap/#removing-devices and it will clean things up so they work as expected. With the help of Eugen I did run "cephadm ceph-volume lvm zap --destroy " and the LV is gone. I think that is the same result as "ceph-volume lvm zap --osd-id 178 --destroy" would give me? I now have 357GB free space on the VG, but Cephadm doesn't find and use this space. Above it the result of the zap command and it show the LV is deleted. $ sudo cephadm ceph-volume lvm zap --destroy /dev/ceph-block-dbs-563432b7-f52d-4cfe-b952-11542594843b/osd-block-db-449bd001-eb32-46de-ab80-a1cbcd293d69 INFO:cephadm:Inferring fsid 3614abcc-201c-11eb-995a-2794bcc75ae0 INFO:cephadm:Using recent ceph image ceph:v15.2.9 INFO:cephadm:/usr/bin/podman:stderr --> Zapping: /dev/ceph-block-dbs-563432b7-f52d-4cfe-b952-11542594843b/osd-block-db-449bd001-eb32-46de-ab80-a1cbcd293d69 INFO:cephadm:/usr/bin/podman:stderr Running command: /usr/bin/dd if=/dev/zero of=/dev/ceph-block-dbs-563432b7-f52d-4cfe-b952-11542594843b/osd-block-db-449bd001-eb32-46de-ab80-a1cbcd293d69 bs=1M count=10 conv=fsync INFO:cephadm:/usr/bin/podman:stderr stderr: 10+0 records in INFO:cephadm:/usr/bin/podman:stderr 10+0 records out INFO:cephadm:/usr/bin/podman:stderr stderr: 10485760 bytes (10 MB, 10 MiB) copied, 0.0195532 s, 536 MB/s INFO:cephadm:/usr/bin/podman:stderr --> More than 1 LV left in VG, will proceed to destroy LV only INFO:cephadm:/usr/bin/podman:stderr --> Removing LV because --destroy was given: /dev/ceph-block-dbs-563432b7-f52d-4cfe-b952-11542594843b/osd-block-db-449bd001-eb32-46de-ab80-a1cbcd293d69 INFO:cephadm:/usr/bin/podman:stderr Running command: /usr/sbin/lvremove -v -f /dev/ceph-block-dbs-563432b7-f52d-4cfe-b952-11542594843b/osd-block-db-449bd001-eb32-46de-ab80-a1cbcd293d69 INFO:cephadm:/usr/bin/podman:stderr stdout: Logical volume "osd-block-db-449bd001-eb32-46de-ab80-a1cbcd293d69" successfully removed INFO:cephadm:/usr/bin/podman:stderr stderr: Removing ceph--block--dbs--563432b7--f52d--4cfe--b952--11542594843b-osd--block--db--449bd001--eb32--46de--ab80--a1cbcd293d69 (253:3) INFO:cephadm:/usr/bin/podman:stderr stderr: Archiving volume group "ceph-block-dbs-563432b7-f52d-4cfe-b952-11542594843b" metadata (seqno 61). INFO:cephadm:/usr/bin/podman:stderr stderr: Releasing logical volume "osd-block-db-449bd001-eb32-46de-ab80-a1cbcd293d69" INFO:cephadm:/usr/bin/podman:stderr stderr: Creating volume group backup "/etc/lvm/backup/ceph-block-dbs-563432b7-f52d-4cfe-b952-11542594843b" (seqno 62). INFO:cephadm:/usr/bin/podman:stderr --> Zapping successful for: /dev/ceph-block-dbs-563432b7-f52d-4cfe-b952-11542594843b/osd-block-db-449bd001-eb32-46de-ab80-a1cbcd293d69> -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: cephadm: How to replace failed HDD where DB is on SSD
, but I'm not sure if that's already the default, I remember that there were issues in Nautilus because the default was OR. I tried this just recently with a version similar to this, I believe it was 15.2.8 and it worked for me, but again, it's just a tiny virtual lab cluster. Yes, AND is default, I tried adding 'filter_logic: AND' but with the same result. In you virtual lab cluster do you have multiple HDD sharing the same SSD as I do? To me it looks like Cephadm can't find or use the 357.71GB free space on the VG, it can only find devices that is available. Here is how my "orch device ls" is for that host $ ceph orch device ls --wide | egrep "Hostname|hd-7" HostnamePath Type Vendor ModelSize Available Reject Reasons pech-hd-7 /dev/sdt hdd WDC WUH721414AL5200 13.7T Yes pech-hd-7 /dev/sdb hdd SEAGATE ST14000NM016813.7T No Insufficient space (<10 extents) on vgs, LVM detected, locked pech-hd-7 /dev/sdc hdd SEAGATE ST14000NM016813.7T No Insufficient space (<10 extents) on vgs, LVM detected, locked pech-hd-7 /dev/sdd hdd SEAGATE ST14000NM016813.7T No Insufficient space (<10 extents) on vgs, LVM detected, locked pech-hd-7 /dev/sde ssd SAMSUNG MZILT1T9HAJQ0D3 1920G No LVM detected, locked pech-hd-7 /dev/sdf ssd SAMSUNG MZILT1T9HAJQ0D3 1920G No LVM detected, locked pech-hd-7 /dev/sdg ssd SAMSUNG MZILT1T9HAJQ0D3 1920G No LVM detected, locked pech-hd-7 /dev/sdi hdd SEAGATE ST14000NM016813.7T No Insufficient space (<10 extents) on vgs, LVM detected, locked pech-hd-7 /dev/sdj hdd SEAGATE ST14000NM016813.7T No Insufficient space (<10 extents) on vgs, LVM detected, locked pech-hd-7 /dev/sdk hdd SEAGATE ST14000NM016813.7T No Insufficient space (<10 extents) on vgs, LVM detected, locked pech-hd-7 /dev/sdl hdd SEAGATE ST14000NM016813.7T No Insufficient space (<10 extents) on vgs, LVM detected, locked pech-hd-7 /dev/sdm hdd SEAGATE ST14000NM016813.7T No Insufficient space (<10 extents) on vgs, LVM detected, locked pech-hd-7 /dev/sdn hdd SEAGATE ST14000NM016813.7T No Insufficient space (<10 extents) on vgs, LVM detected, locked pech-hd-7 /dev/sdo hdd SEAGATE ST14000NM016813.7T No Insufficient space (<10 extents) on vgs, LVM detected, locked pech-hd-7 /dev/sdp hdd SEAGATE ST14000NM016813.7T No Insufficient space (<10 extents) on vgs, LVM detected, locked pech-hd-7 /dev/sdq hdd SEAGATE ST14000NM016813.7T No Insufficient space (<10 extents) on vgs, LVM detected, locked pech-hd-7 /dev/sdr hdd SEAGATE ST14000NM016813.7T No Insufficient space (<10 extents) on vgs, LVM detected, locked pech-hd-7 /dev/sds hdd SEAGATE ST14000NM016813.7T No Insufficient space (<10 extents) on vgs, LVM detected, locked -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: cephadm: How to replace failed HDD where DB is on SSD
On 26.05.2021 11:16, Eugen Block wrote: Yes, the LVs are not removed automatically, you need to free up the VG, there are a couple of ways to do so, for example remotely: pacific1:~ # ceph orch device zap pacific4 /dev/vdb --force or directly on the host with: pacific1:~ # cephadm ceph-volume lvm zap --destroy /dev// Thanks, I used the cephadm command and deleted the LV and the VG now has free space # vgs | egrep "VG|dbs" VG #PV #LV #SN Attr VSize VFree ceph-block-dbs-563432b7-f52d-4cfe-b952-11542594843b 3 14 0 wz--n- <5.24t 357.74g But it doesn't seams to be able to use it, because it can find anyting # ceph orch apply osd -i hdd.yml --dry-run OSDSPEC PREVIEWS +-+--+-+--++-+ |SERVICE |NAME |HOST |DATA |DB |WAL | +-+--+-+--++-+ +-+--+-+--++-+ I tried adding size as you have in your configuration db_devices: rotational: 0 size: '30G:' Still it was unable to create the OSD. If I removed the : so it is 30GB exact size, it did find the disk, but DB is not placed on a SSD since I do not have one with 30 GB exact size OSDSPEC PREVIEWS +-+--+-+--++-+ |SERVICE |NAME |HOST |DATA |DB |WAL | +-+--+-+--++-+ |osd |hdd |pech-hd-7|/dev/sdt |- |-| +-+--+-+--++-+ To me I looks like Cephadm can't use/find the free space on the VG and use that as a new LV for the OSD. -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: cephadm: How to replace failed HDD where DB is on SSD
On 26.05.2021 08:22, Eugen Block wrote: Hi, did you wipe the LV on the SSD that was assigned to the failed HDD? I just did that on a fresh Pacific install successfully, a couple of weeks ago it also worked on an Octopus cluster. No, I did not wipe the LV. Not sure what you mean by wipe, so I tried overwriting the LV with /dev/zero, but that did solve it. So I guess with wipe do you mean delete the LV with lvremove? -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] cephadm: How to replace failed HDD where DB is on SSD
Hi The server run 15.2.9 and has 15 HDD and 3 SSD. The OSDs was created with this YAML file hdd.yml service_type: osd service_id: hdd placement: host_pattern: 'pech-hd-*' data_devices: rotational: 1 db_devices: rotational: 0 The result was that the 3 SSD is added to 1 VG with 15 LV on it. # vgs | egrep "VG|dbs" VG #PV #LV #SN Attr VSize VFree ceph-block-dbs-563432b7-f52d-4cfe-b952-11542594843b 3 15 0 wz--n- <5.24t 48.00m One of the osd failed and I run rm with replace # ceph orch osd rm 178 --replace and the result is # ceph osd tree | grep "ID|destroyed" ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF 178hdd12.82390 osd.178 destroyed 0 1.0 But I'm not able to replace the disk with the same YAML file as shown above. # ceph orch apply osd -i hdd.yml --dry-run OSDSPEC PREVIEWS +-+--+--+--++-+ |SERVICE |NAME |HOST |DATA |DB |WAL | +-+--+--+--++-+ +-+--+--+--++-+ I guess this is the wrong way to do it, but I can't find the answer in the documentation. So how can I replace this failed disk in Cephadm? -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Cephadm: Upgrade 15.2.5 -> 15.2.9 stops on non existing OSD
On 11.03.2021 15:47, Sebastian Wagner wrote: yes Am 11.03.21 um 15:46 schrieb Kai Stian Olstad: To resolve it, could I just remove it with "cephadm rm-daemon"? That worked like a charm, and the upgrade is resumed. Thank you Sebastian. -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Cephadm: Upgrade 15.2.5 -> 15.2.9 stops on non existing OSD
Hi Sebastian On 11.03.2021 13:13, Sebastian Wagner wrote: looks like $ ssh pech-hd-009 # cephadm ls is returning this non-existent OSDs. can you verify that `cephadm ls` on that host doesn't print osd.355 ? "cephadm ls" on the node does list this drive { "style": "cephadm:v1", "name": "osd.355", "fsid": "3614abcc-201c-11eb-995a-2794bcc75ae0", "systemd_unit": "ceph-3614abcc-201c-11eb-995a-2794bcc75ae0@osd.355", "enabled": true, "state": "stopped", "container_id": null, "container_image_name": "goharbor.example.com/library/ceph/ceph:v15.2.5", "container_image_id": null, "version": null, "started": null, "created": "2021-01-20T09:53:22.229080", "deployed": "2021-02-09T09:24:02.855576", "configured": "2021-02-09T09:24:04.211587" } To resolve it, could I just remove it with "cephadm rm-daemon"? -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Cephadm: Upgrade 15.2.5 -> 15.2.9 stops on non existing OSD
Before I started the upgrade the cluster was healthy but one OSD(osd.355) was down, can't remember if it was in or out. Upgrade was started with ceph orch upgrade start --image goharbor.example.com/library/ceph/ceph:v15.2.9 The upgrade started but when Ceph tried to upgrade osd.355 it paused with the following messages: 2021-03-11T09:15:35.638104+ mgr.pech-mon-2.cjeiyc [INF] Upgrade: Target is goharbor.example.com/library/ceph/ceph:v15.2.9 with id dfc48307963697ff48acd9dd6fda4a7a24017b9d8124f86c2 a542b0802fe77ba 2021-03-11T09:15:35.639882+ mgr.pech-mon-2.cjeiyc [INF] Upgrade: Checking mgr daemons... 2021-03-11T09:15:35.644170+ mgr.pech-mon-2.cjeiyc [INF] Upgrade: All mgr daemons are up to date. 2021-03-11T09:15:35.644376+ mgr.pech-mon-2.cjeiyc [INF] Upgrade: Checking mon daemons... 2021-03-11T09:15:35.647669+ mgr.pech-mon-2.cjeiyc [INF] Upgrade: All mon daemons are up to date. 2021-03-11T09:15:35.647866+ mgr.pech-mon-2.cjeiyc [INF] Upgrade: Checking crash daemons... 2021-03-11T09:15:35.652035+ mgr.pech-mon-2.cjeiyc [INF] Upgrade: Setting container_image for all crash... 2021-03-11T09:15:35.653683+ mgr.pech-mon-2.cjeiyc [INF] Upgrade: All crash daemons are up to date. 2021-03-11T09:15:35.653896+ mgr.pech-mon-2.cjeiyc [INF] Upgrade: Checking osd daemons... 2021-03-11T09:15:36.273345+ mgr.pech-mon-2.cjeiyc [INF] It is presumed safe to stop ['osd.355'] 2021-03-11T09:15:36.273504+ mgr.pech-mon-2.cjeiyc [INF] Upgrade: It is presumed safe to stop ['osd.355'] 2021-03-11T09:15:36.273887+ mgr.pech-mon-2.cjeiyc [INF] Upgrade: Redeploying osd.355 2021-03-11T09:15:36.276673+ mgr.pech-mon-2.cjeiyc [ERR] Upgrade: Paused due to UPGRADE_REDEPLOY_DAEMON: Upgrading daemon osd.355 on host pech-hd-009 failed. One of the first ting the upgrade did was to upgrade mon, so they are restarted and now the osd.355 no longer exist $ ceph osd info osd.355 Error EINVAL: osd.355 does not exist But if I run a resume ceph orch upgrade resume it still tries to upgrade osd.355, same message as above. I tried to stop and start the upgrade again with ceph orch upgrade stop ceph orch upgrade start --image goharbor.example.com/library/ceph/ceph:v15.2.9 it still tries to upgrade osd.355, with the same message as above. Looking at the source code it looks like it get daemons to upgrade from mgr cache, so I restarted both mgr but still it tries to upgrade osd.355. Does anyone know how I can get the upgrade to continue? -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io