[ceph-users] Re: pg repair doesn't start
Hi Eugen, thanks for your answer. I gave a search another try and did indeed find something: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/TN6WJVCHTVJ4YIA4JH2D2WYYZFZRMSXI/ Quote: " ... And I've also observed that the repair req isn't queued up -- if the OSDs are busy with other scrubs, the repair req is forgotten. ..." I'm biting my tongue really really hard right now. @Dan (if you read this), thanks for the script: https://github.com/cernceph/ceph-scripts/blob/master/tools/scrubbing/autorepair.sh New status: # ceph status cluster: id: e4ece518-f2cb-4708-b00f-b6bf511e91d9 health: HEALTH_ERR 1 scrub errors Possible data damage: 1 pg inconsistent services: mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 8d) mgr: ceph-25(active, since 8d), standbys: ceph-26, ceph-03, ceph-02, ceph-01 mds: con-fs2:8 4 up:standby 8 up:active osd: 1086 osds: 1071 up (since 14h), 1070 in (since 4d); 542 remapped pgs task status: data: pools: 14 pools, 17185 pgs objects: 1.39G objects, 2.5 PiB usage: 3.1 PiB used, 8.4 PiB / 11 PiB avail pgs: 301878494/11947144857 objects misplaced (2.527%) 16634 active+clean 513 active+remapped+backfill_wait 19active+remapped+backfilling 10active+remapped+backfill_wait+forced_backfill 6 active+clean+scrubbing+deep 2 active+clean+scrubbing 1 active+clean+scrubbing+deep+inconsistent+repair io: client: 444 MiB/s rd, 446 MiB/s wr, 2.19k op/s rd, 2.34k op/s wr recovery: 0 B/s, 223 objects/s Yay! Thanks and best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Eugen Block Sent: 13 October 2022 23:23:10 To: ceph-users@ceph.io Subject: [ceph-users] Re: pg repair doesn't start Hi, I’m not sure if I remember correctly but I believe the backfill is preventing the repair to happen. I think it has been discussed a couple of times on this list but I don’t know right now if you can tweak anything to prioritize the repair, I believe there is, but not sure. It looks like your backfill could take quite some time… Zitat von Frank Schilder : > Hi all, > > we have an inconsistent PG for a couple of days now (octopus latest): > > # ceph status > cluster: > id: > health: HEALTH_ERR > 1 scrub errors > Possible data damage: 1 pg inconsistent > > services: > mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 8d) > mgr: ceph-25(active, since 8d), standbys: ceph-26, ceph-03, > ceph-02, ceph-01 > mds: con-fs2:8 4 up:standby 8 up:active > osd: 1086 osds: 1071 up (since 13h), 1070 in (since 4d); 547 remapped pgs > > task status: > > data: > pools: 14 pools, 17185 pgs > objects: 1.39G objects, 2.5 PiB > usage: 3.1 PiB used, 8.4 PiB / 11 PiB avail > pgs: 305530535/11943726075 objects misplaced (2.558%) > 16614 active+clean > 516 active+remapped+backfill_wait > 23active+clean+scrubbing+deep > 21active+remapped+backfilling > 10active+remapped+backfill_wait+forced_backfill > 1 active+clean+inconsistent > > io: > client: 143 MiB/s rd, 135 MiB/s wr, 2.21k op/s rd, 2.33k op/s wr > recovery: 0 B/s, 224 objects/s > > I issued "ceph pg repair 11.1ba" more than 36 hours ago, but it > never got executed (checked the logs for repair state). The usual > wait time we had on our cluster so far was 2-6 hours. 36 hours is > unusually long. The pool in question is moderately busy and has no > misplaced ojects. Its only unhealthy PG is the inconsistent one. > > Are there situations in which ceph cancels/ignores a pg repair? > Is there any way to check if it is actually still scheduled to happen? > Is there a way to force it a bit more urgently? > > The error was caused by a read error, the drive is healthy: > > 2022-10-11T19:19:13.621470+0200 osd.231 (osd.231) 40 : cluster [ERR] > 11.1ba shard 294(6) soid > 11:5df75341:::rbd_data.1.b688997dc79def.0005d530:head : > candidate had a read error > 2022-10-11T19:26:22.344862+0200 osd.231 (osd.231) 41 : cluster [ERR] > 11.1bas0 deep-scrub 0 missing, 1 inconsistent objects > 2022-10-11T19:26:22.344866+0200 osd.231 (osd.231) 42 : cluster [ERR] > 11.1ba deep-scrub 1 errors > 2022-10-11T19:26:23.356402+0200 mgr.ceph-25 (mgr.144330518) 378551 : > cluster [DBG] pgmap v301249: 17334 pgs: 1 active+clean+inconsistent, > 2 active+clean+scrubbing, 26 active+remapped+backfill_wait, 13 > active+remapped+backfilling, 19 active+clean+scrubbing+deep, 17273 > active+clean; 2.5 PiB data, 3.1 PiB used, 8.4 PiB / 11 PiB avail; > 193 MiB/s rd, 181 MiB/s wr, 4.95k op/s; 16126995/11848511097 objects > misplaced (0.136%); 0 B/s, 513 objects/s recovering >
[ceph-users] Re: pg repair doesn't start
Hi, I’m not sure if I remember correctly but I believe the backfill is preventing the repair to happen. I think it has been discussed a couple of times on this list but I don’t know right now if you can tweak anything to prioritize the repair, I believe there is, but not sure. It looks like your backfill could take quite some time… Zitat von Frank Schilder : Hi all, we have an inconsistent PG for a couple of days now (octopus latest): # ceph status cluster: id: health: HEALTH_ERR 1 scrub errors Possible data damage: 1 pg inconsistent services: mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 8d) mgr: ceph-25(active, since 8d), standbys: ceph-26, ceph-03, ceph-02, ceph-01 mds: con-fs2:8 4 up:standby 8 up:active osd: 1086 osds: 1071 up (since 13h), 1070 in (since 4d); 547 remapped pgs task status: data: pools: 14 pools, 17185 pgs objects: 1.39G objects, 2.5 PiB usage: 3.1 PiB used, 8.4 PiB / 11 PiB avail pgs: 305530535/11943726075 objects misplaced (2.558%) 16614 active+clean 516 active+remapped+backfill_wait 23active+clean+scrubbing+deep 21active+remapped+backfilling 10active+remapped+backfill_wait+forced_backfill 1 active+clean+inconsistent io: client: 143 MiB/s rd, 135 MiB/s wr, 2.21k op/s rd, 2.33k op/s wr recovery: 0 B/s, 224 objects/s I issued "ceph pg repair 11.1ba" more than 36 hours ago, but it never got executed (checked the logs for repair state). The usual wait time we had on our cluster so far was 2-6 hours. 36 hours is unusually long. The pool in question is moderately busy and has no misplaced ojects. Its only unhealthy PG is the inconsistent one. Are there situations in which ceph cancels/ignores a pg repair? Is there any way to check if it is actually still scheduled to happen? Is there a way to force it a bit more urgently? The error was caused by a read error, the drive is healthy: 2022-10-11T19:19:13.621470+0200 osd.231 (osd.231) 40 : cluster [ERR] 11.1ba shard 294(6) soid 11:5df75341:::rbd_data.1.b688997dc79def.0005d530:head : candidate had a read error 2022-10-11T19:26:22.344862+0200 osd.231 (osd.231) 41 : cluster [ERR] 11.1bas0 deep-scrub 0 missing, 1 inconsistent objects 2022-10-11T19:26:22.344866+0200 osd.231 (osd.231) 42 : cluster [ERR] 11.1ba deep-scrub 1 errors 2022-10-11T19:26:23.356402+0200 mgr.ceph-25 (mgr.144330518) 378551 : cluster [DBG] pgmap v301249: 17334 pgs: 1 active+clean+inconsistent, 2 active+clean+scrubbing, 26 active+remapped+backfill_wait, 13 active+remapped+backfilling, 19 active+clean+scrubbing+deep, 17273 active+clean; 2.5 PiB data, 3.1 PiB used, 8.4 PiB / 11 PiB avail; 193 MiB/s rd, 181 MiB/s wr, 4.95k op/s; 16126995/11848511097 objects misplaced (0.136%); 0 B/s, 513 objects/s recovering 2022-10-11T19:26:24.246194+0200 mon.ceph-01 (mon.0) 633486 : cluster [ERR] Health check failed: 1 scrub errors (OSD_SCRUB_ERRORS) 2022-10-11T19:26:24.246215+0200 mon.ceph-01 (mon.0) 633487 : cluster [ERR] Health check failed: Possible data damage: 1 pg inconsistent (PG_DAMAGED) Thanks and best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] pg repair doesn't start
Hi all, we have an inconsistent PG for a couple of days now (octopus latest): # ceph status cluster: id: health: HEALTH_ERR 1 scrub errors Possible data damage: 1 pg inconsistent services: mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 8d) mgr: ceph-25(active, since 8d), standbys: ceph-26, ceph-03, ceph-02, ceph-01 mds: con-fs2:8 4 up:standby 8 up:active osd: 1086 osds: 1071 up (since 13h), 1070 in (since 4d); 547 remapped pgs task status: data: pools: 14 pools, 17185 pgs objects: 1.39G objects, 2.5 PiB usage: 3.1 PiB used, 8.4 PiB / 11 PiB avail pgs: 305530535/11943726075 objects misplaced (2.558%) 16614 active+clean 516 active+remapped+backfill_wait 23active+clean+scrubbing+deep 21active+remapped+backfilling 10active+remapped+backfill_wait+forced_backfill 1 active+clean+inconsistent io: client: 143 MiB/s rd, 135 MiB/s wr, 2.21k op/s rd, 2.33k op/s wr recovery: 0 B/s, 224 objects/s I issued "ceph pg repair 11.1ba" more than 36 hours ago, but it never got executed (checked the logs for repair state). The usual wait time we had on our cluster so far was 2-6 hours. 36 hours is unusually long. The pool in question is moderately busy and has no misplaced ojects. Its only unhealthy PG is the inconsistent one. Are there situations in which ceph cancels/ignores a pg repair? Is there any way to check if it is actually still scheduled to happen? Is there a way to force it a bit more urgently? The error was caused by a read error, the drive is healthy: 2022-10-11T19:19:13.621470+0200 osd.231 (osd.231) 40 : cluster [ERR] 11.1ba shard 294(6) soid 11:5df75341:::rbd_data.1.b688997dc79def.0005d530:head : candidate had a read error 2022-10-11T19:26:22.344862+0200 osd.231 (osd.231) 41 : cluster [ERR] 11.1bas0 deep-scrub 0 missing, 1 inconsistent objects 2022-10-11T19:26:22.344866+0200 osd.231 (osd.231) 42 : cluster [ERR] 11.1ba deep-scrub 1 errors 2022-10-11T19:26:23.356402+0200 mgr.ceph-25 (mgr.144330518) 378551 : cluster [DBG] pgmap v301249: 17334 pgs: 1 active+clean+inconsistent, 2 active+clean+scrubbing, 26 active+remapped+backfill_wait, 13 active+remapped+backfilling, 19 active+clean+scrubbing+deep, 17273 active+clean; 2.5 PiB data, 3.1 PiB used, 8.4 PiB / 11 PiB avail; 193 MiB/s rd, 181 MiB/s wr, 4.95k op/s; 16126995/11848511097 objects misplaced (0.136%); 0 B/s, 513 objects/s recovering 2022-10-11T19:26:24.246194+0200 mon.ceph-01 (mon.0) 633486 : cluster [ERR] Health check failed: 1 scrub errors (OSD_SCRUB_ERRORS) 2022-10-11T19:26:24.246215+0200 mon.ceph-01 (mon.0) 633487 : cluster [ERR] Health check failed: Possible data damage: 1 pg inconsistent (PG_DAMAGED) Thanks and best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] monitoring drives
I was wondering what is a best practice for monitoring drives. I am transitioning from sata to sas drives which have less smartctl information not even power on hours. eg. is ceph registering somewhere when an osd has been created? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Cluster crashing when stopping some host
Unfortunately I can't verify if ceph reports any inactive PG. As soon as the second host disconnects practically everything is locked, nothing appears even using "ceph -w". It only appears that the OSDs are offline when dcs2 returns. Note: Apparently there was a new update recently. When I was in the test environment, this behavior was not happening, dcs1 was UP with all services without crashing even with dcs2 DOWN, performing reading and writing, even without adding dcs3. ### COMMANDS ### [ceph: root@dcs1 /]# ceph osd tree ID CLASS WEIGHTTYPE NAME STATUS REWEIGHT PRI-AFF -1 65.49570 root default -3 32.74785 host dcs1 0hdd 2.72899 osd.0 up 1.0 1.0 1hdd 2.72899 osd.1 up 1.0 1.0 2hdd 2.72899 osd.2 up 1.0 1.0 3hdd 2.72899 osd.3 up 1.0 1.0 4hdd 2.72899 osd.4 up 1.0 1.0 5hdd 2.72899 osd.5 up 1.0 1.0 6hdd 2.72899 osd.6 up 1.0 1.0 7hdd 2.72899 osd.7 up 1.0 1.0 8hdd 2.72899 osd.8 up 1.0 1.0 9hdd 2.72899 osd.9 up 1.0 1.0 10hdd 2.72899 osd.10 up 1.0 1.0 11hdd 2.72899 osd.11 up 1.0 1.0 -5 32.74785 host dcs2 12hdd 2.72899 osd.12 up 1.0 1.0 13hdd 2.72899 osd.13 up 1.0 1.0 14hdd 2.72899 osd.14 up 1.0 1.0 15hdd 2.72899 osd.15 up 1.0 1.0 16hdd 2.72899 osd.16 up 1.0 1.0 17hdd 2.72899 osd.17 up 1.0 1.0 18hdd 2.72899 osd.18 up 1.0 1.0 19hdd 2.72899 osd.19 up 1.0 1.0 20hdd 2.72899 osd.20 up 1.0 1.0 21hdd 2.72899 osd.21 up 1.0 1.0 22hdd 2.72899 osd.22 up 1.0 1.0 23hdd 2.72899 osd.23 up 1.0 1.0 [ceph: root@dcs1 /]# ceph osd pool ls detail pool 1 '.mgr' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 26 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr pool 2 'cephfs.ovirt_hosted_engine.meta' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on last_change 77 lfor 0/0/47 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs pool 3 'cephfs.ovirt_hosted_engine.data' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 179 lfor 0/0/47 flags hashpspool max_bytes 107374182400 stripe_width 0 application cephfs pool 6 '.nfs' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 254 lfor 0/0/252 flags hashpspool stripe_width 0 application nfs pool 7 'cephfs.ovirt_storage_sas.meta' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on last_change 322 lfor 0/0/287 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs pool 8 'cephfs.ovirt_storage_sas.data' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 291 lfor 0/0/289 flags hashpspool stripe_width 0 application cephfs pool 9 'cephfs.ovirt_storage_iso.meta' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on last_change 356 lfor 0/0/325 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs pool 10 'cephfs.ovirt_storage_iso.data' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 329 lfor 0/0/327 flags hashpspool stripe_width 0 application cephfs [ceph: root@dcs1 /]# ceph osd crush rule dump replicated_rule { "rule_id": 0, "rule_name": "replicated_rule", "type": 1, "steps": [ { "op": "take", "item": -1, "item_name": "default" }, { "op": "chooseleaf_firstn", "num": 0, "type": "host" }, { "op": "emit" } ] } [ceph: root@dcs1 /]# ceph pg ls-by-pool cephfs.ovirt_hosted_engine.data PGOBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOGSTATE SINCE VERSIONREPORTED UP ACTING SCRUB_STAMP DEEP_SCRUB_STAMP LAST_SCRUB_DURATION SCRUB_SCHEDULING 3.069 0 00 2852130950 0 10057 active+clean
[ceph-users] Re: crush hierarchy backwards and upmaps ...
Dan, Again i am using 16.2.10 on rocky 8 I decided to take a step back and check a variety of options before I do anything. Here are my results. If I use this rule: rule mypoolname { id -5 type erasure step take myroot step choose indep 4 type rack step choose indep 2 type chassis step chooseleaf indep 1 type host step emit This is changing the pod definitions all to type chassis. I get NO moves when running osdmaptool --test-pg-upmap-itemsand comparing to the current. But --upmap-cleanup gives: check_pg_upmaps verify upmap of pool.pgid returning -22verify_upmap number of buckets 8 exceeds desired 2 for each of my existing upmaps. And it wants to remove them all. If I use the rule: rule mypoolname { id -5 type erasure step take myroot step choose indep 4 type rack step chooseleaf indep 2 type chassis step emit I get almost 1/2 my data moving as per osdmaptool --test-pg-upmap-items. With --upmap-cleanup I get: verify_upmap multiple osds N,M come from the same failure domain -382check_pg_upmap verify upmap of pg poolid.pgid returning -22. For about 1/8 of my upmaps. And it wants to remove these and and add about 100 more. Although I suspect that this will be rectified after things are moved and such. Am I correct? If I use the rule: (after changing my rack definition to only contain hosts that were previously a part of thepods or chassis): rule mypoolname { id -5 type erasure step take myroot step choose indep 4 type rack step chooseleaf indep 2 type host step emit I get almost all my data moving as per osdmaptool --test-pg-upmap-items. With --upmap-cleanup, I get only 10 of these: verify_upmap multiple osds N,M come from the same failure domain -382check_pg_upmap verify upmap of pg poolid.pgid returning -22. But upmap-cleanup wants to remove all my upmaps, which may actually make sense if weredo the entire map this way. I am curious for the first rule, where I am getting the expected 8 got 2, if I am hitting this bug, that seems tosuggest that I am having a problem because I have a multi-level (>2) level rule for an ec pool: https://tracker.ceph.com/issues/51729 This bug appears to be on 14.x, but perhaps it exists on pacific as well.It would be great if I could use the first rule, except for this bug. Perhaps the second rule is best at this point. Any other thoughts would be appreciated. -Chris -Original Message- From: Dan van der Ster To: Christopher Durham Cc: Ceph Users Sent: Tue, Oct 11, 2022 11:39 am Subject: [ceph-users] Re: crush hierarchy backwards and upmaps ... Hi Chris, Just curious, does this rule make sense and help with the multi level crush map issue? (Maybe it also results in zero movement, or at least less then the alternative you proposed?) step choose indep 4 type rack step chooseleaf indep 2 type chassis Cheers, Dan On Tue, Oct 11, 2022, 19:29 Christopher Durham wrote: > Dan, > > Thank you. > > I did what you said regarding --test-map-pgs-dump and it wants to move 3 > OSDs in every PG. Yuk. > > So before i do that, I tried this rule, after changing all my 'pod' bucket > definitions to 'chassis', and compiling and > injecting the new crushmap to an osdmap: > > > rule mypoolname { > id -5 > type erasure > step take myroot > step choose indep 4 type rack > step choose indep 2 type chassis > step chooseleaf indep 1 type host > step emit > > } > > --test-pg-upmap-entries shows there were NO changes to be done after > comparing it with the original!!! > > However, --upmap-cleanup says: > > verify_upmap number of buckets 8 exceeds desired number of 2 > check_pg_upmaps verify_upmap of poolid.pgid returning -22 > > This is output for every current upmap, but I really do want 8 total > buckets per PG, as my pool is a 6+2. > > The upmap-cleanup output wants me to remove all of my upmaps. > > This seems consistent with a bug report that says that there is a problem > with the balancer on a > multi-level rule such as the above, albeit on 14.2.x. Any thoughts? > > https://tracker.ceph.com/issues/51729 > > I am leaning towards just eliminating the middle rule and go directly from > rack to host, even though > it wants to move a LARGE amount of data according to a diff before and > after of --test-pg-upmap-entries. > In this scenario, I dont see any unexpected errors with --upmap-cleanup > and I do not want to get stuck > > rule mypoolname { > id -5 > type erasure > step take myroot > step choose indep 4 type rack > step chooseleaf indep 2 type host > step emit > } > > -Chris > > > -Original Message- > From: Dan van der Ster > To: Christopher Durham > Cc: Ceph Users > Sent: Mon, Oct 10, 2022 12:22 pm > Subject: [ceph-users] Re: crush hierarchy backwards and upmaps ... > > Hi, > > Here's a similar bug: https://tracker.ceph.com/issues/47361 > > Back then, upmap would generate mappings that invalidate the crush
[ceph-users] Re: Cluster crashing when stopping some host
If you do not mind data loss, why do you care about needing to have 2x? Alternative would be to change the replication so it is not over hosts but just on osd's that can reside on one host. > Marc, but there is no mechanism to prevent IO pause? At the moment I > don't worry about data loss. > I understand that putting it as replica x1 can work, but I need it to be > x2. > > > > > > I'm having strange behavior on a new cluster. > > Not strange, by design > > > I have 3 machines, two of them have the disks. We can name them > like > > this: > > dcs1 to dcs3. The dcs1 and dcs2 machines contain the disks. > > > > I started bootstrapping through dcs1, added the other hosts and > left mgr > > on > > dcs3 only. > > > > What is happening is that if I take down dcs2 everything hangs > and > > becomes > > irresponsible, including the mount points that were pointed to > dcs1. > > You have to have disks in 3 machines. (Or set the replication to > 1x) > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Cluster crashing when stopping some host
Could you share more details? Does ceph report inactive PGs when one node is down? Please share: ceph osd tree ceph osd pool ls detail ceph osd crush rule dump ceph pg ls-by-pool ceph -s Zitat von Murilo Morais : Thanks for answering. Marc, but there is no mechanism to prevent IO pause? At the moment I don't worry about data loss. I understand that putting it as replica x1 can work, but I need it to be x2. Em qui., 13 de out. de 2022 às 12:26, Marc escreveu: > > I'm having strange behavior on a new cluster. Not strange, by design > I have 3 machines, two of them have the disks. We can name them like > this: > dcs1 to dcs3. The dcs1 and dcs2 machines contain the disks. > > I started bootstrapping through dcs1, added the other hosts and left mgr > on > dcs3 only. > > What is happening is that if I take down dcs2 everything hangs and > becomes > irresponsible, including the mount points that were pointed to dcs1. You have to have disks in 3 machines. (Or set the replication to 1x) ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Cluster crashing when stopping some host
Thanks for answering. Marc, but there is no mechanism to prevent IO pause? At the moment I don't worry about data loss. I understand that putting it as replica x1 can work, but I need it to be x2. Em qui., 13 de out. de 2022 às 12:26, Marc escreveu: > > > > > I'm having strange behavior on a new cluster. > > Not strange, by design > > > I have 3 machines, two of them have the disks. We can name them like > > this: > > dcs1 to dcs3. The dcs1 and dcs2 machines contain the disks. > > > > I started bootstrapping through dcs1, added the other hosts and left mgr > > on > > dcs3 only. > > > > What is happening is that if I take down dcs2 everything hangs and > > becomes > > irresponsible, including the mount points that were pointed to dcs1. > > You have to have disks in 3 machines. (Or set the replication to 1x) > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: why rgw generates large quantities orphan objects?
Hi Liang, My guess would be this bug: https://tracker.ceph.com/issues/44660 https://www.spinics.net/lists/ceph-users/msg30151.html It's actually existed for at least 6 years: https://tracker.ceph.com/issues/16767 Which occurs any time you reupload the same *part* in a single Multipart Upload multiple times. For example, if my Multipart upload consists of 3 parts, if I upload part #2 twice, then the first upload of part #2 becomes orphaned. If this was indeed the cause, you should have multiple "_multipart_" rados objects for the same part in "rados ls". For example, here's all the rados objects associated with a bugged bucket before I deleted it: cc79b188-89d1-4f47-acb1-ab90513e9bc9.23325574.228__multipart_file.txt.4vkWzU4C5XLd2R6unFgbQ6aZM26vPuq8.1 cc79b188-89d1-4f47-acb1-ab90513e9bc9.23325574.228__multipart_file.txt.2~4zogSe4Ep0xvSC8j6aX71x_96cOgvQN.1 cc79b188-89d1-4f47-acb1-ab90513e9bc9.23325574.228__shadow_file.txt.4vkWzU4C5XLd2R6unFgbQ6aZM26vPuq8.1_1 cc79b188-89d1-4f47-acb1-ab90513e9bc9.23325574.228__shadow_file.txt.2~4zogSe4Ep0xvSC8j6aX71x_96cOgvQN.1_1 If we look at just these two: cc79b188-89d1-4f47-acb1-ab90513e9bc9.23325574.228__multipart_file.txt.4vkWzU4C5XLd2R6unFgbQ6aZM26vPuq8.1 cc79b188-89d1-4f47-acb1-ab90513e9bc9.23325574.228__multipart_file.txt.2~4zogSe4Ep0xvSC8j6aX71x_96cOgvQN.1 They are in the format: $BUCKETID__multipart_$S3KEY.$PARTUID.$PARTNUM Because everything matches ($BUCKETID, $S3KEY, $PARTNUM) except for $PARTUID, this S3 object has been affected by the bug. If you find instances of rados keys that match on everything except $PARTUID, then this bug is probably the cause. Josh From: 郑亮 Sent: Wednesday, October 12, 2022 1:34:31 AM To: ceph-users@ceph.io Subject: [ceph-users] why rgw generates large quantities orphan objects? Hi all, Description of problem: [RGW] Buckets/objects deletion is causing large quantities orphan raods objects The cluster was running a cosbench workload, then remove the partial data by deleting objects from the cosbench client, then we have deleted all the buckets with the help of `s3cmd rb --recursive --force` command that removed all the buckets, but that did not help in the space reclaimation. ``` [root@node01 /]# rgw-orphan-list Available pools: device_health_metrics .rgw.root os-test.rgw.buckets.non-ec os-test.rgw.log os-test.rgw.control os-test.rgw.buckets.index os-test.rgw.meta os-test.rgw.buckets.data deeproute-replica-hdd-pool deeproute-replica-ssd-pool cephfs-metadata cephfs-replicated-pool .nfs Which pool do you want to search for orphans (for multiple, use space-separated list)? os-test.rgw.buckets.data Pool is "os-test.rgw.buckets.data". Note: output files produced will be tagged with the current timestamp -- 20221008062356. running 'rados ls' at Sat Oct 8 06:24:05 UTC 2022 running 'rados ls' on pool os-test.rgw.buckets.data. running 'radosgw-admin bucket radoslist' at Sat Oct 8 06:43:21 UTC 2022 computing delta at Sat Oct 8 06:47:17 UTC 2022 39662551 potential orphans found out of a possible 39844453 (99%). The results can be found in './orphan-list-20221008062356.out'. Intermediate files are './rados-20221008062356.intermediate' and './radosgw-admin-20221008062356.intermediate'. *** *** WARNING: This is EXPERIMENTAL code and the results should be used *** only with CAUTION! *** Done at Sat Oct 8 06:48:07 UTC 2022. [root@node01 /]# radosgw-admin gc list [] [root@node01 /]# cat orphan-list-20221008062356.out | wc -l 39662551 [root@node01 /]# rados df POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD WR_OPS WR USED COMPR UNDER COMPR .nfs 4.3 MiB 4 0 12 00 0 77398 76 MiB146 79 KiB 0 B 0 B .rgw.root180 KiB16 0 48 00 0 28749 28 MiB 0 0 B 0 B 0 B cephfs-metadata 932 MiB 14772 0 44316 00 01569690 3.8 GiB1258651 3.4 GiB 0 B 0 B cephfs-replicated-pool 738 GiB300962 0 902886 00 0 794612 470 GiB 770689 245 GiB 0 B 0 B deeproute-replica-hdd-pool 1016 GiB104276 0 312828 00 0 18176216 298 GiB 441783780 6.7 TiB 0 B 0 B deeproute-replica-ssd-pool30 GiB 3691 0 11073 00 02466079 2.1 GiB8416232 221 GiB 0 B 0 B device_health_metrics 50 MiB 108 0324 00 0 1836 1.8 MiB 1944 18 MiB 0 B 0 B os-test.rgw.buckets.data 5.6 TiB 39844453 0 239066718 00
[ceph-users] Re: Cluster crashing when stopping some host
> > I'm having strange behavior on a new cluster. Not strange, by design > I have 3 machines, two of them have the disks. We can name them like > this: > dcs1 to dcs3. The dcs1 and dcs2 machines contain the disks. > > I started bootstrapping through dcs1, added the other hosts and left mgr > on > dcs3 only. > > What is happening is that if I take down dcs2 everything hangs and > becomes > irresponsible, including the mount points that were pointed to dcs1. You have to have disks in 3 machines. (Or set the replication to 1x) ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Cluster crashing when stopping some host
I'm using Host as Failure Domain. Em qui., 13 de out. de 2022 às 11:41, Eugen Block escreveu: > What is your failure domain? If it's osd you'd have both PGs on the > same host and then no replica is available. > > Zitat von Murilo Morais : > > > Eugen, thanks for responding. > > > > In the current scenario there is no way to insert disks into dcs3. > > > > My pools are size 2, at the moment we can't add more machines with disks, > > so it was sized in this proportion. > > > > Even with min_size=1, if dcs2 stops the IO also stops. > > > > Em qui., 13 de out. de 2022 às 11:19, Eugen Block > escreveu: > > > >> Hi, > >> > >> if your pools have a size 2 (don't do that except in test > >> environments) and host is your failure domain then all IO is paused if > >> one osd host goes down, depending on your min_size. Can you move some > >> disks to dcs3 so you can have size 3 pools with min_size 2? > >> > >> Zitat von Murilo Morais : > >> > >> > Good morning everyone. > >> > > >> > I'm having strange behavior on a new cluster. > >> > > >> > I have 3 machines, two of them have the disks. We can name them like > >> this: > >> > dcs1 to dcs3. The dcs1 and dcs2 machines contain the disks. > >> > > >> > I started bootstrapping through dcs1, added the other hosts and left > mgr > >> on > >> > dcs3 only. > >> > > >> > What is happening is that if I take down dcs2 everything hangs and > >> becomes > >> > irresponsible, including the mount points that were pointed to dcs1. > >> > ___ > >> > ceph-users mailing list -- ceph-users@ceph.io > >> > To unsubscribe send an email to ceph-users-le...@ceph.io > >> > >> > >> > >> ___ > >> ceph-users mailing list -- ceph-users@ceph.io > >> To unsubscribe send an email to ceph-users-le...@ceph.io > >> > > > > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Cluster crashing when stopping some host
What is your failure domain? If it's osd you'd have both PGs on the same host and then no replica is available. Zitat von Murilo Morais : Eugen, thanks for responding. In the current scenario there is no way to insert disks into dcs3. My pools are size 2, at the moment we can't add more machines with disks, so it was sized in this proportion. Even with min_size=1, if dcs2 stops the IO also stops. Em qui., 13 de out. de 2022 às 11:19, Eugen Block escreveu: Hi, if your pools have a size 2 (don't do that except in test environments) and host is your failure domain then all IO is paused if one osd host goes down, depending on your min_size. Can you move some disks to dcs3 so you can have size 3 pools with min_size 2? Zitat von Murilo Morais : > Good morning everyone. > > I'm having strange behavior on a new cluster. > > I have 3 machines, two of them have the disks. We can name them like this: > dcs1 to dcs3. The dcs1 and dcs2 machines contain the disks. > > I started bootstrapping through dcs1, added the other hosts and left mgr on > dcs3 only. > > What is happening is that if I take down dcs2 everything hangs and becomes > irresponsible, including the mount points that were pointed to dcs1. > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Cluster crashing when stopping some host
Eugen, thanks for responding. In the current scenario there is no way to insert disks into dcs3. My pools are size 2, at the moment we can't add more machines with disks, so it was sized in this proportion. Even with min_size=1, if dcs2 stops the IO also stops. Em qui., 13 de out. de 2022 às 11:19, Eugen Block escreveu: > Hi, > > if your pools have a size 2 (don't do that except in test > environments) and host is your failure domain then all IO is paused if > one osd host goes down, depending on your min_size. Can you move some > disks to dcs3 so you can have size 3 pools with min_size 2? > > Zitat von Murilo Morais : > > > Good morning everyone. > > > > I'm having strange behavior on a new cluster. > > > > I have 3 machines, two of them have the disks. We can name them like > this: > > dcs1 to dcs3. The dcs1 and dcs2 machines contain the disks. > > > > I started bootstrapping through dcs1, added the other hosts and left mgr > on > > dcs3 only. > > > > What is happening is that if I take down dcs2 everything hangs and > becomes > > irresponsible, including the mount points that were pointed to dcs1. > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > > > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Cluster crashing when stopping some host
Hi, if your pools have a size 2 (don't do that except in test environments) and host is your failure domain then all IO is paused if one osd host goes down, depending on your min_size. Can you move some disks to dcs3 so you can have size 3 pools with min_size 2? Zitat von Murilo Morais : Good morning everyone. I'm having strange behavior on a new cluster. I have 3 machines, two of them have the disks. We can name them like this: dcs1 to dcs3. The dcs1 and dcs2 machines contain the disks. I started bootstrapping through dcs1, added the other hosts and left mgr on dcs3 only. What is happening is that if I take down dcs2 everything hangs and becomes irresponsible, including the mount points that were pointed to dcs1. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Cluster crashing when stopping some host
Good morning everyone. I'm having strange behavior on a new cluster. I have 3 machines, two of them have the disks. We can name them like this: dcs1 to dcs3. The dcs1 and dcs2 machines contain the disks. I started bootstrapping through dcs1, added the other hosts and left mgr on dcs3 only. What is happening is that if I take down dcs2 everything hangs and becomes irresponsible, including the mount points that were pointed to dcs1. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: MDS Performance and PG/PGP value
Hi Yoann, I'm not using pacific yet, but this here looks very strange to me: cephfs_data data 243T 19.7T usage: 245 TiB used, 89 TiB / 334 TiB avail I'm not sure if there is a mix of raw vs. stored here. Assuming the cephfs_data allocation is right, I'm wondering what your osd [near] full ratios are. The PG counts look very good. The slow ops can have 2 reasons: a bad disk or full OSDs. Looking at 19.7/(243+16.7)=6.4% free I wonder why there are no osd [near] full warnings all over the place. Even if its still 20% free performance can degrade dramatically according to benchmarks we made on octopus. I think you need to provide a lot more details here. Of interest are: ceph df detail ceph osd df tree and possibly a few others. I don't think the multi-MDS mode is bugging you, but you should check. We have seen degraded performance on mimic caused by excessive export_dir operations between the MDSes. However, I can't see such operations reported as stuck. You might want to check on your MDSes with ceph daemon mds.xzy ops | grep -e dirfrag -e export and/or similar commands. You should report a bit what kind of operations tend to be stuck longest. I also remember that there used to be problems having a kclient ceph fs mount on OSD nodes. Not sure if this could play a role here. You have basically zero IO going on: client: 6.2 MiB/s rd, 12 MiB/s wr, 10 op/s rd, 366 op/s wr yet, PGs are laggy. The problem could sit on a non-ceph component. With the hardware you have, there is something very weird going on. You might also want to check that you have the correct MTU on all devices on every single host and that the speed negotiated is the same. Problems like these I have seen with a single host having a wrong MTU and with LACP bonds with a broken transceiver. Something else to check is flaky controller/PCIe connections. We had a case where a controller was behaving odd and we had a huge amount of device resets in the logs. On the host with the broken controller, IO wait was way above average (shown by top). Something similar might happen with NVMes. A painful procedure to locate a bad host could be to out OSDs manually on a single host and wait for PGs to peer and become active. If you have a bad host, in this moment IO should recover to good levels. Do this host by host. I know, it will be a day or two but, well, it might locate something. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Stefan Kooman Sent: 13 October 2022 13:56:45 To: Yoann Moulin; Patrick Donnelly Cc: ceph-users@ceph.io Subject: [ceph-users] Re: MDS Performance and PG/PGP value On 10/13/22 13:47, Yoann Moulin wrote: >> Also, you mentioned you're using 7 active MDS. How's that working out >> for you? Do you use pinning? > > I don't really know how to do that, I have 55 worker nodes in my K8s > cluster, each one can run pods that have access to a cephfs pvc. we have > 28 cephfs persistent volumes. Pods are ML/DL/AI workload, each can be > start and stop whenever our researchers need it. The workloads are > unpredictable. See [1] and [2]. Gr. Stefan [1]: https://docs.ceph.com/en/quincy/cephfs/multimds/#manually-pinning-directory-trees-to-a-particular-rank [2]: https://docs.ceph.com/en/quincy/cephfs/multimds/#setting-subtree-partitioning-policies Gr. Stefan ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: MDS Performance and PG/PGP value
On 10/13/22 13:47, Yoann Moulin wrote: Also, you mentioned you're using 7 active MDS. How's that working out for you? Do you use pinning? I don't really know how to do that, I have 55 worker nodes in my K8s cluster, each one can run pods that have access to a cephfs pvc. we have 28 cephfs persistent volumes. Pods are ML/DL/AI workload, each can be start and stop whenever our researchers need it. The workloads are unpredictable. See [1] and [2]. Gr. Stefan [1]: https://docs.ceph.com/en/quincy/cephfs/multimds/#manually-pinning-directory-trees-to-a-particular-rank [2]: https://docs.ceph.com/en/quincy/cephfs/multimds/#setting-subtree-partitioning-policies Gr. Stefan ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: MDS Performance and PG/PGP value
Hello Patrick, Unfortunately, increasing the number of PG did not help a lot in the end, my cluster is still in trouble... Here the current state of my cluster : https://pastebin.com/Avw5ybgd Is 256 good value in our case ? We have 80TB of data with more than 300M files. You want at least as many PGs that each of the OSDs host a portion of the OMAP data. You want to spread out OMAP to as many _fast_ OSDs as possible. I have tried to find an answer to your question: are more metadata PGs better? I haven't found a definitive answer. This would ideally be tested in a non-prod / pre-prod environment and tuned to individual requirements (type of workload). For now, I would not blindly trust the PG autoscaler. I have seen it advise settings that would definately not be OK. You can skew things in the autoscaler with the "bias" parameter, to compensate for this. But as far as I know the current heuristics to determine a good value do not take into account the importance of OMAP (RocksDB) spread accross OSDs. See a blog post about autoscaler tuning [1]. It would be great if tuning metadata PGs for CephFS / RGW could be performed during the "large scale tests" the devs are planning to perform in the future. With use cases that take into consideration "a lot of small files / objects" versus "loads of large files / objects" to get a feeling how tuning this impacts performance for different work loads. Gr. Stefan [1]: https://ceph.io/en/news/blog/2022/autoscaler_tuning/ Thanks for the information, I agree that autoscaler seem to not be able to handle my use case. (thanks to icepic...@gmail.com too) By the way, since I have set PG=256, I have much less SLOW requests than before, even I still have, the impact on my users has been reduced a lot. # zgrep -c -E 'WRN.*(SLOW_OPS|SLOW_REQUEST|MDS_SLOW_METADATA_IO)' floki.log.4.gz floki.log.3.gz floki.log.2.gz floki.log.1.gz floki.log floki.log.4.gz:6883 floki.log.3.gz:11794 floki.log.2.gz:3391 floki.log.1.gz:1180 floki.log:122 If I have the opportunity, I will try to run some benchmark with multiple value of the PG on cephfs_metadata pool. 256 sounds like a good number to me. Maybe even 128. If you do some experiments, please do share the results. Yes, of course. Also, you mentioned you're using 7 active MDS. How's that working out for you? Do you use pinning? I don't really know how to do that, I have 55 worker nodes in my K8s cluster, each one can run pods that have access to a cephfs pvc. we have 28 cephfs persistent volumes. Pods are ML/DL/AI workload, each can be start and stop whenever our researchers need it. The workloads are unpredictable. Thanks for your help. Best regards, -- Yoann Moulin EPFL IC-IT ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: rgw multisite octopus - bucket can not be resharded after cancelling prior reshard process
Hi Christian, resharding is not an issue, because we only sync the metadata. Like aws s3. But this looks very broken to me, does anyone got an idea how to fix that? > Am 13.10.2022 um 11:58 schrieb Christian Rohmann > : > > Hey Boris, > >> On 07/10/2022 11:30, Boris Behrens wrote: >> I just wanted to reshard a bucket but mistyped the amount of shards. In a >> reflex I hit ctrl-c and waited. It looked like the resharding did not >> finish so I canceled it, and now the bucket is in this state. >> How can I fix it. It does not show up in the stale-instace list. It's also >> a multisite environment (we only sync metadata). > I believe resharding is not supported with rgw multisite > (https://docs.ceph.com/en/latest/radosgw/dynamicresharding/#multisite) > but is being worked on / implemented fpr the Quincy release, see > https://tracker.ceph.com/projects/rgw/issues?query_id=247 > > But you are not syncing the data in your deployment? Maybe that's a different > case then? > > > > Regards > > Christian > > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: rgw multisite octopus - bucket can not be resharded after cancelling prior reshard process
Hey Boris, On 07/10/2022 11:30, Boris Behrens wrote: I just wanted to reshard a bucket but mistyped the amount of shards. In a reflex I hit ctrl-c and waited. It looked like the resharding did not finish so I canceled it, and now the bucket is in this state. How can I fix it. It does not show up in the stale-instace list. It's also a multisite environment (we only sync metadata). I believe resharding is not supported with rgw multisite (https://docs.ceph.com/en/latest/radosgw/dynamicresharding/#multisite) but is being worked on / implemented fpr the Quincy release, see https://tracker.ceph.com/projects/rgw/issues?query_id=247 But you are not syncing the data in your deployment? Maybe that's a different case then? Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Understanding the total space in CephFS
Hi Stefan, the cluster is built of several old machines, with different numbers of disks (from 8 to 16) and disk sizes (from 500 GB to 4 TB). After the PG increase it is still recovering: the number of PGP is at 213 and has to grow up to 256. The balancer status gives: { "active": true, "last_optimize_duration": "0:00:00.000347", "last_optimize_started": "Thu Oct 13 08:59:22 2022", "mode": "upmap", "optimize_result": "Too many objects (0.051218 > 0.05) are misplaced; try again later", "plans": [] } and I guess that this means that optimization is ongoing, right? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] CephFS constant high write I/O to the metadata pool
Hi, I'm seeing constant 25-50MB/s writes to the metadata pool even when all clients and the cluster is idling and in clean state. This surely can't be normal? There's no apparent issues with the performance of the cluster but this write rate seems excessive and I don't know where to look for the culprit. The setup is Ceph 16.2.9 running in hyperconverged 3 node core cluster and 6 hdd osd nodes. Here's typical status when pretty much all clients are idling. Most of that write bandwidth and maybe fifth of the write iops is hitting the metadata pool. --- root@pve-core-1:~# ceph -s cluster: id: 2088b4b1-8de1-44d4-956e-aa3d3afff77f health: HEALTH_OK services: mon: 3 daemons, quorum pve-core-1,pve-core-2,pve-core-3 (age 2w) mgr: pve-core-1(active, since 4w), standbys: pve-core-2, pve-core-3 mds: 1/1 daemons up, 2 standby osd: 48 osds: 48 up (since 5h), 48 in (since 4M) data: volumes: 1/1 healthy pools: 10 pools, 625 pgs objects: 70.06M objects, 46 TiB usage: 95 TiB used, 182 TiB / 278 TiB avail pgs: 625 active+clean io: client: 45 KiB/s rd, 38 MiB/s wr, 6 op/s rd, 287 op/s wr --- Here's some daemonperf dump: --- root@pve-core-1:~# ceph daemonperf mds.`hostname -s` mds- --mds_cache--- --mds_log-- -mds_mem- ---mds_server--- mds_ -objecter-- purg req rlat fwd inos caps exi imi hifc crev cgra ctru cfsa cfa hcc hccd hccr prcr|stry recy recd|subm evts segs repl|ino dn |hcr hcs hsr cre cat |sess|actv rd wr rdwr|purg| 4000 767k 78k 0001610055 37 |1.1k 00 | 17 3.7k 1340 |767k 767k| 40500 0 |110 | 42 210 | 2 5720 767k 78k 0003 16300 11 11 0 17 |1.1k 00 | 45 3.7k 1370 |767k 767k| 57800 0 |110 | 02 280 | 4 5740 767k 78k 0004 34400 34 33 2 26 |1.0k 00 |134 3.9k 1390 |767k 767k| 57 1300 0 |110 | 02 1120 | 19 6730 767k 78k 0006 32600 22 22 0 32 |1.1k 00 | 78 3.9k 1410 |767k 768k| 67400 0 |110 | 02 560 | 2 --- Any ideas where to look at? Tnx! o. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Understanding the total space in CephFS
On 10/13/22 09:32, Nicola Mori wrote: Dear Ceph users, I'd need some help in understanding the total space in a CephFS. My cluster is currently built of 8 machines, the one with the smallest capacity has 8 TB of total disk space, and the total available raw space is 153 TB. I set up a 3x replicated metadata pool and a 6+2 erasure coded data pool with host failure domain for my CephFS. In this configuration every host holds a data chunk, so I would expect a total of about 48 TB of total storage space. I computed this value by noting that (roughly speaking and neglecting the metadata) 48 TB of data will need 48 TB of data chunks and 16 TB of coding chunks, for a total of 64 TB that evenly divided into my 8 machines gives an occupancy of 8 TB per host, which exactly saturates the smallest one. Assuming that the above is correct then I would expect that a df -h on a machine mounting the CephFS would report 48 TB of total space. Instead it started with something around 75 TB at the beginning, and it's slowly decreasing while I'm transferring data to the CephFS, being now at 62 TB. I cannot understand this behavior, nor if my assumptions about the total space are correct, so I'd need some help with this. The amount of space available depends on how well the cluster is balanced. And the fullest OSD is used to calculate amount of space available. IIRC you have recently increased PGs. Do you use the Ceph balancer to achieve optimal data placement (ceph balancer status)? ceph osd df will show in what shape your cluster is with respect to balancing. Gr. Stefan ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: How to remove remaining bucket index shard objects
Hi, Unfortunately, the "large omap objects" message recurred last weekend. So I ran the script you showed to check the situation. `used_.*` is small, but `omap_.*` is large, which is strange. Do you have any idea what it is? id used_mbytes used_objects omap_used_mbytes omap_used_keys -- --- -- 6.0 0 0 0 0 6.1 0 0 0 0 6.2 0 0 86.14682674407959 298586 6.3 0 0 93.08089542388916 323902 6.4 0 1 0 0 6.5 0 1 2.124929428100586 7744 6.6 0 0 0 0 6.7 0 0 2.2477407455444336 8192 6.8 0 0 0 0 6.9 0 0 439.5090618133545 1524746 6.a 0 0 0 0 6.b 0 0 3.4069366455078125 12416 6.c 0 0 0 0 6.d 0 0 0 0 6.e 0 0 0 0 6.f 0 1 0 0 6.10 0 1 2.177792549133301 7936 6.11 0 0 3.9340572357177734 14336 6.12 0 0 7.727175712585449 28160 6.13 0 0 114.01904964447021 394996 6.14 0 0 0 0 6.15 0 0 88.56490707397461 307353 6.16 0 0 0 0 6.17 0 0 7.6217451095581055 27776 6.18 0 0 3.933901786804199 14336 6.19 0 1 0 0 6.1a 0 1 0 0 6.1b 0 0 0 0 6.1c 0 0 88.36568355560303 306677 6.1d 0 0 0 0 6.1e 0 1 0 0 6.1f 0 0 92.21501541137695 320707 6.20 0 1 2.1074790954589844 7680 6.21 0 0 0 0 6.22 0 0 0 0 6.23 0 0 8.605427742004395 31360 6.24 0 0 7.938144683837891 28928 6.25 0 0 0 0 6.26 0 0 0 0 6.27 0 1 2.10748291015625 7680 6.28 0 0 0 0 6.29 0 0 2.1601409912109375 7872 6.2a 0 1 0 0 6.2b 0 0 0 0 6.2c 0 0 5.479369163513184 19968 6.2d 0 0 0 0 6.2e 0 0 0 0 6.2f 0 0 0 0 6.30 0 0 117.55222415924072 407521 6.31 0 1 0 0 6.32 0 1 0 0 6.33 0 0 5.812973976135254 21184 6.34 0 0 0 0 6.35 0 0 0 0 6.36 0 0 5.865510940551758 21376 6.37 0 0 86.26362419128418 298993 6.38 0 0 93.97305393218994 327089 6.39 0 0 15.493829727172852 71787 6.3a 0 0 0 0 6.3b 0 0 4.056745529174805 14784 6.3c 0 0 4.039289474487305 14720 6.3d 0 0 0 0 6.3e 0 0 0 0 6.3f 0 0 0 0 6.40 0 0 2.1073970794677734 7680 6.41 0 1 4.004250526428223 14592 6.42 0 0 3.9866724014282227 14528 6.43 0 0 345.3690414428711 1197068 6.44 0 0 0 0 6.45 0 1 0 0 6.46 0 0 3.968973159790039 14464 6.47 0 0 0 0 6.48 0 0 0 0 6.49 0 0 263.9479990005493 914805 6.4a 0 0 94.751708984375 336275 6.4b 0 0 0 0 6.4c 0 0 0 0 6.4d 0 0 270.53627490997314 937581 6.4e 0 1 0 0 6.4f 0 0 0 0 6.50 0 0 1.8790569305419922 6848 6.51 0
[ceph-users] Understanding the total space in CephFS
Dear Ceph users, I'd need some help in understanding the total space in a CephFS. My cluster is currently built of 8 machines, the one with the smallest capacity has 8 TB of total disk space, and the total available raw space is 153 TB. I set up a 3x replicated metadata pool and a 6+2 erasure coded data pool with host failure domain for my CephFS. In this configuration every host holds a data chunk, so I would expect a total of about 48 TB of total storage space. I computed this value by noting that (roughly speaking and neglecting the metadata) 48 TB of data will need 48 TB of data chunks and 16 TB of coding chunks, for a total of 64 TB that evenly divided into my 8 machines gives an occupancy of 8 TB per host, which exactly saturates the smallest one. Assuming that the above is correct then I would expect that a df -h on a machine mounting the CephFS would report 48 TB of total space. Instead it started with something around 75 TB at the beginning, and it's slowly decreasing while I'm transferring data to the CephFS, being now at 62 TB. I cannot understand this behavior, nor if my assumptions about the total space are correct, so I'd need some help with this. Thanks, Nicola ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Iinfinite backfill loop + number of pgp groups stuck at wrong value
Thank you Frank for the insight. I'd need to study a bit more the details of all of this, but for sure now I understand it a bit better. Nicola ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] rbd: Snapshot Only Permissions
Hi All, Is there any way to configure capabilities for a user to allow the client to *only* create/delete snapshots? I can't find anything which suggests this is possible on https://docs.ceph.com/en/latest/rados/operations/user-management/. Context: I'm writing a script to automatically create and delete snapshots. Ideally i'd like to restrict the permissions for this user so it can't do anything else with rbd images and give it the least privileges possible. thanks, Dan The Networking People (TNP) Limited. Registered office: Network House, Caton Rd, Lancaster, LA1 3PE. Registered in England & Wales with company number: 07667393 This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io