[ceph-users] Re: Impact of large PG splits
Setting osd_max_backfills at much more than 1 on HDD spinners seems anathema to me, and I recall reading others saying the same thing. That's because seek time is a major constraint on them, so keeping activity as contiguous as possible is going to help performance. Maybe pushing it to 2-3 is okay, but we haven't seen a lot of throughput benefit. YMMV. The major aggregate speed improver for us was to increase target_max_misplaced_ratio because of the increased parallelism it induces. Also changing osd_mclock_profile is useful, at the following rough ratios, being aware that it can impact client traffic: * high_client_ops 100% * balanced 150% * high_recovery_ops 200% I've just read the help again (thank you whoever implemented ceph config help ...) and have been reminded again that due to primary and non-primary reservations, setting it to e.g. 1 means it could see 2 shards doing recovery IO on the same OSD. On 10/4/24 18:54, Eugen Block wrote: Thank you for input! We started the split with max_backfills = 1 and watched for a few minutes, then gradually increased it to 8. Now it's backfilling with around 180 MB/s, not really much but since client impact has to be avoided if possible, we decided to let that run for a couple of hours. Then reevaluate the situation and maybe increase the backfills a bit more. Thanks! Zitat von Gregory Orange : We are in the middle of splitting 16k EC 8+3 PGs on 2600x 16TB OSDs with NVME RocksDB, used exclusively for RGWs, holding about 60b objects. We are splitting for the same reason as you - improved balance. We also thought long and hard before we began, concerned about impact, stability etc. We set target_max_misplaced_ratio to 0.1% initially, so we could retain some control and stop it again fairly quickly if we weren't happy with the behaviour. It also serves to limit the performance impact on the cluster, but unfortunately it also makes the whole process slower. We now have the setting up to 1.5%, seeing recovery up to 10GB/s. No issues with the cluster. We could go higher, but are not in a rush at this point. Sometimes nearfull osd warnings get high and MAX AVAIL on the data pool in `ceph df` gets low enough that we want to interrupt it. So, we set pg_num to whatever the current value is (ceph osd pool ls detail), and let it stabilise. Then the balancer gets to work once the misplaced objects drop below the ratio, and things balance out. Nearfull osds drop usually to zero, and MAX AVAIL goes up again. The above behaviour is because while they share the same threshold setting, the autoscaler only runs every minute, and it won't run when misplaced are over the threshold. Meanwhile, checks for the next PG to split happen much more frequently, so the balancer never wins that race. We didn't know how long to expect it all to take, but decided that any improvement in PG size was worth starting. We now estimate it will take another 2-3 weeks to complete, for a total of 4-5 weeks total. We have lost a drive or two during the process, and of course degraded objects went up, and more backfilling work got going. We paused splits for at least one of those, to make sure the degraded objects were sorted out as quick as possible. We can't be sure it went any faster though - there's always a long tail on that sort of thing. Inconsistent objects are found at least a couple of times a week, and to get them repairing we disable scrubs, wait until they're stopped, then set the repair going and reenable scrubs. I don't know if this is special to the current higher splitting load, but we haven't noticed it before. HTH, Greg. On 10/4/24 14:42, Eugen Block wrote: Thank you, Janne. I believe the default 5% target_max_misplaced_ratio would work as well, we've had good experience with that in the past, without the autoscaler. I just haven't dealt with such large PGs, I've been warning them for two years (when the PGs were only almost half this size) and now they finally started to listen. Well, they would still ignore it if it wouldn't impact all kinds of things now. ;-) Thanks, Eugen Zitat von Janne Johansson : Den tis 9 apr. 2024 kl 10:39 skrev Eugen Block : I'm trying to estimate the possible impact when large PGs are splitted. Here's one example of such a PG: PG_STAT OBJECTS BYTES OMAP_BYTES* OMAP_KEYS* LOG DISK_LOG UP 86.3ff 277708 414403098409 0 0 3092 3092 [187,166,122,226,171,234,177,163,155,34,81,239,101,13,117,8,57,111] If you ask for small increases of pg_num, it will only split that many PGs at a time, so while there will be a lot of data movement, (50% due to half of the data needs to go to another newly made PG, and on top of that, PGs per OSD will change, but also the balancing can now work better) it will not be affecting the whole cluster if you increase with say, 8 pg_nums at a time. As per the other reply, if you bump the number with a small
[ceph-users] Re: Issue about execute "ceph fs new"
Thanks for your information, I tried several solutions but it not working then I reinstalled, the issue wasn't appear again.. should be something wrong when install. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Regarding write on CephFS - Operation not permitted
The issue was fixed by , the pool was incorrect before. ceph auth caps client.king mon 'allow r' mds 'allow rw' osd 'allow rwx pool=cephfs-king-data' ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] crushmap history
Hi all, Do the Mons store any crushmap history, and if so how does one get at it please? I ask because we've recently encountered an issue in a medium scale (~5PB raw) EC based RGW focused cluster where "something" happened, which we still don't know, that suddenly caused us to see 94% of objects (5.4 billion of them) misplaced. We've tracked down the first log message of that pgmap state change: Mar 29 10:30:31 mon1 bash\[5804\]: debug 2024-03-29T10:30:31.152+ 7f3b6e378700 0 log\_channel(cluster) log \[DBG\] : pgmap v44327: 2273 pgs: 225 active+clean, 2038 active+remapped+backfill\_wait, 10 active+remapped+backfilling; 1.6 PiB data, 2.1 PiB used, 2.2 PiB / 4.3 PiB avail; 5426274136/5752755429 objects misplaced (94.325%); 248 MiB/s, 109 objects/s recovering This appears to have been preceded (aside from a single HTTP HEAD request coming into RGW) by a 5 minute gap in logs where either journald couldn't keep up with debug messages or the Mons were stuck. The last log before that occurs seems to be a compaction event kicking off: mon1 bash\[25927\]: Int 0/00.00 KB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.00 00.000 0 0 Mar 29 10:24:14 mon1 bash\[25927\]: \*\* Compaction Stats \[L\] \*\* Mar 29 10:24:14 mon1 bash\[25927\]: PriorityFiles Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Mar 29 10:24:14 mon1 bash\[25927\]: --- Mar 29 10:24:14 mon1 bash\[25927\]: Low 0/00.00 KB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0116.0 11.4 0.02 0.01 70.003 490462 Mar 29 10:24:14 mon1 bash\[25927\]: High 0/00.00 KB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.9 1.23 1.20280.044 0 0 Mar 29 10:24:14 mon1 bash\[25927\]: User 0/00.00 KB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 16.4 0.00 0.00 10.001 0 0 We're left wondering what the heck has happened to cause such a huge redistribution of data in the cluster when we've not made any corresponding changes, so wanting to see if there's any breadcrumbs we can find. Appreciate any pointers! -- Cheers, ~Blairo ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Impact of large PG splits
Thank you for input! We started the split with max_backfills = 1 and watched for a few minutes, then gradually increased it to 8. Now it's backfilling with around 180 MB/s, not really much but since client impact has to be avoided if possible, we decided to let that run for a couple of hours. Then reevaluate the situation and maybe increase the backfills a bit more. Thanks! Zitat von Gregory Orange : We are in the middle of splitting 16k EC 8+3 PGs on 2600x 16TB OSDs with NVME RocksDB, used exclusively for RGWs, holding about 60b objects. We are splitting for the same reason as you - improved balance. We also thought long and hard before we began, concerned about impact, stability etc. We set target_max_misplaced_ratio to 0.1% initially, so we could retain some control and stop it again fairly quickly if we weren't happy with the behaviour. It also serves to limit the performance impact on the cluster, but unfortunately it also makes the whole process slower. We now have the setting up to 1.5%, seeing recovery up to 10GB/s. No issues with the cluster. We could go higher, but are not in a rush at this point. Sometimes nearfull osd warnings get high and MAX AVAIL on the data pool in `ceph df` gets low enough that we want to interrupt it. So, we set pg_num to whatever the current value is (ceph osd pool ls detail), and let it stabilise. Then the balancer gets to work once the misplaced objects drop below the ratio, and things balance out. Nearfull osds drop usually to zero, and MAX AVAIL goes up again. The above behaviour is because while they share the same threshold setting, the autoscaler only runs every minute, and it won't run when misplaced are over the threshold. Meanwhile, checks for the next PG to split happen much more frequently, so the balancer never wins that race. We didn't know how long to expect it all to take, but decided that any improvement in PG size was worth starting. We now estimate it will take another 2-3 weeks to complete, for a total of 4-5 weeks total. We have lost a drive or two during the process, and of course degraded objects went up, and more backfilling work got going. We paused splits for at least one of those, to make sure the degraded objects were sorted out as quick as possible. We can't be sure it went any faster though - there's always a long tail on that sort of thing. Inconsistent objects are found at least a couple of times a week, and to get them repairing we disable scrubs, wait until they're stopped, then set the repair going and reenable scrubs. I don't know if this is special to the current higher splitting load, but we haven't noticed it before. HTH, Greg. On 10/4/24 14:42, Eugen Block wrote: Thank you, Janne. I believe the default 5% target_max_misplaced_ratio would work as well, we've had good experience with that in the past, without the autoscaler. I just haven't dealt with such large PGs, I've been warning them for two years (when the PGs were only almost half this size) and now they finally started to listen. Well, they would still ignore it if it wouldn't impact all kinds of things now. ;-) Thanks, Eugen Zitat von Janne Johansson : Den tis 9 apr. 2024 kl 10:39 skrev Eugen Block : I'm trying to estimate the possible impact when large PGs are splitted. Here's one example of such a PG: PG_STAT OBJECTS BYTES OMAP_BYTES* OMAP_KEYS* LOG DISK_LOG UP 86.3ff 277708 414403098409 0 0 3092 3092 [187,166,122,226,171,234,177,163,155,34,81,239,101,13,117,8,57,111] If you ask for small increases of pg_num, it will only split that many PGs at a time, so while there will be a lot of data movement, (50% due to half of the data needs to go to another newly made PG, and on top of that, PGs per OSD will change, but also the balancing can now work better) it will not be affecting the whole cluster if you increase with say, 8 pg_nums at a time. As per the other reply, if you bump the number with a small amount - wait for HEALTH_OK - bump some more it will take a lot of calendar time, but have rather small impact. My view of it is basically that this will be far less impactful than if you lose a whole OSD, and hopefully your cluster can survive this event, so it should be able to handle a slow trickle of PG splits too. You can set a target number for the pool and let the autoscaler run a few splits at a time, there are some settings to look at on how aggressive the autoscaler will be, so it doesn't have to be manual/scripted, but it's not very hard to script it if you are unsure about the amount of work the autoscaler will start at any given time. -- May the most significant bit of your life be positive. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io -- Gregory Orange System Administrator, Scientific
[ceph-users] Re: Impact of large PG splits
> On 10 Apr 2024, at 01:00, Eugen Block wrote: > > I appreciate your message, it really sounds tough (9 months, really?!). But > thanks for the reassurance :-) Yes, the total "make this project great again" tooks 16 month, I think. This my work First problem after 1M objects in PG was a deletion [1]. It's just impossible to delete objects for the 'stray' PG The second was - the code, that cares about nearfull & backfillfull just don't work for this OSD [2], because code use DATA field (the objects), instead RAW field (the DATA + RocksDB database) for computations The third was minor, but WTF statistics metric issue [3] And the last but not least (and still present in master) - when lock object acquired, this crashes replica OSD's in acting set, when object is absent on primary OSD [4]. This may ruin client IO until OSD's restart & recovery For current time, not all collection_list fixes was merged [5], but since 14.2.22 much better than before... > They don’t have any other options so we’ll have to start that process anyway, > probably tomorrow. We’ll see how it goes… Yes, you just have to start, and then we’ll see Thanks, k [1] https://tracker.ceph.com/issues/47044 + https://tracker.ceph.com/issues/45765 -> https://tracker.ceph.com/issues/50466 [2] https://tracker.ceph.com/issues/50533 [3] https://tracker.ceph.com/issues/52512 [4] https://tracker.ceph.com/issues/52513 [5] https://tracker.ceph.com/issues/58274 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Impact of large PG splits
We are in the middle of splitting 16k EC 8+3 PGs on 2600x 16TB OSDs with NVME RocksDB, used exclusively for RGWs, holding about 60b objects. We are splitting for the same reason as you - improved balance. We also thought long and hard before we began, concerned about impact, stability etc. We set target_max_misplaced_ratio to 0.1% initially, so we could retain some control and stop it again fairly quickly if we weren't happy with the behaviour. It also serves to limit the performance impact on the cluster, but unfortunately it also makes the whole process slower. We now have the setting up to 1.5%, seeing recovery up to 10GB/s. No issues with the cluster. We could go higher, but are not in a rush at this point. Sometimes nearfull osd warnings get high and MAX AVAIL on the data pool in `ceph df` gets low enough that we want to interrupt it. So, we set pg_num to whatever the current value is (ceph osd pool ls detail), and let it stabilise. Then the balancer gets to work once the misplaced objects drop below the ratio, and things balance out. Nearfull osds drop usually to zero, and MAX AVAIL goes up again. The above behaviour is because while they share the same threshold setting, the autoscaler only runs every minute, and it won't run when misplaced are over the threshold. Meanwhile, checks for the next PG to split happen much more frequently, so the balancer never wins that race. We didn't know how long to expect it all to take, but decided that any improvement in PG size was worth starting. We now estimate it will take another 2-3 weeks to complete, for a total of 4-5 weeks total. We have lost a drive or two during the process, and of course degraded objects went up, and more backfilling work got going. We paused splits for at least one of those, to make sure the degraded objects were sorted out as quick as possible. We can't be sure it went any faster though - there's always a long tail on that sort of thing. Inconsistent objects are found at least a couple of times a week, and to get them repairing we disable scrubs, wait until they're stopped, then set the repair going and reenable scrubs. I don't know if this is special to the current higher splitting load, but we haven't noticed it before. HTH, Greg. On 10/4/24 14:42, Eugen Block wrote: Thank you, Janne. I believe the default 5% target_max_misplaced_ratio would work as well, we've had good experience with that in the past, without the autoscaler. I just haven't dealt with such large PGs, I've been warning them for two years (when the PGs were only almost half this size) and now they finally started to listen. Well, they would still ignore it if it wouldn't impact all kinds of things now. ;-) Thanks, Eugen Zitat von Janne Johansson : Den tis 9 apr. 2024 kl 10:39 skrev Eugen Block : I'm trying to estimate the possible impact when large PGs are splitted. Here's one example of such a PG: PG_STAT OBJECTS BYTES OMAP_BYTES* OMAP_KEYS* LOG DISK_LOG UP 86.3ff 277708 414403098409 0 0 3092 3092 [187,166,122,226,171,234,177,163,155,34,81,239,101,13,117,8,57,111] If you ask for small increases of pg_num, it will only split that many PGs at a time, so while there will be a lot of data movement, (50% due to half of the data needs to go to another newly made PG, and on top of that, PGs per OSD will change, but also the balancing can now work better) it will not be affecting the whole cluster if you increase with say, 8 pg_nums at a time. As per the other reply, if you bump the number with a small amount - wait for HEALTH_OK - bump some more it will take a lot of calendar time, but have rather small impact. My view of it is basically that this will be far less impactful than if you lose a whole OSD, and hopefully your cluster can survive this event, so it should be able to handle a slow trickle of PG splits too. You can set a target number for the pool and let the autoscaler run a few splits at a time, there are some settings to look at on how aggressive the autoscaler will be, so it doesn't have to be manual/scripted, but it's not very hard to script it if you are unsure about the amount of work the autoscaler will start at any given time. -- May the most significant bit of your life be positive. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io -- Gregory Orange System Administrator, Scientific Platforms Team Pawsey Supercomputing Centre, CSIRO ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Impact of large PG splits
Thank you, Janne. I believe the default 5% target_max_misplaced_ratio would work as well, we've had good experience with that in the past, without the autoscaler. I just haven't dealt with such large PGs, I've been warning them for two years (when the PGs were only almost half this size) and now they finally started to listen. Well, they would still ignore it if it wouldn't impact all kinds of things now. ;-) Thanks, Eugen Zitat von Janne Johansson : Den tis 9 apr. 2024 kl 10:39 skrev Eugen Block : I'm trying to estimate the possible impact when large PGs are splitted. Here's one example of such a PG: PG_STAT OBJECTS BYTES OMAP_BYTES* OMAP_KEYS* LOG DISK_LOGUP 86.3ff277708 4144030984090 0 3092 3092 [187,166,122,226,171,234,177,163,155,34,81,239,101,13,117,8,57,111] If you ask for small increases of pg_num, it will only split that many PGs at a time, so while there will be a lot of data movement, (50% due to half of the data needs to go to another newly made PG, and on top of that, PGs per OSD will change, but also the balancing can now work better) it will not be affecting the whole cluster if you increase with say, 8 pg_nums at a time. As per the other reply, if you bump the number with a small amount - wait for HEALTH_OK - bump some more it will take a lot of calendar time, but have rather small impact. My view of it is basically that this will be far less impactful than if you lose a whole OSD, and hopefully your cluster can survive this event, so it should be able to handle a slow trickle of PG splits too. You can set a target number for the pool and let the autoscaler run a few splits at a time, there are some settings to look at on how aggressive the autoscaler will be, so it doesn't have to be manual/scripted, but it's not very hard to script it if you are unsure about the amount of work the autoscaler will start at any given time. -- May the most significant bit of your life be positive. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Impact of large PG splits
Den tis 9 apr. 2024 kl 10:39 skrev Eugen Block : > I'm trying to estimate the possible impact when large PGs are > splitted. Here's one example of such a PG: > > PG_STAT OBJECTS BYTES OMAP_BYTES* OMAP_KEYS* LOG DISK_LOGUP > 86.3ff277708 4144030984090 0 3092 > 3092 > [187,166,122,226,171,234,177,163,155,34,81,239,101,13,117,8,57,111] If you ask for small increases of pg_num, it will only split that many PGs at a time, so while there will be a lot of data movement, (50% due to half of the data needs to go to another newly made PG, and on top of that, PGs per OSD will change, but also the balancing can now work better) it will not be affecting the whole cluster if you increase with say, 8 pg_nums at a time. As per the other reply, if you bump the number with a small amount - wait for HEALTH_OK - bump some more it will take a lot of calendar time, but have rather small impact. My view of it is basically that this will be far less impactful than if you lose a whole OSD, and hopefully your cluster can survive this event, so it should be able to handle a slow trickle of PG splits too. You can set a target number for the pool and let the autoscaler run a few splits at a time, there are some settings to look at on how aggressive the autoscaler will be, so it doesn't have to be manual/scripted, but it's not very hard to script it if you are unsure about the amount of work the autoscaler will start at any given time. -- May the most significant bit of your life be positive. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io