[ceph-users] Re: Rocksdb compaction and OSD timeout
Hey Konstantin, forget to mention - indeed clusters having 4K bluestore min alloc size are more likely to be exposed to the issue. The key point is the difference between bluestore and bluefs allocation sizes. The issue likely to pop-up when user and DB data are collocated but different allocation units are in use. As a result allocator needs to locate properly aligned chunks for BlueFS among a bunch of inappropriate misaligned chunks. Which might be ineffective in the current implementation and cause the slowdown. Thanks, Igor On 12/09/2023 15:47, Konstantin Shalygin wrote: Hi Igor, On 12 Sep 2023, at 15:28, Igor Fedotov wrote: Default hybrid allocator (as well as AVL one it's based on) could take dramatically long time to allocate pretty large (hundreds of MBs) 64K-aligned chunks for BlueFS. At the original cluster it was exposed as 20-30 sec OSD stalls. For the chunks, this mean bluestore min alloc size? This cluster was deployed pre Pacific (64k) and not redeployed to Pacific default (4k)? Thanks, k Sent from my iPhone ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Rocksdb compaction and OSD timeout
Hi Igor, > On 12 Sep 2023, at 15:28, Igor Fedotov wrote: > > Default hybrid allocator (as well as AVL one it's based on) could take > dramatically long time to allocate pretty large (hundreds of MBs) 64K-aligned > chunks for BlueFS. At the original cluster it was exposed as 20-30 sec OSD > stalls. For the chunks, this mean bluestore min alloc size? This cluster was deployed pre Pacific (64k) and not redeployed to Pacific default (4k)? Thanks, k Sent from my iPhone ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Rocksdb compaction and OSD timeout
HI All, as promised here is a postmortem analysis on what happened. the following ticket (https://tracker.ceph.com/issues/62815) with accompanying materials provide a low-level overview on the issue. In a few words it is as follows: Default hybrid allocator (as well as AVL one it's based on) could take dramatically long time to allocate pretty large (hundreds of MBs) 64K-aligned chunks for BlueFS. At the original cluster it was exposed as 20-30 sec OSD stalls. This is apparently not specific to the recent 16.2.14 Pacific release as I saw that at least once before but https://github.com/ceph/ceph/pull/51773 made it more likely to pop up. RocksDB could preallocate huge WALs in a single short from now on . The issue is definitely bound to aged/fragmented main OSD volumes which colocate DB ones. I don't expect it to pop up for standalone DB/WALs. As already mentioned in this thread the proposed work-around is to switch bluestore_allocator to bitmap. This might cause minor overall performance drop so I'm not sure one should apply this unconditionally. I'd like to ask for apologies for the inconvenience this could result. We're currently working on a proper fix... Thanks, Igor On 07/09/2023 10:05, J-P Methot wrote: Hi, We're running latest Pacific on our production cluster and we've been seeing the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' had timed out after 15.00954s' error. We have reasons to believe this happens each time the RocksDB compaction process is launched on an OSD. My question is, does the cluster detecting that an OSD has timed out interrupt the compaction process? This seems to be what's happening, but it's not immediately obvious. We are currently facing an infinite loop of random OSDs timing out and if the compaction process is interrupted without finishing, it may explain that. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Rocksdb compaction and OSD timeout
The bluestore configuration was 100% default when we did the upgrade and the issue happened. We have provided Igor with an OSD dump and a db dump last Friday, so hopefully you can figure out something from it. On 9/8/23 02:48, Konstantin Shalygin wrote: This cluster use the default settings or something for Bluestore was changed? You can check this via `ceph config diff` As Mark said, it will be nice to have a tracker, if this really release problem Thanks, k Sent from my iPhone On 7 Sep 2023, at 20:22, J-P Methot wrote: We went from 16.2.13 to 16.2.14 Also, timeout is 15 seconds because it's the default in Ceph. Basically, 15 seconds before Ceph shows a warning that OSD is timing out. We may have found the solution, but it would be, in fact, related to bluestore_allocator and not the compaction process. I'll post the actual resolution when we confirm 100% that it works. -- Jean-Philippe Méthot Senior Openstack system administrator Administrateur système Openstack sénior PlanetHoster inc. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Rocksdb compaction and OSD timeout
This cluster use the default settings or something for Bluestore was changed? You can check this via `ceph config diff` As Mark said, it will be nice to have a tracker, if this really release problem Thanks, k Sent from my iPhone > On 7 Sep 2023, at 20:22, J-P Methot wrote: > > We went from 16.2.13 to 16.2.14 > > Also, timeout is 15 seconds because it's the default in Ceph. Basically, 15 > seconds before Ceph shows a warning that OSD is timing out. > > We may have found the solution, but it would be, in fact, related to > bluestore_allocator and not the compaction process. I'll post the actual > resolution when we confirm 100% that it works. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Rocksdb compaction and OSD timeout
On 07-09-2023 19:20, J-P Methot wrote: We went from 16.2.13 to 16.2.14 Also, timeout is 15 seconds because it's the default in Ceph. Basically, 15 seconds before Ceph shows a warning that OSD is timing out. We may have found the solution, but it would be, in fact, related to bluestore_allocator and not the compaction process. I'll post the actual resolution when we confirm 100% that it works. I'm very curious about what comes out of this, so yeah, please share your findings. We have re-provisioned our cluster past couple months, and have kept the "bitmap" allocator for bluestore (non-default). We have one OSD running hybrid for comparison. We want to see if there are any (big) differences in fragmentation over time between the two. Gr. Stefan ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Rocksdb compaction and OSD timeout
I also see the dreaded. i find this is bcache problem .you can use blktrace tools capture iodatas analysis 发自我的小米在 Stefan Kooman ,2023年9月7日 下午10:52写道:On 07-09-2023 09:05, J-P Methot wrote: > Hi, > > We're running latest Pacific on our production cluster and we've been > seeing the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' had timed out > after 15.00954s' error. We have reasons to believe this happens each > time the RocksDB compaction process is launched on an OSD. My question > is, does the cluster detecting that an OSD has timed out interrupt the > compaction process? This seems to be what's happening, but it's not > immediately obvious. We are currently facing an infinite loop of random > OSDs timing out and if the compaction process is interrupted without > finishing, it may explain that. > Does the OSD also restart after it logged the timeouts? You might want to perform an offline compaction every $timeperiod to fix any potential RocksDB degradation. That's what we do. What kind of workload do you run (i.e. RBD, CephFS, RGW)? Do you also see these timeouts occur during deep-scrubs? Gr. Stefan ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Rocksdb compaction and OSD timeout
Oh that's very good to know. I'm sure Igor will respond here, but do you know which PR this was related to? (possibly https://github.com/ceph/ceph/pull/50321) If we think there's a regression here we should get it into the tracker ASAP. Mark On 9/7/23 13:45, J-P Methot wrote: To be quite honest, I will not pretend I have a low level understanding of what was going on. There is very little documentation as to what the bluestore allocator actually does and we had to rely on Igor's help to find the solution, so my understanding of the situation is limited. What I understand is as follows: -Our workload requires us to move around, delete, write a fairly high amount of RBD data around the cluster. -The AVL allocator doesn't seem to like that and changes added to it in 16.2.14 made it worse than before. -It made the OSDs become unresponsive and lag quite a bit whenever high amounts of data was written or deleted, which is, all the time. -We basically changed the allocator to bitmap and, as we speak, this seems to have solved the problem. I understand that this is not ideal as it's apparently less performant, but here it's the difference between a cluster that gives me enough I/Os to work properly and a cluster that murders my performances. I hope this helps. Feel free to ask us if you need further details and I'll see what I can do. On 9/7/23 13:59, Mark Nelson wrote: Ok, good to know. Please feel free to update us here with what you are seeing in the allocator. It might also be worth opening a tracker ticket as well. I did some work in the AVL allocator a while back where we were repeating the linear search from the same offset every allocation, getting stuck, and falling back to fast search over and over leading to significant allocation fragmentation. That got fixed, but I wouldn't be surprised if we have some other sub-optimal behaviors we don't know about. Mark On 9/7/23 12:28, J-P Methot wrote: Hi, By this point, we're 95% sure that, contrary to our previous beliefs, it's an issue with changes to the bluestore_allocator and not the compaction process. That said, I will keep this email in mind as we will want to test optimizations to compaction on our test environment. On 9/7/23 12:32, Mark Nelson wrote: Hello, There are two things that might help you here. One is to try the new "rocksdb_cf_compaction_on_deletion" feature that I added in Reef and we backported to Pacific in 16.2.13. So far this appears to be a huge win for avoiding tombstone accumulation during iteration which is often the issue with threadpool timeouts due to rocksdb. Manual compaction can help, but if you are hitting a case where there's concurrent iteration and deletions with no writes, tombstones will accumulate quickly with no compactions taking place and you'll eventually end up back in the same place. The default sliding window and trigger settings are fairly conservative to avoid excessive compaction, so it may require some tuning to hit the right sweet spot on your cluster. I know of at least one site that's using this feature with more aggressive settings than default and had an extremely positive impact on their cluster. The other thing that can help improve compaction performance in general is enabling lz4 compression in RocksDB. I plan to make this the default behavior in Squid assuming we don't run into any issues in testing. There are several sites that are using this now in production and the benefits have been dramatic relative to the costs. We're seeing significantly faster compactions and about 2.2x lower space requirement for the DB (RGW workload). There may be a slight CPU cost and read/index listing performance impact, but even with testing on NVMe clusters this was quite low (maybe a couple of percent). Mark On 9/7/23 10:21, J-P Methot wrote: Hi, Since my post, we've been speaking with a member of the Ceph dev team. He did, at first, believe it was an issue linked to the common performance degradation after huge deletes operation. So we did do offline compactions on all our OSDs. It fixed nothing and we are going through the logs to try and figure this out. To answer your question, no the OSD doesn't restart after it logs the timeout. It manages to get back online by itself, at the cost of sluggish performances for the cluster and high iowait on VMs. We mostly run RBD workloads. Deep scrubs or no deep scrubs doesn't appear to change anything. Deactivating scrubs altogether did not impact performances in any way. Furthermore, I'll stress that this is only happening since we upgraded to the latest Pacific, yesterday. On 9/7/23 10:49, Stefan Kooman wrote: On 07-09-2023 09:05, J-P Methot wrote: Hi, We're running latest Pacific on our production cluster and we've been seeing the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' had timed out after 15.00954s' error. We have reasons to believe this
[ceph-users] Re: Rocksdb compaction and OSD timeout
To be quite honest, I will not pretend I have a low level understanding of what was going on. There is very little documentation as to what the bluestore allocator actually does and we had to rely on Igor's help to find the solution, so my understanding of the situation is limited. What I understand is as follows: -Our workload requires us to move around, delete, write a fairly high amount of RBD data around the cluster. -The AVL allocator doesn't seem to like that and changes added to it in 16.2.14 made it worse than before. -It made the OSDs become unresponsive and lag quite a bit whenever high amounts of data was written or deleted, which is, all the time. -We basically changed the allocator to bitmap and, as we speak, this seems to have solved the problem. I understand that this is not ideal as it's apparently less performant, but here it's the difference between a cluster that gives me enough I/Os to work properly and a cluster that murders my performances. I hope this helps. Feel free to ask us if you need further details and I'll see what I can do. On 9/7/23 13:59, Mark Nelson wrote: Ok, good to know. Please feel free to update us here with what you are seeing in the allocator. It might also be worth opening a tracker ticket as well. I did some work in the AVL allocator a while back where we were repeating the linear search from the same offset every allocation, getting stuck, and falling back to fast search over and over leading to significant allocation fragmentation. That got fixed, but I wouldn't be surprised if we have some other sub-optimal behaviors we don't know about. Mark On 9/7/23 12:28, J-P Methot wrote: Hi, By this point, we're 95% sure that, contrary to our previous beliefs, it's an issue with changes to the bluestore_allocator and not the compaction process. That said, I will keep this email in mind as we will want to test optimizations to compaction on our test environment. On 9/7/23 12:32, Mark Nelson wrote: Hello, There are two things that might help you here. One is to try the new "rocksdb_cf_compaction_on_deletion" feature that I added in Reef and we backported to Pacific in 16.2.13. So far this appears to be a huge win for avoiding tombstone accumulation during iteration which is often the issue with threadpool timeouts due to rocksdb. Manual compaction can help, but if you are hitting a case where there's concurrent iteration and deletions with no writes, tombstones will accumulate quickly with no compactions taking place and you'll eventually end up back in the same place. The default sliding window and trigger settings are fairly conservative to avoid excessive compaction, so it may require some tuning to hit the right sweet spot on your cluster. I know of at least one site that's using this feature with more aggressive settings than default and had an extremely positive impact on their cluster. The other thing that can help improve compaction performance in general is enabling lz4 compression in RocksDB. I plan to make this the default behavior in Squid assuming we don't run into any issues in testing. There are several sites that are using this now in production and the benefits have been dramatic relative to the costs. We're seeing significantly faster compactions and about 2.2x lower space requirement for the DB (RGW workload). There may be a slight CPU cost and read/index listing performance impact, but even with testing on NVMe clusters this was quite low (maybe a couple of percent). Mark On 9/7/23 10:21, J-P Methot wrote: Hi, Since my post, we've been speaking with a member of the Ceph dev team. He did, at first, believe it was an issue linked to the common performance degradation after huge deletes operation. So we did do offline compactions on all our OSDs. It fixed nothing and we are going through the logs to try and figure this out. To answer your question, no the OSD doesn't restart after it logs the timeout. It manages to get back online by itself, at the cost of sluggish performances for the cluster and high iowait on VMs. We mostly run RBD workloads. Deep scrubs or no deep scrubs doesn't appear to change anything. Deactivating scrubs altogether did not impact performances in any way. Furthermore, I'll stress that this is only happening since we upgraded to the latest Pacific, yesterday. On 9/7/23 10:49, Stefan Kooman wrote: On 07-09-2023 09:05, J-P Methot wrote: Hi, We're running latest Pacific on our production cluster and we've been seeing the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' had timed out after 15.00954s' error. We have reasons to believe this happens each time the RocksDB compaction process is launched on an OSD. My question is, does the cluster detecting that an OSD has timed out interrupt the compaction process? This seems to be what's happening, but it's not immediately obvious. We are currently facing an infinite loop of
[ceph-users] Re: Rocksdb compaction and OSD timeout
Ok, good to know. Please feel free to update us here with what you are seeing in the allocator. It might also be worth opening a tracker ticket as well. I did some work in the AVL allocator a while back where we were repeating the linear search from the same offset every allocation, getting stuck, and falling back to fast search over and over leading to significant allocation fragmentation. That got fixed, but I wouldn't be surprised if we have some other sub-optimal behaviors we don't know about. Mark On 9/7/23 12:28, J-P Methot wrote: Hi, By this point, we're 95% sure that, contrary to our previous beliefs, it's an issue with changes to the bluestore_allocator and not the compaction process. That said, I will keep this email in mind as we will want to test optimizations to compaction on our test environment. On 9/7/23 12:32, Mark Nelson wrote: Hello, There are two things that might help you here. One is to try the new "rocksdb_cf_compaction_on_deletion" feature that I added in Reef and we backported to Pacific in 16.2.13. So far this appears to be a huge win for avoiding tombstone accumulation during iteration which is often the issue with threadpool timeouts due to rocksdb. Manual compaction can help, but if you are hitting a case where there's concurrent iteration and deletions with no writes, tombstones will accumulate quickly with no compactions taking place and you'll eventually end up back in the same place. The default sliding window and trigger settings are fairly conservative to avoid excessive compaction, so it may require some tuning to hit the right sweet spot on your cluster. I know of at least one site that's using this feature with more aggressive settings than default and had an extremely positive impact on their cluster. The other thing that can help improve compaction performance in general is enabling lz4 compression in RocksDB. I plan to make this the default behavior in Squid assuming we don't run into any issues in testing. There are several sites that are using this now in production and the benefits have been dramatic relative to the costs. We're seeing significantly faster compactions and about 2.2x lower space requirement for the DB (RGW workload). There may be a slight CPU cost and read/index listing performance impact, but even with testing on NVMe clusters this was quite low (maybe a couple of percent). Mark On 9/7/23 10:21, J-P Methot wrote: Hi, Since my post, we've been speaking with a member of the Ceph dev team. He did, at first, believe it was an issue linked to the common performance degradation after huge deletes operation. So we did do offline compactions on all our OSDs. It fixed nothing and we are going through the logs to try and figure this out. To answer your question, no the OSD doesn't restart after it logs the timeout. It manages to get back online by itself, at the cost of sluggish performances for the cluster and high iowait on VMs. We mostly run RBD workloads. Deep scrubs or no deep scrubs doesn't appear to change anything. Deactivating scrubs altogether did not impact performances in any way. Furthermore, I'll stress that this is only happening since we upgraded to the latest Pacific, yesterday. On 9/7/23 10:49, Stefan Kooman wrote: On 07-09-2023 09:05, J-P Methot wrote: Hi, We're running latest Pacific on our production cluster and we've been seeing the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' had timed out after 15.00954s' error. We have reasons to believe this happens each time the RocksDB compaction process is launched on an OSD. My question is, does the cluster detecting that an OSD has timed out interrupt the compaction process? This seems to be what's happening, but it's not immediately obvious. We are currently facing an infinite loop of random OSDs timing out and if the compaction process is interrupted without finishing, it may explain that. Does the OSD also restart after it logged the timeouts? You might want to perform an offline compaction every $timeperiod to fix any potential RocksDB degradation. That's what we do. What kind of workload do you run (i.e. RBD, CephFS, RGW)? Do you also see these timeouts occur during deep-scrubs? Gr. Stefan -- Best Regards, Mark Nelson Head of Research and Development Clyso GmbH p: +49 89 21552391 12 | a: Minnesota, USA w: https://clyso.com | e: mark.nel...@clyso.com We are hiring: https://www.clyso.com/jobs/ ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Rocksdb compaction and OSD timeout
Hi, By this point, we're 95% sure that, contrary to our previous beliefs, it's an issue with changes to the bluestore_allocator and not the compaction process. That said, I will keep this email in mind as we will want to test optimizations to compaction on our test environment. On 9/7/23 12:32, Mark Nelson wrote: Hello, There are two things that might help you here. One is to try the new "rocksdb_cf_compaction_on_deletion" feature that I added in Reef and we backported to Pacific in 16.2.13. So far this appears to be a huge win for avoiding tombstone accumulation during iteration which is often the issue with threadpool timeouts due to rocksdb. Manual compaction can help, but if you are hitting a case where there's concurrent iteration and deletions with no writes, tombstones will accumulate quickly with no compactions taking place and you'll eventually end up back in the same place. The default sliding window and trigger settings are fairly conservative to avoid excessive compaction, so it may require some tuning to hit the right sweet spot on your cluster. I know of at least one site that's using this feature with more aggressive settings than default and had an extremely positive impact on their cluster. The other thing that can help improve compaction performance in general is enabling lz4 compression in RocksDB. I plan to make this the default behavior in Squid assuming we don't run into any issues in testing. There are several sites that are using this now in production and the benefits have been dramatic relative to the costs. We're seeing significantly faster compactions and about 2.2x lower space requirement for the DB (RGW workload). There may be a slight CPU cost and read/index listing performance impact, but even with testing on NVMe clusters this was quite low (maybe a couple of percent). Mark On 9/7/23 10:21, J-P Methot wrote: Hi, Since my post, we've been speaking with a member of the Ceph dev team. He did, at first, believe it was an issue linked to the common performance degradation after huge deletes operation. So we did do offline compactions on all our OSDs. It fixed nothing and we are going through the logs to try and figure this out. To answer your question, no the OSD doesn't restart after it logs the timeout. It manages to get back online by itself, at the cost of sluggish performances for the cluster and high iowait on VMs. We mostly run RBD workloads. Deep scrubs or no deep scrubs doesn't appear to change anything. Deactivating scrubs altogether did not impact performances in any way. Furthermore, I'll stress that this is only happening since we upgraded to the latest Pacific, yesterday. On 9/7/23 10:49, Stefan Kooman wrote: On 07-09-2023 09:05, J-P Methot wrote: Hi, We're running latest Pacific on our production cluster and we've been seeing the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' had timed out after 15.00954s' error. We have reasons to believe this happens each time the RocksDB compaction process is launched on an OSD. My question is, does the cluster detecting that an OSD has timed out interrupt the compaction process? This seems to be what's happening, but it's not immediately obvious. We are currently facing an infinite loop of random OSDs timing out and if the compaction process is interrupted without finishing, it may explain that. Does the OSD also restart after it logged the timeouts? You might want to perform an offline compaction every $timeperiod to fix any potential RocksDB degradation. That's what we do. What kind of workload do you run (i.e. RBD, CephFS, RGW)? Do you also see these timeouts occur during deep-scrubs? Gr. Stefan -- Jean-Philippe Méthot Senior Openstack system administrator Administrateur système Openstack sénior PlanetHoster inc. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Rocksdb compaction and OSD timeout
We went from 16.2.13 to 16.2.14 Also, timeout is 15 seconds because it's the default in Ceph. Basically, 15 seconds before Ceph shows a warning that OSD is timing out. We may have found the solution, but it would be, in fact, related to bluestore_allocator and not the compaction process. I'll post the actual resolution when we confirm 100% that it works. On 9/7/23 12:18, Konstantin Shalygin wrote: Hi, On 7 Sep 2023, at 18:21, J-P Methot wrote: Since my post, we've been speaking with a member of the Ceph dev team. He did, at first, believe it was an issue linked to the common performance degradation after huge deletes operation. So we did do offline compactions on all our OSDs. It fixed nothing and we are going through the logs to try and figure this out. To answer your question, no the OSD doesn't restart after it logs the timeout. It manages to get back online by itself, at the cost of sluggish performances for the cluster and high iowait on VMs. We mostly run RBD workloads. Deep scrubs or no deep scrubs doesn't appear to change anything. Deactivating scrubs altogether did not impact performances in any way. Furthermore, I'll stress that this is only happening since we upgraded to the latest Pacific, yesterday. What is your previous release version? What is your OSD drives models? The timeout are always 15s? Not 7s, not 17s? Thanks, k -- Jean-Philippe Méthot Senior Openstack system administrator Administrateur système Openstack sénior PlanetHoster inc. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Rocksdb compaction and OSD timeout
Hello, There are two things that might help you here. One is to try the new "rocksdb_cf_compaction_on_deletion" feature that I added in Reef and we backported to Pacific in 16.2.13. So far this appears to be a huge win for avoiding tombstone accumulation during iteration which is often the issue with threadpool timeouts due to rocksdb. Manual compaction can help, but if you are hitting a case where there's concurrent iteration and deletions with no writes, tombstones will accumulate quickly with no compactions taking place and you'll eventually end up back in the same place. The default sliding window and trigger settings are fairly conservative to avoid excessive compaction, so it may require some tuning to hit the right sweet spot on your cluster. I know of at least one site that's using this feature with more aggressive settings than default and had an extremely positive impact on their cluster. The other thing that can help improve compaction performance in general is enabling lz4 compression in RocksDB. I plan to make this the default behavior in Squid assuming we don't run into any issues in testing. There are several sites that are using this now in production and the benefits have been dramatic relative to the costs. We're seeing significantly faster compactions and about 2.2x lower space requirement for the DB (RGW workload). There may be a slight CPU cost and read/index listing performance impact, but even with testing on NVMe clusters this was quite low (maybe a couple of percent). Mark On 9/7/23 10:21, J-P Methot wrote: Hi, Since my post, we've been speaking with a member of the Ceph dev team. He did, at first, believe it was an issue linked to the common performance degradation after huge deletes operation. So we did do offline compactions on all our OSDs. It fixed nothing and we are going through the logs to try and figure this out. To answer your question, no the OSD doesn't restart after it logs the timeout. It manages to get back online by itself, at the cost of sluggish performances for the cluster and high iowait on VMs. We mostly run RBD workloads. Deep scrubs or no deep scrubs doesn't appear to change anything. Deactivating scrubs altogether did not impact performances in any way. Furthermore, I'll stress that this is only happening since we upgraded to the latest Pacific, yesterday. On 9/7/23 10:49, Stefan Kooman wrote: On 07-09-2023 09:05, J-P Methot wrote: Hi, We're running latest Pacific on our production cluster and we've been seeing the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' had timed out after 15.00954s' error. We have reasons to believe this happens each time the RocksDB compaction process is launched on an OSD. My question is, does the cluster detecting that an OSD has timed out interrupt the compaction process? This seems to be what's happening, but it's not immediately obvious. We are currently facing an infinite loop of random OSDs timing out and if the compaction process is interrupted without finishing, it may explain that. Does the OSD also restart after it logged the timeouts? You might want to perform an offline compaction every $timeperiod to fix any potential RocksDB degradation. That's what we do. What kind of workload do you run (i.e. RBD, CephFS, RGW)? Do you also see these timeouts occur during deep-scrubs? Gr. Stefan -- Best Regards, Mark Nelson Head of Research and Development Clyso GmbH p: +49 89 21552391 12 | a: Minnesota, USA w: https://clyso.com | e: mark.nel...@clyso.com We are hiring: https://www.clyso.com/jobs/ ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Rocksdb compaction and OSD timeout
Hi, > On 7 Sep 2023, at 18:21, J-P Methot wrote: > > Since my post, we've been speaking with a member of the Ceph dev team. He > did, at first, believe it was an issue linked to the common performance > degradation after huge deletes operation. So we did do offline compactions on > all our OSDs. It fixed nothing and we are going through the logs to try and > figure this out. > > To answer your question, no the OSD doesn't restart after it logs the > timeout. It manages to get back online by itself, at the cost of sluggish > performances for the cluster and high iowait on VMs. > > We mostly run RBD workloads. > > Deep scrubs or no deep scrubs doesn't appear to change anything. Deactivating > scrubs altogether did not impact performances in any way. > > Furthermore, I'll stress that this is only happening since we upgraded to the > latest Pacific, yesterday. What is your previous release version? What is your OSD drives models? The timeout are always 15s? Not 7s, not 17s? Thanks, k ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Rocksdb compaction and OSD timeout
Hi, Since my post, we've been speaking with a member of the Ceph dev team. He did, at first, believe it was an issue linked to the common performance degradation after huge deletes operation. So we did do offline compactions on all our OSDs. It fixed nothing and we are going through the logs to try and figure this out. To answer your question, no the OSD doesn't restart after it logs the timeout. It manages to get back online by itself, at the cost of sluggish performances for the cluster and high iowait on VMs. We mostly run RBD workloads. Deep scrubs or no deep scrubs doesn't appear to change anything. Deactivating scrubs altogether did not impact performances in any way. Furthermore, I'll stress that this is only happening since we upgraded to the latest Pacific, yesterday. On 9/7/23 10:49, Stefan Kooman wrote: On 07-09-2023 09:05, J-P Methot wrote: Hi, We're running latest Pacific on our production cluster and we've been seeing the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' had timed out after 15.00954s' error. We have reasons to believe this happens each time the RocksDB compaction process is launched on an OSD. My question is, does the cluster detecting that an OSD has timed out interrupt the compaction process? This seems to be what's happening, but it's not immediately obvious. We are currently facing an infinite loop of random OSDs timing out and if the compaction process is interrupted without finishing, it may explain that. Does the OSD also restart after it logged the timeouts? You might want to perform an offline compaction every $timeperiod to fix any potential RocksDB degradation. That's what we do. What kind of workload do you run (i.e. RBD, CephFS, RGW)? Do you also see these timeouts occur during deep-scrubs? Gr. Stefan -- Jean-Philippe Méthot Senior Openstack system administrator Administrateur système Openstack sénior PlanetHoster inc. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Rocksdb compaction and OSD timeout
On 07-09-2023 09:05, J-P Methot wrote: Hi, We're running latest Pacific on our production cluster and we've been seeing the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' had timed out after 15.00954s' error. We have reasons to believe this happens each time the RocksDB compaction process is launched on an OSD. My question is, does the cluster detecting that an OSD has timed out interrupt the compaction process? This seems to be what's happening, but it's not immediately obvious. We are currently facing an infinite loop of random OSDs timing out and if the compaction process is interrupted without finishing, it may explain that. Does the OSD also restart after it logged the timeouts? You might want to perform an offline compaction every $timeperiod to fix any potential RocksDB degradation. That's what we do. What kind of workload do you run (i.e. RBD, CephFS, RGW)? Do you also see these timeouts occur during deep-scrubs? Gr. Stefan ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Rocksdb compaction and OSD timeout
On an HDD-based Quincy 17.2.5 cluster (with DB/WAL on datacenter-class NVMe with enhanced power loss protection), I sometimes (once or twice per week) see log entries similar to what I reproduced below (a bit trimmed): Wed 2023-09-06 22:41:54 UTC ceph-osd09 ceph-osd@39.service[5574]: 2023-09-06T22:41:54.429+ 7f83d813d700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f83b9ec1700' had timed out after 15.00954s Wed 2023-09-06 22:41:54 UTC ceph-osd09 ceph-osd@39.service[5574]: 2023-09-06T22:41:54.429+ 7f83d793c700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f83b9ec1700' had timed out after 15.00954s Wed 2023-09-06 22:41:54 UTC ceph-osd09 ceph-osd@39.service[5574]: 2023-09-06T22:41:54.469+ 7f83d713b700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f83b9ec1700' had timed out after 15.00954s Wed 2023-09-06 22:42:04 UTC mds02 ceph-mgr@mds02.service[4954]: 2023-09-06T22:42:04.735+ 7ffab1a20700 0 log_channel(cluster) log [DBG] : pgmap v227303: 5449 pgs: 1 active+clean+laggy, 7 active+clean+snaptrim, 34 active+remapped+backfilling, 13 active+clean+scrubbing, 4738 active+clean, 19 active+clean+scrubbing+deep, 637 active+remapped+backfill_wait; 2.0 PiB data, 2.9 PiB used, 2.2 PiB / 5.2 PiB avail; 103 MiB/s rd, 25 MiB/s wr, 2.76k op/s; 923967567/22817724437 objects misplaced (4.049%); 128 MiB/s, 142 objects/s recovering Wed 2023-09-06 22:42:06 UTC mds02 ceph-mgr@mds02.service[4954]: 2023-09-06T22:42:06.767+ 7ffab1a20700 0 log_channel(cluster) log [DBG] : pgmap v227304: 5449 pgs: 2 active+clean+laggy, 7 active+clean+snaptrim, 34 active+remapped+backfilling, 13 active+clean+scrubbing, 4737 active+clean, 19 active+clean+scrubbing+deep, 637 active+remapped+backfill_wait; 2.0 PiB data, 2.9 PiB used, 2.2 PiB / 5.2 PiB avail; 88 MiB/s rd, 23 MiB/s wr, 3.85k op/s; 923967437/22817727283 objects misplaced (4.049%); 115 MiB/s, 126 objects/s recovering Wed 2023-09-06 22:42:15 UTC mds02 ceph-mgr@mds02.service[4954]: 2023-09-06T22:42:15.311+ 7ffab1a20700 0 log_channel(cluster) log [DBG] : pgmap v227308: 5449 pgs: 1 active+remapped+backfill_wait+laggy, 3 active+clean+laggy, 7 active+clean+snaptrim, 34 active+remapped+backfilling, 13 active+clean+scrubbing, 4736 active+clean, 19 active+clean+scrubbing+deep, 636 active+remapped+backfill_wait; 2.0 PiB data, 2.9 PiB used, 2.2 PiB / 5.2 PiB avail; 67 MiB/s rd, 13 MiB/s wr, 3.21k op/s; 923966538/22817734196 objects misplaced (4.049%); 134 MiB/s, 137 objects/s recovering Wed 2023-09-06 22:42:22 UTC ceph-osd09 ceph-osd@39.service[5574]: 2023-09-06T22:42:22.708+ 7f83b9ec1700 1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f83b9ec1700' had timed out after 15.00954s Wed 2023-09-06 22:42:24 UTC ceph-osd06 ceph-osd@12.service[5527]: 2023-09-06T22:42:24.637+ 7f4fbfcc3700 -1 osd.12 1031871 heartbeat_check: no reply from 10.3.0.9:6864 osd.39 since back 2023-09-06T22:41:54.079627+ front 2023-09-06T22:41:54.079742+ (oldest deadline 2023-09-06T22:42:24.579224+) Wed 2023-09-06 22:42:24 UTC ceph-osd08 ceph-osd@126.service[5505]: 2023-09-06T22:42:24.671+ 7feb033d8700 -1 osd.126 1031871 heartbeat_check: no reply from 10.3.0.9:6864 osd.39 since back 2023-09-06T22:41:53.996908+ front 2023-09-06T22:41:53.997184+ (oldest deadline 2023-09-06T22:42:24.496552+) Wed 2023-09-06 22:42:24 UTC ceph-osd06 ceph-osd@75.service[5532]: 2023-09-06T22:42:24.797+ 7fe15bc77700 -1 osd.75 1031871 heartbeat_check: no reply from 10.3.0.9:6864 osd.39 since back 2023-09-06T22:41:49.384037+ front 2023-09-06T22:41:49.384031+ (oldest deadline 2023-09-06T22:42:24.683399+) Wed 2023-09-06 22:42:25 UTC mds01 ceph-mon@mds01.service[3258]: 2023-09-06T22:42:25.669+ 7f8f72216700 1 mon.mds01@0(leader).osd e1031871 prepare_failure osd.39 v1:10.3.0.9:6860/5574 from osd.1 is reporting failure:1 Wed 2023-09-06 22:42:25 UTC mds01 ceph-mon@mds01.service[3258]: 2023-09-06T22:42:25.669+ 7f8f72216700 0 log_channel(cluster) log [DBG] : osd.39 reported failed by osd.1 Wed 2023-09-06 22:42:25 UTC mds01 ceph-mon@mds01.service[3258]: 2023-09-06T22:42:25.905+ 7f8f72216700 1 mon.mds01@0(leader).osd e1031871 prepare_failure osd.39 v1:10.3.0.9:6860/5574 from osd.408 is reporting failure:1 Wed 2023-09-06 22:42:25 UTC mds01 ceph-mon@mds01.service[3258]: 2023-09-06T22:42:25.905+ 7f8f72216700 0 log_channel(cluster) log [DBG] : osd.39 reported failed by osd.408 Wed 2023-09-06 22:42:25 UTC mds01 ceph-mon@mds01.service[3258]: 2023-09-06T22:42:25.905+ 7f8f72216700 1 mon.mds01@0(leader).osd e1031871 we have enough reporters to mark osd.39 down Wed 2023-09-06 22:42:25 UTC mds01 ceph-mon@mds01.service[3258]: 2023-09-06T22:42:25.905+ 7f8f72216700 0 log_channel(cluster) log [INF] : osd.39 failed (root=default,host=ceph-osd09) (2 reporters from different host after 32.232905 >= grace 30.00) Wed 2023-09-06 22:42:26 UTC mds01 ceph-mon@mds01.service[3258]: 2023-09-06T22:42:26.445+ 7f8f74a1b700 0
[ceph-users] Re: Rocksdb compaction and OSD timeout
We're talking about automatic online compaction here, not running the command. On 9/7/23 04:04, Konstantin Shalygin wrote: Hi, On 7 Sep 2023, at 10:05, J-P Methot wrote: We're running latest Pacific on our production cluster and we've been seeing the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' had timed out after 15.00954s' error. We have reasons to believe this happens each time the RocksDB compaction process is launched on an OSD. My question is, does the cluster detecting that an OSD has timed out interrupt the compaction process? This seems to be what's happening, but it's not immediately obvious. We are currently facing an infinite loop of random OSDs timing out and if the compaction process is interrupted without finishing, it may explain that. You run the online compacting for this OSD's (`ceph osd compact ${osd_id}` command), right? k -- Jean-Philippe Méthot Senior Openstack system administrator Administrateur système Openstack sénior PlanetHoster inc. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Rocksdb compaction and OSD timeout
Hi, > On 7 Sep 2023, at 10:05, J-P Methot wrote: > > We're running latest Pacific on our production cluster and we've been seeing > the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' had timed out after > 15.00954s' error. We have reasons to believe this happens each time the > RocksDB compaction process is launched on an OSD. My question is, does the > cluster detecting that an OSD has timed out interrupt the compaction process? > This seems to be what's happening, but it's not immediately obvious. We are > currently facing an infinite loop of random OSDs timing out and if the > compaction process is interrupted without finishing, it may explain that. You run the online compacting for this OSD's (`ceph osd compact ${osd_id}` command), right? k ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io