[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-12 Thread Igor Fedotov

Hey Konstantin,

forget to mention - indeed clusters having 4K bluestore min alloc size 
are more likely to be exposed to the issue. The key point is the 
difference between bluestore and bluefs allocation sizes. The issue 
likely to pop-up when user and DB data are collocated but different 
allocation units are in use. As a result allocator needs to locate 
properly aligned chunks for BlueFS among a bunch of inappropriate 
misaligned chunks. Which might be ineffective in the current 
implementation and cause the slowdown.



Thanks,

Igor

On 12/09/2023 15:47, Konstantin Shalygin wrote:

Hi Igor,


On 12 Sep 2023, at 15:28, Igor Fedotov  wrote:

Default hybrid allocator (as well as AVL one it's based on) could take 
dramatically long time to allocate pretty large (hundreds of MBs) 64K-aligned 
chunks for BlueFS. At the original cluster it was exposed as 20-30 sec OSD 
stalls.

For the chunks, this mean bluestore min alloc size?
This cluster was deployed pre Pacific (64k) and not redeployed to Pacific 
default (4k)?


Thanks,
k
Sent from my iPhone



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-12 Thread Konstantin Shalygin
Hi Igor,

> On 12 Sep 2023, at 15:28, Igor Fedotov  wrote:
> 
> Default hybrid allocator (as well as AVL one it's based on) could take 
> dramatically long time to allocate pretty large (hundreds of MBs) 64K-aligned 
> chunks for BlueFS. At the original cluster it was exposed as 20-30 sec OSD 
> stalls.

For the chunks, this mean bluestore min alloc size?
This cluster was deployed pre Pacific (64k) and not redeployed to Pacific 
default (4k)?


Thanks,
k
Sent from my iPhone

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-12 Thread Igor Fedotov

HI All,

as promised here is a postmortem analysis on what happened.

the following ticket (https://tracker.ceph.com/issues/62815) with 
accompanying materials provide a low-level overview on the issue.


In a few words it is as follows:

Default hybrid allocator (as well as AVL one it's based on) could take 
dramatically long time to allocate pretty large (hundreds of MBs) 
64K-aligned chunks for BlueFS. At the original cluster it was exposed as 
20-30 sec OSD stalls.


This is apparently not specific to the recent 16.2.14 Pacific release as 
I saw that at least once before but 
https://github.com/ceph/ceph/pull/51773 made it more likely to pop up. 
RocksDB could preallocate huge WALs in a single short from now on .


The issue is definitely bound to aged/fragmented main OSD volumes which 
colocate DB ones. I don't expect it to pop up for standalone DB/WALs.


As already mentioned in this thread the proposed work-around is to 
switch bluestore_allocator to bitmap. This might cause minor overall 
performance drop so I'm not sure one should apply this unconditionally.


I'd like to ask for apologies for the inconvenience this could result. 
We're currently working on a proper fix...


Thanks,

Igor

On 07/09/2023 10:05, J-P Methot wrote:

Hi,

We're running latest Pacific on our production cluster and we've been 
seeing the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' had timed 
out after 15.00954s' error. We have reasons to believe this 
happens each time the RocksDB compaction process is launched on an 
OSD. My question is, does the cluster detecting that an OSD has timed 
out interrupt the compaction process? This seems to be what's 
happening, but it's not immediately obvious. We are currently facing 
an infinite loop of random OSDs timing out and if the compaction 
process is interrupted without finishing, it may explain that.



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-11 Thread J-P Methot
The bluestore configuration was 100% default when we did the upgrade and 
the issue happened. We have provided Igor with an OSD dump and a db dump 
last Friday, so hopefully you can figure out something from it.


On 9/8/23 02:48, Konstantin Shalygin wrote:
This cluster use the default settings or something for Bluestore was 
changed?


You can check this via `ceph config diff`


As Mark said, it will be nice to have a tracker, if this really 
release problem


Thanks,
k
Sent from my iPhone


On 7 Sep 2023, at 20:22, J-P Methot  wrote:

We went from 16.2.13 to 16.2.14

Also, timeout is 15 seconds because it's the default in Ceph. 
Basically, 15 seconds before Ceph shows a warning that OSD is timing out.


We may have found the solution, but it would be, in fact, related to 
bluestore_allocator and not the compaction process. I'll post the 
actual resolution when we confirm 100% that it works.



--
Jean-Philippe Méthot
Senior Openstack system administrator
Administrateur système Openstack sénior
PlanetHoster inc.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-08 Thread Konstantin Shalygin
This cluster use the default settings or something for Bluestore was changed?

You can check this via `ceph config diff`


As Mark said, it will be nice to have a tracker, if this really release problem

Thanks,
k
Sent from my iPhone

> On 7 Sep 2023, at 20:22, J-P Methot  wrote:
> 
> We went from 16.2.13 to 16.2.14
> 
> Also, timeout is 15 seconds because it's the default in Ceph. Basically, 15 
> seconds before Ceph shows a warning that OSD is timing out.
> 
> We may have found the solution, but it would be, in fact, related to 
> bluestore_allocator and not the compaction process. I'll post the actual 
> resolution when we confirm 100% that it works.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-08 Thread Stefan Kooman

On 07-09-2023 19:20, J-P Methot wrote:

We went from 16.2.13 to 16.2.14

Also, timeout is 15 seconds because it's the default in Ceph. Basically, 
15 seconds before Ceph shows a warning that OSD is timing out.


We may have found the solution, but it would be, in fact, related to 
bluestore_allocator and not the compaction process. I'll post the actual 
resolution when we confirm 100% that it works.


I'm very curious about what comes out of this, so yeah, please share 
your findings. We have re-provisioned our cluster past couple months, 
and have kept the "bitmap" allocator for bluestore (non-default). We 
have one OSD running hybrid for comparison. We want to see if there are 
any (big) differences in fragmentation over time between the two.


Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread xiaowenhao111
I also see the dreaded.  i find this is bcache problem .you can use blktrace tools capture iodatas analysis
发自我的小米在 Stefan Kooman ,2023年9月7日 下午10:52写道:On 07-09-2023 09:05, J-P Methot wrote:
> Hi,
> 
> We're running latest Pacific on our production cluster and we've been 
> seeing the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' had timed out 
> after 15.00954s' error. We have reasons to believe this happens each 
> time the RocksDB compaction process is launched on an OSD. My question 
> is, does the cluster detecting that an OSD has timed out interrupt the 
> compaction process? This seems to be what's happening, but it's not 
> immediately obvious. We are currently facing an infinite loop of random 
> OSDs timing out and if the compaction process is interrupted without 
> finishing, it may explain that.
> 
Does the OSD also restart after it logged the timeouts?

You might want to perform an offline compaction every $timeperiod to fix 
any potential RocksDB degradation. That's what we do. What kind of 
workload do you run (i.e. RBD, CephFS, RGW)?

Do you also see these timeouts occur during deep-scrubs?

Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread Mark Nelson
Oh that's very good to know.  I'm sure Igor will respond here, but do 
you know which PR this was related to? (possibly 
https://github.com/ceph/ceph/pull/50321)


If we think there's a regression here we should get it into the tracker 
ASAP.



Mark


On 9/7/23 13:45, J-P Methot wrote:
To be quite honest, I will not pretend I have a low level 
understanding of what was going on. There is very little documentation 
as to what the bluestore allocator actually does and we had to rely on 
Igor's help to find the solution, so my understanding of the situation 
is limited. What I understand is as follows:


-Our workload requires us to move around, delete, write a fairly high 
amount of RBD data around the cluster.


-The AVL allocator doesn't seem to like that and changes added to it 
in 16.2.14 made it worse than before.


-It made the OSDs become unresponsive and lag quite a bit whenever 
high amounts of data was written or deleted, which is, all the time.


-We basically changed the allocator to bitmap and, as we speak, this 
seems to have solved the problem. I understand that this is not ideal 
as it's apparently less performant, but here it's the difference 
between a cluster that gives me enough I/Os to work properly and a 
cluster that murders my performances.


I hope this helps. Feel free to ask us if you need further details and 
I'll see what I can do.


On 9/7/23 13:59, Mark Nelson wrote:
Ok, good to know.  Please feel free to update us here with what you 
are seeing in the allocator.  It might also be worth opening a 
tracker ticket as well.  I did some work in the AVL allocator a while 
back where we were repeating the linear search from the same offset 
every allocation, getting stuck, and falling back to fast search over 
and over leading to significant allocation fragmentation. That got 
fixed, but I wouldn't be surprised if we have some other sub-optimal 
behaviors we don't know about.



Mark


On 9/7/23 12:28, J-P Methot wrote:

Hi,

By this point, we're 95% sure that, contrary to our previous 
beliefs, it's an issue with changes to the bluestore_allocator and 
not the compaction process. That said, I will keep this email in 
mind as we will want to test optimizations to compaction on our test 
environment.


On 9/7/23 12:32, Mark Nelson wrote:

Hello,

There are two things that might help you here.  One is to try the 
new "rocksdb_cf_compaction_on_deletion" feature that I added in 
Reef and we backported to Pacific in 16.2.13.  So far this appears 
to be a huge win for avoiding tombstone accumulation during 
iteration which is often the issue with threadpool timeouts due to 
rocksdb.  Manual compaction can help, but if you are hitting a case 
where there's concurrent iteration and deletions with no writes, 
tombstones will accumulate quickly with no compactions taking place 
and you'll eventually end up back in the same place. The default 
sliding window and trigger settings are fairly conservative to 
avoid excessive compaction, so it may require some tuning to hit 
the right sweet spot on your cluster. I know of at least one site 
that's using this feature with more aggressive settings than 
default and had an extremely positive impact on their cluster.


The other thing that can help improve compaction performance in 
general is enabling lz4 compression in RocksDB.  I plan to make 
this the default behavior in Squid assuming we don't run into any 
issues in testing.  There are several sites that are using this now 
in production and the benefits have been dramatic relative to the 
costs.  We're seeing significantly faster compactions and about 
2.2x lower space requirement for the DB (RGW workload). There may 
be a slight CPU cost and read/index listing performance impact, but 
even with testing on NVMe clusters this was quite low (maybe a 
couple of percent).



Mark


On 9/7/23 10:21, J-P Methot wrote:

Hi,

Since my post, we've been speaking with a member of the Ceph dev 
team. He did, at first, believe it was an issue linked to the 
common performance degradation after huge deletes operation. So we 
did do offline compactions on all our OSDs. It fixed nothing and 
we are going through the logs to try and figure this out.


To answer your question, no the OSD doesn't restart after it logs 
the timeout. It manages to get back online by itself, at the cost 
of sluggish performances for the cluster and high iowait on VMs.


We mostly run RBD workloads.

Deep scrubs or no deep scrubs doesn't appear to change anything. 
Deactivating scrubs altogether did not impact performances in any 
way.


Furthermore, I'll stress that this is only happening since we 
upgraded to the latest Pacific, yesterday.


On 9/7/23 10:49, Stefan Kooman wrote:

On 07-09-2023 09:05, J-P Methot wrote:

Hi,

We're running latest Pacific on our production cluster and we've 
been seeing the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' 
had timed out after 15.00954s' error. We have reasons to 
believe this 

[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread J-P Methot
To be quite honest, I will not pretend I have a low level understanding 
of what was going on. There is very little documentation as to what the 
bluestore allocator actually does and we had to rely on Igor's help to 
find the solution, so my understanding of the situation is limited. What 
I understand is as follows:


-Our workload requires us to move around, delete, write a fairly high 
amount of RBD data around the cluster.


-The AVL allocator doesn't seem to like that and changes added to it in 
16.2.14 made it worse than before.


-It made the OSDs become unresponsive and lag quite a bit whenever high 
amounts of data was written or deleted, which is, all the time.


-We basically changed the allocator to bitmap and, as we speak, this 
seems to have solved the problem. I understand that this is not ideal as 
it's apparently less performant, but here it's the difference between a 
cluster that gives me enough I/Os to work properly and a cluster that 
murders my performances.


I hope this helps. Feel free to ask us if you need further details and 
I'll see what I can do.


On 9/7/23 13:59, Mark Nelson wrote:
Ok, good to know.  Please feel free to update us here with what you 
are seeing in the allocator.  It might also be worth opening a tracker 
ticket as well.  I did some work in the AVL allocator a while back 
where we were repeating the linear search from the same offset every 
allocation, getting stuck, and falling back to fast search over and 
over leading to significant allocation fragmentation. That got fixed, 
but I wouldn't be surprised if we have some other sub-optimal 
behaviors we don't know about.



Mark


On 9/7/23 12:28, J-P Methot wrote:

Hi,

By this point, we're 95% sure that, contrary to our previous beliefs, 
it's an issue with changes to the bluestore_allocator and not the 
compaction process. That said, I will keep this email in mind as we 
will want to test optimizations to compaction on our test environment.


On 9/7/23 12:32, Mark Nelson wrote:

Hello,

There are two things that might help you here.  One is to try the 
new "rocksdb_cf_compaction_on_deletion" feature that I added in Reef 
and we backported to Pacific in 16.2.13.  So far this appears to be 
a huge win for avoiding tombstone accumulation during iteration 
which is often the issue with threadpool timeouts due to rocksdb.  
Manual compaction can help, but if you are hitting a case where 
there's concurrent iteration and deletions with no writes, 
tombstones will accumulate quickly with no compactions taking place 
and you'll eventually end up back in the same place. The default 
sliding window and trigger settings are fairly conservative to avoid 
excessive compaction, so it may require some tuning to hit the right 
sweet spot on your cluster. I know of at least one site that's using 
this feature with more aggressive settings than default and had an 
extremely positive impact on their cluster.


The other thing that can help improve compaction performance in 
general is enabling lz4 compression in RocksDB.  I plan to make this 
the default behavior in Squid assuming we don't run into any issues 
in testing.  There are several sites that are using this now in 
production and the benefits have been dramatic relative to the 
costs.  We're seeing significantly faster compactions and about 2.2x 
lower space requirement for the DB (RGW workload). There may be a 
slight CPU cost and read/index listing performance impact, but even 
with testing on NVMe clusters this was quite low (maybe a couple of 
percent).



Mark


On 9/7/23 10:21, J-P Methot wrote:

Hi,

Since my post, we've been speaking with a member of the Ceph dev 
team. He did, at first, believe it was an issue linked to the 
common performance degradation after huge deletes operation. So we 
did do offline compactions on all our OSDs. It fixed nothing and we 
are going through the logs to try and figure this out.


To answer your question, no the OSD doesn't restart after it logs 
the timeout. It manages to get back online by itself, at the cost 
of sluggish performances for the cluster and high iowait on VMs.


We mostly run RBD workloads.

Deep scrubs or no deep scrubs doesn't appear to change anything. 
Deactivating scrubs altogether did not impact performances in any way.


Furthermore, I'll stress that this is only happening since we 
upgraded to the latest Pacific, yesterday.


On 9/7/23 10:49, Stefan Kooman wrote:

On 07-09-2023 09:05, J-P Methot wrote:

Hi,

We're running latest Pacific on our production cluster and we've 
been seeing the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' 
had timed out after 15.00954s' error. We have reasons to 
believe this happens each time the RocksDB compaction process is 
launched on an OSD. My question is, does the cluster detecting 
that an OSD has timed out interrupt the compaction process? This 
seems to be what's happening, but it's not immediately obvious. 
We are currently facing an infinite loop of 

[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread Mark Nelson
Ok, good to know.  Please feel free to update us here with what you are 
seeing in the allocator.  It might also be worth opening a tracker 
ticket as well.  I did some work in the AVL allocator a while back where 
we were repeating the linear search from the same offset every 
allocation, getting stuck, and falling back to fast search over and over 
leading to significant allocation fragmentation.  That got fixed, but I 
wouldn't be surprised if we have some other sub-optimal behaviors we 
don't know about.



Mark


On 9/7/23 12:28, J-P Methot wrote:

Hi,

By this point, we're 95% sure that, contrary to our previous beliefs, 
it's an issue with changes to the bluestore_allocator and not the 
compaction process. That said, I will keep this email in mind as we 
will want to test optimizations to compaction on our test environment.


On 9/7/23 12:32, Mark Nelson wrote:

Hello,

There are two things that might help you here.  One is to try the new 
"rocksdb_cf_compaction_on_deletion" feature that I added in Reef and 
we backported to Pacific in 16.2.13.  So far this appears to be a 
huge win for avoiding tombstone accumulation during iteration which 
is often the issue with threadpool timeouts due to rocksdb.  Manual 
compaction can help, but if you are hitting a case where there's 
concurrent iteration and deletions with no writes, tombstones will 
accumulate quickly with no compactions taking place and you'll 
eventually end up back in the same place. The default sliding window 
and trigger settings are fairly conservative to avoid excessive 
compaction, so it may require some tuning to hit the right sweet spot 
on your cluster. I know of at least one site that's using this 
feature with more aggressive settings than default and had an 
extremely positive impact on their cluster.


The other thing that can help improve compaction performance in 
general is enabling lz4 compression in RocksDB.  I plan to make this 
the default behavior in Squid assuming we don't run into any issues 
in testing.  There are several sites that are using this now in 
production and the benefits have been dramatic relative to the 
costs.  We're seeing significantly faster compactions and about 2.2x 
lower space requirement for the DB (RGW workload). There may be a 
slight CPU cost and read/index listing performance impact, but even 
with testing on NVMe clusters this was quite low (maybe a couple of 
percent).



Mark


On 9/7/23 10:21, J-P Methot wrote:

Hi,

Since my post, we've been speaking with a member of the Ceph dev 
team. He did, at first, believe it was an issue linked to the common 
performance degradation after huge deletes operation. So we did do 
offline compactions on all our OSDs. It fixed nothing and we are 
going through the logs to try and figure this out.


To answer your question, no the OSD doesn't restart after it logs 
the timeout. It manages to get back online by itself, at the cost of 
sluggish performances for the cluster and high iowait on VMs.


We mostly run RBD workloads.

Deep scrubs or no deep scrubs doesn't appear to change anything. 
Deactivating scrubs altogether did not impact performances in any way.


Furthermore, I'll stress that this is only happening since we 
upgraded to the latest Pacific, yesterday.


On 9/7/23 10:49, Stefan Kooman wrote:

On 07-09-2023 09:05, J-P Methot wrote:

Hi,

We're running latest Pacific on our production cluster and we've 
been seeing the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' had 
timed out after 15.00954s' error. We have reasons to believe 
this happens each time the RocksDB compaction process is launched 
on an OSD. My question is, does the cluster detecting that an OSD 
has timed out interrupt the compaction process? This seems to be 
what's happening, but it's not immediately obvious. We are 
currently facing an infinite loop of random OSDs timing out and if 
the compaction process is interrupted without finishing, it may 
explain that.



Does the OSD also restart after it logged the timeouts?

You might want to perform an offline compaction every $timeperiod 
to fix any potential RocksDB degradation. That's what we do. What 
kind of workload do you run (i.e. RBD, CephFS, RGW)?


Do you also see these timeouts occur during deep-scrubs?

Gr. Stefan



--
Best Regards,
Mark Nelson
Head of Research and Development

Clyso GmbH
p: +49 89 21552391 12 | a: Minnesota, USA
w: https://clyso.com | e: mark.nel...@clyso.com

We are hiring: https://www.clyso.com/jobs/
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread J-P Methot

Hi,

By this point, we're 95% sure that, contrary to our previous beliefs, 
it's an issue with changes to the bluestore_allocator and not the 
compaction process. That said, I will keep this email in mind as we will 
want to test optimizations to compaction on our test environment.


On 9/7/23 12:32, Mark Nelson wrote:

Hello,

There are two things that might help you here.  One is to try the new 
"rocksdb_cf_compaction_on_deletion" feature that I added in Reef and 
we backported to Pacific in 16.2.13.  So far this appears to be a huge 
win for avoiding tombstone accumulation during iteration which is 
often the issue with threadpool timeouts due to rocksdb.  Manual 
compaction can help, but if you are hitting a case where there's 
concurrent iteration and deletions with no writes, tombstones will 
accumulate quickly with no compactions taking place and you'll 
eventually end up back in the same place. The default sliding window 
and trigger settings are fairly conservative to avoid excessive 
compaction, so it may require some tuning to hit the right sweet spot 
on your cluster. I know of at least one site that's using this feature 
with more aggressive settings than default and had an extremely 
positive impact on their cluster.


The other thing that can help improve compaction performance in 
general is enabling lz4 compression in RocksDB.  I plan to make this 
the default behavior in Squid assuming we don't run into any issues in 
testing.  There are several sites that are using this now in 
production and the benefits have been dramatic relative to the costs.  
We're seeing significantly faster compactions and about 2.2x lower 
space requirement for the DB (RGW workload). There may be a slight CPU 
cost and read/index listing performance impact, but even with testing 
on NVMe clusters this was quite low (maybe a couple of percent).



Mark


On 9/7/23 10:21, J-P Methot wrote:

Hi,

Since my post, we've been speaking with a member of the Ceph dev 
team. He did, at first, believe it was an issue linked to the common 
performance degradation after huge deletes operation. So we did do 
offline compactions on all our OSDs. It fixed nothing and we are 
going through the logs to try and figure this out.


To answer your question, no the OSD doesn't restart after it logs the 
timeout. It manages to get back online by itself, at the cost of 
sluggish performances for the cluster and high iowait on VMs.


We mostly run RBD workloads.

Deep scrubs or no deep scrubs doesn't appear to change anything. 
Deactivating scrubs altogether did not impact performances in any way.


Furthermore, I'll stress that this is only happening since we 
upgraded to the latest Pacific, yesterday.


On 9/7/23 10:49, Stefan Kooman wrote:

On 07-09-2023 09:05, J-P Methot wrote:

Hi,

We're running latest Pacific on our production cluster and we've 
been seeing the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' had 
timed out after 15.00954s' error. We have reasons to believe 
this happens each time the RocksDB compaction process is launched 
on an OSD. My question is, does the cluster detecting that an OSD 
has timed out interrupt the compaction process? This seems to be 
what's happening, but it's not immediately obvious. We are 
currently facing an infinite loop of random OSDs timing out and if 
the compaction process is interrupted without finishing, it may 
explain that.



Does the OSD also restart after it logged the timeouts?

You might want to perform an offline compaction every $timeperiod to 
fix any potential RocksDB degradation. That's what we do. What kind 
of workload do you run (i.e. RBD, CephFS, RGW)?


Do you also see these timeouts occur during deep-scrubs?

Gr. Stefan



--
Jean-Philippe Méthot
Senior Openstack system administrator
Administrateur système Openstack sénior
PlanetHoster inc.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread J-P Methot

We went from 16.2.13 to 16.2.14

Also, timeout is 15 seconds because it's the default in Ceph. Basically, 
15 seconds before Ceph shows a warning that OSD is timing out.


We may have found the solution, but it would be, in fact, related to 
bluestore_allocator and not the compaction process. I'll post the actual 
resolution when we confirm 100% that it works.


On 9/7/23 12:18, Konstantin Shalygin wrote:

Hi,


On 7 Sep 2023, at 18:21, J-P Methot  wrote:

Since my post, we've been speaking with a member of the Ceph dev 
team. He did, at first, believe it was an issue linked to the common 
performance degradation after huge deletes operation. So we did do 
offline compactions on all our OSDs. It fixed nothing and we are 
going through the logs to try and figure this out.


To answer your question, no the OSD doesn't restart after it logs the 
timeout. It manages to get back online by itself, at the cost of 
sluggish performances for the cluster and high iowait on VMs.


We mostly run RBD workloads.

Deep scrubs or no deep scrubs doesn't appear to change anything. 
Deactivating scrubs altogether did not impact performances in any way.


Furthermore, I'll stress that this is only happening since we 
upgraded to the latest Pacific, yesterday.


What is your previous release version? What is your OSD drives models?
The timeout are always 15s? Not 7s, not 17s?


Thanks,
k


--
Jean-Philippe Méthot
Senior Openstack system administrator
Administrateur système Openstack sénior
PlanetHoster inc.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread Mark Nelson

Hello,

There are two things that might help you here.  One is to try the new 
"rocksdb_cf_compaction_on_deletion" feature that I added in Reef and we 
backported to Pacific in 16.2.13.  So far this appears to be a huge win 
for avoiding tombstone accumulation during iteration which is often the 
issue with threadpool timeouts due to rocksdb.  Manual compaction can 
help, but if you are hitting a case where there's concurrent iteration 
and deletions with no writes, tombstones will accumulate quickly with no 
compactions taking place and you'll eventually end up back in the same 
place. The default sliding window and trigger settings are fairly 
conservative to avoid excessive compaction, so it may require some 
tuning to hit the right sweet spot on your cluster. I know of at least 
one site that's using this feature with more aggressive settings than 
default and had an extremely positive impact on their cluster.


The other thing that can help improve compaction performance in general 
is enabling lz4 compression in RocksDB.  I plan to make this the default 
behavior in Squid assuming we don't run into any issues in testing.  
There are several sites that are using this now in production and the 
benefits have been dramatic relative to the costs.  We're seeing 
significantly faster compactions and about 2.2x lower space requirement 
for the DB (RGW workload). There may be a slight CPU cost and read/index 
listing performance impact, but even with testing on NVMe clusters this 
was quite low (maybe a couple of percent).



Mark


On 9/7/23 10:21, J-P Methot wrote:

Hi,

Since my post, we've been speaking with a member of the Ceph dev team. 
He did, at first, believe it was an issue linked to the common 
performance degradation after huge deletes operation. So we did do 
offline compactions on all our OSDs. It fixed nothing and we are going 
through the logs to try and figure this out.


To answer your question, no the OSD doesn't restart after it logs the 
timeout. It manages to get back online by itself, at the cost of 
sluggish performances for the cluster and high iowait on VMs.


We mostly run RBD workloads.

Deep scrubs or no deep scrubs doesn't appear to change anything. 
Deactivating scrubs altogether did not impact performances in any way.


Furthermore, I'll stress that this is only happening since we upgraded 
to the latest Pacific, yesterday.


On 9/7/23 10:49, Stefan Kooman wrote:

On 07-09-2023 09:05, J-P Methot wrote:

Hi,

We're running latest Pacific on our production cluster and we've 
been seeing the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' had 
timed out after 15.00954s' error. We have reasons to believe 
this happens each time the RocksDB compaction process is launched on 
an OSD. My question is, does the cluster detecting that an OSD has 
timed out interrupt the compaction process? This seems to be what's 
happening, but it's not immediately obvious. We are currently facing 
an infinite loop of random OSDs timing out and if the compaction 
process is interrupted without finishing, it may explain that.



Does the OSD also restart after it logged the timeouts?

You might want to perform an offline compaction every $timeperiod to 
fix any potential RocksDB degradation. That's what we do. What kind 
of workload do you run (i.e. RBD, CephFS, RGW)?


Do you also see these timeouts occur during deep-scrubs?

Gr. Stefan



--
Best Regards,
Mark Nelson
Head of Research and Development

Clyso GmbH
p: +49 89 21552391 12 | a: Minnesota, USA
w: https://clyso.com | e: mark.nel...@clyso.com

We are hiring: https://www.clyso.com/jobs/
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread Konstantin Shalygin
Hi,

> On 7 Sep 2023, at 18:21, J-P Methot  wrote:
> 
> Since my post, we've been speaking with a member of the Ceph dev team. He 
> did, at first, believe it was an issue linked to the common performance 
> degradation after huge deletes operation. So we did do offline compactions on 
> all our OSDs. It fixed nothing and we are going through the logs to try and 
> figure this out.
> 
> To answer your question, no the OSD doesn't restart after it logs the 
> timeout. It manages to get back online by itself, at the cost of sluggish 
> performances for the cluster and high iowait on VMs.
> 
> We mostly run RBD workloads.
> 
> Deep scrubs or no deep scrubs doesn't appear to change anything. Deactivating 
> scrubs altogether did not impact performances in any way.
> 
> Furthermore, I'll stress that this is only happening since we upgraded to the 
> latest Pacific, yesterday.

What is your previous release version? What is your OSD drives models?
The timeout are always 15s? Not 7s, not 17s?


Thanks,
k
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread J-P Methot

Hi,

Since my post, we've been speaking with a member of the Ceph dev team. 
He did, at first, believe it was an issue linked to the common 
performance degradation after huge deletes operation. So we did do 
offline compactions on all our OSDs. It fixed nothing and we are going 
through the logs to try and figure this out.


To answer your question, no the OSD doesn't restart after it logs the 
timeout. It manages to get back online by itself, at the cost of 
sluggish performances for the cluster and high iowait on VMs.


We mostly run RBD workloads.

Deep scrubs or no deep scrubs doesn't appear to change anything. 
Deactivating scrubs altogether did not impact performances in any way.


Furthermore, I'll stress that this is only happening since we upgraded 
to the latest Pacific, yesterday.


On 9/7/23 10:49, Stefan Kooman wrote:

On 07-09-2023 09:05, J-P Methot wrote:

Hi,

We're running latest Pacific on our production cluster and we've been 
seeing the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' had timed 
out after 15.00954s' error. We have reasons to believe this 
happens each time the RocksDB compaction process is launched on an 
OSD. My question is, does the cluster detecting that an OSD has timed 
out interrupt the compaction process? This seems to be what's 
happening, but it's not immediately obvious. We are currently facing 
an infinite loop of random OSDs timing out and if the compaction 
process is interrupted without finishing, it may explain that.



Does the OSD also restart after it logged the timeouts?

You might want to perform an offline compaction every $timeperiod to 
fix any potential RocksDB degradation. That's what we do. What kind of 
workload do you run (i.e. RBD, CephFS, RGW)?


Do you also see these timeouts occur during deep-scrubs?

Gr. Stefan


--
Jean-Philippe Méthot
Senior Openstack system administrator
Administrateur système Openstack sénior
PlanetHoster inc.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread Stefan Kooman

On 07-09-2023 09:05, J-P Methot wrote:

Hi,

We're running latest Pacific on our production cluster and we've been 
seeing the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' had timed out 
after 15.00954s' error. We have reasons to believe this happens each 
time the RocksDB compaction process is launched on an OSD. My question 
is, does the cluster detecting that an OSD has timed out interrupt the 
compaction process? This seems to be what's happening, but it's not 
immediately obvious. We are currently facing an infinite loop of random 
OSDs timing out and if the compaction process is interrupted without 
finishing, it may explain that.



Does the OSD also restart after it logged the timeouts?

You might want to perform an offline compaction every $timeperiod to fix 
any potential RocksDB degradation. That's what we do. What kind of 
workload do you run (i.e. RBD, CephFS, RGW)?


Do you also see these timeouts occur during deep-scrubs?

Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread Alexander E. Patrakov
On an HDD-based Quincy 17.2.5 cluster (with DB/WAL on datacenter-class
NVMe with enhanced power loss protection), I sometimes (once or twice
per week) see log entries similar to what I reproduced below (a bit
trimmed):

Wed 2023-09-06 22:41:54 UTC ceph-osd09 ceph-osd@39.service[5574]:
2023-09-06T22:41:54.429+ 7f83d813d700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f83b9ec1700' had timed out after
15.00954s
Wed 2023-09-06 22:41:54 UTC ceph-osd09 ceph-osd@39.service[5574]:
2023-09-06T22:41:54.429+ 7f83d793c700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f83b9ec1700' had timed out after
15.00954s
Wed 2023-09-06 22:41:54 UTC ceph-osd09 ceph-osd@39.service[5574]:
2023-09-06T22:41:54.469+ 7f83d713b700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f83b9ec1700' had timed out after
15.00954s

Wed 2023-09-06 22:42:04 UTC mds02 ceph-mgr@mds02.service[4954]:
2023-09-06T22:42:04.735+ 7ffab1a20700  0 log_channel(cluster) log
[DBG] : pgmap v227303: 5449 pgs: 1 active+clean+laggy, 7
active+clean+snaptrim, 34 active+remapped+backfilling, 13
active+clean+scrubbing, 4738 active+clean, 19
active+clean+scrubbing+deep, 637 active+remapped+backfill_wait; 2.0
PiB data, 2.9 PiB used, 2.2 PiB / 5.2 PiB avail; 103 MiB/s rd, 25
MiB/s wr, 2.76k op/s; 923967567/22817724437 objects misplaced
(4.049%); 128 MiB/s, 142 objects/s recovering

Wed 2023-09-06 22:42:06 UTC mds02 ceph-mgr@mds02.service[4954]:
2023-09-06T22:42:06.767+ 7ffab1a20700  0 log_channel(cluster) log
[DBG] : pgmap v227304: 5449 pgs: 2 active+clean+laggy, 7
active+clean+snaptrim, 34 active+remapped+backfilling, 13
active+clean+scrubbing, 4737 active+clean, 19
active+clean+scrubbing+deep, 637 active+remapped+backfill_wait; 2.0
PiB data, 2.9 PiB used, 2.2 PiB / 5.2 PiB avail; 88 MiB/s rd, 23 MiB/s
wr, 3.85k op/s; 923967437/22817727283 objects misplaced (4.049%); 115
MiB/s, 126 objects/s recovering

Wed 2023-09-06 22:42:15 UTC mds02 ceph-mgr@mds02.service[4954]:
2023-09-06T22:42:15.311+ 7ffab1a20700  0 log_channel(cluster) log
[DBG] : pgmap v227308: 5449 pgs: 1
active+remapped+backfill_wait+laggy, 3 active+clean+laggy, 7
active+clean+snaptrim, 34 active+remapped+backfilling, 13
active+clean+scrubbing, 4736 active+clean, 19
active+clean+scrubbing+deep, 636 active+remapped+backfill_wait; 2.0
PiB data, 2.9 PiB used, 2.2 PiB / 5.2 PiB avail; 67 MiB/s rd, 13 MiB/s
wr, 3.21k op/s; 923966538/22817734196 objects misplaced (4.049%); 134
MiB/s, 137 objects/s recovering

Wed 2023-09-06 22:42:22 UTC ceph-osd09 ceph-osd@39.service[5574]:
2023-09-06T22:42:22.708+ 7f83b9ec1700  1 heartbeat_map
reset_timeout 'OSD::osd_op_tp thread 0x7f83b9ec1700' had timed out
after 15.00954s

Wed 2023-09-06 22:42:24 UTC ceph-osd06 ceph-osd@12.service[5527]:
2023-09-06T22:42:24.637+ 7f4fbfcc3700 -1 osd.12 1031871
heartbeat_check: no reply from 10.3.0.9:6864 osd.39 since back
2023-09-06T22:41:54.079627+ front 2023-09-06T22:41:54.079742+
(oldest deadline 2023-09-06T22:42:24.579224+)
Wed 2023-09-06 22:42:24 UTC ceph-osd08 ceph-osd@126.service[5505]:
2023-09-06T22:42:24.671+ 7feb033d8700 -1 osd.126 1031871
heartbeat_check: no reply from 10.3.0.9:6864 osd.39 since back
2023-09-06T22:41:53.996908+ front 2023-09-06T22:41:53.997184+
(oldest deadline 2023-09-06T22:42:24.496552+)
Wed 2023-09-06 22:42:24 UTC ceph-osd06 ceph-osd@75.service[5532]:
2023-09-06T22:42:24.797+ 7fe15bc77700 -1 osd.75 1031871
heartbeat_check: no reply from 10.3.0.9:6864 osd.39 since back
2023-09-06T22:41:49.384037+ front 2023-09-06T22:41:49.384031+
(oldest deadline 2023-09-06T22:42:24.683399+)

Wed 2023-09-06 22:42:25 UTC mds01 ceph-mon@mds01.service[3258]:
2023-09-06T22:42:25.669+ 7f8f72216700  1 mon.mds01@0(leader).osd
e1031871 prepare_failure osd.39 v1:10.3.0.9:6860/5574 from osd.1 is
reporting failure:1
Wed 2023-09-06 22:42:25 UTC mds01 ceph-mon@mds01.service[3258]:
2023-09-06T22:42:25.669+ 7f8f72216700  0 log_channel(cluster) log
[DBG] : osd.39 reported failed by osd.1
Wed 2023-09-06 22:42:25 UTC mds01 ceph-mon@mds01.service[3258]:
2023-09-06T22:42:25.905+ 7f8f72216700  1 mon.mds01@0(leader).osd
e1031871 prepare_failure osd.39 v1:10.3.0.9:6860/5574 from osd.408 is
reporting failure:1
Wed 2023-09-06 22:42:25 UTC mds01 ceph-mon@mds01.service[3258]:
2023-09-06T22:42:25.905+ 7f8f72216700  0 log_channel(cluster) log
[DBG] : osd.39 reported failed by osd.408
Wed 2023-09-06 22:42:25 UTC mds01 ceph-mon@mds01.service[3258]:
2023-09-06T22:42:25.905+ 7f8f72216700  1 mon.mds01@0(leader).osd
e1031871  we have enough reporters to mark osd.39 down
Wed 2023-09-06 22:42:25 UTC mds01 ceph-mon@mds01.service[3258]:
2023-09-06T22:42:25.905+ 7f8f72216700  0 log_channel(cluster) log
[INF] : osd.39 failed (root=default,host=ceph-osd09) (2 reporters from
different host after 32.232905 >= grace 30.00)
Wed 2023-09-06 22:42:26 UTC mds01 ceph-mon@mds01.service[3258]:
2023-09-06T22:42:26.445+ 7f8f74a1b700  0 

[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread J-P Methot
We're talking about automatic online compaction here, not running the 
command.


On 9/7/23 04:04, Konstantin Shalygin wrote:

Hi,


On 7 Sep 2023, at 10:05, J-P Methot  wrote:

We're running latest Pacific on our production cluster and we've been 
seeing the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' had timed 
out after 15.00954s' error. We have reasons to believe this 
happens each time the RocksDB compaction process is launched on an 
OSD. My question is, does the cluster detecting that an OSD has timed 
out interrupt the compaction process? This seems to be what's 
happening, but it's not immediately obvious. We are currently facing 
an infinite loop of random OSDs timing out and if the compaction 
process is interrupted without finishing, it may explain that.


You run the online compacting for this OSD's (`ceph osd compact 
${osd_id}` command), right?




k


--
Jean-Philippe Méthot
Senior Openstack system administrator
Administrateur système Openstack sénior
PlanetHoster inc.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread Konstantin Shalygin
Hi,

> On 7 Sep 2023, at 10:05, J-P Methot  wrote:
> 
> We're running latest Pacific on our production cluster and we've been seeing 
> the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' had timed out after 
> 15.00954s' error. We have reasons to believe this happens each time the 
> RocksDB compaction process is launched on an OSD. My question is, does the 
> cluster detecting that an OSD has timed out interrupt the compaction process? 
> This seems to be what's happening, but it's not immediately obvious. We are 
> currently facing an infinite loop of random OSDs timing out and if the 
> compaction process is interrupted without finishing, it may explain that.

You run the online compacting for this OSD's (`ceph osd compact ${osd_id}` 
command), right?



k
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io