[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread Konstantin Shalygin
This cluster use the default settings or something for Bluestore was changed?

You can check this via `ceph config diff`


As Mark said, it will be nice to have a tracker, if this really release problem

Thanks,
k
Sent from my iPhone

> On 7 Sep 2023, at 20:22, J-P Methot  wrote:
> 
> We went from 16.2.13 to 16.2.14
> 
> Also, timeout is 15 seconds because it's the default in Ceph. Basically, 15 
> seconds before Ceph shows a warning that OSD is timing out.
> 
> We may have found the solution, but it would be, in fact, related to 
> bluestore_allocator and not the compaction process. I'll post the actual 
> resolution when we confirm 100% that it works.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread Stefan Kooman

On 07-09-2023 19:20, J-P Methot wrote:

We went from 16.2.13 to 16.2.14

Also, timeout is 15 seconds because it's the default in Ceph. Basically, 
15 seconds before Ceph shows a warning that OSD is timing out.


We may have found the solution, but it would be, in fact, related to 
bluestore_allocator and not the compaction process. I'll post the actual 
resolution when we confirm 100% that it works.


I'm very curious about what comes out of this, so yeah, please share 
your findings. We have re-provisioned our cluster past couple months, 
and have kept the "bitmap" allocator for bluestore (non-default). We 
have one OSD running hybrid for comparison. We want to see if there are 
any (big) differences in fragmentation over time between the two.


Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread xiaowenhao111
I also see the dreaded.  i find this is bcache problem .you can use blktrace tools capture iodatas analysis
发自我的小米在 Stefan Kooman ,2023年9月7日 下午10:52写道:On 07-09-2023 09:05, J-P Methot wrote:
> Hi,
> 
> We're running latest Pacific on our production cluster and we've been 
> seeing the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' had timed out 
> after 15.00954s' error. We have reasons to believe this happens each 
> time the RocksDB compaction process is launched on an OSD. My question 
> is, does the cluster detecting that an OSD has timed out interrupt the 
> compaction process? This seems to be what's happening, but it's not 
> immediately obvious. We are currently facing an infinite loop of random 
> OSDs timing out and if the compaction process is interrupted without 
> finishing, it may explain that.
> 
Does the OSD also restart after it logged the timeouts?

You might want to perform an offline compaction every $timeperiod to fix 
any potential RocksDB degradation. That's what we do. What kind of 
workload do you run (i.e. RBD, CephFS, RGW)?

Do you also see these timeouts occur during deep-scrubs?

Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph_leadership_team_meeting_s18e06.mkv

2023-09-07 Thread Mark Nelson

Hi Rok,

We're still try to catch what's causing the memory growth, so it's hard 
to guess at which releases are affected.  We know it's happening 
intermittently on a live Pacific cluster at least.  If you have the 
ability to catch it while it's happening, there are several 
approaches/tools that might aid in diagnosing it. Container deployments 
are a bit tougher to get debugging tools working in though which afaik 
has slowed down existing attempts at diagnosing the issue.


Mark

On 9/7/23 05:55, Rok Jaklič wrote:

Hi,

we have also experienced several ceph-mgr oom kills on ceph v16.2.13 on
120T/200T data.

Is there any tracker about the problem?

Does upgrade to 17.x "solves" the problem?

Kind regards,
Rok



On Wed, Sep 6, 2023 at 9:36 PM Ernesto Puerta  wrote:


Dear Cephers,

Today brought us an eventful CTL meeting: it looks like Jitsi recently
started
requiring user authentication
 (anonymous users
will get a "Waiting for a moderator" modal), but authentication didn't work
against Google or GitHub accounts, so we had to move to the good old Google
Meet.

As a result of this, Neha has kindly set up a new private Slack channel
(#clt) to allow for quicker communication among CLT members (if you usually
attend the CLT meeting and have not been added, please ping any CLT member
to request that).

Now, let's move on the important stuff:

*The latest Pacific Release (v16.2.14)*

*The Bad*
The 14th drop of the Pacific release has landed with a few hiccups:

- Some .deb packages were made available to downloads.ceph.com before
the release process completion. Although this is not the first time it
happens, we want to ensure this is the last one, so we'd like to gather
ideas to improve the release publishing process. Neha encouraged
everyone
to share ideas here:
   - https://tracker.ceph.com/issues/62671
   - https://tracker.ceph.com/issues/62672
   - v16.2.14 also hit issues during the ceph-container stage. Laura
wanted to raise awareness of its current setbacks
 and collect ideas to tackle
them:
   - Enforce reviews and mandatory CI checks
   - Rework the current approach to use simple Dockerfiles
   
   - Call the Ceph community for help: ceph-container is currently
   maintained part-time by a single contributor (Guillaume Abrioux).
This
   sub-project would benefit from the sound expertise on containers
among Ceph
   users. If you have ever considered contributing to Ceph, but felt a
bit
   intimidated by C++, Paxos and race conditions, ceph-container is a
good
   place to shed your fear.


*The Good*
Not everything about v16.2.14 was going to be bleak: David Orman brought us
really good news. They tested v16.2.14 on a large production cluster
(10gbit/s+ RGW and ~13PiB raw) and found that it solved a major issue
affecting RGW in Pacific .

*The Ugly*
During that testing, they noticed that ceph-mgr was occasionally OOM killed
(nothing new to 16.2.14, as it was previously reported). They already
tried:

- Disabling modules (like the restful one, which was a suspect)
- Enabling debug 20
- Turning the pg autoscaler off

Debugging will continue to characterize this issue:

- Enable profiling (Mark Nelson)
- Try Bloomberg's Python mem profiler
 (Matthew Leonard)


*Infrastructure*

*Reminder: Infrastructure Meeting Tomorrow. **11:30-12:30 Central Time*

Patrick brought up the following topics:

- Need to reduce the OVH spending ($72k/year, which is a good cut in the
Ceph Foundation budget, that's a lot less avocado sandwiches for the
next
Cephalocon):
   - Move services (e.g.: Chacra) to the Sepia lab
   - Re-use CentOS (and any spared/unused) machines for devel purposes
- Current Ceph sys admins are overloaded, so devel/community involvement
would be much appreciated.
- More to be discussed in tomorrow's meeting. Please join if you
think you can help solve/improve the Ceph infrastrucru!


*BTW*: today's CDM will be canceled, since no topics were proposed.

Kind Regards,

Ernesto
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread Mark Nelson
Oh that's very good to know.  I'm sure Igor will respond here, but do 
you know which PR this was related to? (possibly 
https://github.com/ceph/ceph/pull/50321)


If we think there's a regression here we should get it into the tracker 
ASAP.



Mark


On 9/7/23 13:45, J-P Methot wrote:
To be quite honest, I will not pretend I have a low level 
understanding of what was going on. There is very little documentation 
as to what the bluestore allocator actually does and we had to rely on 
Igor's help to find the solution, so my understanding of the situation 
is limited. What I understand is as follows:


-Our workload requires us to move around, delete, write a fairly high 
amount of RBD data around the cluster.


-The AVL allocator doesn't seem to like that and changes added to it 
in 16.2.14 made it worse than before.


-It made the OSDs become unresponsive and lag quite a bit whenever 
high amounts of data was written or deleted, which is, all the time.


-We basically changed the allocator to bitmap and, as we speak, this 
seems to have solved the problem. I understand that this is not ideal 
as it's apparently less performant, but here it's the difference 
between a cluster that gives me enough I/Os to work properly and a 
cluster that murders my performances.


I hope this helps. Feel free to ask us if you need further details and 
I'll see what I can do.


On 9/7/23 13:59, Mark Nelson wrote:
Ok, good to know.  Please feel free to update us here with what you 
are seeing in the allocator.  It might also be worth opening a 
tracker ticket as well.  I did some work in the AVL allocator a while 
back where we were repeating the linear search from the same offset 
every allocation, getting stuck, and falling back to fast search over 
and over leading to significant allocation fragmentation. That got 
fixed, but I wouldn't be surprised if we have some other sub-optimal 
behaviors we don't know about.



Mark


On 9/7/23 12:28, J-P Methot wrote:

Hi,

By this point, we're 95% sure that, contrary to our previous 
beliefs, it's an issue with changes to the bluestore_allocator and 
not the compaction process. That said, I will keep this email in 
mind as we will want to test optimizations to compaction on our test 
environment.


On 9/7/23 12:32, Mark Nelson wrote:

Hello,

There are two things that might help you here.  One is to try the 
new "rocksdb_cf_compaction_on_deletion" feature that I added in 
Reef and we backported to Pacific in 16.2.13.  So far this appears 
to be a huge win for avoiding tombstone accumulation during 
iteration which is often the issue with threadpool timeouts due to 
rocksdb.  Manual compaction can help, but if you are hitting a case 
where there's concurrent iteration and deletions with no writes, 
tombstones will accumulate quickly with no compactions taking place 
and you'll eventually end up back in the same place. The default 
sliding window and trigger settings are fairly conservative to 
avoid excessive compaction, so it may require some tuning to hit 
the right sweet spot on your cluster. I know of at least one site 
that's using this feature with more aggressive settings than 
default and had an extremely positive impact on their cluster.


The other thing that can help improve compaction performance in 
general is enabling lz4 compression in RocksDB.  I plan to make 
this the default behavior in Squid assuming we don't run into any 
issues in testing.  There are several sites that are using this now 
in production and the benefits have been dramatic relative to the 
costs.  We're seeing significantly faster compactions and about 
2.2x lower space requirement for the DB (RGW workload). There may 
be a slight CPU cost and read/index listing performance impact, but 
even with testing on NVMe clusters this was quite low (maybe a 
couple of percent).



Mark


On 9/7/23 10:21, J-P Methot wrote:

Hi,

Since my post, we've been speaking with a member of the Ceph dev 
team. He did, at first, believe it was an issue linked to the 
common performance degradation after huge deletes operation. So we 
did do offline compactions on all our OSDs. It fixed nothing and 
we are going through the logs to try and figure this out.


To answer your question, no the OSD doesn't restart after it logs 
the timeout. It manages to get back online by itself, at the cost 
of sluggish performances for the cluster and high iowait on VMs.


We mostly run RBD workloads.

Deep scrubs or no deep scrubs doesn't appear to change anything. 
Deactivating scrubs altogether did not impact performances in any 
way.


Furthermore, I'll stress that this is only happening since we 
upgraded to the latest Pacific, yesterday.


On 9/7/23 10:49, Stefan Kooman wrote:

On 07-09-2023 09:05, J-P Methot wrote:

Hi,

We're running latest Pacific on our production cluster and we've 
been seeing the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' 
had timed out after 15.00954s' error. We have reasons to 
believe this happen

[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread J-P Methot
To be quite honest, I will not pretend I have a low level understanding 
of what was going on. There is very little documentation as to what the 
bluestore allocator actually does and we had to rely on Igor's help to 
find the solution, so my understanding of the situation is limited. What 
I understand is as follows:


-Our workload requires us to move around, delete, write a fairly high 
amount of RBD data around the cluster.


-The AVL allocator doesn't seem to like that and changes added to it in 
16.2.14 made it worse than before.


-It made the OSDs become unresponsive and lag quite a bit whenever high 
amounts of data was written or deleted, which is, all the time.


-We basically changed the allocator to bitmap and, as we speak, this 
seems to have solved the problem. I understand that this is not ideal as 
it's apparently less performant, but here it's the difference between a 
cluster that gives me enough I/Os to work properly and a cluster that 
murders my performances.


I hope this helps. Feel free to ask us if you need further details and 
I'll see what I can do.


On 9/7/23 13:59, Mark Nelson wrote:
Ok, good to know.  Please feel free to update us here with what you 
are seeing in the allocator.  It might also be worth opening a tracker 
ticket as well.  I did some work in the AVL allocator a while back 
where we were repeating the linear search from the same offset every 
allocation, getting stuck, and falling back to fast search over and 
over leading to significant allocation fragmentation. That got fixed, 
but I wouldn't be surprised if we have some other sub-optimal 
behaviors we don't know about.



Mark


On 9/7/23 12:28, J-P Methot wrote:

Hi,

By this point, we're 95% sure that, contrary to our previous beliefs, 
it's an issue with changes to the bluestore_allocator and not the 
compaction process. That said, I will keep this email in mind as we 
will want to test optimizations to compaction on our test environment.


On 9/7/23 12:32, Mark Nelson wrote:

Hello,

There are two things that might help you here.  One is to try the 
new "rocksdb_cf_compaction_on_deletion" feature that I added in Reef 
and we backported to Pacific in 16.2.13.  So far this appears to be 
a huge win for avoiding tombstone accumulation during iteration 
which is often the issue with threadpool timeouts due to rocksdb.  
Manual compaction can help, but if you are hitting a case where 
there's concurrent iteration and deletions with no writes, 
tombstones will accumulate quickly with no compactions taking place 
and you'll eventually end up back in the same place. The default 
sliding window and trigger settings are fairly conservative to avoid 
excessive compaction, so it may require some tuning to hit the right 
sweet spot on your cluster. I know of at least one site that's using 
this feature with more aggressive settings than default and had an 
extremely positive impact on their cluster.


The other thing that can help improve compaction performance in 
general is enabling lz4 compression in RocksDB.  I plan to make this 
the default behavior in Squid assuming we don't run into any issues 
in testing.  There are several sites that are using this now in 
production and the benefits have been dramatic relative to the 
costs.  We're seeing significantly faster compactions and about 2.2x 
lower space requirement for the DB (RGW workload). There may be a 
slight CPU cost and read/index listing performance impact, but even 
with testing on NVMe clusters this was quite low (maybe a couple of 
percent).



Mark


On 9/7/23 10:21, J-P Methot wrote:

Hi,

Since my post, we've been speaking with a member of the Ceph dev 
team. He did, at first, believe it was an issue linked to the 
common performance degradation after huge deletes operation. So we 
did do offline compactions on all our OSDs. It fixed nothing and we 
are going through the logs to try and figure this out.


To answer your question, no the OSD doesn't restart after it logs 
the timeout. It manages to get back online by itself, at the cost 
of sluggish performances for the cluster and high iowait on VMs.


We mostly run RBD workloads.

Deep scrubs or no deep scrubs doesn't appear to change anything. 
Deactivating scrubs altogether did not impact performances in any way.


Furthermore, I'll stress that this is only happening since we 
upgraded to the latest Pacific, yesterday.


On 9/7/23 10:49, Stefan Kooman wrote:

On 07-09-2023 09:05, J-P Methot wrote:

Hi,

We're running latest Pacific on our production cluster and we've 
been seeing the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' 
had timed out after 15.00954s' error. We have reasons to 
believe this happens each time the RocksDB compaction process is 
launched on an OSD. My question is, does the cluster detecting 
that an OSD has timed out interrupt the compaction process? This 
seems to be what's happening, but it's not immediately obvious. 
We are currently facing an infinite loop of rand

[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread Mark Nelson
Ok, good to know.  Please feel free to update us here with what you are 
seeing in the allocator.  It might also be worth opening a tracker 
ticket as well.  I did some work in the AVL allocator a while back where 
we were repeating the linear search from the same offset every 
allocation, getting stuck, and falling back to fast search over and over 
leading to significant allocation fragmentation.  That got fixed, but I 
wouldn't be surprised if we have some other sub-optimal behaviors we 
don't know about.



Mark


On 9/7/23 12:28, J-P Methot wrote:

Hi,

By this point, we're 95% sure that, contrary to our previous beliefs, 
it's an issue with changes to the bluestore_allocator and not the 
compaction process. That said, I will keep this email in mind as we 
will want to test optimizations to compaction on our test environment.


On 9/7/23 12:32, Mark Nelson wrote:

Hello,

There are two things that might help you here.  One is to try the new 
"rocksdb_cf_compaction_on_deletion" feature that I added in Reef and 
we backported to Pacific in 16.2.13.  So far this appears to be a 
huge win for avoiding tombstone accumulation during iteration which 
is often the issue with threadpool timeouts due to rocksdb.  Manual 
compaction can help, but if you are hitting a case where there's 
concurrent iteration and deletions with no writes, tombstones will 
accumulate quickly with no compactions taking place and you'll 
eventually end up back in the same place. The default sliding window 
and trigger settings are fairly conservative to avoid excessive 
compaction, so it may require some tuning to hit the right sweet spot 
on your cluster. I know of at least one site that's using this 
feature with more aggressive settings than default and had an 
extremely positive impact on their cluster.


The other thing that can help improve compaction performance in 
general is enabling lz4 compression in RocksDB.  I plan to make this 
the default behavior in Squid assuming we don't run into any issues 
in testing.  There are several sites that are using this now in 
production and the benefits have been dramatic relative to the 
costs.  We're seeing significantly faster compactions and about 2.2x 
lower space requirement for the DB (RGW workload). There may be a 
slight CPU cost and read/index listing performance impact, but even 
with testing on NVMe clusters this was quite low (maybe a couple of 
percent).



Mark


On 9/7/23 10:21, J-P Methot wrote:

Hi,

Since my post, we've been speaking with a member of the Ceph dev 
team. He did, at first, believe it was an issue linked to the common 
performance degradation after huge deletes operation. So we did do 
offline compactions on all our OSDs. It fixed nothing and we are 
going through the logs to try and figure this out.


To answer your question, no the OSD doesn't restart after it logs 
the timeout. It manages to get back online by itself, at the cost of 
sluggish performances for the cluster and high iowait on VMs.


We mostly run RBD workloads.

Deep scrubs or no deep scrubs doesn't appear to change anything. 
Deactivating scrubs altogether did not impact performances in any way.


Furthermore, I'll stress that this is only happening since we 
upgraded to the latest Pacific, yesterday.


On 9/7/23 10:49, Stefan Kooman wrote:

On 07-09-2023 09:05, J-P Methot wrote:

Hi,

We're running latest Pacific on our production cluster and we've 
been seeing the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' had 
timed out after 15.00954s' error. We have reasons to believe 
this happens each time the RocksDB compaction process is launched 
on an OSD. My question is, does the cluster detecting that an OSD 
has timed out interrupt the compaction process? This seems to be 
what's happening, but it's not immediately obvious. We are 
currently facing an infinite loop of random OSDs timing out and if 
the compaction process is interrupted without finishing, it may 
explain that.



Does the OSD also restart after it logged the timeouts?

You might want to perform an offline compaction every $timeperiod 
to fix any potential RocksDB degradation. That's what we do. What 
kind of workload do you run (i.e. RBD, CephFS, RGW)?


Do you also see these timeouts occur during deep-scrubs?

Gr. Stefan



--
Best Regards,
Mark Nelson
Head of Research and Development

Clyso GmbH
p: +49 89 21552391 12 | a: Minnesota, USA
w: https://clyso.com | e: mark.nel...@clyso.com

We are hiring: https://www.clyso.com/jobs/
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread J-P Methot

Hi,

By this point, we're 95% sure that, contrary to our previous beliefs, 
it's an issue with changes to the bluestore_allocator and not the 
compaction process. That said, I will keep this email in mind as we will 
want to test optimizations to compaction on our test environment.


On 9/7/23 12:32, Mark Nelson wrote:

Hello,

There are two things that might help you here.  One is to try the new 
"rocksdb_cf_compaction_on_deletion" feature that I added in Reef and 
we backported to Pacific in 16.2.13.  So far this appears to be a huge 
win for avoiding tombstone accumulation during iteration which is 
often the issue with threadpool timeouts due to rocksdb.  Manual 
compaction can help, but if you are hitting a case where there's 
concurrent iteration and deletions with no writes, tombstones will 
accumulate quickly with no compactions taking place and you'll 
eventually end up back in the same place. The default sliding window 
and trigger settings are fairly conservative to avoid excessive 
compaction, so it may require some tuning to hit the right sweet spot 
on your cluster. I know of at least one site that's using this feature 
with more aggressive settings than default and had an extremely 
positive impact on their cluster.


The other thing that can help improve compaction performance in 
general is enabling lz4 compression in RocksDB.  I plan to make this 
the default behavior in Squid assuming we don't run into any issues in 
testing.  There are several sites that are using this now in 
production and the benefits have been dramatic relative to the costs.  
We're seeing significantly faster compactions and about 2.2x lower 
space requirement for the DB (RGW workload). There may be a slight CPU 
cost and read/index listing performance impact, but even with testing 
on NVMe clusters this was quite low (maybe a couple of percent).



Mark


On 9/7/23 10:21, J-P Methot wrote:

Hi,

Since my post, we've been speaking with a member of the Ceph dev 
team. He did, at first, believe it was an issue linked to the common 
performance degradation after huge deletes operation. So we did do 
offline compactions on all our OSDs. It fixed nothing and we are 
going through the logs to try and figure this out.


To answer your question, no the OSD doesn't restart after it logs the 
timeout. It manages to get back online by itself, at the cost of 
sluggish performances for the cluster and high iowait on VMs.


We mostly run RBD workloads.

Deep scrubs or no deep scrubs doesn't appear to change anything. 
Deactivating scrubs altogether did not impact performances in any way.


Furthermore, I'll stress that this is only happening since we 
upgraded to the latest Pacific, yesterday.


On 9/7/23 10:49, Stefan Kooman wrote:

On 07-09-2023 09:05, J-P Methot wrote:

Hi,

We're running latest Pacific on our production cluster and we've 
been seeing the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' had 
timed out after 15.00954s' error. We have reasons to believe 
this happens each time the RocksDB compaction process is launched 
on an OSD. My question is, does the cluster detecting that an OSD 
has timed out interrupt the compaction process? This seems to be 
what's happening, but it's not immediately obvious. We are 
currently facing an infinite loop of random OSDs timing out and if 
the compaction process is interrupted without finishing, it may 
explain that.



Does the OSD also restart after it logged the timeouts?

You might want to perform an offline compaction every $timeperiod to 
fix any potential RocksDB degradation. That's what we do. What kind 
of workload do you run (i.e. RBD, CephFS, RGW)?


Do you also see these timeouts occur during deep-scrubs?

Gr. Stefan



--
Jean-Philippe Méthot
Senior Openstack system administrator
Administrateur système Openstack sénior
PlanetHoster inc.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread J-P Methot

We went from 16.2.13 to 16.2.14

Also, timeout is 15 seconds because it's the default in Ceph. Basically, 
15 seconds before Ceph shows a warning that OSD is timing out.


We may have found the solution, but it would be, in fact, related to 
bluestore_allocator and not the compaction process. I'll post the actual 
resolution when we confirm 100% that it works.


On 9/7/23 12:18, Konstantin Shalygin wrote:

Hi,


On 7 Sep 2023, at 18:21, J-P Methot  wrote:

Since my post, we've been speaking with a member of the Ceph dev 
team. He did, at first, believe it was an issue linked to the common 
performance degradation after huge deletes operation. So we did do 
offline compactions on all our OSDs. It fixed nothing and we are 
going through the logs to try and figure this out.


To answer your question, no the OSD doesn't restart after it logs the 
timeout. It manages to get back online by itself, at the cost of 
sluggish performances for the cluster and high iowait on VMs.


We mostly run RBD workloads.

Deep scrubs or no deep scrubs doesn't appear to change anything. 
Deactivating scrubs altogether did not impact performances in any way.


Furthermore, I'll stress that this is only happening since we 
upgraded to the latest Pacific, yesterday.


What is your previous release version? What is your OSD drives models?
The timeout are always 15s? Not 7s, not 17s?


Thanks,
k


--
Jean-Philippe Méthot
Senior Openstack system administrator
Administrateur système Openstack sénior
PlanetHoster inc.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread Mark Nelson

Hello,

There are two things that might help you here.  One is to try the new 
"rocksdb_cf_compaction_on_deletion" feature that I added in Reef and we 
backported to Pacific in 16.2.13.  So far this appears to be a huge win 
for avoiding tombstone accumulation during iteration which is often the 
issue with threadpool timeouts due to rocksdb.  Manual compaction can 
help, but if you are hitting a case where there's concurrent iteration 
and deletions with no writes, tombstones will accumulate quickly with no 
compactions taking place and you'll eventually end up back in the same 
place. The default sliding window and trigger settings are fairly 
conservative to avoid excessive compaction, so it may require some 
tuning to hit the right sweet spot on your cluster. I know of at least 
one site that's using this feature with more aggressive settings than 
default and had an extremely positive impact on their cluster.


The other thing that can help improve compaction performance in general 
is enabling lz4 compression in RocksDB.  I plan to make this the default 
behavior in Squid assuming we don't run into any issues in testing.  
There are several sites that are using this now in production and the 
benefits have been dramatic relative to the costs.  We're seeing 
significantly faster compactions and about 2.2x lower space requirement 
for the DB (RGW workload). There may be a slight CPU cost and read/index 
listing performance impact, but even with testing on NVMe clusters this 
was quite low (maybe a couple of percent).



Mark


On 9/7/23 10:21, J-P Methot wrote:

Hi,

Since my post, we've been speaking with a member of the Ceph dev team. 
He did, at first, believe it was an issue linked to the common 
performance degradation after huge deletes operation. So we did do 
offline compactions on all our OSDs. It fixed nothing and we are going 
through the logs to try and figure this out.


To answer your question, no the OSD doesn't restart after it logs the 
timeout. It manages to get back online by itself, at the cost of 
sluggish performances for the cluster and high iowait on VMs.


We mostly run RBD workloads.

Deep scrubs or no deep scrubs doesn't appear to change anything. 
Deactivating scrubs altogether did not impact performances in any way.


Furthermore, I'll stress that this is only happening since we upgraded 
to the latest Pacific, yesterday.


On 9/7/23 10:49, Stefan Kooman wrote:

On 07-09-2023 09:05, J-P Methot wrote:

Hi,

We're running latest Pacific on our production cluster and we've 
been seeing the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' had 
timed out after 15.00954s' error. We have reasons to believe 
this happens each time the RocksDB compaction process is launched on 
an OSD. My question is, does the cluster detecting that an OSD has 
timed out interrupt the compaction process? This seems to be what's 
happening, but it's not immediately obvious. We are currently facing 
an infinite loop of random OSDs timing out and if the compaction 
process is interrupted without finishing, it may explain that.



Does the OSD also restart after it logged the timeouts?

You might want to perform an offline compaction every $timeperiod to 
fix any potential RocksDB degradation. That's what we do. What kind 
of workload do you run (i.e. RBD, CephFS, RGW)?


Do you also see these timeouts occur during deep-scrubs?

Gr. Stefan



--
Best Regards,
Mark Nelson
Head of Research and Development

Clyso GmbH
p: +49 89 21552391 12 | a: Minnesota, USA
w: https://clyso.com | e: mark.nel...@clyso.com

We are hiring: https://www.clyso.com/jobs/
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread Konstantin Shalygin
Hi,

> On 7 Sep 2023, at 18:21, J-P Methot  wrote:
> 
> Since my post, we've been speaking with a member of the Ceph dev team. He 
> did, at first, believe it was an issue linked to the common performance 
> degradation after huge deletes operation. So we did do offline compactions on 
> all our OSDs. It fixed nothing and we are going through the logs to try and 
> figure this out.
> 
> To answer your question, no the OSD doesn't restart after it logs the 
> timeout. It manages to get back online by itself, at the cost of sluggish 
> performances for the cluster and high iowait on VMs.
> 
> We mostly run RBD workloads.
> 
> Deep scrubs or no deep scrubs doesn't appear to change anything. Deactivating 
> scrubs altogether did not impact performances in any way.
> 
> Furthermore, I'll stress that this is only happening since we upgraded to the 
> latest Pacific, yesterday.

What is your previous release version? What is your OSD drives models?
The timeout are always 15s? Not 7s, not 17s?


Thanks,
k
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Upgrading OS [and ceph release] nondestructively for oldish Ceph cluster

2023-09-07 Thread Sam Skipsey
Hello all,

We've had a Nautilus [latest releases] cluster for some years now, and are 
planning the upgrade process - both moving off Centos7 [ideally to a RHEL9 
compatible spin like Alma 9 or Rocky 9] and also moving to a newer Ceph release 
[ideally Pacific or higher to avoid too many later upgrades needed].

As far as ceph release upgrades go, I understand the process in general.

What I'm less certain about (and more nervous about from a potential data loss 
perspective) is the OS upgrade.
For Ceph bluestore OSDs, I assume all the relevant metadata is on the OSD disk 
[or on the separate disk configured for RocksDB etc if you have nvme], and none 
is on the OS itself?
For Mons and Mgrs, what stuff do I need to retain across the OS upgrade to have 
things "just work" [since they're relatively stateless, I assume mostly the 
/etc/ceph/ stuff and ceph cluster keys?]
For the MDS, I assume it's similar to MGRS? The MDS, IIRC, mainly works as a 
caching layer so I assume there's not much state that can be lost permanently?

Has anyone gone through this process who would be happy to share their 
experience? (There's not a lot on this on the wider internet - lots on 
upgrading ceph, much less on the OS)

Sam
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread J-P Methot

Hi,

Since my post, we've been speaking with a member of the Ceph dev team. 
He did, at first, believe it was an issue linked to the common 
performance degradation after huge deletes operation. So we did do 
offline compactions on all our OSDs. It fixed nothing and we are going 
through the logs to try and figure this out.


To answer your question, no the OSD doesn't restart after it logs the 
timeout. It manages to get back online by itself, at the cost of 
sluggish performances for the cluster and high iowait on VMs.


We mostly run RBD workloads.

Deep scrubs or no deep scrubs doesn't appear to change anything. 
Deactivating scrubs altogether did not impact performances in any way.


Furthermore, I'll stress that this is only happening since we upgraded 
to the latest Pacific, yesterday.


On 9/7/23 10:49, Stefan Kooman wrote:

On 07-09-2023 09:05, J-P Methot wrote:

Hi,

We're running latest Pacific on our production cluster and we've been 
seeing the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' had timed 
out after 15.00954s' error. We have reasons to believe this 
happens each time the RocksDB compaction process is launched on an 
OSD. My question is, does the cluster detecting that an OSD has timed 
out interrupt the compaction process? This seems to be what's 
happening, but it's not immediately obvious. We are currently facing 
an infinite loop of random OSDs timing out and if the compaction 
process is interrupted without finishing, it may explain that.



Does the OSD also restart after it logged the timeouts?

You might want to perform an offline compaction every $timeperiod to 
fix any potential RocksDB degradation. That's what we do. What kind of 
workload do you run (i.e. RBD, CephFS, RGW)?


Do you also see these timeouts occur during deep-scrubs?

Gr. Stefan


--
Jean-Philippe Méthot
Senior Openstack system administrator
Administrateur système Openstack sénior
PlanetHoster inc.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread Stefan Kooman

On 07-09-2023 09:05, J-P Methot wrote:

Hi,

We're running latest Pacific on our production cluster and we've been 
seeing the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' had timed out 
after 15.00954s' error. We have reasons to believe this happens each 
time the RocksDB compaction process is launched on an OSD. My question 
is, does the cluster detecting that an OSD has timed out interrupt the 
compaction process? This seems to be what's happening, but it's not 
immediately obvious. We are currently facing an infinite loop of random 
OSDs timing out and if the compaction process is interrupted without 
finishing, it may explain that.



Does the OSD also restart after it logged the timeouts?

You might want to perform an offline compaction every $timeperiod to fix 
any potential RocksDB degradation. That's what we do. What kind of 
workload do you run (i.e. RBD, CephFS, RGW)?


Do you also see these timeouts occur during deep-scrubs?

Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] failure domain and rack awareness

2023-09-07 Thread Reza Bakhshayeshi
Hello,

What is the best strategy regarding failure domain and rack awareness when
there are only 2 physical racks and we need 3 replicas of data?

In this scenario what is your point of view if we create 4 artificial racks
at least to be able to manage deliberate node maintenance in a more
efficient way?

Regards,
Reza
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Permissions of the .snap directory do not inherit ACLs in 17.2.6

2023-09-07 Thread Eugen Block
Your description seems to match my observations trying to create  
cephfs snapshots via dashboard. In latest Octopus it works, in Pacific  
16.2.13 and Quincy 17.2.6 it doesn't, in Reef 18.2.0 it works again.



Zitat von MARTEL Arnaud :


Hi Eugen,



We have a lot of shared directories in cephfs and each directory has  
a specific ACL to grant access to several groups (for read and/or  
for read/write access).


Here is the complete steps to reproduce the pb in 17.2.6 with only  
one group, GIPSI, in the ACL:


# mkdir /mnt/ceph/test

# chown root:nogroup /mnt/ceph/test

# chmod 770 /mnt/ceph/test

# setfacl  
--set="u::rwx,g::rwx,o::-,d:m::rwx,m::rwx,d:g:GIPSI:rwx,g:GIPSI:rwx"  
/mnt/ceph/test/




# getfacl /mnt/ceph/test

# file: mnt/ceph/test

# owner: root

# group: nogroup

user::rwx

group::rwx

group:GIPSI:rwx

mask::rwx

other::---

default:user::rwx

default:group::rwx

default:group:GIPSI:rwx

default:mask::rwx

default:other::---



# touch /mnt/ceph/test/foo

# getfacl /mnt/ceph/test/foo

# file: mnt/ceph/test/foo

# owner: root

# group: root

user::rw-

group::rwx   #effective:rw-

group:GIPSI:rwx  #effective:rw-

mask::rw-

other::---



# mkdir /mnt/ceph/ec42/test/.snap/snaptest

# getfacl /mnt/ceph/test/.snap

# file: mnt/ceph/test/.snap

# owner: root

# group: nogroup

user::rwx

group::rwx

other::---





As a result, no member of the GIPSI group is able to access the snaphots…

And we had no user complained about the access to the snapshots  
before our upgrade so I suppose that the ACL of the .snap directory  
was OK in pacific (> 16.2.9)




Arnaud



Le 04/09/2023 12:59, « Eugen Block » > a écrit :






I'm wondering if I did something wrong or if I'm missing something. I

tried to reproduce the described steps from the bug you mentioned, and

from Nautilus to Reef (I have a couple of test clusters) the getfacl

output always shows the same output for the .snap directory:





$ getfacl /mnt/cephfs/test/.snap/

getfacl: Removing leading '/' from absolute path names

# file: mnt/cephfs/test/.snap/

# owner: root

# group: root

user::rwx

group::rwx

other::---





So in my tests it never actually shows the "users" group acl. But you

wrote that it worked with Pacific for you, I'm confused...





Zitat von MARTEL Arnaud mailto:arnaud.mar...@cea.fr>>:






Hi,







I'm facing the same situation as described in bug #57084


(https://tracker.ceph.com/issues/57084  
) since I upgraded from



16.2.13 to 17.2.6







for example:







root@faiserver:~# getfacl /mnt/ceph/default/



# file: mnt/ceph/default/



# owner: 99



# group: nogroup



# flags: -s-



user::rwx



user:s-sac-acquisition:rwx



group::rwx



group:acquisition:r-x



group:SAC_R:r-x



mask::rwx



other::---



default:user::rwx



default:user:s-sac-acquisition:rwx



default:group::rwx



default:group:acquisition:r-x



default:group:SAC_R:r-x



default:mask::rwx



default:other::---







root@faiserver:~# getfacl /mnt/ceph/default/.snap



# file: mnt/ceph/default/.snap



# owner: 99



# group: nogroup



# flags: -s-



user::rwx



group::rwx



other::r-x











Before creating a new bug report, could you tell me if someone has



the same problem with 17.2.6 ??







Kind regards,



Arnaud



___



ceph-users mailing list -- ceph-users@ceph.io 


To unsubscribe send an email to ceph-users-le...@ceph.io  











___

ceph-users mailing list -- ceph-users@ceph.io 

To unsubscribe send an email to ceph-users-le...@ceph.io  






___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Is it possible (or meaningful) to revive old OSDs?

2023-09-07 Thread ceph-mail
Thanks all for the advice, very helpful!

The node also had a mon, which happily slotted right back into the cluster. The 
node's been up and running for a number of days now, but the systemd OSD 
processes don't seem to be trying continously, they're never progressing or 
getting a newer map.

As mentioned, the cluster is otherwise healthy (only these OSDs, which are down 
and out), and I have spare capacity and no issue with min_size. And they've 
been out for a long time (months) so it's reasonable to guess that most PGs may 
have been touched.

So, based on the advice, my plan is the following:

  1.  Set norebalance
  2.  One by one, do this for each OSD
 *   Purge the OSD from the dashboard
 *   cephadm ceph-volume lvm zap
 *   cephadm may automatically find and add the OSD, otherwise I'll add it 
manually
  3.  use pgremapper to prevent the 
OSDs to be filled
  4.  unset norebalance
  5.  Let the balancer gently flow data back into the OSDs over the next hours, 
days, weeks.

Thanks all!




From: Richard Bade 'hitrich at gmail.com' 
Sent: Thursday, September 7, 2023 01:25
To: ceph-m...@rikdvk.mailer.me 
Subject: Re: [ceph-users] Re: Is it possible (or meaningful) to revive old OSDs?

Yes, I agree with Anthony. If your cluster is healthy and you don't
*need* to bring them back in it's going to be less work and time to
just deploy them as new.

I usually set norebalance, purge the osds in ceph, remove the vg from
the disks and re-deploy. Then unset norebalance at the end once
everything is peered and happy. This is so that it doesn't start
moving stuff around when you purge.

Rich

On Thu, 7 Sept 2023 at 02:21, Anthony D'Atri  wrote:
>
> Resurrection usually only makes sense if fate or a certain someone resulted 
> in enough overlapping removed OSDs that you can't meet min_size.  I've had to 
> a couple of times :-/
>
> If an OSD is down for more than a short while, backfilling a redeployed OSD 
> will likely be faster than waiting for it to peer and do deltas -- if it can 
> at all.
>
> > On Sep 6, 2023, at 10:16, Malte Stroem  wrote:
> >
> > Hi ceph-m...@rikdvk.mailer.me,
> >
> > you could squeeze the OSDs back in but it does not make sense.
> >
> > Just clean the disks with dd for example and add them as new disks to your 
> > cluster.
> >
> > Best,
> > Malte
> >
> > Am 04.09.23 um 09:39 schrieb ceph-m...@rikdvk.mailer.me:
> >> Hello,
> >> I have a ten node cluster with about 150 OSDs. One node went down a while 
> >> back, several months. The OSDs on the node have been marked as down and 
> >> out since.
> >> I am now in the position to return the node to the cluster, with all the 
> >> OS and OSD disks. When I boot up the now working node, the OSDs do not 
> >> start.
> >> Essentially , it seems to complain with "fail[ing]to load OSD map for 
> >> [various epoch]s, got 0 bytes".
> >> I'm guessing the OSDs on disk maps are so old, they can't get back into 
> >> the cluster?
> >> My questions are whether it's possible or worth it to try to squeeze these 
> >> OSDs back in or to just replace them. And if I should just replace them, 
> >> what's the best way? Manually remove [1] and recreate? Replace [2]? Purge 
> >> in dashboard?
> >> [1] 
> >> https://docs.ceph.com/en/quincy/rados/operations/add-or-rm-osds/#removing-osds-manual
> >> [2] 
> >> https://docs.ceph.com/en/quincy/rados/operations/add-or-rm-osds/#replacing-an-osd
> >> Many thanks!
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread Alexander E. Patrakov
On an HDD-based Quincy 17.2.5 cluster (with DB/WAL on datacenter-class
NVMe with enhanced power loss protection), I sometimes (once or twice
per week) see log entries similar to what I reproduced below (a bit
trimmed):

Wed 2023-09-06 22:41:54 UTC ceph-osd09 ceph-osd@39.service[5574]:
2023-09-06T22:41:54.429+ 7f83d813d700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f83b9ec1700' had timed out after
15.00954s
Wed 2023-09-06 22:41:54 UTC ceph-osd09 ceph-osd@39.service[5574]:
2023-09-06T22:41:54.429+ 7f83d793c700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f83b9ec1700' had timed out after
15.00954s
Wed 2023-09-06 22:41:54 UTC ceph-osd09 ceph-osd@39.service[5574]:
2023-09-06T22:41:54.469+ 7f83d713b700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f83b9ec1700' had timed out after
15.00954s

Wed 2023-09-06 22:42:04 UTC mds02 ceph-mgr@mds02.service[4954]:
2023-09-06T22:42:04.735+ 7ffab1a20700  0 log_channel(cluster) log
[DBG] : pgmap v227303: 5449 pgs: 1 active+clean+laggy, 7
active+clean+snaptrim, 34 active+remapped+backfilling, 13
active+clean+scrubbing, 4738 active+clean, 19
active+clean+scrubbing+deep, 637 active+remapped+backfill_wait; 2.0
PiB data, 2.9 PiB used, 2.2 PiB / 5.2 PiB avail; 103 MiB/s rd, 25
MiB/s wr, 2.76k op/s; 923967567/22817724437 objects misplaced
(4.049%); 128 MiB/s, 142 objects/s recovering

Wed 2023-09-06 22:42:06 UTC mds02 ceph-mgr@mds02.service[4954]:
2023-09-06T22:42:06.767+ 7ffab1a20700  0 log_channel(cluster) log
[DBG] : pgmap v227304: 5449 pgs: 2 active+clean+laggy, 7
active+clean+snaptrim, 34 active+remapped+backfilling, 13
active+clean+scrubbing, 4737 active+clean, 19
active+clean+scrubbing+deep, 637 active+remapped+backfill_wait; 2.0
PiB data, 2.9 PiB used, 2.2 PiB / 5.2 PiB avail; 88 MiB/s rd, 23 MiB/s
wr, 3.85k op/s; 923967437/22817727283 objects misplaced (4.049%); 115
MiB/s, 126 objects/s recovering

Wed 2023-09-06 22:42:15 UTC mds02 ceph-mgr@mds02.service[4954]:
2023-09-06T22:42:15.311+ 7ffab1a20700  0 log_channel(cluster) log
[DBG] : pgmap v227308: 5449 pgs: 1
active+remapped+backfill_wait+laggy, 3 active+clean+laggy, 7
active+clean+snaptrim, 34 active+remapped+backfilling, 13
active+clean+scrubbing, 4736 active+clean, 19
active+clean+scrubbing+deep, 636 active+remapped+backfill_wait; 2.0
PiB data, 2.9 PiB used, 2.2 PiB / 5.2 PiB avail; 67 MiB/s rd, 13 MiB/s
wr, 3.21k op/s; 923966538/22817734196 objects misplaced (4.049%); 134
MiB/s, 137 objects/s recovering

Wed 2023-09-06 22:42:22 UTC ceph-osd09 ceph-osd@39.service[5574]:
2023-09-06T22:42:22.708+ 7f83b9ec1700  1 heartbeat_map
reset_timeout 'OSD::osd_op_tp thread 0x7f83b9ec1700' had timed out
after 15.00954s

Wed 2023-09-06 22:42:24 UTC ceph-osd06 ceph-osd@12.service[5527]:
2023-09-06T22:42:24.637+ 7f4fbfcc3700 -1 osd.12 1031871
heartbeat_check: no reply from 10.3.0.9:6864 osd.39 since back
2023-09-06T22:41:54.079627+ front 2023-09-06T22:41:54.079742+
(oldest deadline 2023-09-06T22:42:24.579224+)
Wed 2023-09-06 22:42:24 UTC ceph-osd08 ceph-osd@126.service[5505]:
2023-09-06T22:42:24.671+ 7feb033d8700 -1 osd.126 1031871
heartbeat_check: no reply from 10.3.0.9:6864 osd.39 since back
2023-09-06T22:41:53.996908+ front 2023-09-06T22:41:53.997184+
(oldest deadline 2023-09-06T22:42:24.496552+)
Wed 2023-09-06 22:42:24 UTC ceph-osd06 ceph-osd@75.service[5532]:
2023-09-06T22:42:24.797+ 7fe15bc77700 -1 osd.75 1031871
heartbeat_check: no reply from 10.3.0.9:6864 osd.39 since back
2023-09-06T22:41:49.384037+ front 2023-09-06T22:41:49.384031+
(oldest deadline 2023-09-06T22:42:24.683399+)

Wed 2023-09-06 22:42:25 UTC mds01 ceph-mon@mds01.service[3258]:
2023-09-06T22:42:25.669+ 7f8f72216700  1 mon.mds01@0(leader).osd
e1031871 prepare_failure osd.39 v1:10.3.0.9:6860/5574 from osd.1 is
reporting failure:1
Wed 2023-09-06 22:42:25 UTC mds01 ceph-mon@mds01.service[3258]:
2023-09-06T22:42:25.669+ 7f8f72216700  0 log_channel(cluster) log
[DBG] : osd.39 reported failed by osd.1
Wed 2023-09-06 22:42:25 UTC mds01 ceph-mon@mds01.service[3258]:
2023-09-06T22:42:25.905+ 7f8f72216700  1 mon.mds01@0(leader).osd
e1031871 prepare_failure osd.39 v1:10.3.0.9:6860/5574 from osd.408 is
reporting failure:1
Wed 2023-09-06 22:42:25 UTC mds01 ceph-mon@mds01.service[3258]:
2023-09-06T22:42:25.905+ 7f8f72216700  0 log_channel(cluster) log
[DBG] : osd.39 reported failed by osd.408
Wed 2023-09-06 22:42:25 UTC mds01 ceph-mon@mds01.service[3258]:
2023-09-06T22:42:25.905+ 7f8f72216700  1 mon.mds01@0(leader).osd
e1031871  we have enough reporters to mark osd.39 down
Wed 2023-09-06 22:42:25 UTC mds01 ceph-mon@mds01.service[3258]:
2023-09-06T22:42:25.905+ 7f8f72216700  0 log_channel(cluster) log
[INF] : osd.39 failed (root=default,host=ceph-osd09) (2 reporters from
different host after 32.232905 >= grace 30.00)
Wed 2023-09-06 22:42:26 UTC mds01 ceph-mon@mds01.service[3258]:
2023-09-06T22:42:26.445+ 7f8f74a1b700  0 log_channe

[ceph-users] Re: Debian/bullseye build for reef

2023-09-07 Thread Matthew Vernon

Hi,

On 21/08/2023 17:16, Josh Durgin wrote:
We weren't targeting bullseye once we discovered the compiler version 
problem, the focus shifted to bookworm. If anyone would like to help 
maintaining debian builds, or looking into these issues, it would be 
welcome:


https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1030129 
https://tracker.ceph.com/issues/61845 


I've made some progress on building on bookworm now, and have updated 
the ticket; the failure now seems to be the tree missing 
src/pybind/mgr/dashboard/frontend/dist rather than anything relating to 
C++ issues...


HTH,

Matthew
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph_leadership_team_meeting_s18e06.mkv

2023-09-07 Thread Rok Jaklič
Hi,

we have also experienced several ceph-mgr oom kills on ceph v16.2.13 on
120T/200T data.

Is there any tracker about the problem?

Does upgrade to 17.x "solves" the problem?

Kind regards,
Rok



On Wed, Sep 6, 2023 at 9:36 PM Ernesto Puerta  wrote:

> Dear Cephers,
>
> Today brought us an eventful CTL meeting: it looks like Jitsi recently
> started
> requiring user authentication
>  (anonymous users
> will get a "Waiting for a moderator" modal), but authentication didn't work
> against Google or GitHub accounts, so we had to move to the good old Google
> Meet.
>
> As a result of this, Neha has kindly set up a new private Slack channel
> (#clt) to allow for quicker communication among CLT members (if you usually
> attend the CLT meeting and have not been added, please ping any CLT member
> to request that).
>
> Now, let's move on the important stuff:
>
> *The latest Pacific Release (v16.2.14)*
>
> *The Bad*
> The 14th drop of the Pacific release has landed with a few hiccups:
>
>- Some .deb packages were made available to downloads.ceph.com before
>the release process completion. Although this is not the first time it
>happens, we want to ensure this is the last one, so we'd like to gather
>ideas to improve the release publishing process. Neha encouraged
> everyone
>to share ideas here:
>   - https://tracker.ceph.com/issues/62671
>   - https://tracker.ceph.com/issues/62672
>   - v16.2.14 also hit issues during the ceph-container stage. Laura
>wanted to raise awareness of its current setbacks
> and collect ideas to tackle
>them:
>   - Enforce reviews and mandatory CI checks
>   - Rework the current approach to use simple Dockerfiles
>   
>   - Call the Ceph community for help: ceph-container is currently
>   maintained part-time by a single contributor (Guillaume Abrioux).
> This
>   sub-project would benefit from the sound expertise on containers
> among Ceph
>   users. If you have ever considered contributing to Ceph, but felt a
> bit
>   intimidated by C++, Paxos and race conditions, ceph-container is a
> good
>   place to shed your fear.
>
>
> *The Good*
> Not everything about v16.2.14 was going to be bleak: David Orman brought us
> really good news. They tested v16.2.14 on a large production cluster
> (10gbit/s+ RGW and ~13PiB raw) and found that it solved a major issue
> affecting RGW in Pacific .
>
> *The Ugly*
> During that testing, they noticed that ceph-mgr was occasionally OOM killed
> (nothing new to 16.2.14, as it was previously reported). They already
> tried:
>
>- Disabling modules (like the restful one, which was a suspect)
>- Enabling debug 20
>- Turning the pg autoscaler off
>
> Debugging will continue to characterize this issue:
>
>- Enable profiling (Mark Nelson)
>- Try Bloomberg's Python mem profiler
> (Matthew Leonard)
>
>
> *Infrastructure*
>
> *Reminder: Infrastructure Meeting Tomorrow. **11:30-12:30 Central Time*
>
> Patrick brought up the following topics:
>
>- Need to reduce the OVH spending ($72k/year, which is a good cut in the
>Ceph Foundation budget, that's a lot less avocado sandwiches for the
> next
>Cephalocon):
>   - Move services (e.g.: Chacra) to the Sepia lab
>   - Re-use CentOS (and any spared/unused) machines for devel purposes
>- Current Ceph sys admins are overloaded, so devel/community involvement
>would be much appreciated.
>- More to be discussed in tomorrow's meeting. Please join if you
>think you can help solve/improve the Ceph infrastrucru!
>
>
> *BTW*: today's CDM will be canceled, since no topics were proposed.
>
> Kind Regards,
>
> Ernesto
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread J-P Methot
We're talking about automatic online compaction here, not running the 
command.


On 9/7/23 04:04, Konstantin Shalygin wrote:

Hi,


On 7 Sep 2023, at 10:05, J-P Methot  wrote:

We're running latest Pacific on our production cluster and we've been 
seeing the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' had timed 
out after 15.00954s' error. We have reasons to believe this 
happens each time the RocksDB compaction process is launched on an 
OSD. My question is, does the cluster detecting that an OSD has timed 
out interrupt the compaction process? This seems to be what's 
happening, but it's not immediately obvious. We are currently facing 
an infinite loop of random OSDs timing out and if the compaction 
process is interrupted without finishing, it may explain that.


You run the online compacting for this OSD's (`ceph osd compact 
${osd_id}` command), right?




k


--
Jean-Philippe Méthot
Senior Openstack system administrator
Administrateur système Openstack sénior
PlanetHoster inc.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Awful new dashboard in Reef

2023-09-07 Thread Nicola Mori
My cluster has 104 OSDs, so I don't think this can be a factor for the 
malfunctioning.


smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Awful new dashboard in Reef

2023-09-07 Thread Nigel Williams
On Thu, 7 Sept 2023 at 18:05, Nicola Mori  wrote:

> Is it just me or maybe my impressions are shared by someone else? Is
> there anything that can be done to improve the situation?
>

I wonder about the implementation choice for this dashboard. I find with
our Reef cluster it seems to get stuck during refresh or stutters trying to
update. This is on a 1200 OSD cluster, not sure if that number is too high
for it?

I have around a dozen web-based dashboards open in Firefox, all the usual
suspects: LibreNMS, Netbox, AAP, OME and so on. They all work and are
responsive, refresh their pages, but the Ceph one bogs down. It attempts to
show alert popups but they get backlogged and appear as a stream of popups
one after another. I gave up on the dashboard and continue to use the CLI.

I would really appreciate a snappy responsive, information dense dashboard
view - maybe in the S-release?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread Konstantin Shalygin
Hi,

> On 7 Sep 2023, at 10:05, J-P Methot  wrote:
> 
> We're running latest Pacific on our production cluster and we've been seeing 
> the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' had timed out after 
> 15.00954s' error. We have reasons to believe this happens each time the 
> RocksDB compaction process is launched on an OSD. My question is, does the 
> cluster detecting that an OSD has timed out interrupt the compaction process? 
> This seems to be what's happening, but it's not immediately obvious. We are 
> currently facing an infinite loop of random OSDs timing out and if the 
> compaction process is interrupted without finishing, it may explain that.

You run the online compacting for this OSD's (`ceph osd compact ${osd_id}` 
command), right?



k
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Awful new dashboard in Reef

2023-09-07 Thread Nicola Mori

Dear Ceph users,

I just upgraded my cluster to Reef, and with the new version came also a 
revamped dashboard. Unfortunately the new dashboard is really awful to me:


1) it's no longer possible to see the status of the PGs: in the old 
dashboard it was very easy to see e.g. how many PGs were recovering, how 
many scrubbing etc. by clicking on the PG Status widget. Now the 
interface shows just how many are Ok and how many are working, without 
details, and I have to go to the command line to understand what's 
happening (not really comfortable on mobile)


2) The new timeline graphs do not work properly: changing the time frame 
sometimes produce empty graphs,


3) The instant values in Cluster utilization are refreshed so slowly 
that I cannot properly monitor the cluster behavior in real time


Is it just me or maybe my impressions are shared by someone else? Is 
there anything that can be done to improve the situation?

Thanks,

Nicola


smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Is it possible (or meaningful) to revive old OSDs?

2023-09-07 Thread Frank Schilder
Hi, I did something like that in the past. If you have a sufficient amount of 
cold data in general and you can bring the OSDs back with their original IDs, 
recovery was significantly faster than rebalancing. It really depends how 
trivial the version update per object is. In my case it could re-use thousands 
of clean objects per dirty object. If you are unsure its probably best to do a 
wipe + rebalance.

What can take quite a while at the beginning is the osdmap update if they were 
down for such a long time. The first boot until they show up as "in" will take 
a while. Set norecover and norebalance until you see in the OSD log that they 
have the latest OSD map version.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Malte Stroem 
Sent: Wednesday, September 6, 2023 4:16 PM
To: ceph-m...@rikdvk.mailer.me; ceph-users@ceph.io
Subject: [ceph-users] Re: Is it possible (or meaningful) to revive old OSDs?

Hi ceph-m...@rikdvk.mailer.me,

you could squeeze the OSDs back in but it does not make sense.

Just clean the disks with dd for example and add them as new disks to
your cluster.

Best,
Malte

Am 04.09.23 um 09:39 schrieb ceph-m...@rikdvk.mailer.me:
> Hello,
>
> I have a ten node cluster with about 150 OSDs. One node went down a while 
> back, several months. The OSDs on the node have been marked as down and out 
> since.
>
> I am now in the position to return the node to the cluster, with all the OS 
> and OSD disks. When I boot up the now working node, the OSDs do not start.
>
> Essentially , it seems to complain with "fail[ing]to load OSD map for 
> [various epoch]s, got 0 bytes".
>
> I'm guessing the OSDs on disk maps are so old, they can't get back into the 
> cluster?
>
> My questions are whether it's possible or worth it to try to squeeze these 
> OSDs back in or to just replace them. And if I should just replace them, 
> what's the best way? Manually remove [1] and recreate? Replace [2]? Purge in 
> dashboard?
>
> [1] 
> https://docs.ceph.com/en/quincy/rados/operations/add-or-rm-osds/#removing-osds-manual
> [2] 
> https://docs.ceph.com/en/quincy/rados/operations/add-or-rm-osds/#replacing-an-osd
>
> Many thanks!
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Rocksdb compaction and OSD timeout

2023-09-07 Thread J-P Methot

Hi,

We're running latest Pacific on our production cluster and we've been 
seeing the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' had timed out 
after 15.00954s' error. We have reasons to believe this happens each 
time the RocksDB compaction process is launched on an OSD. My question 
is, does the cluster detecting that an OSD has timed out interrupt the 
compaction process? This seems to be what's happening, but it's not 
immediately obvious. We are currently facing an infinite loop of random 
OSDs timing out and if the compaction process is interrupted without 
finishing, it may explain that.


--
Jean-Philippe Méthot
Senior Openstack system administrator
Administrateur système Openstack sénior
PlanetHoster inc.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io