[ceph-users] Re: OSD crash on Onode::put

2023-01-16 Thread Dongdong Tao
Hi Frank,

I don't have an operational workaround, the patch
https://github.com/ceph/ceph/pull/46911/commits/f43f596aac97200a70db7a70a230eb9343018159
is simple and can be applied cleanly.

Yes, restarting the OSD will clear pool entries, you can restart it when
the bluestore_onode items are very low (e.g less than 10) if it really
helps, but I think you'll need to tune and monitor the performance until
you can get a number that is most suitable for your cluster.

But it can't help with the crash, since in general, the crash itself is
basically a restart.

Regards,
Dongdong

On Tue, Jan 10, 2023 at 8:21 PM Serkan Çoban  wrote:

> Slot 19 is inside the chassis? Do you check chassis temperature? I
> sometimes have more failure rate in chassis HDDs than in front of the
> chassis. In our case it was related to the temperature difference.
>
> On Tue, Jan 10, 2023 at 1:28 PM Frank Schilder  wrote:
> >
> > Following up on my previous post, we have identical OSD hosts. The very
> strange observation now is, that all outlier OSDs are in exactly the same
> disk slot on these hosts. We have 5 problematic OSDs and they are all in
> slot 19 on 5 different hosts. This is an extremely strange and unlikely
> co-incidence.
> >
> > Are there any specific conditions for this problem to be present or
> amplified that could have to do with hardware?
> >
> > Best regards,
> > =
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD crash on Onode::put

2023-01-13 Thread Frank Schilder
Hi Igor,

my approach here, before doing something crazy like a daily cron job for 
restarting OSDs, is to do at least a minimum of thread analysis. How much of a 
problem is it really? I'm here also mostly guided by performance loss. As far 
as I know, the onode cache should be one of the most important caches regarding 
performance. Of course, only if the hit-rate is decent and this I can't pull 
out. Since I can't check the hit-rate, the second best thing is to see how an 
OSD's item count compares with average, how it develops on a restarted OSD and 
so on to get an idea what is normal, what is degraded and what requires action.

As far as I can tell after this relatively short amount of time, the item leak 
is a rather mild problem on our cluster. The few OSDs that were exceptional are 
all OSDs that were newly deployed and not restarted since backfill completed. 
It seems that backfill is an operation that triggers a measurable amount of 
cache_other to be lost to cleanup. Otherwise, a restart every 2-3 months might 
be warranted. Since we plan to upgrade to pacific this summer, this means not 
too much needs to be done. I will just keep an eye on onode item counts and 
restart one or the other OSD when warranted.

About "its just a restart". Most of the time it is. However, there was just 
recently a case where a restart meant complete loss of an OSD. The bug causing 
the restart corrupted the rocks DB beyond repair. Therefore, I think its always 
worth checking, doing some thread analysis and preventing unintended restarts 
if possible.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 12 January 2023 13:07:11
To: Frank Schilder; Dongdong Tao; ceph-users@ceph.io
Cc: d...@ceph.io
Subject: Re: [ceph-users] Re: OSD crash on Onode::put

Hi Frank,

IMO all the below logic is a bit of overkill and no one can provide 100% valid 
guidance on specific numbers atm. Generally I agree with Dongdong's point that 
crash is effectively an OSD restart and hence no much sense to perform such a 
restart manually - well, the rationale might be to do that gracefully and avoid 
some potential issues though...

Anyway I'd rather recommend to do periodic(!) manual OSD restart e.g. on a 
daily basis at off-peak hours instead of using tricks with mempool stats 
analysis..


Thanks,

Igor


On 1/10/2023 1:15 PM, Frank Schilder wrote:

Hi Dongdong and Igor,

thanks for pointing to this issue. I guess if its a memory leak issue (well, 
cache pool trim issue), checking for some indicator and an OSD restart should 
be a work-around? Dongdong promised a work-around but talks only about a patch 
(fix).

Looking at the tracker items, my conclusion is that unusually low values of 
.mempool.by_pool.bluestore_cache_onode.items of an OSD might be such an 
indicator. I just run a very simple check on all our OSDs:

for o in $(ceph osd ls); do n_onode="$(ceph tell "osd.$o" dump_mempools | jq 
".mempool.by_pool.bluestore_cache_onode.items")"; echo -n "$o: "; 
((n_onode<10)) && echo "$n_onode"; done; echo ""

and found 2 with seemingly very unusual values:

: 3098
1112: 7403

Comparing two OSDs with same disk on the same host gives:

# ceph daemon osd. dump_mempools | jq 
".mempool.by_pool.bluestore_cache_onode.items,.mempool.by_pool.bluestore_cache_onode.bytes,.mempool.by_pool.bluestore_cache_other.items,.mempool.by_pool.bluestore_cache_other.bytes"
3200
1971200
260924
900303680

# ceph daemon osd.1030 dump_mempools | jq 
".mempool.by_pool.bluestore_cache_onode.items,.mempool.by_pool.bluestore_cache_onode.bytes,.mempool.by_pool.bluestore_cache_other.items,.mempool.by_pool.bluestore_cache_other.bytes"
60281
37133096
8908591
255862680

OSD  does look somewhat bad. Shortly after restarting this OSD I get

# ceph daemon osd. dump_mempools | jq 
".mempool.by_pool.bluestore_cache_onode.items,.mempool.by_pool.bluestore_cache_onode.bytes,.mempool.by_pool.bluestore_cache_other.items,.mempool.by_pool.bluestore_cache_other.bytes"
20775
12797400
803582
24017100

So, the above procedure seems to work and, yes, there seems to be a leak of 
items in cache_other that pushes other pools down to 0. There seem to be 2 
useful indicators:

- very low .mempool.by_pool.bluestore_cache_onode.items
- very high 
.mempool.by_pool.bluestore_cache_other.bytes/.mempool.by_pool.bluestore_cache_other.items

Here a command to get both numbers with OSD ID in an awk-friendly format:

for o in $(ceph osd ls); do printf "%6d %8d %7.2f\n" "$o" $(ceph tell "osd.$o" 
dump_mempools | jq 
".mempool.by_pool.bluestore_cache_onode.items,.mempool.by_pool.bluestore_cache_other.bytes/.mempool.by_pool.bluestore_cache_other.items");
 done

Pipe it to a file and do things like:

awk '$2<5 || $3>200' FILE


[ceph-users] Re: OSD crash on Onode::put

2023-01-13 Thread Frank Schilder
Hi Anthony and Serkan,

I think Anthony had the right idea. I forgot that we re-deployed a number of 
OSDs on existing drives and also did a PG split over Christmas. The relatively 
few disks that stick out with cache_other usage seem all to be these newly 
deployed OSDs. So, it looks like that the cach_other item leakage is rather 
mild in normal operations, but can be substantial after backfilling new disks.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder
Sent: 11 January 2023 12:21:30
To: Serkan Çoban; Anthony D'Atri
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Re: OSD crash on Onode::put

Hi Anthony and Serkan,

I checked the drive temperatures and there is nothing special about this slot. 
The disks in this slot are from different vendors and were not populated 
incrementally. It might be a very weird coincidence. I seem to have an OSD 
developing this problem in another slot on a different host now. Let's see what 
happens in the future. No reason to turn superstitious :)

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD crash on Onode::put

2023-01-12 Thread Igor Fedotov

Hi Frank,

IMO all the below logic is a bit of overkill and no one can provide 100% 
valid guidance on specific numbers atm. Generally I agree with 
Dongdong's point that crash is effectively an OSD restart and hence no 
much sense to perform such a restart manually - well, the rationale 
might be to do that gracefully and avoid some potential issues though...


Anyway I'd rather recommend to do periodic(!) manual OSD restart e.g. on 
a daily basis at off-peak hours instead of using tricks with mempool 
stats analysis..



Thanks,

Igor


On 1/10/2023 1:15 PM, Frank Schilder wrote:

Hi Dongdong and Igor,

thanks for pointing to this issue. I guess if its a memory leak issue (well, 
cache pool trim issue), checking for some indicator and an OSD restart should 
be a work-around? Dongdong promised a work-around but talks only about a patch 
(fix).

Looking at the tracker items, my conclusion is that unusually low values of 
.mempool.by_pool.bluestore_cache_onode.items of an OSD might be such an 
indicator. I just run a very simple check on all our OSDs:

for o in $(ceph osd ls); do n_onode="$(ceph tell "osd.$o" dump_mempools | jq 
".mempool.by_pool.bluestore_cache_onode.items")"; echo -n "$o: "; ((n_onode<10)) && echo "$n_onode"; 
done; echo ""

and found 2 with seemingly very unusual values:

: 3098
1112: 7403

Comparing two OSDs with same disk on the same host gives:

# ceph daemon osd. dump_mempools | jq 
".mempool.by_pool.bluestore_cache_onode.items,.mempool.by_pool.bluestore_cache_onode.bytes,.mempool.by_pool.bluestore_cache_other.items,.mempool.by_pool.bluestore_cache_other.bytes"
3200
1971200
260924
900303680

# ceph daemon osd.1030 dump_mempools | jq 
".mempool.by_pool.bluestore_cache_onode.items,.mempool.by_pool.bluestore_cache_onode.bytes,.mempool.by_pool.bluestore_cache_other.items,.mempool.by_pool.bluestore_cache_other.bytes"
60281
37133096
8908591
255862680

OSD  does look somewhat bad. Shortly after restarting this OSD I get

# ceph daemon osd. dump_mempools | jq 
".mempool.by_pool.bluestore_cache_onode.items,.mempool.by_pool.bluestore_cache_onode.bytes,.mempool.by_pool.bluestore_cache_other.items,.mempool.by_pool.bluestore_cache_other.bytes"
20775
12797400
803582
24017100

So, the above procedure seems to work and, yes, there seems to be a leak of 
items in cache_other that pushes other pools down to 0. There seem to be 2 
useful indicators:

- very low .mempool.by_pool.bluestore_cache_onode.items
- very high 
.mempool.by_pool.bluestore_cache_other.bytes/.mempool.by_pool.bluestore_cache_other.items

Here a command to get both numbers with OSD ID in an awk-friendly format:

for o in $(ceph osd ls); do printf "%6d %8d %7.2f\n" "$o" $(ceph tell "osd.$o" 
dump_mempools | jq 
".mempool.by_pool.bluestore_cache_onode.items,.mempool.by_pool.bluestore_cache_other.bytes/.mempool.by_pool.bluestore_cache_other.items");
 done

Pipe it to a file and do things like:

awk '$2<5 || $3>200' FILE

For example, I still get:

# awk '$2<5 || $3>200' cache_onode.txt
   109249225   43.74
   109346193   43.70
   109847550   43.47
   110148873   43.34
   110248008   43.31
   110348152   43.29
   110549235   43.59
   110746694   43.35
   110948511   43.08
   111314612  739.46
   111413199  693.76
   111645300  205.70

flagging 3 more outliers.

Would it be possible to provide a bit of guidance to everyone about when to 
consider restarting an OSD? What values of the above variables are critical and 
what are tolerable? Of course a proper fix would be better, but I doubt that 
everyone is willing to apply a patch. Therefore, some guidance on how to 
mitigate this problem to acceptable levels might be useful. I'm thinking here 
how few onode items are acceptable before performance drops painfully.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

____________________
From: Igor Fedotov
Sent: 09 January 2023 13:34:42
To: Dongdong Tao;ceph-users@ceph.io
Cc:d...@ceph.io
Subject: [ceph-users] Re: OSD crash on Onode::put

Hi Dongdong,

thanks a lot for your post, it's really helpful.


Thanks,

Igor

On 1/5/2023 6:12 AM, Dongdong Tao wrote:

I see many users recently reporting that they have been struggling
with this Onode::put race condition issue[1] on both the latest
Octopus and pacific.
Igor opened a PR [2]  to address this issue, I've reviewed it
carefully, and looks good to me. I'm hoping this could get some
priority from the community.

For those who had been hitting this issue, I would like to share a
workaround that could unblock you:

During the investigation of this issue, I found this race condition
always happens after the bluestore onode cache size becomes 0.
Setting debug_bluestore = 1/30 will allow you to see the cache size
afte

[ceph-users] Re: OSD crash on Onode::put

2023-01-11 Thread Frank Schilder
Hi Anthony and Serkan,

I checked the drive temperatures and there is nothing special about this slot. 
The disks in this slot are from different vendors and were not populated 
incrementally. It might be a very weird coincidence. I seem to have an OSD 
developing this problem in another slot on a different host now. Let's see what 
happens in the future. No reason to turn superstitious :)

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD crash on Onode::put

2023-01-11 Thread Frank Schilder
Hi Dongdong.

> is simple and can be applied cleanly.

I understand this statement from a developer's perspective. Now, try to explain 
to a user with a cephadm deployed containerized cluster how to build a 
container from source, point cephadm to use this container and what to do for 
the next upgrade. I think "simple" depends on context. Applying a patch to a 
production system is currently an expert operation, I'm afraid.

If you have instructions for building a ceph-container with the patch applied, 
I would be very interested. I was asking for a source container for exactly 
this reason. As far as I can tell from the conversation, this is quite a 
project in itself. The thread was "Re: Building ceph packages in containers? 
[was: Ceph debian/ubuntu packages build]", but I can't find it on the mailing 
list any more. There seems to be an archived version: 
https://www.spinics.net/lists/ceph-users/msg73231.html

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Dongdong Tao 
Sent: 11 January 2023 04:30:14
To: Frank Schilder
Cc: Igor Fedotov; ceph-users@ceph.io; cobanser...@gmail.com
Subject: Re: [ceph-users] Re: OSD crash on Onode::put

Hi Frank,

I don't have an operational workaround, the patch 
https://github.com/ceph/ceph/pull/46911/commits/f43f596aac97200a70db7a70a230eb9343018159
 is simple and can be applied cleanly.

Yes, restarting the OSD will clear pool entries, you can restart it when the 
bluestore_onode items are very low (e.g less than 10) if it really helps, but I 
think you'll need to tune and monitor the performance until you can get a 
number that is most suitable for your cluster.

But it can't help with the crash, since in general, the crash itself is 
basically a restart.

Regards,
Dongdong

On Tue, Jan 10, 2023 at 8:21 PM Serkan Çoban 
mailto:cobanser...@gmail.com>> wrote:
Slot 19 is inside the chassis? Do you check chassis temperature? I
sometimes have more failure rate in chassis HDDs than in front of the
chassis. In our case it was related to the temperature difference.

On Tue, Jan 10, 2023 at 1:28 PM Frank Schilder 
mailto:fr...@dtu.dk>> wrote:
>
> Following up on my previous post, we have identical OSD hosts. The very 
> strange observation now is, that all outlier OSDs are in exactly the same 
> disk slot on these hosts. We have 5 problematic OSDs and they are all in slot 
> 19 on 5 different hosts. This is an extremely strange and unlikely 
> co-incidence.
>
> Are there any specific conditions for this problem to be present or amplified 
> that could have to do with hardware?
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io>
> To unsubscribe send an email to 
> ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD crash on Onode::put

2023-01-10 Thread Anthony D'Atri
Could this be a temporal co-incidence?  E.g. each host got a different model 
drive in slot 19 via an incremental expansion.

> On Jan 10, 2023, at 05:27, Frank Schilder  wrote:
> 
> Following up on my previous post, we have identical OSD hosts. The very 
> strange observation now is, that all outlier OSDs are in exactly the same 
> disk slot on these hosts. We have 5 problematic OSDs and they are all in slot 
> 19 on 5 different hosts. This is an extremely strange and unlikely 
> co-incidence.
> 
> Are there any specific conditions for this problem to be present or amplified 
> that could have to do with hardware?
> 
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD crash on Onode::put

2023-01-10 Thread Serkan Çoban
Slot 19 is inside the chassis? Do you check chassis temperature? I
sometimes have more failure rate in chassis HDDs than in front of the
chassis. In our case it was related to the temperature difference.

On Tue, Jan 10, 2023 at 1:28 PM Frank Schilder  wrote:
>
> Following up on my previous post, we have identical OSD hosts. The very 
> strange observation now is, that all outlier OSDs are in exactly the same 
> disk slot on these hosts. We have 5 problematic OSDs and they are all in slot 
> 19 on 5 different hosts. This is an extremely strange and unlikely 
> co-incidence.
>
> Are there any specific conditions for this problem to be present or amplified 
> that could have to do with hardware?
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD crash on Onode::put

2023-01-10 Thread Frank Schilder
Following up on my previous post, we have identical OSD hosts. The very strange 
observation now is, that all outlier OSDs are in exactly the same disk slot on 
these hosts. We have 5 problematic OSDs and they are all in slot 19 on 5 
different hosts. This is an extremely strange and unlikely co-incidence.

Are there any specific conditions for this problem to be present or amplified 
that could have to do with hardware?

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD crash on Onode::put

2023-01-10 Thread Frank Schilder
Hi Dongdong and Igor,

thanks for pointing to this issue. I guess if its a memory leak issue (well, 
cache pool trim issue), checking for some indicator and an OSD restart should 
be a work-around? Dongdong promised a work-around but talks only about a patch 
(fix).

Looking at the tracker items, my conclusion is that unusually low values of 
.mempool.by_pool.bluestore_cache_onode.items of an OSD might be such an 
indicator. I just run a very simple check on all our OSDs:

for o in $(ceph osd ls); do n_onode="$(ceph tell "osd.$o" dump_mempools | jq 
".mempool.by_pool.bluestore_cache_onode.items")"; echo -n "$o: "; 
((n_onode<10)) && echo "$n_onode"; done; echo ""

and found 2 with seemingly very unusual values:

: 3098
1112: 7403

Comparing two OSDs with same disk on the same host gives:

# ceph daemon osd. dump_mempools | jq 
".mempool.by_pool.bluestore_cache_onode.items,.mempool.by_pool.bluestore_cache_onode.bytes,.mempool.by_pool.bluestore_cache_other.items,.mempool.by_pool.bluestore_cache_other.bytes"
3200
1971200
260924
900303680

# ceph daemon osd.1030 dump_mempools | jq 
".mempool.by_pool.bluestore_cache_onode.items,.mempool.by_pool.bluestore_cache_onode.bytes,.mempool.by_pool.bluestore_cache_other.items,.mempool.by_pool.bluestore_cache_other.bytes"
60281
37133096
8908591
255862680

OSD  does look somewhat bad. Shortly after restarting this OSD I get

# ceph daemon osd. dump_mempools | jq 
".mempool.by_pool.bluestore_cache_onode.items,.mempool.by_pool.bluestore_cache_onode.bytes,.mempool.by_pool.bluestore_cache_other.items,.mempool.by_pool.bluestore_cache_other.bytes"
20775
12797400
803582
24017100

So, the above procedure seems to work and, yes, there seems to be a leak of 
items in cache_other that pushes other pools down to 0. There seem to be 2 
useful indicators:

- very low .mempool.by_pool.bluestore_cache_onode.items
- very high 
.mempool.by_pool.bluestore_cache_other.bytes/.mempool.by_pool.bluestore_cache_other.items

Here a command to get both numbers with OSD ID in an awk-friendly format:

for o in $(ceph osd ls); do printf "%6d %8d %7.2f\n" "$o" $(ceph tell "osd.$o" 
dump_mempools | jq 
".mempool.by_pool.bluestore_cache_onode.items,.mempool.by_pool.bluestore_cache_other.bytes/.mempool.by_pool.bluestore_cache_other.items");
 done

Pipe it to a file and do things like:

awk '$2<5 || $3>200' FILE

For example, I still get:

# awk '$2<5 || $3>200' cache_onode.txt
  109249225   43.74
  109346193   43.70
  109847550   43.47
  110148873   43.34
  110248008   43.31
  110348152   43.29
  110549235   43.59
  110746694   43.35
  110948511   43.08
  111314612  739.46
  111413199  693.76
  111645300  205.70

flagging 3 more outliers.

Would it be possible to provide a bit of guidance to everyone about when to 
consider restarting an OSD? What values of the above variables are critical and 
what are tolerable? Of course a proper fix would be better, but I doubt that 
everyone is willing to apply a patch. Therefore, some guidance on how to 
mitigate this problem to acceptable levels might be useful. I'm thinking here 
how few onode items are acceptable before performance drops painfully.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

____________________
From: Igor Fedotov 
Sent: 09 January 2023 13:34:42
To: Dongdong Tao; ceph-users@ceph.io
Cc: d...@ceph.io
Subject: [ceph-users] Re: OSD crash on Onode::put

Hi Dongdong,

thanks a lot for your post, it's really helpful.


Thanks,

Igor

On 1/5/2023 6:12 AM, Dongdong Tao wrote:
>
> I see many users recently reporting that they have been struggling
> with this Onode::put race condition issue[1] on both the latest
> Octopus and pacific.
> Igor opened a PR [2]  to address this issue, I've reviewed it
> carefully, and looks good to me. I'm hoping this could get some
> priority from the community.
>
> For those who had been hitting this issue, I would like to share a
> workaround that could unblock you:
>
> During the investigation of this issue, I found this race condition
> always happens after the bluestore onode cache size becomes 0.
> Setting debug_bluestore = 1/30 will allow you to see the cache size
> after the crash:
> ---
> 2022-10-25T00:47:26.562+ 7f424f78e700 30
> bluestore.MempoolThread(0x564a9dae2a68) _resize_shards
> max_shard_onodes: 0 max_shard_buffer: 8388608
> ---
>
> This is apparently wrong as this means the bluestore metadata cache is
> basically disabled,
> but it makes much sense to explain why we are hitting the race
> condition so easily -- An onode will be trimmed right away after it's
> unpinned.
>
> Keep going with the investigation, it turns out the culprit fo

[ceph-users] Re: OSD crash on Onode::put

2023-01-09 Thread Igor Fedotov

Hi Dongdong,

thanks a lot for your post, it's really helpful.


Thanks,

Igor

On 1/5/2023 6:12 AM, Dongdong Tao wrote:


I see many users recently reporting that they have been struggling 
with this Onode::put race condition issue[1] on both the latest 
Octopus and pacific.
Igor opened a PR [2]  to address this issue, I've reviewed it 
carefully, and looks good to me. I'm hoping this could get some 
priority from the community.


For those who had been hitting this issue, I would like to share a 
workaround that could unblock you:


During the investigation of this issue, I found this race condition 
always happens after the bluestore onode cache size becomes 0.
Setting debug_bluestore = 1/30 will allow you to see the cache size 
after the crash:

---
2022-10-25T00:47:26.562+ 7f424f78e700 30 
bluestore.MempoolThread(0x564a9dae2a68) _resize_shards 
max_shard_onodes: 0 max_shard_buffer: 8388608

---

This is apparently wrong as this means the bluestore metadata cache is 
basically disabled,
but it makes much sense to explain why we are hitting the race 
condition so easily -- An onode will be trimmed right away after it's 
unpinned.


Keep going with the investigation, it turns out the culprit for the 
0-sized cache is the leak that happened in bluestore_cache_other mempool
Please refer to the bug tracker [3] which has the detail of the leak 
issue, it was already fixed by  [4], and the next Pacific point 
release will have it.

But it was never backported to Octopus.
So if you are hitting the same:
For those who are on Octopus, you can manually backport this patch to 
fix the leak and prevent the race condition from happening.
For those who are on Pacific, you can wait for the next Pacific point 
release.


By the way, I'm backporting the fix to ubuntu Octopus and Pacific 
through this SRU [5], so it will be landed in ubuntu's package soon.


[1] https://tracker.ceph.com/issues/56382
[2] https://github.com/ceph/ceph/pull/47702
[3] https://tracker.ceph.com/issues/56424
[4] https://github.com/ceph/ceph/pull/46911
[5] https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1996010

Cheers,
Dongdong



--
Igor Fedotov
Ceph Lead Developer
--
croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web  | LinkedIn  | 
Youtube  | 
Twitter 


Meet us at the SC22 Conference! Learn more 
Technology Fast50 Award Winner by Deloitte 
!



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io