[ceph-users] Re: Full cluster outage when ECONNREFUSED is triggered

2023-11-24 Thread Denis Krienbühl
Thanks everyone for your feedback. I created a ticket and added most of our 
internal post-mortem research:

https://tracker.ceph.com/issues/63636

Cheers, Denis

> On 24 Nov 2023, at 09:01, Denis Krienbühl  wrote:
> 
> Hi
> 
> We’ve recently had a serious outage at work, after a host had a network 
> problem: 
> 
> - We rebooted a single host in a cluster of fifteen hosts across three racks.
> - The single host had a bad network configuration after booting, causing it 
> to send some packets to the wrong network.
> - One network still worked and offered a connection to the mons.
> - The other network connection was bad. Packets were refused, not dropped.
> - Due to osd_fast_fail_on_connection_refused=true, the broken host forced the 
> mons to take all other OSDs down (immediate failure).
> - Only after shutting down the faulty host, was it possible to start the shut 
> down OSDs, to restore the cluster.
> 
> We have since solved the problem by removing the default route that caused 
> the packets to end up in the wrong network, where they were summarily 
> rejected by a firewall. That is, we made sure that packets would be dropped 
> in the future, not rejected.
> 
> Still, I figured I’ll send this experience of ours to this mailing list, as 
> this seems to be something others might encounter as well.
> 
> In the following PR, that introduced osd_fast_fail_on_connection_refused, 
> there’s this description:
> 
>> This changeset adds additional handler (handle_refused()) to the dispatchers
>> and code that detects when connection attempt fails with ECONNREFUSED error
>> (connection refused) which is a clear indication that host is alive, but
>> daemon isn't, so daemons can instantly mark the other side as undoubtly
>> downed without the need for grace timer.
> 
> And this comment:
> 
>> As for flapping, we discussed it on ceph-devel ml
>> and came to conclusion that it requires either broken firewall or network
>> configuration to cause this, and these are more serious issues that should
>> be resolved first before worrying about OSDs flapping (either way, flapping
>> OSDs could be good for getting someone's attention).
> 
> https://github.com/ceph/ceph/pull/8558https://github.com/ceph/ceph/pull/8558
> 
> It has left us wondering if these are the right assumptions. An ECONNREFUSED 
> condition can bring down a whole cluster, and I wonder if there should be 
> some kind of safe-guard to ensure that this is avoided. One badly configured 
> host should generally not be able do that, and if the packets are dropped, 
> instead of refused, the cluster notices that the OSD down reports come only 
> from one host, and acts accordingly.
> 
> What do you think? Does this warrant a change in Ceph? I’m happy to provide 
> details and create a ticket.
> 
> Cheers,
> 
> Denis
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Full cluster outage when ECONNREFUSED is triggered

2023-11-24 Thread Denis Krienbühl
Hi Frank.

> On 24 Nov 2023, at 14:27, Frank Schilder  wrote:
> 
> I have to ask a clarifying question. If I understand the intend of 
> osd_fast_fail_on_connection_refused correctly, an OSD that receives a 
> connection_refused should get marked down fast to avoid unnecessarily long 
> wait times. And *only* OSDs that receive connection refused.
> 
> In your case, did booting up the server actually create a network route for 
> all other OSDs to the wrong network as well? In other words, did it act as a 
> gateway and all OSDs received connection refused messages and not just the 
> ones on the critical host? If so, your observation would be expected. If not, 
> then there is something wrong with the down reporting that should be looked 
> at.

No, the server has two networks through which to reach OSDs and mons. Say north 
and south. South was down and the traffic destined to it made it through the 
default gateway to an unrelated host that would bounce everything with 
“connection refused”.

North was still up, and through it the other OSDs and mons could also be 
reached.

So the host that was bootet had the wrong configuration.

The packets on the other hosts of the cluster were unaffected and all their 
network configuration remained as is, though they would not have reached the 
OSDs on the booted host via south anymore. Those would have been dropped by my 
understanding.

I’ll be sure to create a detailed ticket and to post it to this thread, I’m 
just not sure I’ll be able to do it today, but after what I’ve heard I think 
this should at least be looked at in detail and I’ll be sure to provide as much 
info as I can.

Denis
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Full cluster outage when ECONNREFUSED is triggered

2023-11-24 Thread Denis Krienbühl

> On 24 Nov 2023, at 11:49, Burkhard Linke 
>  wrote:
> 
> This should not be case in the reported situation unless setting 
> osd_fast_fail_on_connection_refused=true
>  changes this behaviour.


In our tests it does change the behavior. Usually the mons take 
mon_osd_reporter_subtree_level and mon_osd_min_down_reporters into account. In 
our tests, this is the case if an OSD heartbeat is dropped and the OSD is still 
able to talk to the mons.

However, if the OSD heartbeat is rejected, in our case because of an unrelated 
firewall change, the OSD sends an immediate failure to the mon:
https://github.com/ceph/ceph/blob/febfdd83a7838338033486826ef1fc9a5e8d588e/src/osd/OSD.cc#L6434
ceph/src/osd/OSD.cc at febfdd83a7838338033486826ef1fc9a5e8d588e · ceph/ceph
github.com


The mon then propagates that failure, without taking any other reports into 
consideration:

https://github.com/ceph/ceph/blob/febfdd83a7838338033486826ef1fc9a5e8d588e/src/mon/OSDMonitor.cc#L3367
ceph/src/mon/OSDMonitor.cc at febfdd83a7838338033486826ef1fc9a5e8d588e · 
ceph/ceph
github.com

This is fine when a single OSD goes down and everything else is okay. It then 
has the intended effect of getting rid of the OSD fast. The assumption 
presumably being: If a host can answer with a rejection to the OSD heartbeat, 
it is only the OSD that is affected.

In our case however, a network change caused rejections from an entirely 
different host (a gateway), while a network path to the mons was still 
available. In this case, Ceph does not apply the safe-guards it usually does.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Full cluster outage when ECONNREFUSED is triggered

2023-11-24 Thread Denis Krienbühl
Thanks Frank.

I see it the same way. I’ll be sure to create a ticket with all the details and 
steps to reproduce the issue.

Denis

> On 24 Nov 2023, at 10:24, Frank Schilder  wrote:
> 
> Hi Denis,
> 
> I would agree with you that a single misconfigured host should not take out 
> healthy hosts under any circumstances. I'm not sure if your incident is 
> actually covered by the devs comments, it is quite possible that you observed 
> an unintended side effect that is a bug in handling the connection error. I 
> think the intention is to shut down fast the OSDs with connection refused 
> (where timeouts are not required) and not other OSDs.
> 
> A bug report with tracker seems warranted.
> 
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> 
> 
> From: Denis Krienbühl 
> Sent: Friday, November 24, 2023 9:01 AM
> To: ceph-users
> Subject: [ceph-users] Full cluster outage when ECONNREFUSED is triggered
> 
> Hi
> 
> We’ve recently had a serious outage at work, after a host had a network 
> problem:
> 
> - We rebooted a single host in a cluster of fifteen hosts across three racks.
> - The single host had a bad network configuration after booting, causing it 
> to send some packets to the wrong network.
> - One network still worked and offered a connection to the mons.
> - The other network connection was bad. Packets were refused, not dropped.
> - Due to osd_fast_fail_on_connection_refused=true, the broken host forced the 
> mons to take all other OSDs down (immediate failure).
> - Only after shutting down the faulty host, was it possible to start the shut 
> down OSDs, to restore the cluster.
> 
> We have since solved the problem by removing the default route that caused 
> the packets to end up in the wrong network, where they were summarily 
> rejected by a firewall. That is, we made sure that packets would be dropped 
> in the future, not rejected.
> 
> Still, I figured I’ll send this experience of ours to this mailing list, as 
> this seems to be something others might encounter as well.
> 
> In the following PR, that introduced osd_fast_fail_on_connection_refused, 
> there’s this description:
> 
>> This changeset adds additional handler (handle_refused()) to the dispatchers
>> and code that detects when connection attempt fails with ECONNREFUSED error
>> (connection refused) which is a clear indication that host is alive, but
>> daemon isn't, so daemons can instantly mark the other side as undoubtly
>> downed without the need for grace timer.
> 
> And this comment:
> 
>> As for flapping, we discussed it on ceph-devel ml
>> and came to conclusion that it requires either broken firewall or network
>> configuration to cause this, and these are more serious issues that should
>> be resolved first before worrying about OSDs flapping (either way, flapping
>> OSDs could be good for getting someone's attention).
> 
> https://github.com/ceph/ceph/pull/8558https://github.com/ceph/ceph/pull/8558
> 
> It has left us wondering if these are the right assumptions. An ECONNREFUSED 
> condition can bring down a whole cluster, and I wonder if there should be 
> some kind of safe-guard to ensure that this is avoided. One badly configured 
> host should generally not be able do that, and if the packets are dropped, 
> instead of refused, the cluster notices that the OSD down reports come only 
> from one host, and acts accordingly.
> 
> What do you think? Does this warrant a change in Ceph? I’m happy to provide 
> details and create a ticket.
> 
> Cheers,
> 
> Denis
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Full cluster outage when ECONNREFUSED is triggered

2023-11-24 Thread Denis Krienbühl
Hi

We’ve recently had a serious outage at work, after a host had a network 
problem: 

- We rebooted a single host in a cluster of fifteen hosts across three racks.
- The single host had a bad network configuration after booting, causing it to 
send some packets to the wrong network.
- One network still worked and offered a connection to the mons.
- The other network connection was bad. Packets were refused, not dropped.
- Due to osd_fast_fail_on_connection_refused=true, the broken host forced the 
mons to take all other OSDs down (immediate failure).
- Only after shutting down the faulty host, was it possible to start the shut 
down OSDs, to restore the cluster.

We have since solved the problem by removing the default route that caused the 
packets to end up in the wrong network, where they were summarily rejected by a 
firewall. That is, we made sure that packets would be dropped in the future, 
not rejected.

Still, I figured I’ll send this experience of ours to this mailing list, as 
this seems to be something others might encounter as well.

In the following PR, that introduced osd_fast_fail_on_connection_refused, 
there’s this description:

> This changeset adds additional handler (handle_refused()) to the dispatchers
> and code that detects when connection attempt fails with ECONNREFUSED error
> (connection refused) which is a clear indication that host is alive, but
> daemon isn't, so daemons can instantly mark the other side as undoubtly
> downed without the need for grace timer.

And this comment:

> As for flapping, we discussed it on ceph-devel ml
> and came to conclusion that it requires either broken firewall or network
> configuration to cause this, and these are more serious issues that should
> be resolved first before worrying about OSDs flapping (either way, flapping
> OSDs could be good for getting someone's attention).

https://github.com/ceph/ceph/pull/8558https://github.com/ceph/ceph/pull/8558

It has left us wondering if these are the right assumptions. An ECONNREFUSED 
condition can bring down a whole cluster, and I wonder if there should be some 
kind of safe-guard to ensure that this is avoided. One badly configured host 
should generally not be able do that, and if the packets are dropped, instead 
of refused, the cluster notices that the OSD down reports come only from one 
host, and acts accordingly.

What do you think? Does this warrant a change in Ceph? I’m happy to provide 
details and create a ticket.

Cheers,

Denis
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: NoSuchKey on key that is visible in s3 list/radosgw bk

2020-11-23 Thread Denis Krienbühl
Thanks Frédéric, we’ve done that in the meantime to work around issue #47866.

The error has been reproduced and there’s a PR associated with the issue:

https://tracker.ceph.com/issues/47866 <https://tracker.ceph.com/issues/47866>

Cheers,

Denis

> On 23 Nov 2020, at 11:56, Frédéric Nass  
> wrote:
> 
> Hi Denis,
> 
> You might want to look at rgw_gc_obj_min_wait from [1] and try increasing the 
> default value of 7200s (2 hours) to whatever suits your need < 2^64.
> Just remind that at some point you'll have to get these objects processed by 
> the gc. Or manually through the API [2].
> 
> One thing that comes to mind regarding the "last night's missing object" is 
> maybe it was multi-part re-written and the re-write failed somehow and the 
> object was then enlisted by the gc. But that supposes this particular object 
> sometimes gets re-written which may not be the case.
> 
> Regards,
> 
> Frédéric.
> 
> [1] https://docs.ceph.com/en/latest/radosgw/config-ref/
> [2] 
> https://docs.ceph.com/en/latest/dev/radosgw/admin/adminops_nonimplemented/#manually-processes-garbage-collection-items
> 
> Le 18/11/2020 à 11:27, Denis Krienbühl a écrit :
>> By the way, since there’s some probability that this is a GC refcount issue, 
>> would it be possible and sane to somehow slow the GC down or disable it 
>> altogether? Is that something we could implement on our end as a stop-gap 
>> measure to prevent dataloss?
>> 
>>> On 18 Nov 2020, at 10:46, Denis Krienbühl  wrote:
>>> 
>>> I can now confirm that last night’s missing object was a multi-part file.
>>> 
>>>> On 18 Nov 2020, at 10:01, Janek Bevendorff 
>>>>  wrote:
>>>> 
>>>> Sorry, it's radosgw-admin object stat --bucket=BUCKETNAME 
>>>> --object=OBJECTNAME (forgot the "object" there)
>>>> 
>>>> On 18/11/2020 09:58, Janek Bevendorff wrote:
>>>>>> The object, a Docker layer, that went missing has not been touched in 2 
>>>>>> months. It worked for a while, but then suddenly went missing.
>>>>> Was the object a multipart object? You can check by running radosgw-admin 
>>>>> stat --bucket=BUCKETNAME --object=OBJECTNAME. It should say something 
>>>>> "ns": "multipart" in the output. If it says "ns": "shadow", it's a 
>>>>> single-part object.
>>>> ___
>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: NoSuchKey on key that is visible in s3 list/radosgw bk

2020-11-19 Thread Denis Krienbühl
Thanks, we are currently scanning our object storage. It looks like we can 
detect the missing objects that return “No Such Key” looking at all 
“__multipart_” objects returned by radosgw-admin bucket radoslist, and checking 
if they exist using rados stat. We are currently not looking at shadow objects 
as our approach already yields more instances of this problem.

> On 19 Nov 2020, at 09:09, Janek Bevendorff  
> wrote:
> 
>> - The head object had a size of 0.
>> - There was an object with a ’shadow’ in its name, belonging to that path.
> That is normal. What is not normal is if there are NO shadow objects.
> 
> On 18/11/2020 10:06, Denis Krienbühl wrote:
>> It looks like a single-part object. But we did replace that object last 
>> night from backup, so I can’t know for sure if the lost one was like that.
>> 
>> Another engineer that looked at the Rados objects last night did notice two 
>> things:
>> 
>> - The head object had a size of 0.
>> - There was an object with a ’shadow’ in its name, belonging to that path.
>> 
>> I’m not knowledgable about Rados, so I’m not sure this is helpful.
>> 
>>> On 18 Nov 2020, at 10:01, Janek Bevendorff  
>>> wrote:
>>> 
>>> Sorry, it's radosgw-admin object stat --bucket=BUCKETNAME 
>>> --object=OBJECTNAME (forgot the "object" there)
>>> 
>>> On 18/11/2020 09:58, Janek Bevendorff wrote:
>>>>> The object, a Docker layer, that went missing has not been touched in 2 
>>>>> months. It worked for a while, but then suddenly went missing.
>>>> Was the object a multipart object? You can check by running radosgw-admin 
>>>> stat --bucket=BUCKETNAME --object=OBJECTNAME. It should say something 
>>>> "ns": "multipart" in the output. If it says "ns": "shadow", it's a 
>>>> single-part object.
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: NoSuchKey on key that is visible in s3 list/radosgw bk

2020-11-18 Thread Denis Krienbühl
By the way, since there’s some probability that this is a GC refcount issue, 
would it be possible and sane to somehow slow the GC down or disable it 
altogether? Is that something we could implement on our end as a stop-gap 
measure to prevent dataloss?

> On 18 Nov 2020, at 10:46, Denis Krienbühl  wrote:
> 
> I can now confirm that last night’s missing object was a multi-part file.
> 
>> On 18 Nov 2020, at 10:01, Janek Bevendorff  
>> wrote:
>> 
>> Sorry, it's radosgw-admin object stat --bucket=BUCKETNAME 
>> --object=OBJECTNAME (forgot the "object" there)
>> 
>> On 18/11/2020 09:58, Janek Bevendorff wrote:
>>>> 
>>>> The object, a Docker layer, that went missing has not been touched in 2 
>>>> months. It worked for a while, but then suddenly went missing.
>>> Was the object a multipart object? You can check by running radosgw-admin 
>>> stat --bucket=BUCKETNAME --object=OBJECTNAME. It should say something "ns": 
>>> "multipart" in the output. If it says "ns": "shadow", it's a single-part 
>>> object.
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: NoSuchKey on key that is visible in s3 list/radosgw bk

2020-11-18 Thread Denis Krienbühl
I can now confirm that last night’s missing object was a multi-part file.

> On 18 Nov 2020, at 10:01, Janek Bevendorff  
> wrote:
> 
> Sorry, it's radosgw-admin object stat --bucket=BUCKETNAME --object=OBJECTNAME 
> (forgot the "object" there)
> 
> On 18/11/2020 09:58, Janek Bevendorff wrote:
>>> 
>>> The object, a Docker layer, that went missing has not been touched in 2 
>>> months. It worked for a while, but then suddenly went missing.
>> Was the object a multipart object? You can check by running radosgw-admin 
>> stat --bucket=BUCKETNAME --object=OBJECTNAME. It should say something "ns": 
>> "multipart" in the output. If it says "ns": "shadow", it's a single-part 
>> object.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: NoSuchKey on key that is visible in s3 list/radosgw bk

2020-11-18 Thread Denis Krienbühl
It looks like a single-part object. But we did replace that object last night 
from backup, so I can’t know for sure if the lost one was like that.

Another engineer that looked at the Rados objects last night did notice two 
things:

- The head object had a size of 0.
- There was an object with a ’shadow’ in its name, belonging to that path.

I’m not knowledgable about Rados, so I’m not sure this is helpful.

> On 18 Nov 2020, at 10:01, Janek Bevendorff  
> wrote:
> 
> Sorry, it's radosgw-admin object stat --bucket=BUCKETNAME --object=OBJECTNAME 
> (forgot the "object" there)
> 
> On 18/11/2020 09:58, Janek Bevendorff wrote:
>>> 
>>> The object, a Docker layer, that went missing has not been touched in 2 
>>> months. It worked for a while, but then suddenly went missing.
>> Was the object a multipart object? You can check by running radosgw-admin 
>> stat --bucket=BUCKETNAME --object=OBJECTNAME. It should say something "ns": 
>> "multipart" in the output. If it says "ns": "shadow", it's a single-part 
>> object.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] How to Improve RGW Bucket Stats Performance

2020-11-13 Thread Denis Krienbühl
Hi!

To bill our customers we regularly call radosgw-admin bucket stats --uid .

Since upgrading from Mimic to Octopus (with a short stop at Nautilus), we’ve 
been seeing much slower response times for this command.

It went from less than a minute for our largest customers, to 5 minutes (with 
some variance depending on load).

Assuming this is not a bug, is there any way to get these stats quicker?

Ceph seems to do in a single call here, which seems to me like something you 
could spread out over time (keep a counter somewhere and just return the latest 
value on request).

One thing we did notice, is that we get a lot of these when the stats are 
synced:

2020-11-13T14:56:17.288+0100 7f15347e0700  0 check_bucket_shards: 
resharding needed: stats.num_objects=5776982 shard max_objects=320

Could that hint at a problem in our configuration?

Anything else we could maybe tune to get this time down?

Appreciate any hints and I hope everyone is about to have a great weekend.

Denis
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to run ceph_osd_dump

2020-11-11 Thread Denis Krienbühl
Hi Eugen

That works. Apart from the release notes, there’s also documentation that has 
this wrong:
https://docs.ceph.com/en/latest/rados/operations/monitoring/#network-performance-checks
 
<https://docs.ceph.com/en/latest/rados/operations/monitoring/#network-performance-checks>

Thank you!

Denis

> On 12 Nov 2020, at 08:15, Eugen Block  wrote:
> 
> Hi,
> 
> although the Nautilus v14.2.5 release notes [1] state that this command is 
> available for both mgr and osd it doesn't seem to apply to mgr. But you 
> should be able to run it for an osd daemon.
> 
> Regards,
> Eugen
> 
> 
> [1] https://docs.ceph.com/en/latest/releases/nautilus/
> 
> 
> Zitat von Denis Krienbühl :
> 
>> Hi
>> 
>> We’ve recently encountered the following errors:
>> 
>>  [WRN] OSD_SLOW_PING_TIME_BACK: Slow OSD heartbeats on back (longest 
>> 2752.832ms)
>>  Slow OSD heartbeats on back from osd.2 [nvme-a] to osd.290 [nvme-c] 
>> 2752.832 msec
>>  ...
>>  Truncated long network list.  Use ceph daemon mgr.# dump_osd_network 
>> for more information
>> 
>> To get more information we wanted to run the dump_osd_network command, but 
>> it doesn’t seem to be a valid command:
>> 
>> ceph daemon /var/run/ceph/ceph-mgr.$(hostname).asok dump_osd_network 0
>> 
>>  no valid command found; 10 closest matches:
>>  0
>>  1
>>  2
>>  abort
>>  assert
>>  config diff
>>  config diff get 
>>  config get 
>>  config help []
>>  config set  ...
>>  admin_socket: invalid command
>> 
>> Other commands, like ceph daemon dump_cache work, so it seems to hit the 
>> right socket.
>> 
>> What am I doing wrong?
>> 
>> Cheers,
>> 
>> Denis
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] How to run ceph_osd_dump

2020-11-11 Thread Denis Krienbühl
Hi

We’ve recently encountered the following errors:

[WRN] OSD_SLOW_PING_TIME_BACK: Slow OSD heartbeats on back (longest 
2752.832ms)
Slow OSD heartbeats on back from osd.2 [nvme-a] to osd.290 [nvme-c] 
2752.832 msec
...
Truncated long network list.  Use ceph daemon mgr.# dump_osd_network 
for more information

To get more information we wanted to run the dump_osd_network command, but it 
doesn’t seem to be a valid command:

ceph daemon /var/run/ceph/ceph-mgr.$(hostname).asok dump_osd_network 0

no valid command found; 10 closest matches:
0
1
2
abort
assert
config diff
config diff get 
config get 
config help []
config set  ...
admin_socket: invalid command

Other commands, like ceph daemon dump_cache work, so it seems to hit the right 
socket.

What am I doing wrong?

Cheers,

Denis
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW seems to not clean up after some requests

2020-11-02 Thread Denis Krienbühl
Hi Abhishek

> On 2 Nov 2020, at 14:54, Abhishek Lekshmanan  wrote:
> 
> There isn't much in terms of code changes in the scheduler from
> v15.2.4->5. Does the perf dump (`ceph daemon perf dump 
> `) on RGW socket show any throttle counts?

I know, I was wondering if this somehow might have an influence, but I’m likely 
wrong:
https://github.com/ceph/ceph/commit/c43f71056322e1a149a444735bf65d80fec7a7ae 


As for the perf counters, I don’t see anything interesting. I dumped the 
current state, but I don’t know how interesting this is:
https://gist.github.com/href/a42c30e001789f005e9aa748f6f858fc 


At the moment we don’t see any errors, but I do already count 135 incomplete 
requests in the current log (out of 3 Million).

This number is typical for most days, where we’ll see something like 150 such 
requests. Our working theory is that out of the 1024 maximum outstanding 
requests of the throttler, ~150 get lost every day to those incomplete 
requests, until our need for up to 400 requests per instance can no longer be 
met (first a few will be over the watermark, then more, then all).

For those incomplete requests we know that the following line is executed, 
producing “starting new request”:
https://github.com/ceph/ceph/blob/8f393c0fc1886a369d213d5e5791c10cb1591828/src/rgw/rgw_process.cc#L187
 


However, it never reaches “req done” in the same function:
https://github.com/ceph/ceph/blob/master/src/rgw/rgw_process.cc#L350 


That entry, and the “beast” entry is missing for those few requests.

Cheers, Denis
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] RGW seems to not clean up after some requests

2020-11-02 Thread Denis Krienbühl
Hi everyone

We have faced some RGW outages recently, with the RGW returning HTTP 503. First 
for a few, then for most, then all requests - in the course of 1-2 hours. This 
seems to have started since we have updated from 15.2.4 to 15.2.5.

The line that accompanies these outages in the log is the following:

s3:list_bucket Scheduling request failed with -2218

It first pops up a few times here and there, until it eventually applies to all 
requests. It seems to indicate that the throttler has reached the limit of open 
connections.

As we run a pair of HAProxy instances in front of RGW, which limit the number 
of connections to the two RGW instances to 400, this limit should never be 
reached. We do use RGW metadata sync between the instances, which could account 
for some extra connections, but if I look at open TCP connections between the 
instances I can count no more than 20 at any given time.

I also noticed that some connections in the RGW log seem to never complete. 
That is, I can find a ‘starting new request’ line, but no associated ‘req done’ 
or ‘beast’ line.

I don’t think there are any hung connections around, as they are killed by 
HAProxy after a short timeout.

Looking at the code, it seems as if the throttler in use (SimpleThrottler), 
eventually reaches the maximum count of 1024 connections 
(outstanding_requests), and never recovers. I believe that the request_complete 
function is not called in all cases, but I am not familiar with the Ceph 
codebase, so I am not sure.

See 
https://github.com/ceph/ceph/blob/cc17681b478594aa39dd80437256a54e388432f0/src/rgw/rgw_dmclock_async_scheduler.h#L166-L214
 


Does anyone see the same phenomenon? Could this be a bug in the request 
handling of RGW, or am I wrong in my assumptions?

For now we’re just restarting our RGWs regularly, which seems to keep the 
problem at bay.

Thanks for any hints.

Denis
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: virtual machines crashes after upgrade to octopus

2020-09-24 Thread Denis Krienbühl
I’m interested in the following as well, any chance you could point us to a 
specific commit Jason?

> On 14 Sep 2020, at 13:55, Jason Dillaman  wrote:
> 
> Can you try the latest development release of Octopus [1]? A librbd
> crash fix has been sitting in that branch for about a month now to be
> included in the next point release.

> On 22 Sep 2020, at 11:48, Michael Bisig  wrote:
> 
> We also facing the problem and we would like to upgrade the clients to the 
> specific release.
> @jason can you point us to the respective commit and the point release that 
> contains the fix?


Cheers, Denis
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RandomCrashes on OSDs Attached to Mon Hosts with Octopus

2020-09-01 Thread Denis Krienbühl
Hi Igor

To bring this thread to a conclusion: We managed to stop the random crashes by 
restarting each of the OSDs manually.

After upgrading the cluster we reshuffled a lot of our data by changing PG 
counts. It seems like the memory reserved during that time was never released 
back to the OS.

Though we did not see any change in swap usage, with swap page in/out actually 
being lower than before the upgrade, the OSDs did not reclaim the memory they 
used before the restart in the days following the restart. We also stopped 
seeing random crashes.

I can’t say definitely what the error was, but for us these random crashes were 
solved by restarting all OSDs. Maybe this helps somebody else searching for 
this error in the future.

Thanks again for your help!

Denis

> On 27 Aug 2020, at 13:46, Denis Krienbühl  wrote:
> 
> Hi Igor
> 
> Just to clarify:
> 
>>> I grepped the logs for "checksum mismatch" and "_verify_csum". The only
>>> occurrences I could find where the ones that preceed the crashes.
>> 
>> Are you able to find multiple _verify_csum precisely?
> 
> There are no “_verify_csum” entries whatsoever. I wrote that wrongly.
> I could only find “checksum mismatch” right when the crash happens.
> 
> Sorry for the confusion.
> 
> I will keep tracking those counters and have a look at monitor/osd memory 
> tracking.
> 
> Cheers,
> 
> Denis
> 
>> On 27 Aug 2020, at 13:39, Igor Fedotov > <mailto:ifedo...@suse.de>> wrote:
>> 
>> Hi Denis
>> 
>> please see my comments inline.
>> 
>> 
>> Thanks,
>> 
>> Igor
>> 
>> On 8/27/2020 10:06 AM, Denis Krienbühl wrote:
>>> Hi Igor,
>>> 
>>> Thanks for your input. I tried to gather as much information as I could to
>>> answer your questions. Hopefully we can get to the bottom of this.
>>> 
>>>> 0) What is backing disks layout for OSDs in question (main device type?, 
>>>> additional DB/WAL devices?).
>>> Everything is on a single Intel NVMe P4510 using dmcrypt with 2 OSDs per 
>>> NVMe
>>> device. There is no additional DB/WAL device and there are no HDDs involved.
>>> 
>>> Also note that we use 40 OSDs per host with a memory target of 
>>> 6'174'015'488.
>>> 
>>>> 1) Please check all the existing logs for OSDs at "failing" nodes for 
>>>> other checksum errors (as per my comment #38)
>>> I grepped the logs for "checksum mismatch" and "_verify_csum". The only
>>> occurrences I could find where the ones that preceed the crashes.
>> 
>> Are you able to find multiple _verify_csum precisely?
>> 
>> If so this means data read failures were observed at user data not RocksDB 
>> one. Which backs the hypothesis about interim  disk read
>> 
>> errors as a root cause. User data reading has quite a different access stack 
>> and is able to retry after such errors hence they aren't that visible.
>> 
>> But having checksum failures for both DB and user data points to the same 
>> root cause at lower layers (kernel, I/O stack etc).
>> 
>> It might be interesting whether _verify_csum and RocksDB csum were happening 
>> nearly at the same period of time. Not even for a single OSD but for 
>> different OSDs of the same node.
>> 
>> This might indicate that node was suffering from some decease at that time. 
>> Anything suspicious from system-wide logs for this time period?
>> 
>>> 
>>>> 2) Check if BlueFS spillover is observed for any failing OSDs.
>>> As everything is on the same device, there can be no spillover, right?
>> Right
>>> 
>>>> 3) Check "bluestore_reads_with_retries" performance counters for all OSDs 
>>>> at nodes in question. See comments 38-42 on the details. Any non-zero 
>>>> values?
>>> I monitored this over night by repeatedly polling this performance counter 
>>> over
>>> all OSDs on the mons. Only one OSD, which has crashed in the past, has had a
>>> value of 1 since I started measuring. All the other OSDs, including the ones
>>> that crashed over night, have a value of 0. Before and after the crash.
>> 
>> Even a single occurrence isn't expected - this counter should always be 
>> equal to 0. And presumably these are peak hours when the cluster is exposed 
>> to the issue at most. Night is likely to be not the the peak period though. 
>> So please keep tracking...
>> 
>> 
>>> 
>>>> 4) Start monitoring RAM usage and swapping for these nodes. Comment 39.

[ceph-users] Re: RandomCrashes on OSDs Attached to Mon Hosts with Octopus

2020-08-27 Thread Denis Krienbühl
Hi Igor

Just to clarify:

>> I grepped the logs for "checksum mismatch" and "_verify_csum". The only
>> occurrences I could find where the ones that preceed the crashes.
> 
> Are you able to find multiple _verify_csum precisely?

There are no “_verify_csum” entries whatsoever. I wrote that wrongly.
I could only find “checksum mismatch” right when the crash happens.

Sorry for the confusion.

I will keep tracking those counters and have a look at monitor/osd memory 
tracking.

Cheers,

Denis

> On 27 Aug 2020, at 13:39, Igor Fedotov  wrote:
> 
> Hi Denis
> 
> please see my comments inline.
> 
> 
> Thanks,
> 
> Igor
> 
> On 8/27/2020 10:06 AM, Denis Krienbühl wrote:
>> Hi Igor,
>> 
>> Thanks for your input. I tried to gather as much information as I could to
>> answer your questions. Hopefully we can get to the bottom of this.
>> 
>>> 0) What is backing disks layout for OSDs in question (main device type?, 
>>> additional DB/WAL devices?).
>> Everything is on a single Intel NVMe P4510 using dmcrypt with 2 OSDs per NVMe
>> device. There is no additional DB/WAL device and there are no HDDs involved.
>> 
>> Also note that we use 40 OSDs per host with a memory target of 6'174'015'488.
>> 
>>> 1) Please check all the existing logs for OSDs at "failing" nodes for other 
>>> checksum errors (as per my comment #38)
>> I grepped the logs for "checksum mismatch" and "_verify_csum". The only
>> occurrences I could find where the ones that preceed the crashes.
> 
> Are you able to find multiple _verify_csum precisely?
> 
> If so this means data read failures were observed at user data not RocksDB 
> one. Which backs the hypothesis about interim  disk read
> 
> errors as a root cause. User data reading has quite a different access stack 
> and is able to retry after such errors hence they aren't that visible.
> 
> But having checksum failures for both DB and user data points to the same 
> root cause at lower layers (kernel, I/O stack etc).
> 
> It might be interesting whether _verify_csum and RocksDB csum were happening 
> nearly at the same period of time. Not even for a single OSD but for 
> different OSDs of the same node.
> 
> This might indicate that node was suffering from some decease at that time. 
> Anything suspicious from system-wide logs for this time period?
> 
>> 
>>> 2) Check if BlueFS spillover is observed for any failing OSDs.
>> As everything is on the same device, there can be no spillover, right?
> Right
>> 
>>> 3) Check "bluestore_reads_with_retries" performance counters for all OSDs 
>>> at nodes in question. See comments 38-42 on the details. Any non-zero 
>>> values?
>> I monitored this over night by repeatedly polling this performance counter 
>> over
>> all OSDs on the mons. Only one OSD, which has crashed in the past, has had a
>> value of 1 since I started measuring. All the other OSDs, including the ones
>> that crashed over night, have a value of 0. Before and after the crash.
> 
> Even a single occurrence isn't expected - this counter should always be equal 
> to 0. And presumably these are peak hours when the cluster is exposed to the 
> issue at most. Night is likely to be not the the peak period though. So 
> please keep tracking...
> 
> 
>> 
>>> 4) Start monitoring RAM usage and swapping for these nodes. Comment 39.
>> The memory use of those nodes is pretty constant with ~6GB free, ~25GB 
>> availble of 256GB.
>> There are also only a handful of pages being swapped, if at all.
>> 
>>> a hypothesis why mon hosts are affected only  - higher memory utilization 
>>> at these nodes is what causes disk reading failures to appear. RAM leakage 
>>> (or excessive utilization) in MON processes or something?
>> Since the memory usage is rather constant I'm not sure this is the case, I 
>> think
>> we would see more of an up/down pattern. However we are not yet monitoring 
>> all
>> processes, and that would be somthing I'd like to get some data on, but I'm 
>> not
>> sure this is the right course of action at the moment.
> 
> Given the fact that colocation with monitors is probably the clue - suggest 
> to track  MON and OSD process at least.
> 
> And high memory pressure is just a working hypothesis for these disk failures 
> root cause. Something else (e.g. high disk utilization) might be another 
> trigger or it might just be wrong...
> 
> So please just pay some attention to this.
> 
>> 
>> What do you think, is it still plausible that we see a memory utiliza

[ceph-users] Re: RandomCrashes on OSDs Attached to Mon Hosts with Octopus

2020-08-27 Thread Denis Krienbühl
Hi Igor,

Thanks for your input. I tried to gather as much information as I could to
answer your questions. Hopefully we can get to the bottom of this.

> 0) What is backing disks layout for OSDs in question (main device type?, 
> additional DB/WAL devices?).

Everything is on a single Intel NVMe P4510 using dmcrypt with 2 OSDs per NVMe
device. There is no additional DB/WAL device and there are no HDDs involved.

Also note that we use 40 OSDs per host with a memory target of 6'174'015'488.

> 1) Please check all the existing logs for OSDs at "failing" nodes for other 
> checksum errors (as per my comment #38)

I grepped the logs for "checksum mismatch" and "_verify_csum". The only
occurrences I could find where the ones that preceed the crashes.

> 2) Check if BlueFS spillover is observed for any failing OSDs.

As everything is on the same device, there can be no spillover, right?

> 3) Check "bluestore_reads_with_retries" performance counters for all OSDs at 
> nodes in question. See comments 38-42 on the details. Any non-zero values?

I monitored this over night by repeatedly polling this performance counter over
all OSDs on the mons. Only one OSD, which has crashed in the past, has had a
value of 1 since I started measuring. All the other OSDs, including the ones
that crashed over night, have a value of 0. Before and after the crash.

> 4) Start monitoring RAM usage and swapping for these nodes. Comment 39.

The memory use of those nodes is pretty constant with ~6GB free, ~25GB availble 
of 256GB.
There are also only a handful of pages being swapped, if at all.

> a hypothesis why mon hosts are affected only  - higher memory utilization at 
> these nodes is what causes disk reading failures to appear. RAM leakage (or 
> excessive utilization) in MON processes or something?

Since the memory usage is rather constant I'm not sure this is the case, I think
we would see more of an up/down pattern. However we are not yet monitoring all
processes, and that would be somthing I'd like to get some data on, but I'm not
sure this is the right course of action at the moment.

What do you think, is it still plausible that we see a memory utilization
problem, even though there's little variance in the memory usage patterns?

The approaches we currently consider is to upgrade our kernel and to lower the 
memory
target somewhat.

Cheers,

Denis


> On 26 Aug 2020, at 15:29, Igor Fedotov  wrote:
> 
> Hi Denis,
> 
> this reminds me the following ticket: https://tracker.ceph.com/issues/37282
> 
> Please note they mentioned co-location with mon in comment #29.
> 
> 
> Working hypothesis for this ticket is the interim disk read failures which 
> cause RocksDB checksum failures. Earlier we observed such a problem for main 
> device. Presumably it's heavy memory pressure which causes kernel to be 
> failing this way.  See my comment #38 there.
> 
> So I'd like to see answers/comments for the following questions:
> 
> 0) What is backing disks layout for OSDs in question (main device type?, 
> additional DB/WAL devices?).
> 
> 1) Please check all the existing logs for OSDs at "failing" nodes for other 
> checksum errors (as per my comment #38)
> 
> 2) Check if BlueFS spillover is observed for any failing OSDs.
> 
> 3) Check "bluestore_reads_with_retries" performance counters for all OSDs at 
> nodes in question. See comments 38-42 on the details. Any non-zero values?
> 
> 4) Start monitoring RAM usage and swapping for these nodes. Comment 39.
> 
> 
> Thanks,
> 
> Igor
> 
> 
> 
> 
> 
> 
> On 8/26/2020 3:47 PM, Denis Krienbühl wrote:
>> Hi!
>> 
>> We've recently upgraded all our clusters from Mimic to Octopus (15.2.4). 
>> Since
>> then, our largest cluster is experiencing random crashes on OSDs attached to 
>> the
>> mon hosts.
>> 
>> This is the crash we are seeing (cut for brevity, see links in post 
>> scriptum):
>> 
>>{
>>"ceph_version": "15.2.4",
>>"utsname_release": "4.15.0-72-generic",
>>"assert_condition": "r == 0",
>>"assert_func": "void 
>> BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)",
>>"assert_file": "/build/ceph-15.2.4/src/os/bluestore/BlueStore.cc 
>> <http://bluestore.cc/>",
>>"assert_line": 11430,
>>"assert_thread_name": "bstore_kv_sync",
>>"assert_msg": "/build/ceph-15.2.4/src/os/bluestore/BlueStore.cc 
>> <http://bluestore.cc/>: In function 'void 
>> BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)' thread 
>&g

[ceph-users] RandomCrashes on OSDs Attached to Mon Hosts with Octopus

2020-08-26 Thread Denis Krienbühl
Hi!

We've recently upgraded all our clusters from Mimic to Octopus (15.2.4). Since
then, our largest cluster is experiencing random crashes on OSDs attached to the
mon hosts.

This is the crash we are seeing (cut for brevity, see links in post scriptum):

   {
   "ceph_version": "15.2.4",
   "utsname_release": "4.15.0-72-generic",
   "assert_condition": "r == 0",
   "assert_func": "void BlueStore::_txc_apply_kv(BlueStore::TransContext*, 
bool)",
   "assert_file": "/build/ceph-15.2.4/src/os/bluestore/BlueStore.cc 
",
   "assert_line": 11430,
   "assert_thread_name": "bstore_kv_sync",
   "assert_msg": "/build/ceph-15.2.4/src/os/bluestore/BlueStore.cc 
: In function 'void 
BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)' thread 7fc56311a700 
time 
2020-08-26T08:52:24.917083+0200\n/build/ceph-15.2.4/src/os/bluestore/BlueStore.cc
 : 11430: FAILED ceph_assert(r == 0)\n",
   "backtrace": [
   "(()+0x12890) [0x7fc576875890]",
   "(gsignal()+0xc7) [0x7fc575527e97]",
   "(abort()+0x141) [0x7fc575529801]",
   "(ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x1a5) [0x559ef9ae97b5]",
   "(ceph::__ceph_assertf_fail(char const*, char const*, int, char 
const*, char const*, ...)+0) [0x559ef9ae993f]",
   "(BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)+0x3a0) 
[0x559efa0245b0]",
   "(BlueStore::_kv_sync_thread()+0xbdd) [0x559efa07745d]",
   "(BlueStore::KVSyncThread::entry()+0xd) [0x559efa09cd3d]",
   "(()+0x76db) [0x7fc57686a6db]",
   "(clone()+0x3f) [0x7fc57560a88f]"
   ]
   }

Right before the crash occurs, we see the following message in the crash log:

   -3> 2020-08-26T08:52:24.787+0200 7fc569b2d700  2 rocksdb: 
[db/db_impl_compaction_flush.cc:2212 
] Waiting after background compaction 
error: Corruption: block checksum mismatch: expected 2548200440, got 2324967102 
 in db/815839.sst offset 67107066 size 3808, Accumulated background error 
counts: 1
   -2> 2020-08-26T08:52:24.852+0200 7fc56311a700 -1 rocksdb: submit_common 
error: Corruption: block checksum mismatch: expected 2548200440, got 2324967102 
 in db/815839.sst offset 67107066 size 3808 code = 2 Rocksdb transaction:

In short, we see a Rocksdb corruption error after background compaction, when 
this happens.

When an OSD crashes, which happens about 10-15 times a day, it restarts and
resumes work without any further problems.

We are pretty confident that this is not a hardware issue, due to the following 
facts:

* The crashes occur on 5 different hosts over 3 different racks.
* There is no smartctl/dmesg output that could explain it.
* It usually happens to a different OSD that did not crash before.

Still we checked the following on a few OSDs/hosts:

* We can do a manual compaction, both offline and online.
* We successfully ran "ceph-bluestore-tool fsck --deep yes" on one of the OSDs.
* We manually compacted a number of OSDs, one of which crashed hours later.

The only thing we have noticed so far: It only happens to OSDs that are attached
to a mon host. *None* of the non-mon host OSDs have had a crash!

Does anyone have a hint what could be causing this? We currently have no good
theory that could explain this, much less have a fix or workaround.

Any help would be greatly appreciated.

Denis

Crash: https://public-resources.objects.lpg.cloudscale.ch/osd-crash/meta.txt 

Log: https://public-resources.objects.lpg.cloudscale.ch/osd-crash/log.txt 


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Disproportionate Metadata Size

2020-05-13 Thread Denis Krienbühl
Sure, the db device has a size of 22.5G, the primary deice has 100G.

Here’s the complete ceph osd df output of one of the OSDs experiencing this 
issue:

ID CLASS WEIGHT  REWEIGHT SIZEUSE DATAOMAP METAAVAIL   %USE  
VAR  PGS
14   hdd 0.11960  1.0 122 GiB 118 GiB 2.4 GiB 0 B  116 GiB 4.2 GiB 96.55 
3.06 195

I pasted the full output here, since this might not be so readable in the 
e-mail:
https://pastebin.com/NWYTHwxh <https://pastebin.com/NWYTHwxh>

The OSD in question has the ID 14.

Let me know if there’s anything else I can provide you with.

Cheers,

Denis

> On 13 May 2020, at 11:49, Eugen Block  wrote:
> 
> Hi Daniel,
> 
> I had the exact same issue in a (virtual) Luminous cluster without much data 
> in it. The root cause was that my OSDs were too small (10 GB only) and the 
> rocksDB also grew until manual compaction. I had configured the small OSDs 
> intentionally because it was never supposed to store lots of data. Can you 
> provide some more details like 'ceph odf df'?
> 
> Manual compaction did help, but then I recreated the OSDs with 20 GB each and 
> the issue didn't occur after that.
> 
> Regards,
> Eugen
> 
> 
> Zitat von Denis Krienbühl :
> 
>> Hi
>> 
>> On one of our Ceph clusters, some OSDs have been marked as full. Since this 
>> is a staging cluster that does not have much data on it, this is strange.
>> 
>> Looking at the full OSDs through “ceph osd df” I figured out that the space 
>> is mostly used by metadata:
>> 
>>SIZE: 122 GiB
>>USE: 118 GiB
>>DATA: 2.4 GiB
>>META: 116 GiB
>> 
>> We run mimic, and for the affected OSDs we use a db device (nvme) in 
>> addition to the primary device (hdd).
>> 
>> In the logs we see the following errors:
>> 
>>2020-05-12 17:10:26.089 7f183f604700  1 bluefs _allocate failed to 
>> allocate 0x40 on bdev 1, free 0x0; fallback to bdev 2
>>2020-05-12 17:10:27.113 7f183f604700  1 
>> bluestore(/var/lib/ceph/osd/ceph-8) _balance_bluefs_freespace gifting 
>> 0x180a00~40 to bluefs
>>2020-05-12 17:10:27.153 7f183f604700  1 bluefs add_block_extent bdev 2 
>> 0x180a00~40
>> 
>> We assume it is an issue with Rocksdb, as the following call will quickly 
>> fix the problem:
>> 
>>ceph daemon osd.8 compact
>> 
>> The question is, why is this happening? I would think that “compact" is 
>> something that runs automatically from time to time, but I’m not sure.
>> 
>> Is it on us to run this regularly?
>> 
>> Any pointers are welcome. I’m quite new to Ceph :)
>> 
>> Cheers,
>> 
>> Denis
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Disproportionate Metadata Size

2020-05-13 Thread Denis Krienbühl
Hi

On one of our Ceph clusters, some OSDs have been marked as full. Since this is 
a staging cluster that does not have much data on it, this is strange.

Looking at the full OSDs through “ceph osd df” I figured out that the space is 
mostly used by metadata:

SIZE: 122 GiB
USE: 118 GiB
DATA: 2.4 GiB
META: 116 GiB

We run mimic, and for the affected OSDs we use a db device (nvme) in addition 
to the primary device (hdd).

In the logs we see the following errors:

2020-05-12 17:10:26.089 7f183f604700  1 bluefs _allocate failed to allocate 
0x40 on bdev 1, free 0x0; fallback to bdev 2
2020-05-12 17:10:27.113 7f183f604700  1 bluestore(/var/lib/ceph/osd/ceph-8) 
_balance_bluefs_freespace gifting 0x180a00~40 to bluefs
2020-05-12 17:10:27.153 7f183f604700  1 bluefs add_block_extent bdev 2 
0x180a00~40

We assume it is an issue with Rocksdb, as the following call will quickly fix 
the problem:

ceph daemon osd.8 compact

The question is, why is this happening? I would think that “compact" is 
something that runs automatically from time to time, but I’m not sure.

Is it on us to run this regularly?

Any pointers are welcome. I’m quite new to Ceph :)

Cheers,

Denis
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io