[ceph-users] Re: Redeploy ceph orch OSDs after reboot, but don't mark as 'unmanaged'

2023-11-09 Thread Janek Bevendorff



I meant this one: https://tracker.ceph.com/issues/55395


Ah, alright, almost forgot about that one.


Is there an "unmanaged: true" statement in this output?
ceph orch ls osd --export


No, it only contains the managed services that I configured.

Just out of curiosity, is there a "service_name" in your unit.meta for 
that OSD?


grep service_name /var/lib/ceph/{fsid}/osd.{id}/unit.meta


Indeed! It says "osd" for all the unmanaged OSDs. When I change it to 
the name of my managed service and restart the daemon, it shows up in 
ceph orch ps --service-name. I checked whether cephadm deploy perhaps 
has an undocumented flag for setting the service name, but couldn't find 
any. I could run deploy, change the service name and then restart the 
service, but that's quite ugly. Any better ideas?


Janek






Zitat von Janek Bevendorff :


Hi Eugen,

I stopped one OSD (which was deployed by ceph orch before) and this 
is what the MGR log says:


2023-11-09T13:35:36.941+ 7f067f1f0700  0 [cephadm DEBUG 
cephadm.services.osd] osd id 96 daemon already exists


Before and after that are JSON dumps of the LVM properties of all 
OSDs. I get the same messages when I delete all files under 
/var/lib/ceph//osd.96 and the OSD service symlink in 
/etc/systemd/system/.


ceph cephadm osd activate --verbose only shows this:

[{'flags': 8,
  'help': 'Start OSD containers for existing OSDs',
  'module': 'mgr',
  'perm': 'rw',
  'sig': [argdesc(, req=True, 
name=prefix, n=1, numseen=0, prefix=cephadm),
  argdesc(, req=True, 
name=prefix, n=1, numseen=0, prefix=osd),
  argdesc(, req=True, 
name=prefix, n=1, numseen=0, prefix=activate),
  argdesc(, req=True, 
name=host, n=N, numseen=0)]}]
Submitting command:  {'prefix': 'cephadm osd activate', 'host': 
['XXX'], 'target': ('mon-mgr', '')}
submit {"prefix": "cephadm osd activate", "host": ["XXX"], "target": 
["mon-mgr", ""]} to mon-mgr

Created no osd(s) on host XXX; already created?

I suspect that it doesn't work for OSDs that are not explicitly 
marked as managed by ceph orch. But how do I do that?



I also commented the tracker issue you referred to.


Which issue exactly do you mean?

Janek




Thanks,
Eugen

Zitat von Janek Bevendorff :

Actually, ceph cephadm osd activate doesn't do what I expected it 
to do. It  seems to be looking for new OSDs to create instead of 
looking for existing OSDs to activate. Hence, it does nothing on my 
hosts and only prints 'Created no osd(s) on host XXX; already 
created?' So this wouldn't be an option either, even if I were 
willing to deploy the admin key on the OSD hosts.



On 07/11/2023 11:41, Janek Bevendorff wrote:

Hi,

We have our cluster RAM-booted, so we start from a clean slate 
after every reboot. That means I need to redeploy all OSD daemons 
as well. At the moment, I run cephadm deploy via Salt on the 
rebooted node, which brings the deployed OSDs back up, but the 
problem with this is that the deployed OSD shows up as 'unmanaged' 
in ceph orch ps afterwards.


I could simply skip the cephadm call and wait for the Ceph 
orchestrator to reconcile and auto-activate the disks, but that 
can take up to 15 minutes, which is unacceptable. Running ceph 
cephadm osd activate is not an option either, since I don't have 
the admin keyring deployed on the OSD hosts (I could do that, but 
I don't want to).


How can I manually activate the OSDs after a reboot and hand over 
control to the Ceph orchestrator afterwards? I checked the 
deployments in /var/lib/ceph/, but the only difference I 
found between my manual cephadm deployment and what ceph orch does 
is that the device links to /dev/mapper/ceph--... instead of 
/dev/ceph-...


Any hints appreciated!

Janek



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io






smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Redeploy ceph orch OSDs after reboot, but don't mark as 'unmanaged'

2023-11-09 Thread Janek Bevendorff

Hi Eugen,

I stopped one OSD (which was deployed by ceph orch before) and this is 
what the MGR log says:


2023-11-09T13:35:36.941+ 7f067f1f0700  0 [cephadm DEBUG 
cephadm.services.osd] osd id 96 daemon already exists


Before and after that are JSON dumps of the LVM properties of all OSDs. 
I get the same messages when I delete all files under 
/var/lib/ceph//osd.96 and the OSD service symlink in 
/etc/systemd/system/.


ceph cephadm osd activate --verbose only shows this:

[{'flags': 8,
  'help': 'Start OSD containers for existing OSDs',
  'module': 'mgr',
  'perm': 'rw',
  'sig': [argdesc(, req=True, 
name=prefix, n=1, numseen=0, prefix=cephadm),
  argdesc(, req=True, 
name=prefix, n=1, numseen=0, prefix=osd),
  argdesc(, req=True, 
name=prefix, n=1, numseen=0, prefix=activate),
  argdesc(, req=True, 
name=host, n=N, numseen=0)]}]
Submitting command:  {'prefix': 'cephadm osd activate', 'host': ['XXX'], 
'target': ('mon-mgr', '')}
submit {"prefix": "cephadm osd activate", "host": ["XXX"], "target": 
["mon-mgr", ""]} to mon-mgr

Created no osd(s) on host XXX; already created?

I suspect that it doesn't work for OSDs that are not explicitly marked 
as managed by ceph orch. But how do I do that?



I also commented the tracker issue you referred to.


Which issue exactly do you mean?

Janek




Thanks,
Eugen

Zitat von Janek Bevendorff :

Actually, ceph cephadm osd activate doesn't do what I expected it to 
do. It  seems to be looking for new OSDs to create instead of looking 
for existing OSDs to activate. Hence, it does nothing on my hosts and 
only prints 'Created no osd(s) on host XXX; already created?' So this 
wouldn't be an option either, even if I were willing to deploy the 
admin key on the OSD hosts.



On 07/11/2023 11:41, Janek Bevendorff wrote:

Hi,

We have our cluster RAM-booted, so we start from a clean slate after 
every reboot. That means I need to redeploy all OSD daemons as well. 
At the moment, I run cephadm deploy via Salt on the rebooted node, 
which brings the deployed OSDs back up, but the problem with this is 
that the deployed OSD shows up as 'unmanaged' in ceph orch ps 
afterwards.


I could simply skip the cephadm call and wait for the Ceph 
orchestrator to reconcile and auto-activate the disks, but that can 
take up to 15 minutes, which is unacceptable. Running ceph cephadm 
osd activate is not an option either, since I don't have the admin 
keyring deployed on the OSD hosts (I could do that, but I don't want 
to).


How can I manually activate the OSDs after a reboot and hand over 
control to the Ceph orchestrator afterwards? I checked the 
deployments in /var/lib/ceph/, but the only difference I found 
between my manual cephadm deployment and what ceph orch does is that 
the device links to /dev/mapper/ceph--... instead of /dev/ceph-...


Any hints appreciated!

Janek



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Bauhaus-Universität Weimar
Bauhausstr. 9a, R308
99423 Weimar, Germany

Phone: +49 3643 58 3577
www.webis.de



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Redeploy ceph orch OSDs after reboot, but don't mark as 'unmanaged'

2023-11-07 Thread Janek Bevendorff
Actually, ceph cephadm osd activate doesn't do what I expected it to do. 
It  seems to be looking for new OSDs to create instead of looking for 
existing OSDs to activate. Hence, it does nothing on my hosts and only 
prints 'Created no osd(s) on host XXX; already created?' So this 
wouldn't be an option either, even if I were willing to deploy the admin 
key on the OSD hosts.



On 07/11/2023 11:41, Janek Bevendorff wrote:

Hi,

We have our cluster RAM-booted, so we start from a clean slate after 
every reboot. That means I need to redeploy all OSD daemons as well. 
At the moment, I run cephadm deploy via Salt on the rebooted node, 
which brings the deployed OSDs back up, but the problem with this is 
that the deployed OSD shows up as 'unmanaged' in ceph orch ps afterwards.


I could simply skip the cephadm call and wait for the Ceph 
orchestrator to reconcile and auto-activate the disks, but that can 
take up to 15 minutes, which is unacceptable. Running ceph cephadm osd 
activate is not an option either, since I don't have the admin keyring 
deployed on the OSD hosts (I could do that, but I don't want to).


How can I manually activate the OSDs after a reboot and hand over 
control to the Ceph orchestrator afterwards? I checked the deployments 
in /var/lib/ceph/, but the only difference I found between my 
manual cephadm deployment and what ceph orch does is that the device 
links to /dev/mapper/ceph--... instead of /dev/ceph-...


Any hints appreciated!

Janek 


smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Redeploy ceph orch OSDs after reboot, but don't mark as 'unmanaged'

2023-11-07 Thread Janek Bevendorff

Hi,

We have our cluster RAM-booted, so we start from a clean slate after 
every reboot. That means I need to redeploy all OSD daemons as well. At 
the moment, I run cephadm deploy via Salt on the rebooted node, which 
brings the deployed OSDs back up, but the problem with this is that the 
deployed OSD shows up as 'unmanaged' in ceph orch ps afterwards.


I could simply skip the cephadm call and wait for the Ceph orchestrator 
to reconcile and auto-activate the disks, but that can take up to 15 
minutes, which is unacceptable. Running ceph cephadm osd activate is not 
an option either, since I don't have the admin keyring deployed on the 
OSD hosts (I could do that, but I don't want to).


How can I manually activate the OSDs after a reboot and hand over 
control to the Ceph orchestrator afterwards? I checked the deployments 
in /var/lib/ceph/, but the only difference I found between my 
manual cephadm deployment and what ceph orch does is that the device 
links to /dev/mapper/ceph--... instead of /dev/ceph-...


Any hints appreciated!

Janek



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] RadosGW load balancing with Kubernetes + ceph orch

2023-10-23 Thread Janek Bevendorff

Hey all,

My Ceph cluster is managed mostly by cephadm / ceph orch to avoid 
circular dependencies between in our infrastructure deployment. Our 
RadosGW endpoints, however, are managed by Kubernetes, since it provides 
proper load balancing and service health checks.


This leaves me in the unsatisfactory situation that Ceph complains about 
'stray' RGW daemons in the cluster. The only two solutions to this that 
I found were a) to turn of the warning, which applies to all daemons and 
not just the RGWs (not pretty!), or b) to move the deployment out of 
Kubernetes. For the latter, I could define external Endpoints in 
Kubernetes, so that I still have load balancing, but then I don't have 
proper health checks any more. Meaning, if one of the RGW endpoints goes 
down, requests to our S3 endpoint will intermittently time out in 
round-robin fashion (not pretty at all!).


Can you think of a better option to solve this? I would already be 
satisfied with turning off the warning for RGW daemons only, but there 
doesn't seem to be a config option for that.


Thanks
Janek



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Ceph orch OSD redeployment after boot on stateless RAM root

2023-10-23 Thread Janek Bevendorff

Hi,

I recently moved from a manual Ceph deployment using Saltstack to a 
hybrid of Saltstack and cephadm / ceph orch. We are provisioning our 
Ceph hosts using a stateless PXE RAM root, so I definitely need 
Saltstack to bootstrap at least the Ceph APT repository and the MON/MGR 
deployment. After that, ceph orch can take over and deploy the remaining 
daemons.


The MONs/MGRs are deployed after each reboot with

cephadm deploy --name mon.{{ ceph.node_id }} --fsid {{ 
ceph.conf.global.fsid }} --config /etc/ceph/ceph.conf
cephadm deploy --name mgr.{{ ceph.node_id }} --fsid {{ 
ceph.conf.global.fsid }} --config /etc/ceph/ceph.conf


(the MON store is provided in /var/lib/ceph/{{ ceph.conf.global.fsid 
}}/mon.{{ ceph.node_id }}).


Since cephadm ceph-volume lvm activate --all is broken (see 
https://tracker.ceph.com/issues/55395), I am activating each OSD 
individually like this:


cephadm deploy --name osd.{{ osd_id }} --fsid {{ ceph.conf.global.fsid 
}} --osd-fsid {{ osd_fsid }} --config /etc/ceph/ceph.conf


Now my question: Is there a better way to do this and can ceph orch take 
care of this in the same way it deploys my MDS?


All OSDs are listed as  in ceph orch ls (I think this is by 
design?) and I cannot find a way to activate them automatically via ceph 
orch when the host boots up. I tried


ceph cephadm osd activate HOSTNAME,

but all I get is "Created no osd(s) on host HOSTNAME; already created?"

The docs only talk about how I can create new OSDs, but not how I can 
automatically redeploy existing OSDs after a fresh boot. It seems like 
it is generally assumed that OSD deployments are persistent and next 
time the host boots, systemd simply activates the existing units.


I'd be glad about any hints!
Janek



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pgs incossistent every day same osd

2023-09-26 Thread Janek Bevendorff
Yes. If you've seen this reoccur multiple times, you can expect it will 
only get worse with time. You should replace the disk soon. Very often 
these disks are also starting to slow down other operations in the 
cluster as the read times increase.



On 26/09/2023 13:17, Jorge JP wrote:

Hello,

First, sorry for my english...

Since a few weeks, I receive every day notifies with HEALTH ERR in my ceph. The 
notifies are related to inconssistent pgs and ever are on same osd.

I ran smartctl test to the disk osd assigned and the result is "passed".

Should replace the disk by other new?

Regards!

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Bauhaus-Universität Weimar
Bauhausstr. 9a, R308
99423 Weimar, Germany

Phone: +49 3643 58 3577
www.webis.de



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS warning: clients laggy due to laggy OSDs

2023-09-26 Thread Janek Bevendorff
I have had defer_client_eviction_on_laggy_osds set to false for a while 
and I haven't had any further warnings so far (obviously), but also all 
the other problems with laggy clients bringing our MDS to a crawl over 
time seem to have gone. So at least on our cluster, the new configurable 
seems to do more harm than good. I can see why it's there, but the 
implementation appears to be rather buggy.


I also set mds_session_blocklist_on_timeout to false, because I had the 
impression that clients where being blocklisted too quickly.



On 21/09/2023 09:24, Janek Bevendorff wrote:


Hi,

I took a snapshot of MDS.0's logs. We have five active MDS in total, 
each one reporting laggy OSDs/clients, but I cannot find anything 
related to that in the log snippet. Anyhow, I uploaded the log for 
your reference with ceph-post-file ID 
79b5138b-61d7-4ba7-b0a9-c6f02f47b881.


This is what ceph status looks like after a couple of days. This is 
not normal:


HEALTH_WARN
55 client(s) laggy due to laggy OSDs
8 clients failing to respond to capability release
1 clients failing to advance oldest client/flush tid
5 MDSs report slow requests

(55 clients are actually "just" 11 unique client IDs, but each MDS 
makes their own report.)


osd mon_osd_laggy_halflife is not configured on our cluster, so it's 
the default of 3600.



Janek


On 20/09/2023 13:17, Dhairya Parmar wrote:

Hi Janek,

The PR venky mentioned makes use of OSD's laggy parameters 
(laggy_interval and
laggy_probability) to find if any OSD is laggy or not. These laggy 
parameters
can reset to 0 if the interval between the last modification done to 
OSDMap and
the time stamp when OSD was marked down exceeds the grace interval 
threshold

which is the value we get by `mon_osd_laggy_halflife * 48` where
mon_osd_laggy_halflife is a configurable value which is by default 
3600 so only
if the interval I talked about exceeds 172800; the laggy parameters 
would reset
to 0. I'd recommend taking a look at what your configured value 
is(using cmd:

ceph config get osd mon_osd_laggy_halflife).

There is also a "hack" to reset the parameters manually(*Not 
recommended, just

for info*): set mon_osd_laggy_weight to 1 using `ceph config set osd
mon_osd_laggy_weight 1` and reboot the OSD(s) which is/are being said 
laggy and

you will see the lagginess go away.


*Dhairya Parmar*

Associate Software Engineer, CephFS

Red Hat Inc. <https://www.redhat.com/>

dpar...@redhat.com

<https://www.redhat.com/>



On Wed, Sep 20, 2023 at 3:25 PM Venky Shankar  
wrote:


Hey Janek,

I took a closer look at various places where the MDS would consider a
client as laggy and it seems like a wide variety of reasons are taken
into consideration and not all of them might be a reason to defer
client
eviction, so the warning is a bit misleading. I'll post a PR for
this. In
the meantime, could you share the debug logs stated in my
previous email?

On Wed, Sep 20, 2023 at 3:07 PM Venky Shankar
 wrote:

> Hi Janek,
>
> On Tue, Sep 19, 2023 at 4:44 PM Janek Bevendorff <
> janek.bevendo...@uni-weimar.de> wrote:
>
>> Hi Venky,
>>
>> As I said: There are no laggy OSDs. The maximum ping I have
for any OSD
>> in ceph osd perf is around 60ms (just a handful, probably
aging disks). The
>> vast majority of OSDs have ping times of less than 1ms. Same
for the host
>> machines, yet I'm still seeing this message. It seems that the
affected
>> hosts are usually the same, but I have absolutely no clue why.
>>
>
> It's possible that you are running into a bug which does not
clear the
> laggy clients list which the MDS sends to monitors via beacons.
Could you
> help us out with debug mds logs (by setting debug_mds=20) for
the active
> mds for around 15-20 seconds and share the logs please? Also
reset the log
> level once done since it can hurt performance.
>
> # ceph config set mds.<> debug_mds 20
>
> and reset via
>
> # ceph config rm mds.<> debug_mds
>
>
>> Janek
>>
>>
>> On 19/09/2023 12:36, Venky Shankar wrote:
>>
>> Hi Janek,
>>
>> On Mon, Sep 18, 2023 at 9:52 PM Janek Bevendorff <
>> janek.bevendo...@uni-weimar.de> wrote:
>>
>>> Thanks! However, I still don't really understand why I am
seeing this.
>>>
>>
>> This is due to a changes that was merged recently in pacific
>>
>> https://github.com/ceph/ceph/pull/52270
>>
>> The MDS would not evict laggy clients if the OSDs report as
laggy. Laggy
>> OSDs can cause cephfs clients to not flush dirty data (during
cap revo

[ceph-users] Re: CephFS warning: clients laggy due to laggy OSDs

2023-09-21 Thread Janek Bevendorff

Hi,

I took a snapshot of MDS.0's logs. We have five active MDS in total, 
each one reporting laggy OSDs/clients, but I cannot find anything 
related to that in the log snippet. Anyhow, I uploaded the log for your 
reference with ceph-post-file ID 79b5138b-61d7-4ba7-b0a9-c6f02f47b881.


This is what ceph status looks like after a couple of days. This is not 
normal:


HEALTH_WARN
55 client(s) laggy due to laggy OSDs
8 clients failing to respond to capability release
1 clients failing to advance oldest client/flush tid
5 MDSs report slow requests

(55 clients are actually "just" 11 unique client IDs, but each MDS makes 
their own report.)


osd mon_osd_laggy_halflife is not configured on our cluster, so it's the 
default of 3600.



Janek


On 20/09/2023 13:17, Dhairya Parmar wrote:

Hi Janek,

The PR venky mentioned makes use of OSD's laggy parameters 
(laggy_interval and
laggy_probability) to find if any OSD is laggy or not. These laggy 
parameters
can reset to 0 if the interval between the last modification done to 
OSDMap and
the time stamp when OSD was marked down exceeds the grace interval 
threshold

which is the value we get by `mon_osd_laggy_halflife * 48` where
mon_osd_laggy_halflife is a configurable value which is by default 
3600 so only
if the interval I talked about exceeds 172800; the laggy parameters 
would reset
to 0. I'd recommend taking a look at what your configured value 
is(using cmd:

ceph config get osd mon_osd_laggy_halflife).

There is also a "hack" to reset the parameters manually(*Not 
recommended, just

for info*): set mon_osd_laggy_weight to 1 using `ceph config set osd
mon_osd_laggy_weight 1` and reboot the OSD(s) which is/are being said 
laggy and

you will see the lagginess go away.


*Dhairya Parmar*

Associate Software Engineer, CephFS

Red Hat Inc. <https://www.redhat.com/>

dpar...@redhat.com

<https://www.redhat.com/>



On Wed, Sep 20, 2023 at 3:25 PM Venky Shankar  wrote:

Hey Janek,

I took a closer look at various places where the MDS would consider a
client as laggy and it seems like a wide variety of reasons are taken
into consideration and not all of them might be a reason to defer
client
eviction, so the warning is a bit misleading. I'll post a PR for
this. In
the meantime, could you share the debug logs stated in my previous
email?

On Wed, Sep 20, 2023 at 3:07 PM Venky Shankar
 wrote:

> Hi Janek,
>
> On Tue, Sep 19, 2023 at 4:44 PM Janek Bevendorff <
> janek.bevendo...@uni-weimar.de> wrote:
>
>> Hi Venky,
>>
>> As I said: There are no laggy OSDs. The maximum ping I have for
any OSD
>> in ceph osd perf is around 60ms (just a handful, probably aging
disks). The
>> vast majority of OSDs have ping times of less than 1ms. Same
for the host
>> machines, yet I'm still seeing this message. It seems that the
affected
>> hosts are usually the same, but I have absolutely no clue why.
>>
>
> It's possible that you are running into a bug which does not
clear the
> laggy clients list which the MDS sends to monitors via beacons.
Could you
> help us out with debug mds logs (by setting debug_mds=20) for
the active
> mds for around 15-20 seconds and share the logs please? Also
reset the log
> level once done since it can hurt performance.
>
> # ceph config set mds.<> debug_mds 20
>
> and reset via
>
> # ceph config rm mds.<> debug_mds
>
>
>> Janek
>>
>>
>> On 19/09/2023 12:36, Venky Shankar wrote:
>>
>> Hi Janek,
>>
>> On Mon, Sep 18, 2023 at 9:52 PM Janek Bevendorff <
>> janek.bevendo...@uni-weimar.de> wrote:
>>
>>> Thanks! However, I still don't really understand why I am
seeing this.
>>>
>>
>> This is due to a changes that was merged recently in pacific
>>
>> https://github.com/ceph/ceph/pull/52270
>>
>> The MDS would not evict laggy clients if the OSDs report as
laggy. Laggy
>> OSDs can cause cephfs clients to not flush dirty data (during
cap revokes
>> by the MDS) and thereby showing up as laggy and getting evicted
by the MDS.
>> This behaviour was changed and therefore you get warnings that
some client
>> are laggy but they are not evicted since the OSDs are laggy.
>>
>>
>>> The first time I had this, one of the clients was a remote
user dialling
>>> in via VPN, which could indeed be laggy. But I am also seeing
it from
>>> neighbouring hosts that are on the same physical network with
reliable ping
>>>

[ceph-users] Re: CephFS warning: clients laggy due to laggy OSDs

2023-09-19 Thread Janek Bevendorff

Hi Venky,

As I said: There are no laggy OSDs. The maximum ping I have for any OSD 
in ceph osd perf is around 60ms (just a handful, probably aging disks). 
The vast majority of OSDs have ping times of less than 1ms. Same for the 
host machines, yet I'm still seeing this message. It seems that the 
affected hosts are usually the same, but I have absolutely no clue why.


Janek


On 19/09/2023 12:36, Venky Shankar wrote:

Hi Janek,

On Mon, Sep 18, 2023 at 9:52 PM Janek Bevendorff 
 wrote:


Thanks! However, I still don't really understand why I am seeing this.


This is due to a changes that was merged recently in pacific

https://github.com/ceph/ceph/pull/52270

The MDS would not evict laggy clients if the OSDs report as laggy. 
Laggy OSDs can cause cephfs clients to not flush dirty data (during 
cap revokes by the MDS) and thereby showing up as laggy and getting 
evicted by the MDS. This behaviour was changed and therefore you get 
warnings that some client are laggy but they are not evicted since the 
OSDs are laggy.


The first time I had this, one of the clients was a remote user
dialling in via VPN, which could indeed be laggy. But I am also
seeing it from neighbouring hosts that are on the same physical
network with reliable ping times way below 1ms. How is that
considered laggy?

 Are some of your OSDs reporting laggy? This can be check via `perf dump`

> ceph tell mds.<> perf dump
(search for op_laggy/osd_laggy)


On 18/09/2023 18:07, Laura Flores wrote:

Hi Janek,

There was some documentation added about it here:
https://docs.ceph.com/en/pacific/cephfs/health-messages/

There is a description of what it means, and it's tied to an mds
configurable.

On Mon, Sep 18, 2023 at 10:51 AM Janek Bevendorff
 wrote:

Hey all,

Since the upgrade to Ceph 16.2.14, I keep seeing the
following warning:

10 client(s) laggy due to laggy OSDs

ceph health detail shows it as:

[WRN] MDS_CLIENTS_LAGGY: 10 client(s) laggy due to laggy OSDs
 mds.***(mds.3): Client *** is laggy; not evicted because
some
OSD(s) is/are laggy
 more of this...

When I restart the client(s) or the affected MDS daemons, the
message
goes away and then comes back after a while. ceph osd perf
does not list
any laggy OSDs (a few with 10-60ms ping, but overwhelmingly <
1ms), so
I'm on a total loss what this even means.

I have never seen this message before nor was I able to find
anything
about it. Do you have any idea what this message actually
means and how
I can get rid of it?

Thanks
Janek

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



-- 


Laura Flores

She/Her/Hers

Software Engineer, Ceph Storage <https://ceph.io>

Chicago, IL

lflo...@ibm.com | lflo...@redhat.com <mailto:lflo...@redhat.com>
M: +17087388804 



-- 
Bauhaus-Universität Weimar

Bauhausstr. 9a, R308
99423 Weimar, Germany

Phone: +49 3643 58 3577
www.webis.de  <http://www.webis.de>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



--
Cheers,
Venky


--
Bauhaus-Universität Weimar
Bauhausstr. 9a, R308
99423 Weimar, Germany

Phone: +49 3643 58 3577
www.webis.de


smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS warning: clients laggy due to laggy OSDs

2023-09-18 Thread Janek Bevendorff

Thanks! However, I still don't really understand why I am seeing this.

The first time I had this, one of the clients was a remote user dialling 
in via VPN, which could indeed be laggy. But I am also seeing it from 
neighbouring hosts that are on the same physical network with reliable 
ping times way below 1ms. How is that considered laggy?



On 18/09/2023 18:07, Laura Flores wrote:

Hi Janek,

There was some documentation added about it here: 
https://docs.ceph.com/en/pacific/cephfs/health-messages/


There is a description of what it means, and it's tied to an mds 
configurable.


On Mon, Sep 18, 2023 at 10:51 AM Janek Bevendorff 
 wrote:


Hey all,

Since the upgrade to Ceph 16.2.14, I keep seeing the following
warning:

10 client(s) laggy due to laggy OSDs

ceph health detail shows it as:

[WRN] MDS_CLIENTS_LAGGY: 10 client(s) laggy due to laggy OSDs
 mds.***(mds.3): Client *** is laggy; not evicted because some
OSD(s) is/are laggy
 more of this...

When I restart the client(s) or the affected MDS daemons, the message
goes away and then comes back after a while. ceph osd perf does
not list
any laggy OSDs (a few with 10-60ms ping, but overwhelmingly <
1ms), so
I'm on a total loss what this even means.

I have never seen this message before nor was I able to find anything
about it. Do you have any idea what this message actually means
and how
I can get rid of it?

Thanks
Janek

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



--

Laura Flores

She/Her/Hers

Software Engineer, Ceph Storage <https://ceph.io>

Chicago, IL

lflo...@ibm.com | lflo...@redhat.com <mailto:lflo...@redhat.com>
M: +17087388804 




--
Bauhaus-Universität Weimar
Bauhausstr. 9a, R308
99423 Weimar, Germany

Phone: +49 3643 58 3577
www.webis.de


smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] CephFS warning: clients laggy due to laggy OSDs

2023-09-18 Thread Janek Bevendorff

Hey all,

Since the upgrade to Ceph 16.2.14, I keep seeing the following warning:

10 client(s) laggy due to laggy OSDs

ceph health detail shows it as:

[WRN] MDS_CLIENTS_LAGGY: 10 client(s) laggy due to laggy OSDs
    mds.***(mds.3): Client *** is laggy; not evicted because some 
OSD(s) is/are laggy

    more of this...

When I restart the client(s) or the affected MDS daemons, the message 
goes away and then comes back after a while. ceph osd perf does not list 
any laggy OSDs (a few with 10-60ms ping, but overwhelmingly < 1ms), so 
I'm on a total loss what this even means.


I have never seen this message before nor was I able to find anything 
about it. Do you have any idea what this message actually means and how 
I can get rid of it?


Thanks
Janek



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS metadata pool grows by two orders of magnitude while trimming (?) snapshots

2023-06-19 Thread Janek Bevendorff

Hi Patrick,


The event log size of 3/5 MDS is also very high, still. mds.1, mds.3,
and mds.4 report between 4 and 5 million events, mds.0 around 1.4
million and mds.2 between 0 and 200,000. The numbers have been constant
since my last MDS restart four days ago.

I ran your ceph-gather.sh script a couple of times, but dumps only
mds.0. Should I modify it to dump mds.3 instead so you can have a look?

Yes, please.


The session load on mds.3 had already resolved itself after a few days, 
so I cannot reproduce it any more. Right now, mds.0 has the highest load 
and a steadily growing event log, but it's not crazy (yet). Nonetheless, 
I've sent you my dumps with upload ID 
b95ee882-21e1-4ea1-a419-639a86acc785. The older dumps are from when 
mds.3 was under load, but they are all from mds.0. I also attached a 
newer batch, which I created just a few minutes ago.


Janek


--

Bauhaus-Universität Weimar
Bauhausstr. 9a, R308
99423 Weimar, Germany

Phone: +49 3643 58 3577
www.webis.de
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS metadata pool grows by two orders of magnitude while trimming (?) snapshots

2023-06-12 Thread Janek Bevendorff
Good news: We haven't had any new fill-ups so far. On the contrary, the 
pool size is as small as it's ever been (200GiB).


Bad news: The MDS are still acting strangely. I have very uneven session 
load and I don't know where it comes from. ceph_mds_sessions_total_load 
reports a number of 1.4 million on mds.3, whereas all the others are 
mostly idle. I checked the client list on that rank, but the heaviest 
client has about 8k caps, which isn't very much at all. Most have 0 or 
1. I don't see any blocked ops in flight. I don't think this is to do 
with the disabled balancer, because I've seen this pattern before.


The event log size of 3/5 MDS is also very high, still. mds.1, mds.3, 
and mds.4 report between 4 and 5 million events, mds.0 around 1.4 
million and mds.2 between 0 and 200,000. The numbers have been constant 
since my last MDS restart four days ago.


I ran your ceph-gather.sh script a couple of times, but dumps only 
mds.0. Should I modify it to dump mds.3 instead so you can have a look?


Janek


On 10/06/2023 15:23, Patrick Donnelly wrote:

On Fri, Jun 9, 2023 at 3:27 AM Janek Bevendorff
 wrote:

Hi Patrick,


I'm afraid your ceph-post-file logs were lost to the nether. AFAICT,
our ceph-post-file storage has been non-functional since the beginning
of the lab outage last year. We're looking into it.

I have it here still. Any other way I can send it to you?

Nevermind, I found the machine it was stored on. It was a
misconfiguration caused by post-lab-outage rebuilds.


Extremely unlikely.

Okay, taking your word for it. But something seems to be stalling
journal trimming. We had a similar thing yesterday evening, but at much
smaller scale without noticeable pool size increase. I only got an alert
that the ceph_mds_log_ev Prometheus metric starting going up again for a
single MDS. It grew past 1M events, so I restarted it. I also restarted
the other MDS and they all immediately jumped to above 5M events and
stayed there. They are, in fact, still there and have decreased only
very slightly in the morning. The pool size is totally within a normal
range, though, at 290GiB.

Please keep monitoring it. I think you're not the only cluster to
experience this.


So clearly (a) an incredible number of journal events are being logged
and (b) trimming is slow or unable to make progress. I'm looking into
why but you can help by running the attached script when the problem
is occurring so I can investigate. I'll need a tarball of the outputs.

How do I send it to you if not via ceph-post-file?

It should work soon next week. We're moving the drop.ceph.com service
to a standalone VM soonish.


Also, in the off-chance this is related to the MDS balancer, please
disable it since you're using ephemeral pinning:

ceph config set mds mds_bal_interval 0

Done.

Thanks for your help!
Janek


--

Bauhaus-Universität Weimar
Bauhausstr. 9a, R308
99423 Weimar, Germany

Phone: +49 3643 58 3577
www.webis.de




--

Bauhaus-Universität Weimar
Bauhausstr. 9a, R308
99423 Weimar, Germany

Phone: +49 3643 58 3577
www.webis.de
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS metadata pool grows by two orders of magnitude while trimming (?) snapshots

2023-06-09 Thread Janek Bevendorff

Hi Patrick,


I'm afraid your ceph-post-file logs were lost to the nether. AFAICT,
our ceph-post-file storage has been non-functional since the beginning
of the lab outage last year. We're looking into it.


I have it here still. Any other way I can send it to you?


Extremely unlikely.


Okay, taking your word for it. But something seems to be stalling 
journal trimming. We had a similar thing yesterday evening, but at much 
smaller scale without noticeable pool size increase. I only got an alert 
that the ceph_mds_log_ev Prometheus metric starting going up again for a 
single MDS. It grew past 1M events, so I restarted it. I also restarted 
the other MDS and they all immediately jumped to above 5M events and 
stayed there. They are, in fact, still there and have decreased only 
very slightly in the morning. The pool size is totally within a normal 
range, though, at 290GiB.



So clearly (a) an incredible number of journal events are being logged
and (b) trimming is slow or unable to make progress. I'm looking into
why but you can help by running the attached script when the problem
is occurring so I can investigate. I'll need a tarball of the outputs.


How do I send it to you if not via ceph-post-file?


Also, in the off-chance this is related to the MDS balancer, please
disable it since you're using ephemeral pinning:

ceph config set mds mds_bal_interval 0


Done.

Thanks for your help!
Janek


--

Bauhaus-Universität Weimar
Bauhausstr. 9a, R308
99423 Weimar, Germany

Phone: +49 3643 58 3577
www.webis.de
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS metadata pool grows by two orders of magnitude while trimming (?) snapshots

2023-06-06 Thread Janek Bevendorff
I guess the mailing list didn't preserve the embedded image. Here's an 
Imgur link: https://imgur.com/a/WSmAOaG


I checked the logs as far back as we have them. The issue started 
appearing only after my last Ceph upgrade on 2 May, which introduced the 
new corruption assertion.



On 06/06/2023 09:16, Janek Bevendorff wrote:
I checked our Prometheus logs and the number of log events of 
individual MONs are indeed randomly starting to increase dramatically 
all of a sudden. I attached a picture of the curves.


The first incidence you see there was when our metadata store filled 
up entirely. The second, smaller one was the more controlled fill-up. 
The last instance with only one runaway MDS is what I have just reported.


My unqualified wild guess is that the new safeguard to prevent the MDS 
from committing corrupt dentries is holding up the queue, so all of a 
sudden events are starting to pile up until the store is full.





On 05/06/2023 18:03, Janek Bevendorff wrote:
That said, our MON store size has also been growing slowly from 900MB 
to 5.4GB. But we also have a few remapped PGs right now. Not sure if 
that would have an influence.



On 05/06/2023 17:48, Janek Bevendorff wrote:

Hi Patrick, hi Dan!

I got the MDS back and I think the issue is connected to the "newly 
corrupt dentry" bug [1]. Even though I couldn't see any particular 
reason for the SIGABRT at first, I then noticed one of these awfully 
familiar stack traces.


I rescheduled the two broken MDS ranks on two machines with 1.5TB 
RAM each (just to make sure it's not that) and then let them do 
their thing. The routine goes as follows: both replay the journal, 
then rank 4 goes into the "resolve" state, but as soon as rank 3 
also starts resolving, they both crash.


Then I set

ceph config mds mds_abort_on_newly_corrupt_dentry false
ceph config mds mds_go_bad_corrupt_dentry false

and this time I was able to recover the ranks, even though "resolve" 
and "clientreplay" took forever. I uploaded a compressed log of rank 
3 using ceph-post-file [2]. It's a log of several crash cycles, 
including the final successful attempt after changing the settings. 
The log decompresses to 815MB. I didn't censor any paths and they 
are not super-secret, but please don't share.


While writing this, the metadata pool size has reduced from 6TiB 
back to 440GiB. I am starting to think that the fill-ups may also be 
connected to the corruption issue. I also noticed that the ranks 3 
and 4 always have huge journals. An inspection using 
ceph-journal-tool takes forever and consumes 50GB of memory in the 
process. Listing the events in the journal is impossible without 
running out of RAM. Ranks 0, 1, and 2 don't have this problem and 
this wasn't a problem for ranks 3 and 4 either before the fill-ups 
started happening.


Hope that helps getting to the bottom of this. I reset the guardrail 
settings in the meantime.


Cheers
Janek


[1] "Newly corrupt dentry" ML link: 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/JNZ6V5WSYKQTNQPQPLWRBM2GEP2YSCRV/#PKQVZYWZCH7P76Q75D5WD5JEAVWOKJE3


[2] ceph-post-file ID: 7c039483-49fd-468c-ba40-fb10337aa7d6



On 05/06/2023 16:08, Janek Bevendorff wrote:
I just had the problem again that MDS were constantly reporting 
slow metadata IO and the pool was slowly growing. Hence I restarted 
the MDS and now ranks 4 and 5 don't come up again.


Every time, they get to the resolve stage, the crash with a SIGABRT 
without an error message (not even at debug_mds = 20). Any idea 
what the reason could be? I checked whether they have enough RAM, 
which seems to be the case (unless they try to allocate tens of GB 
in one allocation).


Janek


On 31/05/2023 21:57, Janek Bevendorff wrote:

Hi Dan,

Sorry, I meant Pacific. The version number was correct, the name 
wasn’t. ;-)


Yes, I have five active MDS and five hot standbys. Static pinning 
isn’t really an options for our directory structure, so we’re 
using ephemeral pins.


Janek


On 31. May 2023, at 18:44, Dan van der Ster 
 wrote:


Hi Janek,

A few questions and suggestions:
- Do you have multi-active MDS? In my experience back in nautilus if
something went wrong with mds export between mds's, the mds
log/journal could grow unbounded like you observed until that export
work was done. Static pinning could help if you are not using it
already.
- You definitely should disable the pg autoscaling on the mds 
metadata

pool (and other pools imho) -- decide the correct number of PGs for
your pools and leave it.
- Which version are you running? You said nautilus but wrote 16.2.12
which is pacific... If you're running nautilus v14 then I recommend
disabling pg autoscaling completely -- IIRC it does not have a 
fix for

the OSD memory growth "pg dup" issue which can occur during PG
splitting/merging.

Cheers, Dan

__
Clyso GmbH | https://www.clyso.com


On Wed, May 31, 2023

[ceph-users] Re: CephFS metadata pool grows by two orders of magnitude while trimming (?) snapshots

2023-06-06 Thread Janek Bevendorff
I checked our Prometheus logs and the number of log events of individual 
MONs are indeed randomly starting to increase dramatically all of a 
sudden. I attached a picture of the curves.


The first incidence you see there was when our metadata store filled up 
entirely. The second, smaller one was the more controlled fill-up. The 
last instance with only one runaway MDS is what I have just reported.


My unqualified wild guess is that the new safeguard to prevent the MDS 
from committing corrupt dentries is holding up the queue, so all of a 
sudden events are starting to pile up until the store is full.





On 05/06/2023 18:03, Janek Bevendorff wrote:
That said, our MON store size has also been growing slowly from 900MB 
to 5.4GB. But we also have a few remapped PGs right now. Not sure if 
that would have an influence.



On 05/06/2023 17:48, Janek Bevendorff wrote:

Hi Patrick, hi Dan!

I got the MDS back and I think the issue is connected to the "newly 
corrupt dentry" bug [1]. Even though I couldn't see any particular 
reason for the SIGABRT at first, I then noticed one of these awfully 
familiar stack traces.


I rescheduled the two broken MDS ranks on two machines with 1.5TB RAM 
each (just to make sure it's not that) and then let them do their 
thing. The routine goes as follows: both replay the journal, then 
rank 4 goes into the "resolve" state, but as soon as rank 3 also 
starts resolving, they both crash.


Then I set

ceph config mds mds_abort_on_newly_corrupt_dentry false
ceph config mds mds_go_bad_corrupt_dentry false

and this time I was able to recover the ranks, even though "resolve" 
and "clientreplay" took forever. I uploaded a compressed log of rank 
3 using ceph-post-file [2]. It's a log of several crash cycles, 
including the final successful attempt after changing the settings. 
The log decompresses to 815MB. I didn't censor any paths and they are 
not super-secret, but please don't share.


While writing this, the metadata pool size has reduced from 6TiB back 
to 440GiB. I am starting to think that the fill-ups may also be 
connected to the corruption issue. I also noticed that the ranks 3 
and 4 always have huge journals. An inspection using 
ceph-journal-tool takes forever and consumes 50GB of memory in the 
process. Listing the events in the journal is impossible without 
running out of RAM. Ranks 0, 1, and 2 don't have this problem and 
this wasn't a problem for ranks 3 and 4 either before the fill-ups 
started happening.


Hope that helps getting to the bottom of this. I reset the guardrail 
settings in the meantime.


Cheers
Janek


[1] "Newly corrupt dentry" ML link: 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/JNZ6V5WSYKQTNQPQPLWRBM2GEP2YSCRV/#PKQVZYWZCH7P76Q75D5WD5JEAVWOKJE3


[2] ceph-post-file ID: 7c039483-49fd-468c-ba40-fb10337aa7d6



On 05/06/2023 16:08, Janek Bevendorff wrote:
I just had the problem again that MDS were constantly reporting slow 
metadata IO and the pool was slowly growing. Hence I restarted the 
MDS and now ranks 4 and 5 don't come up again.


Every time, they get to the resolve stage, the crash with a SIGABRT 
without an error message (not even at debug_mds = 20). Any idea what 
the reason could be? I checked whether they have enough RAM, which 
seems to be the case (unless they try to allocate tens of GB in one 
allocation).


Janek


On 31/05/2023 21:57, Janek Bevendorff wrote:

Hi Dan,

Sorry, I meant Pacific. The version number was correct, the name 
wasn’t. ;-)


Yes, I have five active MDS and five hot standbys. Static pinning 
isn’t really an options for our directory structure, so we’re using 
ephemeral pins.


Janek


On 31. May 2023, at 18:44, Dan van der Ster 
 wrote:


Hi Janek,

A few questions and suggestions:
- Do you have multi-active MDS? In my experience back in nautilus if
something went wrong with mds export between mds's, the mds
log/journal could grow unbounded like you observed until that export
work was done. Static pinning could help if you are not using it
already.
- You definitely should disable the pg autoscaling on the mds 
metadata

pool (and other pools imho) -- decide the correct number of PGs for
your pools and leave it.
- Which version are you running? You said nautilus but wrote 16.2.12
which is pacific... If you're running nautilus v14 then I recommend
disabling pg autoscaling completely -- IIRC it does not have a fix 
for

the OSD memory growth "pg dup" issue which can occur during PG
splitting/merging.

Cheers, Dan

__
Clyso GmbH | https://www.clyso.com


On Wed, May 31, 2023 at 4:03 AM Janek Bevendorff
 wrote:
I checked our logs from yesterday, the PG scaling only started 
today,
perhaps triggered by the snapshot trimming. I disabled it, but it 
didn't

change anything.

What did change something was restarting the MDS one by one, 
which had
got far behind with trimming their caches and with a bunch of

[ceph-users] Re: CephFS metadata pool grows by two orders of magnitude while trimming (?) snapshots

2023-06-05 Thread Janek Bevendorff
That said, our MON store size has also been growing slowly from 900MB to 
5.4GB. But we also have a few remapped PGs right now. Not sure if that 
would have an influence.



On 05/06/2023 17:48, Janek Bevendorff wrote:

Hi Patrick, hi Dan!

I got the MDS back and I think the issue is connected to the "newly 
corrupt dentry" bug [1]. Even though I couldn't see any particular 
reason for the SIGABRT at first, I then noticed one of these awfully 
familiar stack traces.


I rescheduled the two broken MDS ranks on two machines with 1.5TB RAM 
each (just to make sure it's not that) and then let them do their 
thing. The routine goes as follows: both replay the journal, then rank 
4 goes into the "resolve" state, but as soon as rank 3 also starts 
resolving, they both crash.


Then I set

ceph config mds mds_abort_on_newly_corrupt_dentry false
ceph config mds mds_go_bad_corrupt_dentry false

and this time I was able to recover the ranks, even though "resolve" 
and "clientreplay" took forever. I uploaded a compressed log of rank 3 
using ceph-post-file [2]. It's a log of several crash cycles, 
including the final successful attempt after changing the settings. 
The log decompresses to 815MB. I didn't censor any paths and they are 
not super-secret, but please don't share.


While writing this, the metadata pool size has reduced from 6TiB back 
to 440GiB. I am starting to think that the fill-ups may also be 
connected to the corruption issue. I also noticed that the ranks 3 and 
4 always have huge journals. An inspection using ceph-journal-tool 
takes forever and consumes 50GB of memory in the process. Listing the 
events in the journal is impossible without running out of RAM. Ranks 
0, 1, and 2 don't have this problem and this wasn't a problem for 
ranks 3 and 4 either before the fill-ups started happening.


Hope that helps getting to the bottom of this. I reset the guardrail 
settings in the meantime.


Cheers
Janek


[1] "Newly corrupt dentry" ML link: 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/JNZ6V5WSYKQTNQPQPLWRBM2GEP2YSCRV/#PKQVZYWZCH7P76Q75D5WD5JEAVWOKJE3


[2] ceph-post-file ID: 7c039483-49fd-468c-ba40-fb10337aa7d6



On 05/06/2023 16:08, Janek Bevendorff wrote:
I just had the problem again that MDS were constantly reporting slow 
metadata IO and the pool was slowly growing. Hence I restarted the 
MDS and now ranks 4 and 5 don't come up again.


Every time, they get to the resolve stage, the crash with a SIGABRT 
without an error message (not even at debug_mds = 20). Any idea what 
the reason could be? I checked whether they have enough RAM, which 
seems to be the case (unless they try to allocate tens of GB in one 
allocation).


Janek


On 31/05/2023 21:57, Janek Bevendorff wrote:

Hi Dan,

Sorry, I meant Pacific. The version number was correct, the name 
wasn’t. ;-)


Yes, I have five active MDS and five hot standbys. Static pinning 
isn’t really an options for our directory structure, so we’re using 
ephemeral pins.


Janek


On 31. May 2023, at 18:44, Dan van der Ster 
 wrote:


Hi Janek,

A few questions and suggestions:
- Do you have multi-active MDS? In my experience back in nautilus if
something went wrong with mds export between mds's, the mds
log/journal could grow unbounded like you observed until that export
work was done. Static pinning could help if you are not using it
already.
- You definitely should disable the pg autoscaling on the mds metadata
pool (and other pools imho) -- decide the correct number of PGs for
your pools and leave it.
- Which version are you running? You said nautilus but wrote 16.2.12
which is pacific... If you're running nautilus v14 then I recommend
disabling pg autoscaling completely -- IIRC it does not have a fix for
the OSD memory growth "pg dup" issue which can occur during PG
splitting/merging.

Cheers, Dan

__
Clyso GmbH | https://www.clyso.com


On Wed, May 31, 2023 at 4:03 AM Janek Bevendorff
 wrote:

I checked our logs from yesterday, the PG scaling only started today,
perhaps triggered by the snapshot trimming. I disabled it, but it 
didn't

change anything.

What did change something was restarting the MDS one by one, which 
had
got far behind with trimming their caches and with a bunch of 
stuck ops.

After restarting them, the pool size decreased quickly to 600GiB. I
noticed the same behaviour yesterday, though yesterday is was more
extreme and restarting the MDS took about an hour and I had to 
increase

the heartbeat timeout. This time, it took only half a minute per MDS,
probably because it wasn't that extreme yet and I had reduced the
maximum cache size. Still looks like a bug to me.


On 31/05/2023 11:18, Janek Bevendorff wrote:

Another thing I just noticed is that the auto-scaler is trying to
scale the pool down to 128 PGs. That could also result in large
fluctuations, but this big?? In any case, it looks like a bug to me.
Whateve

[ceph-users] Re: CephFS metadata pool grows by two orders of magnitude while trimming (?) snapshots

2023-06-05 Thread Janek Bevendorff

Hi Patrick, hi Dan!

I got the MDS back and I think the issue is connected to the "newly 
corrupt dentry" bug [1]. Even though I couldn't see any particular 
reason for the SIGABRT at first, I then noticed one of these awfully 
familiar stack traces.


I rescheduled the two broken MDS ranks on two machines with 1.5TB RAM 
each (just to make sure it's not that) and then let them do their thing. 
The routine goes as follows: both replay the journal, then rank 4 goes 
into the "resolve" state, but as soon as rank 3 also starts resolving, 
they both crash.


Then I set

ceph config mds mds_abort_on_newly_corrupt_dentry false
ceph config mds mds_go_bad_corrupt_dentry false

and this time I was able to recover the ranks, even though "resolve" and 
"clientreplay" took forever. I uploaded a compressed log of rank 3 using 
ceph-post-file [2]. It's a log of several crash cycles, including the 
final successful attempt after changing the settings. The log 
decompresses to 815MB. I didn't censor any paths and they are not 
super-secret, but please don't share.


While writing this, the metadata pool size has reduced from 6TiB back to 
440GiB. I am starting to think that the fill-ups may also be connected 
to the corruption issue. I also noticed that the ranks 3 and 4 always 
have huge journals. An inspection using ceph-journal-tool takes forever 
and consumes 50GB of memory in the process. Listing the events in the 
journal is impossible without running out of RAM. Ranks 0, 1, and 2 
don't have this problem and this wasn't a problem for ranks 3 and 4 
either before the fill-ups started happening.


Hope that helps getting to the bottom of this. I reset the guardrail 
settings in the meantime.


Cheers
Janek


[1] "Newly corrupt dentry" ML link: 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/JNZ6V5WSYKQTNQPQPLWRBM2GEP2YSCRV/#PKQVZYWZCH7P76Q75D5WD5JEAVWOKJE3


[2] ceph-post-file ID: 7c039483-49fd-468c-ba40-fb10337aa7d6



On 05/06/2023 16:08, Janek Bevendorff wrote:
I just had the problem again that MDS were constantly reporting slow 
metadata IO and the pool was slowly growing. Hence I restarted the MDS 
and now ranks 4 and 5 don't come up again.


Every time, they get to the resolve stage, the crash with a SIGABRT 
without an error message (not even at debug_mds = 20). Any idea what 
the reason could be? I checked whether they have enough RAM, which 
seems to be the case (unless they try to allocate tens of GB in one 
allocation).


Janek


On 31/05/2023 21:57, Janek Bevendorff wrote:

Hi Dan,

Sorry, I meant Pacific. The version number was correct, the name 
wasn’t. ;-)


Yes, I have five active MDS and five hot standbys. Static pinning 
isn’t really an options for our directory structure, so we’re using 
ephemeral pins.


Janek


On 31. May 2023, at 18:44, Dan van der Ster 
 wrote:


Hi Janek,

A few questions and suggestions:
- Do you have multi-active MDS? In my experience back in nautilus if
something went wrong with mds export between mds's, the mds
log/journal could grow unbounded like you observed until that export
work was done. Static pinning could help if you are not using it
already.
- You definitely should disable the pg autoscaling on the mds metadata
pool (and other pools imho) -- decide the correct number of PGs for
your pools and leave it.
- Which version are you running? You said nautilus but wrote 16.2.12
which is pacific... If you're running nautilus v14 then I recommend
disabling pg autoscaling completely -- IIRC it does not have a fix for
the OSD memory growth "pg dup" issue which can occur during PG
splitting/merging.

Cheers, Dan

__
Clyso GmbH | https://www.clyso.com


On Wed, May 31, 2023 at 4:03 AM Janek Bevendorff
 wrote:

I checked our logs from yesterday, the PG scaling only started today,
perhaps triggered by the snapshot trimming. I disabled it, but it 
didn't

change anything.

What did change something was restarting the MDS one by one, which had
got far behind with trimming their caches and with a bunch of stuck 
ops.

After restarting them, the pool size decreased quickly to 600GiB. I
noticed the same behaviour yesterday, though yesterday is was more
extreme and restarting the MDS took about an hour and I had to 
increase

the heartbeat timeout. This time, it took only half a minute per MDS,
probably because it wasn't that extreme yet and I had reduced the
maximum cache size. Still looks like a bug to me.


On 31/05/2023 11:18, Janek Bevendorff wrote:

Another thing I just noticed is that the auto-scaler is trying to
scale the pool down to 128 PGs. That could also result in large
fluctuations, but this big?? In any case, it looks like a bug to me.
Whatever is happening here, there should be safeguards with regard to
the pool's capacity.

Here's the current state of the pool in ceph osd pool ls detail:

pool 110 'cephfs.storage.meta' replicated size 4 min_size 3 
crush_rule

[ceph-users] Re: CephFS metadata pool grows by two orders of magnitude while trimming (?) snapshots

2023-06-05 Thread Janek Bevendorff
I just had the problem again that MDS were constantly reporting slow 
metadata IO and the pool was slowly growing. Hence I restarted the MDS 
and now ranks 4 and 5 don't come up again.


Every time, they get to the resolve stage, the crash with a SIGABRT 
without an error message (not even at debug_mds = 20). Any idea what the 
reason could be? I checked whether they have enough RAM, which seems to 
be the case (unless they try to allocate tens of GB in one allocation).


Janek


On 31/05/2023 21:57, Janek Bevendorff wrote:

Hi Dan,

Sorry, I meant Pacific. The version number was correct, the name wasn’t. ;-)

Yes, I have five active MDS and five hot standbys. Static pinning isn’t really 
an options for our directory structure, so we’re using ephemeral pins.

Janek



On 31. May 2023, at 18:44, Dan van der Ster  wrote:

Hi Janek,

A few questions and suggestions:
- Do you have multi-active MDS? In my experience back in nautilus if
something went wrong with mds export between mds's, the mds
log/journal could grow unbounded like you observed until that export
work was done. Static pinning could help if you are not using it
already.
- You definitely should disable the pg autoscaling on the mds metadata
pool (and other pools imho) -- decide the correct number of PGs for
your pools and leave it.
- Which version are you running? You said nautilus but wrote 16.2.12
which is pacific... If you're running nautilus v14 then I recommend
disabling pg autoscaling completely -- IIRC it does not have a fix for
the OSD memory growth "pg dup" issue which can occur during PG
splitting/merging.

Cheers, Dan

__
Clyso GmbH | https://www.clyso.com


On Wed, May 31, 2023 at 4:03 AM Janek Bevendorff
 wrote:

I checked our logs from yesterday, the PG scaling only started today,
perhaps triggered by the snapshot trimming. I disabled it, but it didn't
change anything.

What did change something was restarting the MDS one by one, which had
got far behind with trimming their caches and with a bunch of stuck ops.
After restarting them, the pool size decreased quickly to 600GiB. I
noticed the same behaviour yesterday, though yesterday is was more
extreme and restarting the MDS took about an hour and I had to increase
the heartbeat timeout. This time, it took only half a minute per MDS,
probably because it wasn't that extreme yet and I had reduced the
maximum cache size. Still looks like a bug to me.


On 31/05/2023 11:18, Janek Bevendorff wrote:

Another thing I just noticed is that the auto-scaler is trying to
scale the pool down to 128 PGs. That could also result in large
fluctuations, but this big?? In any case, it looks like a bug to me.
Whatever is happening here, there should be safeguards with regard to
the pool's capacity.

Here's the current state of the pool in ceph osd pool ls detail:

pool 110 'cephfs.storage.meta' replicated size 4 min_size 3 crush_rule
5 object_hash rjenkins pg_num 495 pgp_num 471 pg_num_target 128
pgp_num_target 128 autoscale_mode on last_change 1359013 lfor
0/1358620/1358618 flags hashpspool,nodelete stripe_width 0
expected_num_objects 300 recovery_op_priority 5 recovery_priority
2 application cephfs

Janek


On 31/05/2023 10:10, Janek Bevendorff wrote:

Forgot to add: We are still on Nautilus (16.2.12).


On 31/05/2023 09:53, Janek Bevendorff wrote:

Hi,

Perhaps this is a known issue and I was simply too dumb to find it,
but we are having problems with our CephFS metadata pool filling up
over night.

Our cluster has a small SSD pool of around 15TB which hosts our
CephFS metadata pool. Usually, that's more than enough. The normal
size of the pool ranges between 200 and 800GiB (which is quite a lot
of fluctuation already). Yesterday, we had suddenly had the pool
fill up entirely and they only way to fix it was to add more
capacity. I increased the pool size to 18TB by adding more SSDs and
could resolve the problem. After a couple of hours of reshuffling,
the pool size finally went back to 230GiB.

But then we had another fill-up tonight to 7.6TiB. Luckily, I had
adjusted the weights so that not all disks could fill up entirely
like last time, so it ended there.

I wasn't really able to identify the problem yesterday, but under
the more controllable scenario today, I could check the MDS logs at
debug_mds=10 and to me it seems like the problem is caused by
snapshot trimming. The logs contain a lot of snapshot-related
messages for paths that haven't been touched in a long time. The
messages all look something like this:

May 31 09:16:48 XXX ceph-mds[2947525]: 2023-05-31T09:16:48.292+0200
7f7ce1bd9700 10 mds.1.cache.ino(0x1000b3c3670) add_client_cap first
cap, joining realm snaprealm(0x100 seq 1b1c lc 1b1b cr 1
b1b cps 2 snaps={185f=snap(185f 0x100 'monthly_20221201'
2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x100
'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941
0x100 ...

May 31 09:25:03 XXX ceph-mds[3268481

[ceph-users] Re: CephFS metadata pool grows by two orders of magnitude while trimming (?) snapshots

2023-05-31 Thread Janek Bevendorff
Hi Dan,

Sorry, I meant Pacific. The version number was correct, the name wasn’t. ;-)

Yes, I have five active MDS and five hot standbys. Static pinning isn’t really 
an options for our directory structure, so we’re using ephemeral pins.

Janek


> On 31. May 2023, at 18:44, Dan van der Ster  wrote:
> 
> Hi Janek,
> 
> A few questions and suggestions:
> - Do you have multi-active MDS? In my experience back in nautilus if
> something went wrong with mds export between mds's, the mds
> log/journal could grow unbounded like you observed until that export
> work was done. Static pinning could help if you are not using it
> already.
> - You definitely should disable the pg autoscaling on the mds metadata
> pool (and other pools imho) -- decide the correct number of PGs for
> your pools and leave it.
> - Which version are you running? You said nautilus but wrote 16.2.12
> which is pacific... If you're running nautilus v14 then I recommend
> disabling pg autoscaling completely -- IIRC it does not have a fix for
> the OSD memory growth "pg dup" issue which can occur during PG
> splitting/merging.
> 
> Cheers, Dan
> 
> __
> Clyso GmbH | https://www.clyso.com
> 
> 
> On Wed, May 31, 2023 at 4:03 AM Janek Bevendorff
>  wrote:
>> 
>> I checked our logs from yesterday, the PG scaling only started today,
>> perhaps triggered by the snapshot trimming. I disabled it, but it didn't
>> change anything.
>> 
>> What did change something was restarting the MDS one by one, which had
>> got far behind with trimming their caches and with a bunch of stuck ops.
>> After restarting them, the pool size decreased quickly to 600GiB. I
>> noticed the same behaviour yesterday, though yesterday is was more
>> extreme and restarting the MDS took about an hour and I had to increase
>> the heartbeat timeout. This time, it took only half a minute per MDS,
>> probably because it wasn't that extreme yet and I had reduced the
>> maximum cache size. Still looks like a bug to me.
>> 
>> 
>> On 31/05/2023 11:18, Janek Bevendorff wrote:
>>> Another thing I just noticed is that the auto-scaler is trying to
>>> scale the pool down to 128 PGs. That could also result in large
>>> fluctuations, but this big?? In any case, it looks like a bug to me.
>>> Whatever is happening here, there should be safeguards with regard to
>>> the pool's capacity.
>>> 
>>> Here's the current state of the pool in ceph osd pool ls detail:
>>> 
>>> pool 110 'cephfs.storage.meta' replicated size 4 min_size 3 crush_rule
>>> 5 object_hash rjenkins pg_num 495 pgp_num 471 pg_num_target 128
>>> pgp_num_target 128 autoscale_mode on last_change 1359013 lfor
>>> 0/1358620/1358618 flags hashpspool,nodelete stripe_width 0
>>> expected_num_objects 300 recovery_op_priority 5 recovery_priority
>>> 2 application cephfs
>>> 
>>> Janek
>>> 
>>> 
>>> On 31/05/2023 10:10, Janek Bevendorff wrote:
>>>> Forgot to add: We are still on Nautilus (16.2.12).
>>>> 
>>>> 
>>>> On 31/05/2023 09:53, Janek Bevendorff wrote:
>>>>> Hi,
>>>>> 
>>>>> Perhaps this is a known issue and I was simply too dumb to find it,
>>>>> but we are having problems with our CephFS metadata pool filling up
>>>>> over night.
>>>>> 
>>>>> Our cluster has a small SSD pool of around 15TB which hosts our
>>>>> CephFS metadata pool. Usually, that's more than enough. The normal
>>>>> size of the pool ranges between 200 and 800GiB (which is quite a lot
>>>>> of fluctuation already). Yesterday, we had suddenly had the pool
>>>>> fill up entirely and they only way to fix it was to add more
>>>>> capacity. I increased the pool size to 18TB by adding more SSDs and
>>>>> could resolve the problem. After a couple of hours of reshuffling,
>>>>> the pool size finally went back to 230GiB.
>>>>> 
>>>>> But then we had another fill-up tonight to 7.6TiB. Luckily, I had
>>>>> adjusted the weights so that not all disks could fill up entirely
>>>>> like last time, so it ended there.
>>>>> 
>>>>> I wasn't really able to identify the problem yesterday, but under
>>>>> the more controllable scenario today, I could check the MDS logs at
>>>>> debug_mds=10 and to me it seems like the problem is caused by
>>>>> snapshot trimming. The logs contain a lot of snapshot-related
>>>>

[ceph-users] Re: MDS corrupt (also RADOS-level copy?)

2023-05-31 Thread Janek Bevendorff
Forgot so say: As for your corrupt rank 0, you should check the logs 
with a higher debug level. Looks like you were less lucky than we were. 
Your journal position may be incorrect. This could be fixed by editing 
the journal header. You might also try to tell your MDS to skip corrupt 
entries. None of these operations are safe, though.



On 31/05/2023 16:41, Janek Bevendorff wrote:

Hi Jake,

Very interesting. This sounds very much like what we have been 
experiencing the last two days. We also had a sudden fill-up of the 
metadata pool, which repeated last night. See my question here: 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/7U27L27FHHPDYGA6VNNVWGLTXCGP7X23/


I also noticed that I couldn't dump the current journal using the 
cephfs-journal-tool, as it would eat up all my RAM (probably not 
surprising with a journal that seems to be filling up a 16TiB pool).


Note: I did NOT need to reset the journal (and you probably don't need 
to either). I did, however, have to add extra capacity and balance out 
the data. After an MDS restart, the pool quickly cleared out again. 
The first MDS restart took an hour or so and I had to increase the MDS 
lag timeout (mds_beacon_grace), otherwise the MONs kept killing the 
MDS during the resolve phase. I set it to 1600 to be on the safe side.


While your MDS are recovering, you may want to set debug_mds to 10 for 
one of your MDS and check the logs. My logs were being spammed with 
snapshot-related messages, but I cannot really make sense of them. 
Still hoping for a reply on the ML.


In any case, once you are recovered, I recommend you adjust the 
weights of some of your OSDs to be much lower than others as a 
temporary safeguard. This way, only some OSDs would fill up and 
trigger your FULL watermark should this thing repeat.


Janek


On 31/05/2023 16:13, Jake Grimmett wrote:

Dear All,

we are trying to recover from what we suspect is a corrupt MDS :(
and have been following the guide here:

<https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/>

Symptoms: MDS SSD pool (2TB) filled completely over the weekend, 
normally uses less than 400GB, resulting in MDS crash.


We added 4 x extra SSD to increase pool capacity to 3.5TB, however 
MDS did not recover


# ceph fs status
cephfs2 - 0 clients
===
RANK   STATE MDS ACTIVITY   DNS    INOS   DIRS   CAPS
 0 failed
 1    resolve  wilma-s3    8065   8063   8047  0
 2    resolve  wilma-s2 901k   802k  34.4k 0
  POOL TYPE USED  AVAIL
    mds_ssd  metadata  2296G  3566G
primary_fs_data    data   0   3566G
    ec82pool   data    2168T  3557T
STANDBY MDS
  wilma-s1
  wilma-s4

setting "ceph mds repaired 0" causes rank 0 to restart, and then 
immediately fail.


Following the disaster-recovery-experts guide, the first step we did 
was to export the MDS journals, e.g:


# cephfs-journal-tool --rank=cephfs2:0 journal export /root/backup.bin.0
journal is 9744716714163~658103700
wrote 658103700 bytes at offset 9744716714163 to /root/backup.bin.0

so far so good, however when we try to backup the final MDS the 
process consumes all available RAM (470GB) and needs to be killed 
after 14 minutes.


# cephfs-journal-tool --rank=cephfs2:2 journal export /root/backup.bin.2

similarly, "recover_dentries summary" consumes all RAM when applied 
to MDS 2

# cephfs-journal-tool --rank=cephfs2:2 event recover_dentries summary

We successfully ran "cephfs-journal-tool --rank=cephfs2:0 event 
recover_dentries summary" and "cephfs-journal-tool --rank=cephfs2:1 
event recover_dentries summary"


at this point, we tried to follow the instructions and make a RADOS 
level copy of the journal data, however the link in the docs doesn't 
explain how to do this and just points to 
<http://tracker.ceph.com/issues/9902>


At this point we are tempted to reset the journal on MDS 2, but 
wanted to get a feeling from others about how dangerous this could be?


We have a backup, but as there is 1.8PB of data, it's going to take a 
few weeks to restore


any ideas gratefully received.

Jake



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS corrupt (also RADOS-level copy?)

2023-05-31 Thread Janek Bevendorff

Hi Jake,

Very interesting. This sounds very much like what we have been 
experiencing the last two days. We also had a sudden fill-up of the 
metadata pool, which repeated last night. See my question here: 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/7U27L27FHHPDYGA6VNNVWGLTXCGP7X23/


I also noticed that I couldn't dump the current journal using the 
cephfs-journal-tool, as it would eat up all my RAM (probably not 
surprising with a journal that seems to be filling up a 16TiB pool).


Note: I did NOT need to reset the journal (and you probably don't need 
to either). I did, however, have to add extra capacity and balance out 
the data. After an MDS restart, the pool quickly cleared out again. The 
first MDS restart took an hour or so and I had to increase the MDS lag 
timeout (mds_beacon_grace), otherwise the MONs kept killing the MDS 
during the resolve phase. I set it to 1600 to be on the safe side.


While your MDS are recovering, you may want to set debug_mds to 10 for 
one of your MDS and check the logs. My logs were being spammed with 
snapshot-related messages, but I cannot really make sense of them. Still 
hoping for a reply on the ML.


In any case, once you are recovered, I recommend you adjust the weights 
of some of your OSDs to be much lower than others as a temporary 
safeguard. This way, only some OSDs would fill up and trigger your FULL 
watermark should this thing repeat.


Janek


On 31/05/2023 16:13, Jake Grimmett wrote:

Dear All,

we are trying to recover from what we suspect is a corrupt MDS :(
and have been following the guide here:



Symptoms: MDS SSD pool (2TB) filled completely over the weekend, 
normally uses less than 400GB, resulting in MDS crash.


We added 4 x extra SSD to increase pool capacity to 3.5TB, however MDS 
did not recover


# ceph fs status
cephfs2 - 0 clients
===
RANK   STATE MDS ACTIVITY   DNS    INOS   DIRS   CAPS
 0 failed
 1    resolve  wilma-s3    8065   8063   8047  0
 2    resolve  wilma-s2 901k   802k  34.4k 0
  POOL TYPE USED  AVAIL
    mds_ssd  metadata  2296G  3566G
primary_fs_data    data   0   3566G
    ec82pool   data    2168T  3557T
STANDBY MDS
  wilma-s1
  wilma-s4

setting "ceph mds repaired 0" causes rank 0 to restart, and then 
immediately fail.


Following the disaster-recovery-experts guide, the first step we did 
was to export the MDS journals, e.g:


# cephfs-journal-tool --rank=cephfs2:0 journal export /root/backup.bin.0
journal is 9744716714163~658103700
wrote 658103700 bytes at offset 9744716714163 to /root/backup.bin.0

so far so good, however when we try to backup the final MDS the 
process consumes all available RAM (470GB) and needs to be killed 
after 14 minutes.


# cephfs-journal-tool --rank=cephfs2:2 journal export /root/backup.bin.2

similarly, "recover_dentries summary" consumes all RAM when applied to 
MDS 2

# cephfs-journal-tool --rank=cephfs2:2 event recover_dentries summary

We successfully ran "cephfs-journal-tool --rank=cephfs2:0 event 
recover_dentries summary" and "cephfs-journal-tool --rank=cephfs2:1 
event recover_dentries summary"


at this point, we tried to follow the instructions and make a RADOS 
level copy of the journal data, however the link in the docs doesn't 
explain how to do this and just points to 



At this point we are tempted to reset the journal on MDS 2, but wanted 
to get a feeling from others about how dangerous this could be?


We have a backup, but as there is 1.8PB of data, it's going to take a 
few weeks to restore


any ideas gratefully received.

Jake



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS metadata pool grows by two orders of magnitude while trimming (?) snapshots

2023-05-31 Thread Janek Bevendorff
I checked our logs from yesterday, the PG scaling only started today, 
perhaps triggered by the snapshot trimming. I disabled it, but it didn't 
change anything.


What did change something was restarting the MDS one by one, which had 
got far behind with trimming their caches and with a bunch of stuck ops. 
After restarting them, the pool size decreased quickly to 600GiB. I 
noticed the same behaviour yesterday, though yesterday is was more 
extreme and restarting the MDS took about an hour and I had to increase 
the heartbeat timeout. This time, it took only half a minute per MDS, 
probably because it wasn't that extreme yet and I had reduced the 
maximum cache size. Still looks like a bug to me.



On 31/05/2023 11:18, Janek Bevendorff wrote:
Another thing I just noticed is that the auto-scaler is trying to 
scale the pool down to 128 PGs. That could also result in large 
fluctuations, but this big?? In any case, it looks like a bug to me. 
Whatever is happening here, there should be safeguards with regard to 
the pool's capacity.


Here's the current state of the pool in ceph osd pool ls detail:

pool 110 'cephfs.storage.meta' replicated size 4 min_size 3 crush_rule 
5 object_hash rjenkins pg_num 495 pgp_num 471 pg_num_target 128 
pgp_num_target 128 autoscale_mode on last_change 1359013 lfor 
0/1358620/1358618 flags hashpspool,nodelete stripe_width 0 
expected_num_objects 300 recovery_op_priority 5 recovery_priority 
2 application cephfs


Janek


On 31/05/2023 10:10, Janek Bevendorff wrote:

Forgot to add: We are still on Nautilus (16.2.12).


On 31/05/2023 09:53, Janek Bevendorff wrote:

Hi,

Perhaps this is a known issue and I was simply too dumb to find it, 
but we are having problems with our CephFS metadata pool filling up 
over night.


Our cluster has a small SSD pool of around 15TB which hosts our 
CephFS metadata pool. Usually, that's more than enough. The normal 
size of the pool ranges between 200 and 800GiB (which is quite a lot 
of fluctuation already). Yesterday, we had suddenly had the pool 
fill up entirely and they only way to fix it was to add more 
capacity. I increased the pool size to 18TB by adding more SSDs and 
could resolve the problem. After a couple of hours of reshuffling, 
the pool size finally went back to 230GiB.


But then we had another fill-up tonight to 7.6TiB. Luckily, I had 
adjusted the weights so that not all disks could fill up entirely 
like last time, so it ended there.


I wasn't really able to identify the problem yesterday, but under 
the more controllable scenario today, I could check the MDS logs at 
debug_mds=10 and to me it seems like the problem is caused by 
snapshot trimming. The logs contain a lot of snapshot-related 
messages for paths that haven't been touched in a long time. The 
messages all look something like this:


May 31 09:16:48 XXX ceph-mds[2947525]: 2023-05-31T09:16:48.292+0200 
7f7ce1bd9700 10 mds.1.cache.ino(0x1000b3c3670) add_client_cap first 
cap, joining realm snaprealm(0x100 seq 1b1c lc 1b1b cr 1
b1b cps 2 snaps={185f=snap(185f 0x100 'monthly_20221201' 
2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x100 
'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941 
0x100 ...


May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.396+0200 
7f0e6a6ca700 10 mds.0.cache | |__ 3 rep [dir 
0x10218fe.10101* /storage/REDACTED/| ptrwaiter=0 request=0 
child=0 frozen=0 subtree=1 replicated=0 dirty=0 waiter=0 authpin=0 
tempexporting=0 0x5607759d9600]


May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.452+0200 
7f0e6a6ca700 10 mds.0.cache | | | 4 rep [dir 
0x10ff904.10001010* /storage/REDACTED/| ptrwaiter=0 
request=0 child=0 frozen=0 subtree=1 importing=0 replicated=0 
waiter=0 authpin=0 tempexporting=0 0x56034ed25200]


May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.716+0200 
7f0e6becd700 10 mds.0.server set_trace_dist snaprealm 
snaprealm(0x100 seq 1b1c lc 1b1b cr 1b1b cps 2 
snaps={185f=snap(185f 0x100 'monthly_20221201' 
2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x100 
'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941 
0x100 'monthly_20230201' 
2023-02-01T00:00:01.854059+0100),19a6=snap(19a6 0x100 
'monthly_20230301' 2023-03-01T00:00:01.215197+0100),1a24=snap(1a24 
0x100 'monthly_20230401'  ...) len=384


May 31 09:25:36 deltaweb055 ceph-mds[3268481]: 
2023-05-31T09:25:36.076+0200 7f0e6becd700 10 
mds.0.cache.ino(0x10004d74911) remove_client_cap last cap, leaving 
realm snaprealm(0x100 seq 1b1c lc 1b1b cr 1b1b cps 2 
snaps={185f=snap(185f 0x100 'monthly_20221201' 
2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x100 
'monthly_20230101'  ...)


The daily_*, montly_* etc. names are the names of our regular 
snapshots.


I posted a larger log file snippet using ceph-post-file with the ID: 
da0eb93d-f340-4457-8a3f-434e8ef37d36


Is it possible

[ceph-users] Re: CephFS metadata pool grows by two orders of magnitude while trimming (?) snapshots

2023-05-31 Thread Janek Bevendorff
Another thing I just noticed is that the auto-scaler is trying to scale 
the pool down to 128 PGs. That could also result in large fluctuations, 
but this big?? In any case, it looks like a bug to me. Whatever is 
happening here, there should be safeguards with regard to the pool's 
capacity.


Here's the current state of the pool in ceph osd pool ls detail:

pool 110 'cephfs.storage.meta' replicated size 4 min_size 3 crush_rule 5 
object_hash rjenkins pg_num 495 pgp_num 471 pg_num_target 128 
pgp_num_target 128 autoscale_mode on last_change 1359013 lfor 
0/1358620/1358618 flags hashpspool,nodelete stripe_width 0 
expected_num_objects 300 recovery_op_priority 5 recovery_priority 2 
application cephfs


Janek


On 31/05/2023 10:10, Janek Bevendorff wrote:

Forgot to add: We are still on Nautilus (16.2.12).


On 31/05/2023 09:53, Janek Bevendorff wrote:

Hi,

Perhaps this is a known issue and I was simply too dumb to find it, 
but we are having problems with our CephFS metadata pool filling up 
over night.


Our cluster has a small SSD pool of around 15TB which hosts our 
CephFS metadata pool. Usually, that's more than enough. The normal 
size of the pool ranges between 200 and 800GiB (which is quite a lot 
of fluctuation already). Yesterday, we had suddenly had the pool fill 
up entirely and they only way to fix it was to add more capacity. I 
increased the pool size to 18TB by adding more SSDs and could resolve 
the problem. After a couple of hours of reshuffling, the pool size 
finally went back to 230GiB.


But then we had another fill-up tonight to 7.6TiB. Luckily, I had 
adjusted the weights so that not all disks could fill up entirely 
like last time, so it ended there.


I wasn't really able to identify the problem yesterday, but under the 
more controllable scenario today, I could check the MDS logs at 
debug_mds=10 and to me it seems like the problem is caused by 
snapshot trimming. The logs contain a lot of snapshot-related 
messages for paths that haven't been touched in a long time. The 
messages all look something like this:


May 31 09:16:48 XXX ceph-mds[2947525]: 2023-05-31T09:16:48.292+0200 
7f7ce1bd9700 10 mds.1.cache.ino(0x1000b3c3670) add_client_cap first 
cap, joining realm snaprealm(0x100 seq 1b1c lc 1b1b cr 1
b1b cps 2 snaps={185f=snap(185f 0x100 'monthly_20221201' 
2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x100 
'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941 
0x100 ...


May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.396+0200 
7f0e6a6ca700 10 mds.0.cache | |__ 3 rep [dir 
0x10218fe.10101* /storage/REDACTED/| ptrwaiter=0 request=0 
child=0 frozen=0 subtree=1 replicated=0 dirty=0 waiter=0 authpin=0 
tempexporting=0 0x5607759d9600]


May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.452+0200 
7f0e6a6ca700 10 mds.0.cache | | | 4 rep [dir 
0x10ff904.10001010* /storage/REDACTED/| ptrwaiter=0 request=0 
child=0 frozen=0 subtree=1 importing=0 replicated=0 waiter=0 
authpin=0 tempexporting=0 0x56034ed25200]


May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.716+0200 
7f0e6becd700 10 mds.0.server set_trace_dist snaprealm 
snaprealm(0x100 seq 1b1c lc 1b1b cr 1b1b cps 2 
snaps={185f=snap(185f 0x100 'monthly_20221201' 
2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x100 
'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941 
0x100 'monthly_20230201' 
2023-02-01T00:00:01.854059+0100),19a6=snap(19a6 0x100 
'monthly_20230301' 2023-03-01T00:00:01.215197+0100),1a24=snap(1a24 
0x100 'monthly_20230401'  ...) len=384


May 31 09:25:36 deltaweb055 ceph-mds[3268481]: 
2023-05-31T09:25:36.076+0200 7f0e6becd700 10 
mds.0.cache.ino(0x10004d74911) remove_client_cap last cap, leaving 
realm snaprealm(0x100 seq 1b1c lc 1b1b cr 1b1b cps 2 
snaps={185f=snap(185f 0x100 'monthly_20221201' 
2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x100 
'monthly_20230101'  ...)


The daily_*, montly_* etc. names are the names of our regular snapshots.

I posted a larger log file snippet using ceph-post-file with the ID: 
da0eb93d-f340-4457-8a3f-434e8ef37d36


Is it possible that the MDS are trimming old snapshots without taking 
care not to fill up the entire metadata pool?


Cheers
Janek
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS metadata pool grows by two orders of magnitude while trimming (?) snapshots

2023-05-31 Thread Janek Bevendorff

Forgot to add: We are still on Nautilus (16.2.12).


On 31/05/2023 09:53, Janek Bevendorff wrote:

Hi,

Perhaps this is a known issue and I was simply too dumb to find it, 
but we are having problems with our CephFS metadata pool filling up 
over night.


Our cluster has a small SSD pool of around 15TB which hosts our CephFS 
metadata pool. Usually, that's more than enough. The normal size of 
the pool ranges between 200 and 800GiB (which is quite a lot of 
fluctuation already). Yesterday, we had suddenly had the pool fill up 
entirely and they only way to fix it was to add more capacity. I 
increased the pool size to 18TB by adding more SSDs and could resolve 
the problem. After a couple of hours of reshuffling, the pool size 
finally went back to 230GiB.


But then we had another fill-up tonight to 7.6TiB. Luckily, I had 
adjusted the weights so that not all disks could fill up entirely like 
last time, so it ended there.


I wasn't really able to identify the problem yesterday, but under the 
more controllable scenario today, I could check the MDS logs at 
debug_mds=10 and to me it seems like the problem is caused by snapshot 
trimming. The logs contain a lot of snapshot-related messages for 
paths that haven't been touched in a long time. The messages all look 
something like this:


May 31 09:16:48 XXX ceph-mds[2947525]: 2023-05-31T09:16:48.292+0200 
7f7ce1bd9700 10 mds.1.cache.ino(0x1000b3c3670) add_client_cap first 
cap, joining realm snaprealm(0x100 seq 1b1c lc 1b1b cr 1
b1b cps 2 snaps={185f=snap(185f 0x100 'monthly_20221201' 
2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x100 
'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941 
0x100 ...


May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.396+0200 
7f0e6a6ca700 10 mds.0.cache | |__ 3 rep [dir 
0x10218fe.10101* /storage/REDACTED/| ptrwaiter=0 request=0 
child=0 frozen=0 subtree=1 replicated=0 dirty=0 waiter=0 authpin=0 
tempexporting=0 0x5607759d9600]


May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.452+0200 
7f0e6a6ca700 10 mds.0.cache | | | 4 rep [dir 
0x10ff904.10001010* /storage/REDACTED/| ptrwaiter=0 request=0 
child=0 frozen=0 subtree=1 importing=0 replicated=0 waiter=0 authpin=0 
tempexporting=0 0x56034ed25200]


May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.716+0200 
7f0e6becd700 10 mds.0.server set_trace_dist snaprealm 
snaprealm(0x100 seq 1b1c lc 1b1b cr 1b1b cps 2 
snaps={185f=snap(185f 0x100 'monthly_20221201' 
2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x100 
'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941 
0x100 'monthly_20230201' 
2023-02-01T00:00:01.854059+0100),19a6=snap(19a6 0x100 
'monthly_20230301' 2023-03-01T00:00:01.215197+0100),1a24=snap(1a24 
0x100 'monthly_20230401'  ...) len=384


May 31 09:25:36 deltaweb055 ceph-mds[3268481]: 
2023-05-31T09:25:36.076+0200 7f0e6becd700 10 
mds.0.cache.ino(0x10004d74911) remove_client_cap last cap, leaving 
realm snaprealm(0x100 seq 1b1c lc 1b1b cr 1b1b cps 2 
snaps={185f=snap(185f 0x100 'monthly_20221201' 
2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x100 
'monthly_20230101'  ...)


The daily_*, montly_* etc. names are the names of our regular snapshots.

I posted a larger log file snippet using ceph-post-file with the ID: 
da0eb93d-f340-4457-8a3f-434e8ef37d36


Is it possible that the MDS are trimming old snapshots without taking 
care not to fill up the entire metadata pool?


Cheers
Janek
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] CephFS metadata pool grows by two orders of magnitude while trimming (?) snapshots

2023-05-31 Thread Janek Bevendorff

Hi,

Perhaps this is a known issue and I was simply too dumb to find it, but 
we are having problems with our CephFS metadata pool filling up over night.


Our cluster has a small SSD pool of around 15TB which hosts our CephFS 
metadata pool. Usually, that's more than enough. The normal size of the 
pool ranges between 200 and 800GiB (which is quite a lot of fluctuation 
already). Yesterday, we had suddenly had the pool fill up entirely and 
they only way to fix it was to add more capacity. I increased the pool 
size to 18TB by adding more SSDs and could resolve the problem. After a 
couple of hours of reshuffling, the pool size finally went back to 230GiB.


But then we had another fill-up tonight to 7.6TiB. Luckily, I had 
adjusted the weights so that not all disks could fill up entirely like 
last time, so it ended there.


I wasn't really able to identify the problem yesterday, but under the 
more controllable scenario today, I could check the MDS logs at 
debug_mds=10 and to me it seems like the problem is caused by snapshot 
trimming. The logs contain a lot of snapshot-related messages for paths 
that haven't been touched in a long time. The messages all look 
something like this:


May 31 09:16:48 XXX ceph-mds[2947525]: 2023-05-31T09:16:48.292+0200 
7f7ce1bd9700 10 mds.1.cache.ino(0x1000b3c3670) add_client_cap first cap, 
joining realm snaprealm(0x100 seq 1b1c lc 1b1b cr 1
b1b cps 2 snaps={185f=snap(185f 0x100 'monthly_20221201' 
2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x100 
'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941 
0x100 ...


May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.396+0200 
7f0e6a6ca700 10 mds.0.cache | |__ 3 rep [dir 
0x10218fe.10101* /storage/REDACTED/| ptrwaiter=0 request=0 
child=0 frozen=0 subtree=1 replicated=0 dirty=0 waiter=0 authpin=0 
tempexporting=0 0x5607759d9600]


May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.452+0200 
7f0e6a6ca700 10 mds.0.cache | | | 4 rep [dir 
0x10ff904.10001010* /storage/REDACTED/| ptrwaiter=0 request=0 
child=0 frozen=0 subtree=1 importing=0 replicated=0 waiter=0 authpin=0 
tempexporting=0 0x56034ed25200]


May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.716+0200 
7f0e6becd700 10 mds.0.server set_trace_dist snaprealm 
snaprealm(0x100 seq 1b1c lc 1b1b cr 1b1b cps 2 
snaps={185f=snap(185f 0x100 'monthly_20221201' 
2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x100 
'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941 
0x100 'monthly_20230201' 
2023-02-01T00:00:01.854059+0100),19a6=snap(19a6 0x100 
'monthly_20230301' 2023-03-01T00:00:01.215197+0100),1a24=snap(1a24 
0x100 'monthly_20230401'  ...) len=384


May 31 09:25:36 deltaweb055 ceph-mds[3268481]: 
2023-05-31T09:25:36.076+0200 7f0e6becd700 10 
mds.0.cache.ino(0x10004d74911) remove_client_cap last cap, leaving realm 
snaprealm(0x100 seq 1b1c lc 1b1b cr 1b1b cps 2 
snaps={185f=snap(185f 0x100 'monthly_20221201' 
2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x100 
'monthly_20230101'  ...)


The daily_*, montly_* etc. names are the names of our regular snapshots.

I posted a larger log file snippet using ceph-post-file with the ID: 
da0eb93d-f340-4457-8a3f-434e8ef37d36


Is it possible that the MDS are trimming old snapshots without taking 
care not to fill up the entire metadata pool?


Cheers
Janek
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS "newly corrupt dentry" after patch version upgrade

2023-05-12 Thread Janek Bevendorff
If is thrown while decoding the file name, then somebody probably 
managed to store files with non-UTF-8 characters in the name. Although I 
don't really know how this can happen. Perhaps some OS quirk.


On 10/05/2023 22:33, Patrick Donnelly wrote:

Hi Janek,

All this indicates is that you have some files with binary keys  that
cannot be decoded as utf-8. Unfortunately, the rados python library
assumes that omap keys can be decoded this way. I have a ticket here:

https://tracker.ceph.com/issues/59716

I hope to have a fix soon.

On Thu, May 4, 2023 at 3:15 AM Janek Bevendorff
 wrote:

After running the tool for 11 hours straight, it exited with the
following exception:

Traceback (most recent call last):
File "/home/webis/first-damage.py", line 156, in 
  traverse(f, ioctx)
File "/home/webis/first-damage.py", line 84, in traverse
  for (dnk, val) in it:
File "rados.pyx", line 1389, in rados.OmapIterator.__next__
File "rados.pyx", line 318, in rados.decode_cstr
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 8:
invalid start byte

Does that mean that the last inode listed in the output file is corrupt?
Any way I can fix it?

The output file has 14 million lines. We have about 24.5 million objects
in the metadata pool.

Janek


On 03/05/2023 14:20, Patrick Donnelly wrote:

On Wed, May 3, 2023 at 4:33 AM Janek Bevendorff
 wrote:

Hi Patrick,


I'll try that tomorrow and let you know, thanks!

I was unable to reproduce the crash today. Even with
mds_abort_on_newly_corrupt_dentry set to true, all MDS booted up
correctly (though they took forever to rejoin with logs set to 20).

To me it looks like the issue has resolved itself overnight. I had run a
recursive scrub on the file system and another snapshot was taken, in
case any of those might have had an effect on this. It could also be the
case that the (supposedly) corrupt journal entry has simply been
committed now and hence doesn't trigger the assertion any more. Is there
any way I can verify this?

You can run:

https://github.com/ceph/ceph/blob/main/src/tools/cephfs/first-damage.py

Just do:

python3 first-damage.py --memo run.1 

No need to do any of the other steps if you just want a read-only check.


--

Bauhaus-Universität Weimar
Bauhausstr. 9a, R308
99423 Weimar, Germany

Phone: +49 3643 58 3577
www.webis.de




--

Bauhaus-Universität Weimar
Bauhausstr. 9a, R308
99423 Weimar, Germany

Phone: +49 3643 58 3577
www.webis.de
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS "newly corrupt dentry" after patch version upgrade

2023-05-04 Thread Janek Bevendorff
After running the tool for 11 hours straight, it exited with the 
following exception:


Traceback (most recent call last):
  File "/home/webis/first-damage.py", line 156, in 
    traverse(f, ioctx)
  File "/home/webis/first-damage.py", line 84, in traverse
    for (dnk, val) in it:
  File "rados.pyx", line 1389, in rados.OmapIterator.__next__
  File "rados.pyx", line 318, in rados.decode_cstr
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 8: 
invalid start byte


Does that mean that the last inode listed in the output file is corrupt? 
Any way I can fix it?


The output file has 14 million lines. We have about 24.5 million objects 
in the metadata pool.


Janek


On 03/05/2023 14:20, Patrick Donnelly wrote:

On Wed, May 3, 2023 at 4:33 AM Janek Bevendorff
 wrote:

Hi Patrick,


I'll try that tomorrow and let you know, thanks!

I was unable to reproduce the crash today. Even with
mds_abort_on_newly_corrupt_dentry set to true, all MDS booted up
correctly (though they took forever to rejoin with logs set to 20).

To me it looks like the issue has resolved itself overnight. I had run a
recursive scrub on the file system and another snapshot was taken, in
case any of those might have had an effect on this. It could also be the
case that the (supposedly) corrupt journal entry has simply been
committed now and hence doesn't trigger the assertion any more. Is there
any way I can verify this?

You can run:

https://github.com/ceph/ceph/blob/main/src/tools/cephfs/first-damage.py

Just do:

python3 first-damage.py --memo run.1 

No need to do any of the other steps if you just want a read-only check.


--

Bauhaus-Universität Weimar
Bauhausstr. 9a, R308
99423 Weimar, Germany

Phone: +49 3643 58 3577
www.webis.de
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS "newly corrupt dentry" after patch version upgrade

2023-05-03 Thread Janek Bevendorff

Hi Patrick,


I'll try that tomorrow and let you know, thanks!


I was unable to reproduce the crash today. Even with 
mds_abort_on_newly_corrupt_dentry set to true, all MDS booted up 
correctly (though they took forever to rejoin with logs set to 20).


To me it looks like the issue has resolved itself overnight. I had run a 
recursive scrub on the file system and another snapshot was taken, in 
case any of those might have had an effect on this. It could also be the 
case that the (supposedly) corrupt journal entry has simply been 
committed now and hence doesn't trigger the assertion any more. Is there 
any way I can verify this?


Janek
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS "newly corrupt dentry" after patch version upgrade

2023-05-02 Thread Janek Bevendorff

Hi Patrick,


Please be careful resetting the journal. It was not necessary. You can
try to recover the missing inode using cephfs-data-scan [2].


Yes. I did that very reluctantly after trying everything else as a last 
resort. But since it only gave me another error, I restored the previous 
state. Downgrading to the previous version only came to mind minutes 
before Dan wrote that there's a new assertion in 16.2.12 (I didn't 
expect a corruption issue to be "fixable" like that).




Thanks for the report. Unfortunately this looks like a false positive.
You're not using snapshots, right?


Or fortunately for me? We have an automated snapshot schedule which 
creates snapshots of certain top-level directories daily. Our main 
folder is /storage, which had this issue.



In any case, if you can reproduce it again with:


ceph config mds debug_mds 20
ceph config mds debug_ms 1


I'll try that tomorrow and let you know, thanks!


and upload the logs using ceph-post-file [1], that would be helpful to
understand what happened.

After that you can disable the check as Dan pointed out:

ceph config set mds mds_abort_on_newly_corrupt_dentry false
ceph config set mds mds_go_bad_corrupt_dentry false

NOTE FOR OTHER READERS OF THIS MAIL: it is not recommended to blindly
set these configs as the MDS is trying to catch legitimate metadata
corruption.

[1] https://docs.ceph.com/en/quincy/man/8/ceph-post-file/
[2] https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/



--

Bauhaus-Universität Weimar
Bauhausstr. 9a, R308
99423 Weimar, Germany

Phone: +49 3643 58 3577
www.webis.de
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS "newly corrupt dentry" after patch version upgrade

2023-05-02 Thread Janek Bevendorff

Thanks!

I tried downgrading to 16.2.10 and was able to get it running again, but 
after a reboot, got a warning that two of the OSDs on that host had 
broken Bluestore compression. Restarting the two OSDs again got rid of 
it, but that's still a bit concerning.



On 02/05/2023 16:48, Dan van der Ster wrote:

Hi Janek,

That assert is part of a new corruption check added in 16.2.12 -- see
https://github.com/ceph/ceph/commit/1771aae8e79b577acde749a292d9965264f20202

The abort is controlled by a new option:

+Option("mds_abort_on_newly_corrupt_dentry", Option::TYPE_BOOL,
Option::LEVEL_ADVANCED)
+.set_default(true)
+.set_description("MDS will abort if dentry is detected newly corrupted."),

So in theory you could switch that off, but it is concerning that the
metadata is corrupted already.
I'm cc'ing Patrick who has been working on this issue.

Cheers, Dan

__
Clyso GmbH | https://www.clyso.com

On Tue, May 2, 2023 at 7:32 AM Janek Bevendorff
 wrote:

Hi,

After a patch version upgrade from 16.2.10 to 16.2.12, our rank 0 MDS
fails start start. After replaying the journal, it just crashes with

[ERR] : MDS abort because newly corrupt dentry to be committed: [dentry
#0x1/storage [2,head] auth (dversion lock)

Immediately after the upgrade, I had it running shortly, but then it
decided to crash for unknown reasons and I cannot get it back up.

We have five ranks in total, the other four seem to be fine. I backed up
the journal and tried to run cephfs-journal-tool --rank=cephfs.storage:0
event recover_dentries summary, but it never finishes only eats up a lot
of RAM. I stopped it after an hour and 50GB RAM.

Resetting the journal makes the MDS crash with a missing inode error on
another top-level directory, so I re-imported the backed-up journal. Is
there any way to recover from this without rebuilding the whole file system?

Thanks
Janek


Here's the full crash log:


May 02 16:16:53 xxx077 ceph-mds[3047358]:-29>
2023-05-02T16:16:52.761+0200 7f51f878b700  1 mds.0.1711712 Finished
replaying journal
May 02 16:16:53 xxx077 ceph-mds[3047358]:-28>
2023-05-02T16:16:52.761+0200 7f51f878b700  1 mds.0.1711712 making mds
journal writeable
May 02 16:16:53 xxx077 ceph-mds[3047358]:-27>
2023-05-02T16:16:52.761+0200 7f51f878b700  1 mds.0.journaler.mdlog(ro)
set_writeable
May 02 16:16:53 xxx077 ceph-mds[3047358]:-26>
2023-05-02T16:16:52.761+0200 7f51f878b700  2 mds.0.1711712 i am not
alone, moving to state resolve
May 02 16:16:53 xxx077 ceph-mds[3047358]:-25>
2023-05-02T16:16:52.761+0200 7f51f878b700  3 mds.0.1711712 request_state
up:resolve
May 02 16:16:53 xxx077 ceph-mds[3047358]:-24>
2023-05-02T16:16:52.761+0200 7f51f878b700  5 mds.beacon.xxx077
set_want_state: up:replay -> up:resolve
May 02 16:16:53 xxx077 ceph-mds[3047358]:-23>
2023-05-02T16:16:52.761+0200 7f51f878b700  5 mds.beacon.xxx077 Sending
beacon up:resolve seq 15
May 02 16:16:53 xxx077 ceph-mds[3047358]:-22>
2023-05-02T16:16:52.761+0200 7f51f878b700 10 monclient:
_send_mon_message to mon.xxx056 at v2:141.54.133.56:3300/0
May 02 16:16:53 xxx077 ceph-mds[3047358]:-21>
2023-05-02T16:16:53.113+0200 7f51fef98700 10 monclient: tick
May 02 16:16:53 xxx077 ceph-mds[3047358]:-20>
2023-05-02T16:16:53.113+0200 7f51fef98700 10 monclient:
_check_auth_rotating have uptodate secrets (they expire after
2023-05-02T16:16:23.118186+0200)
May 02 16:16:53 xxx077 ceph-mds[3047358]:-19>
2023-05-02T16:16:53.373+0200 7f51fff9a700  1 mds.xxx077 Updating MDS map
to version 1711713 from mon.1
May 02 16:16:53 xxx077 ceph-mds[3047358]:-18>
2023-05-02T16:16:53.373+0200 7f51fff9a700  1 mds.0.1711712
handle_mds_map i am now mds.0.1711712
May 02 16:16:53 xxx077 ceph-mds[3047358]:-17>
2023-05-02T16:16:53.373+0200 7f51fff9a700  1 mds.0.1711712
handle_mds_map state change up:replay --> up:resolve
May 02 16:16:53 xxx077 ceph-mds[3047358]:-16>
2023-05-02T16:16:53.373+0200 7f51fff9a700  1 mds.0.1711712 resolve_start
May 02 16:16:53 xxx077 ceph-mds[3047358]:-15>
2023-05-02T16:16:53.373+0200 7f51fff9a700  1 mds.0.1711712 reopen_log
May 02 16:16:53 xxx077 ceph-mds[3047358]:-14>
2023-05-02T16:16:53.373+0200 7f51fff9a700  1 mds.0.1711712 recovery set
is 1,2,3,4
May 02 16:16:53 xxx077 ceph-mds[3047358]:-13>
2023-05-02T16:16:53.373+0200 7f51fff9a700  1 mds.0.1711712 recovery set
is 1,2,3,4
May 02 16:16:53 xxx077 ceph-mds[3047358]:-12>
2023-05-02T16:16:53.373+0200 7f5202fa0700 10 monclient: get_auth_request
con 0x5574fe74c400 auth_method 0
May 02 16:16:53 xxx077 ceph-mds[3047358]:-11>
2023-05-02T16:16:53.373+0200 7f52037a1700 10 monclient: get_auth_request
con 0x5574fe40fc00 auth_method 0
May 02 16:16:53 xxx077 ceph-mds[3047358]:-10>
2023-05-02T16:16:53.373+0200 7f520279f700 10 monclient: get_auth_request
con 0x5574f932fc00 auth_method 0
May 02 16:16:53 xxx077 ceph-mds[3047358]: -9>
2023-05-02T16:

[ceph-users] MDS "newly corrupt dentry" after patch version upgrade

2023-05-02 Thread Janek Bevendorff

Hi,

After a patch version upgrade from 16.2.10 to 16.2.12, our rank 0 MDS 
fails start start. After replaying the journal, it just crashes with


[ERR] : MDS abort because newly corrupt dentry to be committed: [dentry 
#0x1/storage [2,head] auth (dversion lock)


Immediately after the upgrade, I had it running shortly, but then it 
decided to crash for unknown reasons and I cannot get it back up.


We have five ranks in total, the other four seem to be fine. I backed up 
the journal and tried to run cephfs-journal-tool --rank=cephfs.storage:0 
event recover_dentries summary, but it never finishes only eats up a lot 
of RAM. I stopped it after an hour and 50GB RAM.


Resetting the journal makes the MDS crash with a missing inode error on 
another top-level directory, so I re-imported the backed-up journal. Is 
there any way to recover from this without rebuilding the whole file system?


Thanks
Janek


Here's the full crash log:


May 02 16:16:53 xxx077 ceph-mds[3047358]:    -29> 
2023-05-02T16:16:52.761+0200 7f51f878b700  1 mds.0.1711712 Finished 
replaying journal
May 02 16:16:53 xxx077 ceph-mds[3047358]:    -28> 
2023-05-02T16:16:52.761+0200 7f51f878b700  1 mds.0.1711712 making mds 
journal writeable
May 02 16:16:53 xxx077 ceph-mds[3047358]:    -27> 
2023-05-02T16:16:52.761+0200 7f51f878b700  1 mds.0.journaler.mdlog(ro) 
set_writeable
May 02 16:16:53 xxx077 ceph-mds[3047358]:    -26> 
2023-05-02T16:16:52.761+0200 7f51f878b700  2 mds.0.1711712 i am not 
alone, moving to state resolve
May 02 16:16:53 xxx077 ceph-mds[3047358]:    -25> 
2023-05-02T16:16:52.761+0200 7f51f878b700  3 mds.0.1711712 request_state 
up:resolve
May 02 16:16:53 xxx077 ceph-mds[3047358]:    -24> 
2023-05-02T16:16:52.761+0200 7f51f878b700  5 mds.beacon.xxx077 
set_want_state: up:replay -> up:resolve
May 02 16:16:53 xxx077 ceph-mds[3047358]:    -23> 
2023-05-02T16:16:52.761+0200 7f51f878b700  5 mds.beacon.xxx077 Sending 
beacon up:resolve seq 15
May 02 16:16:53 xxx077 ceph-mds[3047358]:    -22> 
2023-05-02T16:16:52.761+0200 7f51f878b700 10 monclient: 
_send_mon_message to mon.xxx056 at v2:141.54.133.56:3300/0
May 02 16:16:53 xxx077 ceph-mds[3047358]:    -21> 
2023-05-02T16:16:53.113+0200 7f51fef98700 10 monclient: tick
May 02 16:16:53 xxx077 ceph-mds[3047358]:    -20> 
2023-05-02T16:16:53.113+0200 7f51fef98700 10 monclient: 
_check_auth_rotating have uptodate secrets (they expire after 
2023-05-02T16:16:23.118186+0200)
May 02 16:16:53 xxx077 ceph-mds[3047358]:    -19> 
2023-05-02T16:16:53.373+0200 7f51fff9a700  1 mds.xxx077 Updating MDS map 
to version 1711713 from mon.1
May 02 16:16:53 xxx077 ceph-mds[3047358]:    -18> 
2023-05-02T16:16:53.373+0200 7f51fff9a700  1 mds.0.1711712 
handle_mds_map i am now mds.0.1711712
May 02 16:16:53 xxx077 ceph-mds[3047358]:    -17> 
2023-05-02T16:16:53.373+0200 7f51fff9a700  1 mds.0.1711712 
handle_mds_map state change up:replay --> up:resolve
May 02 16:16:53 xxx077 ceph-mds[3047358]:    -16> 
2023-05-02T16:16:53.373+0200 7f51fff9a700  1 mds.0.1711712 resolve_start
May 02 16:16:53 xxx077 ceph-mds[3047358]:    -15> 
2023-05-02T16:16:53.373+0200 7f51fff9a700  1 mds.0.1711712 reopen_log
May 02 16:16:53 xxx077 ceph-mds[3047358]:    -14> 
2023-05-02T16:16:53.373+0200 7f51fff9a700  1 mds.0.1711712 recovery set 
is 1,2,3,4
May 02 16:16:53 xxx077 ceph-mds[3047358]:    -13> 
2023-05-02T16:16:53.373+0200 7f51fff9a700  1 mds.0.1711712 recovery set 
is 1,2,3,4
May 02 16:16:53 xxx077 ceph-mds[3047358]:    -12> 
2023-05-02T16:16:53.373+0200 7f5202fa0700 10 monclient: get_auth_request 
con 0x5574fe74c400 auth_method 0
May 02 16:16:53 xxx077 ceph-mds[3047358]:    -11> 
2023-05-02T16:16:53.373+0200 7f52037a1700 10 monclient: get_auth_request 
con 0x5574fe40fc00 auth_method 0
May 02 16:16:53 xxx077 ceph-mds[3047358]:    -10> 
2023-05-02T16:16:53.373+0200 7f520279f700 10 monclient: get_auth_request 
con 0x5574f932fc00 auth_method 0
May 02 16:16:53 xxx077 ceph-mds[3047358]: -9> 
2023-05-02T16:16:53.373+0200 7f520279f700 10 monclient: get_auth_request 
con 0x5574ffce2000 auth_method 0
May 02 16:16:53 xxx077 ceph-mds[3047358]: -8> 
2023-05-02T16:16:53.377+0200 7f5202fa0700  5 mds.beacon.xxx077 received 
beacon reply up:resolve seq 15 rtt 0.616008
May 02 16:16:53 xxx077 ceph-mds[3047358]: -7> 
2023-05-02T16:16:53.393+0200 7f51fff9a700  5 mds.xxx077 handle_mds_map 
old map epoch 1711713 <= 1711713, discarding
May 02 16:16:53 xxx077 ceph-mds[3047358]: -6> 
2023-05-02T16:16:53.393+0200 7f51fff9a700  5 mds.xxx077 handle_mds_map 
old map epoch 1711713 <= 1711713, discarding
May 02 16:16:53 xxx077 ceph-mds[3047358]: -5> 
2023-05-02T16:16:53.393+0200 7f51fff9a700  5 mds.xxx077 handle_mds_map 
old map epoch 1711713 <= 1711713, discarding
May 02 16:16:53 xxx077 ceph-mds[3047358]: -4> 
2023-05-02T16:16:53.393+0200 7f51fff9a700  5 mds.xxx077 handle_mds_map 
old map epoch 1711713 <= 1711713, discarding
May 02 16:16:53 xxx077 ceph-mds[3047358]: -3> 
2023-05-02T16:16:53.545+0200 7f51fff9a700 -1 

[ceph-users] Re: Containerized radosgw crashes randomly at startup

2022-05-31 Thread Janek Bevendorff

Okay, after writing this mail, I might have found what's wrong.

The message

monclient(hunting): handle_auth_bad_method server allowed_methods [2] 
but i only support [2]


makes no sense, but it pointed me to something else when I had a pod 
that refused to start even after deleting it multiple times. I noticed 
that it was always scheduled on the very same host, so something about 
the host itself must have caused it. All radosgw daemons are 
bootstrapped using a bootstrap key and the resulting auth key is 
persisted to /var/lib/ceph/radosgw/ on the host. After deleting that 
directory on all hosts and restarting the deployment, all pods came back 
up again. So I guess something was wrong with the keys stored on some of 
the host machines.


Janek


On 31/05/2022 11:08, Janek Bevendorff wrote:

Hi,

This is an issue I've been having since at least Ceph 15.x and I 
haven't found a way around it yet. I have a bunch of radosgw nodes in 
a Kubernetes cluster (using the ceph/ceph-daemon Docker image) and 
once every few container restarts, the daemon decides to crash at 
startup for unknown reasons resulting in a crash loop. When I delete 
the entire pod and try again, it boots up fine most of the time (not 
always).


There is no obvious error message. When I set DEBUG=stayalive, all I 
get is:


2022-05-31 08:51:39  /opt/ceph-container/bin/entrypoint.sh: STAYALIVE: 
container will not die if a command fails.
2022-05-31 08:51:39  /opt/ceph-container/bin/entrypoint.sh: static: 
does not generate config

2022-05-31 08:51:39  /opt/ceph-container/bin/entrypoint.sh: SUCCESS
exec: PID 51: spawning /usr/bin/radosgw --cluster ceph --setuser ceph 
--setgroup ceph --default-log-to-stderr=true --err-to-stderr=true 
--default-log-to-file=false --foreground -n client.rgw.XXX -k 
/var/lib/ceph/radosgw/ceph-rgw.XXX/keyring

exec: Waiting 51 to quit
2022-05-31T08:51:39.355+ 7f23c0fe9700 -1 monclient(hunting): 
handle_auth_bad_method server allowed_methods [2] but i only support [2]

failed to fetch mon config (--no-mon-config to skip)
teardown: managing teardown after SIGCHLD
teardown: Waiting PID 51 to terminate
teardown: Process 51 is terminated
/opt/ceph-container/bin/docker_exec.sh: line 14: warning: 
run_pending_traps: bad value in trap_list[17]: 0x5616cd3318c0
/opt/ceph-container/bin/docker_exec.sh: line 6: warning: 
run_pending_traps: bad value in trap_list[17]: 0x5616cd3318c0

An issue occured and you asked me to stay alive.
You can connect to me with: sudo docker exec -i -t  /bin/bash
The current environment variables will be reloaded by this bash to be 
in a similar context.

When debugging is over stop me with: pkill sleep
I'll sleep endlessly waiting for you darling, bye bye
/opt/ceph-container/bin/docker_exec.sh: line 6: warning: 
run_pending_traps: bad value in trap_list[17]: 0x5616cd3318c0



The actual error seems to be "warning: run_pending_traps: bad value in 
trap_list", but I have no idea how to fix that or why that even happens.


This is super annoying, because it means that over time, the number of 
live radosgw containers is reduced, because at some point, most pods 
are stuck in a CrashLoopBackOff state. I then have to manually delete 
all those pods so that they get rescheduled, which tends to work in 
about 3 out of 4 attempts or so.


The radosgw containers are running version 16.2.5 (the latest version 
available for the container image), the rest of the cluster is on 16.2.9.


Any help would be greatly appreciated.

Janek

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Containerized radosgw crashes randomly at startup

2022-05-31 Thread Janek Bevendorff

Hi,

This is an issue I've been having since at least Ceph 15.x and I haven't 
found a way around it yet. I have a bunch of radosgw nodes in a 
Kubernetes cluster (using the ceph/ceph-daemon Docker image) and once 
every few container restarts, the daemon decides to crash at startup for 
unknown reasons resulting in a crash loop. When I delete the entire pod 
and try again, it boots up fine most of the time (not always).


There is no obvious error message. When I set DEBUG=stayalive, all I get is:

2022-05-31 08:51:39  /opt/ceph-container/bin/entrypoint.sh: STAYALIVE: 
container will not die if a command fails.
2022-05-31 08:51:39  /opt/ceph-container/bin/entrypoint.sh: static: does 
not generate config

2022-05-31 08:51:39  /opt/ceph-container/bin/entrypoint.sh: SUCCESS
exec: PID 51: spawning /usr/bin/radosgw --cluster ceph --setuser ceph 
--setgroup ceph --default-log-to-stderr=true --err-to-stderr=true 
--default-log-to-file=false --foreground -n client.rgw.XXX -k 
/var/lib/ceph/radosgw/ceph-rgw.XXX/keyring

exec: Waiting 51 to quit
2022-05-31T08:51:39.355+ 7f23c0fe9700 -1 monclient(hunting): 
handle_auth_bad_method server allowed_methods [2] but i only support [2]

failed to fetch mon config (--no-mon-config to skip)
teardown: managing teardown after SIGCHLD
teardown: Waiting PID 51 to terminate
teardown: Process 51 is terminated
/opt/ceph-container/bin/docker_exec.sh: line 14: warning: 
run_pending_traps: bad value in trap_list[17]: 0x5616cd3318c0
/opt/ceph-container/bin/docker_exec.sh: line 6: warning: 
run_pending_traps: bad value in trap_list[17]: 0x5616cd3318c0

An issue occured and you asked me to stay alive.
You can connect to me with: sudo docker exec -i -t  /bin/bash
The current environment variables will be reloaded by this bash to be in 
a similar context.

When debugging is over stop me with: pkill sleep
I'll sleep endlessly waiting for you darling, bye bye
/opt/ceph-container/bin/docker_exec.sh: line 6: warning: 
run_pending_traps: bad value in trap_list[17]: 0x5616cd3318c0



The actual error seems to be "warning: run_pending_traps: bad value in 
trap_list", but I have no idea how to fix that or why that even happens.


This is super annoying, because it means that over time, the number of 
live radosgw containers is reduced, because at some point, most pods are 
stuck in a CrashLoopBackOff state. I then have to manually delete all 
those pods so that they get rescheduled, which tends to work in about 3 
out of 4 attempts or so.


The radosgw containers are running version 16.2.5 (the latest version 
available for the container image), the rest of the cluster is on 16.2.9.


Any help would be greatly appreciated.

Janek

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Release Index and Docker Hub images outdated

2022-05-31 Thread Janek Bevendorff
The quay.io/ceph/daemon:latest-pacific image is also stuck on 16.2.5. 
Only the quay.io/ceph/ceph:v16 image seems to be up to date, but I can't 
get it to start the daemons.



On 30/05/2022 14:54, Janek Bevendorff wrote:

Was this announced somewhere? Could this not wait till Pacific is EOL, so we 
can at least get updates for that?

At least the release index on docs.ceph.com should be updated, though. Not even 
are the Pacific releases outdated, it also still lists hub.docker.com as an 
official source for images: https://docs.ceph.com/en/latest/install/containers/




On 30. May 2022, at 14:47, Robert Sander  wrote:

Am 30.05.22 um 13:16 schrieb Janek Bevendorff:


The image tags on Docker Hub are even more outdated and stop at v16.2.5. 
quay.io seems to be up to date.

Docker Hub does not get new images any more. The project has moved to quay.io.

Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

http://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 220009 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--

Bauhaus-Universität Weimar
Bauhausstr. 9a, R308
99423 Weimar, Germany

Phone: +49 3643 58 3577
www.webis.de

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Release Index and Docker Hub images outdated

2022-05-30 Thread Janek Bevendorff
Was this announced somewhere? Could this not wait till Pacific is EOL, so we 
can at least get updates for that?

At least the release index on docs.ceph.com should be updated, though. Not even 
are the Pacific releases outdated, it also still lists hub.docker.com as an 
official source for images: https://docs.ceph.com/en/latest/install/containers/



> On 30. May 2022, at 14:47, Robert Sander  wrote:
> 
> Am 30.05.22 um 13:16 schrieb Janek Bevendorff:
> 
>> The image tags on Docker Hub are even more outdated and stop at v16.2.5. 
>> quay.io seems to be up to date.
> 
> Docker Hub does not get new images any more. The project has moved to quay.io.
> 
> Regards
> -- 
> Robert Sander
> Heinlein Consulting GmbH
> Schwedter Str. 8/9b, 10119 Berlin
> 
> http://www.heinlein-support.de
> 
> Tel: 030 / 405051-43
> Fax: 030 / 405051-19
> 
> Zwangsangaben lt. §35a GmbHG:
> HRB 220009 B / Amtsgericht Berlin-Charlottenburg,
> Geschäftsführer: Peer Heinlein -- Sitz: Berlin
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Release Index and Docker Hub images outdated

2022-05-30 Thread Janek Bevendorff
Hi,

The release index on docs.ceph.com is outdated by two Pacific patch releases: 
https://docs.ceph.com/en/quincy/releases/index.html
The image tags on Docker Hub are even more outdated and stop at v16.2.5. 
quay.io seems to be up to date.

Can these be updated please? Thanks

Janek
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MON slow ops and growing MON store

2021-03-18 Thread Janek Bevendorff
We just had the same problem again after a power outage that took out 
62% of our cluster and three out of five MONs. Once everything was back 
up, the MONs started lagging and piling up slow ops while to MON store 
was growing to double-digit gigabytes. It was so bad that I couldn't 
even list the flying ops anymore, because ceph daemon mon.XXX ops did 
not return at all.


Like last time, after I restarted all five MONs, the store size 
decreased and everything went back to normal. I also had to restart MGRs 
and MDSs afterwards. This starts looking like a bug to me.


Janek


On 26/02/2021 15:24, Janek Bevendorff wrote:
Since the full cluster restart and disabling logging to syslog, it's 
not a problem any more (for now).


Unfortunately, just disabling clog_to_monitors didn't have the wanted 
effect when I tried it yesterday. But I also believe that it is 
somehow related. I could not find any specific reason for the incident 
yesterday in the logs besides a few more RocksDB status and compact 
messages than usual, but that's more symptomatic.



On 26/02/2021 13:05, Mykola Golub wrote:

On Thu, Feb 25, 2021 at 08:58:01PM +0100, Janek Bevendorff wrote:


On the first MON, the command doesn’t even return, but I was able to
get a dump from the one I restarted most recently. The oldest ops
look like this:

 {
 "description": "log(1000 entries from seq 17876238 at 
2021-02-25T15:13:20.306487+0100)",

 "initiated_at": "2021-02-25T20:40:34.698932+0100",
 "age": 183.762551121,
 "duration": 183.762599201,

The mon stores cluster log messages in the mon db. You mentioned
problems with osds flooding with log messages. It looks like related.

If you still observe the db growth you may try temporarily disable
clog_to_monitors, i.e. set for all osds:

  clog_to_monitors = false

And see if it stops growing after this and if it helps with the slow
ops (it might make sense to restar mons if some look like get
stuck). You can apply the config option on the fly (without restarting
the osds, e.g with injectargs), but when re-enabling back you will
have to restart the osds to avoid crashes due to this bug [1].

[1] https://tracker.ceph.com/issues/48946


--

Bauhaus-Universität Weimar
Bauhausstr. 9a, R308
99423 Weimar, Germany

Phone: +49 3643 58 3577
www.webis.de
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: PG inactive when host is down despite CRUSH failure domain being host

2021-03-10 Thread Janek Bevendorff
No, he pool is size 3. But you put me on the right track. The pool had 
an explicit min_size set that was equal to the size. No idea why I 
didn't check that in the first place. Reducing it to 2 seems to solve 
the problem. How embarrassing, thanks! :-D


May I suggest giving this a better error description perhaps? 
undersized+degraded+peered is kind of a meaningless PG state and ceph pg 
query could have mentioned at least something.


Janek


On 10/03/2021 17:11, Eugen Block wrote:

Hi,

I only took a quick look, but is that pool configured with size 2? The 
crush_rule says min_size 2 which would explain what you're describing.




Zitat von Janek Bevendorff :


Hi,

I am having a weird phenomenon, which I am having trouble to debug. 
We have 16 OSDs per host, so when I reboot one node, 16 OSDs will be 
missing for a short time. Since our minimum CRUSH failure domain is 
host, this should not cause any problems. Unfortunately, I always 
have handful (1-5) PGs that become inactive nonetheless and are stuck 
in the state undersized+degraded+peered until the host and its OSDs 
are back up. The other 2000+ PGs that are also on these OSDs do not 
have this problem. In total, we have between 110 and 150 PGs per OSD 
with a configured maximum of 250, which should give us enough headspace.


The affected pools always seem to be RBD pools or at least I haven't 
seen it on our much larger RGW pool yet. The pool's CRUSH rule looks 
like this:


rule rbd-data {
    id 8
    type replicated
    min_size 2
    max_size 10
    step take default
    step chooseleaf firstn 0 type host
    step emit
}

ceph pg dump_stuck inactive gives me this:

PG_STAT  STATE   UP  UP_PRIMARY 
ACTING  ACTING_PRIMARY
115.3    undersized+degraded+peered   [194,267] 194 
[194,267] 194
115.13   undersized+degraded+peered  [151,1122] 151 
[151,1122] 151
116.12   undersized+degraded+peered   [288,726] 288 
[288,726] 288


and when I query one of the inactive PGs, I see (among other things):

    "up": [
    288,
    726
    ],
    "acting": [
    288,
    726
    ],
    "acting_recovery_backfill": [
    "288",
    "726"
    ],

    "recovery_state": [
    {
    "name": "Started/Primary/Active",
    "enter_time": "2021-03-10T16:23:09.301174+0100",
    "might_have_unfound": [],
    "recovery_progress": {
    "backfill_targets": [],
    "waiting_on_backfill": [],
    "last_backfill_started": "MIN",
    "backfill_info": {
    "begin": "MIN",
    "end": "MIN",
    "objects": []
    },
    "peer_backfill_info": [],
    "backfills_in_flight": [],
    "recovering": [],
    "pg_backend": {
    "pull_from_peer": [],
    "pushing": []
    }
    }
    },
    {
    "name": "Started",
    "enter_time": "2021-03-10T16:23:08.297622+0100"
    }
    ],

So you can see that two out of three OSDs on other hosts are indeed 
up and active and the . I also see the ceph-osd daemons running on 
those hosts, so the data is definitely there and the PG should be 
available. Do you have any idea why these PGs may be becoming 
inactive nonetheless? I am suspecting some kind of concurrency limit, 
but I wouldn't know which one that could be.


Thanks
Janek
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] PG inactive when host is down despite CRUSH failure domain being host

2021-03-10 Thread Janek Bevendorff

Hi,

I am having a weird phenomenon, which I am having trouble to debug. We 
have 16 OSDs per host, so when I reboot one node, 16 OSDs will be 
missing for a short time. Since our minimum CRUSH failure domain is 
host, this should not cause any problems. Unfortunately, I always have 
handful (1-5) PGs that become inactive nonetheless and are stuck in the 
state undersized+degraded+peered until the host and its OSDs are back 
up. The other 2000+ PGs that are also on these OSDs do not have this 
problem. In total, we have between 110 and 150 PGs per OSD with a 
configured maximum of 250, which should give us enough headspace.


The affected pools always seem to be RBD pools or at least I haven't 
seen it on our much larger RGW pool yet. The pool's CRUSH rule looks 
like this:


rule rbd-data {
    id 8
    type replicated
    min_size 2
    max_size 10
    step take default
    step chooseleaf firstn 0 type host
    step emit
}

ceph pg dump_stuck inactive gives me this:

PG_STAT  STATE   UP  UP_PRIMARY ACTING  
ACTING_PRIMARY
115.3    undersized+degraded+peered   [194,267] 194 
[194,267] 194
115.13   undersized+degraded+peered  [151,1122] 151 
[151,1122] 151
116.12   undersized+degraded+peered   [288,726] 288 
[288,726] 288


and when I query one of the inactive PGs, I see (among other things):

    "up": [
    288,
    726
    ],
    "acting": [
    288,
    726
    ],
    "acting_recovery_backfill": [
    "288",
    "726"
    ],

    "recovery_state": [
    {
    "name": "Started/Primary/Active",
    "enter_time": "2021-03-10T16:23:09.301174+0100",
    "might_have_unfound": [],
    "recovery_progress": {
    "backfill_targets": [],
    "waiting_on_backfill": [],
    "last_backfill_started": "MIN",
    "backfill_info": {
    "begin": "MIN",
    "end": "MIN",
    "objects": []
    },
    "peer_backfill_info": [],
    "backfills_in_flight": [],
    "recovering": [],
    "pg_backend": {
    "pull_from_peer": [],
    "pushing": []
    }
    }
    },
    {
    "name": "Started",
    "enter_time": "2021-03-10T16:23:08.297622+0100"
    }
    ],

So you can see that two out of three OSDs on other hosts are indeed up 
and active and the . I also see the ceph-osd daemons running on those 
hosts, so the data is definitely there and the PG should be available. 
Do you have any idea why these PGs may be becoming inactive nonetheless? 
I am suspecting some kind of concurrency limit, but I wouldn't know 
which one that could be.


Thanks
Janek
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MON slow ops and growing MON store

2021-02-26 Thread Janek Bevendorff
Since the full cluster restart and disabling logging to syslog, it's not 
a problem any more (for now).


Unfortunately, just disabling clog_to_monitors didn't have the wanted 
effect when I tried it yesterday. But I also believe that it is somehow 
related. I could not find any specific reason for the incident yesterday 
in the logs besides a few more RocksDB status and compact messages than 
usual, but that's more symptomatic.



On 26/02/2021 13:05, Mykola Golub wrote:

On Thu, Feb 25, 2021 at 08:58:01PM +0100, Janek Bevendorff wrote:


On the first MON, the command doesn’t even return, but I was able to
get a dump from the one I restarted most recently. The oldest ops
look like this:

 {
 "description": "log(1000 entries from seq 17876238 at 
2021-02-25T15:13:20.306487+0100)",
 "initiated_at": "2021-02-25T20:40:34.698932+0100",
 "age": 183.762551121,
 "duration": 183.762599201,

The mon stores cluster log messages in the mon db. You mentioned
problems with osds flooding with log messages. It looks like related.

If you still observe the db growth you may try temporarily disable
clog_to_monitors, i.e. set for all osds:

  clog_to_monitors = false

And see if it stops growing after this and if it helps with the slow
ops (it might make sense to restar mons if some look like get
stuck). You can apply the config option on the fly (without restarting
the osds, e.g with injectargs), but when re-enabling back you will
have to restart the osds to avoid crashes due to this bug [1].

[1] https://tracker.ceph.com/issues/48946


--

Bauhaus-Universität Weimar
Bauhausstr. 9a, R308
99423 Weimar, Germany

Phone: +49 3643 58 3577
www.webis.de
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MON slow ops and growing MON store

2021-02-25 Thread Janek Bevendorff


> On 25. Feb 2021, at 22:17, Dan van der Ster  wrote:
> 
> Also did you solve your log spam issue here?
> https://tracker.ceph.com/issues/49161
> Surely these things are related?


No. But I noticed that DBG log spam only happens when log_to_syslog is enabled. 
systemd is smart enough to avoid filling up the disks/RAM, but it may still 
strain the whole system. I disabled that for now and enabled log_to_file again, 
which doesn’t ignore the configured debug level. I am pretty sure that’s a bug.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MON slow ops and growing MON store

2021-02-25 Thread Janek Bevendorff
Thanks, I’ll try that tomorrow.


> On 25. Feb 2021, at 21:59, Dan van der Ster  wrote:
> 
> Maybe the debugging steps in that insights tracker can be helpful
> anyway: https://tracker.ceph.com/issues/39955
> 
> -- dan
> 
> On Thu, Feb 25, 2021 at 9:27 PM Janek Bevendorff
>  wrote:
>> 
>> Thanks for the tip, but I do not have degraded PGs and the module is already 
>> disabled.
>> 
>> 
>> On 25. Feb 2021, at 21:17, Seena Fallah  wrote:
>> 
>> I had the same problem in my cluster and it was because of insights mgr 
>> module that was storing lots of data to the RocksDB because mu cluster was 
>> degraded.
>> If you have degraded pgs try to disable insights module.
>> 
>> On Thu, Feb 25, 2021 at 11:40 PM Dan van der Ster  
>> wrote:
>>> 
>>>> "source": "osd.104...
>>> 
>>> What's happening on that osd? Is it something new which corresponds to when
>>> your mon started growing? Are other OSDs also flooding the mons with logs?
>>> 
>>> I'm mobile so can't check... Are those logging configs the defaults? If not
>>>  revert to default...
>>> 
>>> BTW do your mons have stable quorum or are they flapping with this load?
>>> 
>>> .. dan
>>> 
>>> 
>>> 
>>> On Thu, Feb 25, 2021, 8:58 PM Janek Bevendorff <
>>> janek.bevendo...@uni-weimar.de> wrote:
>>> 
>>>> Thanks, Dan.
>>>> 
>>>> On the first MON, the command doesn’t even return, but I was able to get a
>>>> dump from the one I restarted most recently. The oldest ops look like this:
>>>> 
>>>>{
>>>>"description": "log(1000 entries from seq 17876238 at
>>>> 2021-02-25T15:13:20.306487+0100)",
>>>>"initiated_at": "2021-02-25T20:40:34.698932+0100",
>>>>"age": 183.762551121,
>>>>"duration": 183.762599201,
>>>>"type_data": {
>>>>"events": [
>>>>{
>>>>"time": "2021-02-25T20:40:34.698932+0100",
>>>>"event": "initiated"
>>>>},
>>>>{
>>>>"time": "2021-02-25T20:40:34.698636+0100",
>>>>"event": "throttled"
>>>>},
>>>>{
>>>>"time": "2021-02-25T20:40:34.698932+0100",
>>>>"event": "header_read"
>>>>},
>>>>{
>>>>"time": "2021-02-25T20:40:34.701407+0100",
>>>>"event": "all_read"
>>>>},
>>>>{
>>>>"time": "2021-02-25T20:40:34.701455+0100",
>>>>"event": "dispatched"
>>>>},
>>>>{
>>>>"time": "2021-02-25T20:40:34.701458+0100",
>>>>"event": "mon:_ms_dispatch"
>>>>},
>>>>{
>>>>"time": "2021-02-25T20:40:34.701459+0100",
>>>>"event": "mon:dispatch_op"
>>>>},
>>>>{
>>>>"time": "2021-02-25T20:40:34.701459+0100",
>>>>"event": "psvc:dispatch"
>>>>},
>>>>{
>>>>"time": "2021-02-25T20:40:34.701490+0100",
>>>>"event": "logm:wait_for_readable"
>>>>},
>>>>{
>>>>"time": "2021-02-25T20:40:34.701491+0100",
>>>>"event": "logm:wait_for_readable/paxos"
>>>>},
>>&g

[ceph-users] Re: MON slow ops and growing MON store

2021-02-25 Thread Janek Bevendorff
Thanks for the tip, but I do not have degraded PGs and the module is already 
disabled.


> On 25. Feb 2021, at 21:17, Seena Fallah  wrote:
> 
> I had the same problem in my cluster and it was because of insights mgr 
> module that was storing lots of data to the RocksDB because mu cluster was 
> degraded. 
> If you have degraded pgs try to disable insights module.
> 
> On Thu, Feb 25, 2021 at 11:40 PM Dan van der Ster  <mailto:d...@vanderster.com>> wrote:
> > "source": "osd.104...
> 
> What's happening on that osd? Is it something new which corresponds to when
> your mon started growing? Are other OSDs also flooding the mons with logs?
> 
> I'm mobile so can't check... Are those logging configs the defaults? If not
>  revert to default...
> 
> BTW do your mons have stable quorum or are they flapping with this load?
> 
> .. dan
> 
> 
> 
> On Thu, Feb 25, 2021, 8:58 PM Janek Bevendorff <
> janek.bevendo...@uni-weimar.de <mailto:janek.bevendo...@uni-weimar.de>> wrote:
> 
> > Thanks, Dan.
> >
> > On the first MON, the command doesn’t even return, but I was able to get a
> > dump from the one I restarted most recently. The oldest ops look like this:
> >
> > {
> > "description": "log(1000 entries from seq 17876238 at
> > 2021-02-25T15:13:20.306487+0100)",
> > "initiated_at": "2021-02-25T20:40:34.698932+0100",
> > "age": 183.762551121,
> > "duration": 183.762599201,
> > "type_data": {
> > "events": [
> > {
> > "time": "2021-02-25T20:40:34.698932+0100",
> > "event": "initiated"
> > },
> > {
> > "time": "2021-02-25T20:40:34.698636+0100",
> > "event": "throttled"
> > },
> > {
> > "time": "2021-02-25T20:40:34.698932+0100",
> > "event": "header_read"
> > },
> > {
> > "time": "2021-02-25T20:40:34.701407+0100",
> > "event": "all_read"
> > },
> > {
> > "time": "2021-02-25T20:40:34.701455+0100",
> > "event": "dispatched"
> > },
> > {
> > "time": "2021-02-25T20:40:34.701458+0100",
> > "event": "mon:_ms_dispatch"
> > },
> > {
> > "time": "2021-02-25T20:40:34.701459+0100",
> > "event": "mon:dispatch_op"
> > },
> > {
> > "time": "2021-02-25T20:40:34.701459+0100",
> > "event": "psvc:dispatch"
> > },
> > {
> > "time": "2021-02-25T20:40:34.701490+0100",
> > "event": "logm:wait_for_readable"
> > },
> > {
> > "time": "2021-02-25T20:40:34.701491+0100",
> > "event": "logm:wait_for_readable/paxos"
> > },
> > {
> > "time": "2021-02-25T20:40:34.701496+0100",
> > "event": "paxos:wait_for_readable"
> > },
> > {
> > "time": "2021-02-25T20:40:34.989198+0100",
> > "event": "callback finished"
> > },
> > {
> > "time": "2021-02-25T20:40:34.989199+0100",
> > "event": "psvc:dispatch"
> >  

[ceph-users] Re: MON slow ops and growing MON store

2021-02-25 Thread Janek Bevendorff
Nothing special is going on that OSD as far as I can tell and the OSD number of 
each op is different.
The config isn’t entirely default, but we have been using it successfully for 
quite a bit. It basically just redirects everything to journald so that we 
don’t have log creep. I reverted it nonetheless.

The MONs have a stable quorum, but the store size is so large now (35GB by this 
time), that I am seeing first daemon restarts.


> On 25. Feb 2021, at 21:10, Dan van der Ster  wrote:
> 
> > "source": "osd.104...
> 
> What's happening on that osd? Is it something new which corresponds to when 
> your mon started growing? Are other OSDs also flooding the mons with logs?
> 
> I'm mobile so can't check... Are those logging configs the defaults? If not 
>  revert to default...
> 
> BTW do your mons have stable quorum or are they flapping with this load?
> 
> .. dan
> 
> 
> 
> On Thu, Feb 25, 2021, 8:58 PM Janek Bevendorff 
> mailto:janek.bevendo...@uni-weimar.de>> 
> wrote:
> Thanks, Dan.
> 
> On the first MON, the command doesn’t even return, but I was able to get a 
> dump from the one I restarted most recently. The oldest ops look like this:
> 
> {
> "description": "log(1000 entries from seq 17876238 at 
> 2021-02-25T15:13:20.306487+0100)",
> "initiated_at": "2021-02-25T20:40:34.698932+0100",
> "age": 183.762551121,
> "duration": 183.762599201,
> "type_data": {
> "events": [
> {
> "time": "2021-02-25T20:40:34.698932+0100",
> "event": "initiated"
> },
> {
> "time": "2021-02-25T20:40:34.698636+0100",
> "event": "throttled"
> },
> {
> "time": "2021-02-25T20:40:34.698932+0100",
> "event": "header_read"
> },
> {
> "time": "2021-02-25T20:40:34.701407+0100",
> "event": "all_read"
> },
> {
> "time": "2021-02-25T20:40:34.701455+0100",
> "event": "dispatched"
> },
> {
> "time": "2021-02-25T20:40:34.701458+0100",
> "event": "mon:_ms_dispatch"
> },
> {
> "time": "2021-02-25T20:40:34.701459+0100",
> "event": "mon:dispatch_op"
> },
> {
> "time": "2021-02-25T20:40:34.701459+0100",
> "event": "psvc:dispatch"
> },
> {
> "time": "2021-02-25T20:40:34.701490+0100",
> "event": "logm:wait_for_readable"
> },
> {
> "time": "2021-02-25T20:40:34.701491+0100",
> "event": "logm:wait_for_readable/paxos"
> },
> {
> "time": "2021-02-25T20:40:34.701496+0100",
> "event": "paxos:wait_for_readable"
> },
> {
> "time": "2021-02-25T20:40:34.989198+0100",
> "event": "callback finished"
> },
> {
> "time": "2021-02-25T20:40:34.989199+0100",
> "event": "psvc:dispatch"
> },
> {
> "time": "2021-02-25T20:40:34.989208+0100",
> "event": "logm:preprocess_query"
> },
> {
> "time": "2021-02-25T20:40:34.989208+0100",
> "e

[ceph-users] Re: MON slow ops and growing MON store

2021-02-25 Thread Janek Bevendorff
per second, mostly RocksDB stuff, but nothing that actually 
looks serious or even log-worthy. I noticed that before that despite logging 
being set to warning level, the cluster log keeps being written to the MON log. 
But it shouldn’t cause such massive stability issues, should it? The date on 
the log op is also weird. 15:13+0100 was hours ago.

Here’s my log config:

globaladvanced  clog_to_syslog_level
 warning
globalbasic err_to_syslog   
 true
globalbasic log_to_file 
 false
globalbasic log_to_stderr   
 false
globalbasic log_to_syslog   
 true
globaladvanced  mon_cluster_log_file_level  
 error
globaladvanced  mon_cluster_log_to_file 
 false
globaladvanced  mon_cluster_log_to_stderr   
 false
globaladvanced  mon_cluster_log_to_syslog   
 false
globaladvanced  mon_cluster_log_to_syslog_level 
 warning



Ceph version is 15.2.8.

Janek


> On 25. Feb 2021, at 20:33, Dan van der Ster  wrote:
> 
> ceph daemon mon.`hostname -s` ops
> 
> That should show you the accumulating ops.
> 
> .. dan
> 
> 
> On Thu, Feb 25, 2021, 8:23 PM Janek Bevendorff 
> mailto:janek.bevendo...@uni-weimar.de>> 
> wrote:
> Hi,
> 
> All of a sudden, we are experiencing very concerning MON behaviour. We have 
> five MONs and all of them have thousands up to tens of thousands of slow ops, 
> the oldest one blocking basically indefinitely (at least the timer keeps 
> creeping up). Additionally, the MON stores keep inflating heavily. Under 
> normal circumstances we have about 450-550MB there. Right now its 27GB and 
> growing (rapidly).
> 
> I tried restarting all MONs, I disabled auto-scaling (just in case) and 
> checked the system load and hardware. I also restarted the MGR and MDS 
> daemons, but to no avail.
> 
> Is there any way I can debug this properly? I can’t seem to find how I can 
> actually view what ops are causing this and what client (if any) may be 
> responsible for it.
> 
> Thanks
> Janek
> ___
> ceph-users mailing list -- ceph-users@ceph.io <mailto:ceph-users@ceph.io>
> To unsubscribe send an email to ceph-users-le...@ceph.io 
> <mailto:ceph-users-le...@ceph.io>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] MON slow ops and growing MON store

2021-02-25 Thread Janek Bevendorff
Hi,

All of a sudden, we are experiencing very concerning MON behaviour. We have 
five MONs and all of them have thousands up to tens of thousands of slow ops, 
the oldest one blocking basically indefinitely (at least the timer keeps 
creeping up). Additionally, the MON stores keep inflating heavily. Under normal 
circumstances we have about 450-550MB there. Right now its 27GB and growing 
(rapidly).

I tried restarting all MONs, I disabled auto-scaling (just in case) and checked 
the system load and hardware. I also restarted the MGR and MDS daemons, but to 
no avail.

Is there any way I can debug this properly? I can’t seem to find how I can 
actually view what ops are causing this and what client (if any) may be 
responsible for it.

Thanks
Janek
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Provide more documentation for MDS performance tuning on large file systems

2020-12-15 Thread Janek Bevendorff




My current settings are:

mds   advanced  mds_beacon_grace 15.00


True. I might as well remove it completely, it's an artefact of earlier 
experiments.




This should be a global setting. It is used by the mons and mdss.


mds   basic mds_cache_memory_limit 4294967296
mds   advanced  mds_cache_trim_threshold 393216
globaladvanced  mds_export_ephemeral_distributed true
mds   advanced  mds_recall_global_max_decay_threshold 393216
mds   advanced  mds_recall_max_caps 3
mds   advanced  mds_recall_max_decay_threshold 98304
mds   advanced  mds_recall_warning_threshold 196608
globaladvanced  mon_compact_on_start true

I haven't had any noticeable slow downs or crashes in a while with 3
active MDS and 3 hot standbys.

Thanks for sharing the settings that worked for you.


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Provide more documentation for MDS performance tuning on large file systems

2020-12-15 Thread Janek Bevendorff

My current settings are:

mds   advanced  mds_beacon_grace 15.00
mds   basic mds_cache_memory_limit 4294967296
mds   advanced  mds_cache_trim_threshold 393216
global    advanced  mds_export_ephemeral_distributed true
mds   advanced  mds_recall_global_max_decay_threshold 393216
mds   advanced  mds_recall_max_caps 3
mds   advanced  mds_recall_max_decay_threshold 98304
mds   advanced  mds_recall_warning_threshold 196608
global    advanced  mon_compact_on_start true

I haven't had any noticeable slow downs or crashes in a while with 3 
active MDS and 3 hot standbys.



On 14/12/2020 22:33, Patrick Donnelly wrote:

On Mon, Dec 7, 2020 at 12:06 PM Patrick Donnelly  wrote:

Hi Dan & Janek,

On Sat, Dec 5, 2020 at 6:26 AM Dan van der Ster  wrote:

My understanding is that the recall thresholds (see my list below)
should be scaled proportionally. OTOH, I haven't played with the decay
rates (and don't know if there's any significant value to tuning
those).

I haven't gone through this thread yet but I want to note for those
reading that we do now have documentation (thanks for the frequent
pokes Janek!) for the recall configurations:

https://docs.ceph.com/en/latest/cephfs/cache-configuration/#mds-recall

Please let us know if it's missing information or if something could
be more clear.

I also now have a PR open for updating the defaults based on these and
other discussions: https://github.com/ceph/ceph/pull/38574

Feedback welcome.


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: mgr's stop responding, dropping out of cluster with _check_auth_rotating

2020-12-10 Thread Janek Bevendorff
FYI, this is the ceph-exporter we're using at the moment: 
https://github.com/digitalocean/ceph_exporter


It's not as good, but it does the job mostly. Some more specific metrics 
are missing, but the majority is there.



On 10/12/2020 19:01, Janek Bevendorff wrote:
Do you have the prometheus module enabled? Turn that off, it's causing 
issues. I replaced it with another ceph exporter from Github and 
almost forgot about it.


Here's the relevant issue report: 
https://tracker.ceph.com/issues/39264#change-179946


On 10/12/2020 16:43, Welby McRoberts wrote:

Hi Folks

We've noticed that in a cluster of 21 nodes (5 mgrs & 504 OSDs 
with 24
per node) that the mgr's are, after a non specific period of time, 
dropping

out of the cluster. The logs only show the following:

debug 2020-12-10T02:02:50.409+ 7f1005840700  0 
log_channel(cluster) log

[DBG] : pgmap v14163: 4129 pgs: 4129 active+clean; 10 GiB data, 31 TiB
used, 6.3 PiB / 6.3 PiB avail
debug 2020-12-10T03:20:59.223+ 7f10624eb700 -1 monclient:
_check_auth_rotating possible clock skew, rotating keys expired way too
early (before 2020-12-10T02:20:59.226159+)
debug 2020-12-10T03:21:00.223+ 7f10624eb700 -1 monclient:
_check_auth_rotating possible clock skew, rotating keys expired way too
early (before 2020-12-10T02:21:00.226310+)

The _check_auth_rotating repeats approximately every second. The 
instances

are all syncing their time with NTP and have no issues on that front. A
restart of the mgr fixes the issue.

It appears that this may be related to 
https://tracker.ceph.com/issues/39264.

The suggestion seems to be to disable prometheus metrics, however, this
obviously isn't realistic for a production environment where metrics are
critical for operations.

Please let us know what additional information we can provide to 
assist in

resolving this critical issue.

Cheers
Welby
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: mgr's stop responding, dropping out of cluster with _check_auth_rotating

2020-12-10 Thread Janek Bevendorff
Do you have the prometheus module enabled? Turn that off, it's causing 
issues. I replaced it with another ceph exporter from Github and almost 
forgot about it.


Here's the relevant issue report: 
https://tracker.ceph.com/issues/39264#change-179946


On 10/12/2020 16:43, Welby McRoberts wrote:

Hi Folks

We've noticed that in a cluster of 21 nodes (5 mgrs & 504 OSDs with 24
per node) that the mgr's are, after a non specific period of time, dropping
out of the cluster. The logs only show the following:

debug 2020-12-10T02:02:50.409+ 7f1005840700  0 log_channel(cluster) log
[DBG] : pgmap v14163: 4129 pgs: 4129 active+clean; 10 GiB data, 31 TiB
used, 6.3 PiB / 6.3 PiB avail
debug 2020-12-10T03:20:59.223+ 7f10624eb700 -1 monclient:
_check_auth_rotating possible clock skew, rotating keys expired way too
early (before 2020-12-10T02:20:59.226159+)
debug 2020-12-10T03:21:00.223+ 7f10624eb700 -1 monclient:
_check_auth_rotating possible clock skew, rotating keys expired way too
early (before 2020-12-10T02:21:00.226310+)

The _check_auth_rotating repeats approximately every second. The instances
are all syncing their time with NTP and have no issues on that front. A
restart of the mgr fixes the issue.

It appears that this may be related to https://tracker.ceph.com/issues/39264.
The suggestion seems to be to disable prometheus metrics, however, this
obviously isn't realistic for a production environment where metrics are
critical for operations.

Please let us know what additional information we can provide to assist in
resolving this critical issue.

Cheers
Welby
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS lost, Filesystem degraded and wont mount

2020-12-08 Thread Janek Bevendorff




Wow! Distributed epins :) Thanks for trying it. How many
sub-directories under the distributed epin'd directory? (There's a lot
of stability problems that are to be fixed in Pacific associated with
lots of subtrees so if you have too large of a directory, things could
get ugly!)


Yay, beta testing in production! ^^

We are talking millions, but the three is very deep, not very wide. 
That's why it's so hard to maintain manual pins. I enabled it on a few 
levels of the tree, where the largest one has 117 direct descendants 
(but several million files below). So far, it's working all right, but 
it is very hard to see if the settings is actually effective. I enabled 
it for testing purposes on a directory that was (at that time) rather 
busy with 3k MDS op/s and I could see a handful of new pins come and go 
in ceph tell mds.0 get subtrees, but most of our directories are rather 
idle most of the time and manually browsing the tree isn't enough to 
trigger any new observable epins it seems. So for the main directories 
where it actually matters, I can only assume that it's working.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Provide more documentation for MDS performance tuning on large file systems

2020-12-07 Thread Janek Bevendorff

Hi Patrick,

I haven't gone through this thread yet but I want to note for those
reading that we do now have documentation (thanks for the frequent
pokes Janek!) for the recall configurations:

https://docs.ceph.com/en/latest/cephfs/cache-configuration/#mds-recall

Please let us know if it's missing information or if something could
be more clear.
The documentation has helped a big deal already and I've been playing 
around with them quite a bit recently. What's missing, obviously, are 
recommended settings for individual scenarios (at least ballparks). But 
that is hard to come by without experimenting first (I wouldn't call our 
deployment massive, but very likely significantly above average and I 
don't know what scale the developers are usually testing at). As I 
mentioned in the other thread, I am testing Dan's recommendations at the 
moment and will refine them for our purposes. The effects of individual 
tweaks are hard to assess without dedicated benchmarks (although "MDS 
not hanging up" is already somewhat of a benchmark :-)).

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS lost, Filesystem degraded and wont mount

2020-12-07 Thread Janek Bevendorff




This sounds like there is one or a few clients acquiring too many
caps. Have you checked this? Are there any messages about the OOM
killer? What config changes for the MDS have you made?


Yes, it's individual clients acquiring too my caps. I first ran the 
adjusted recall settings you suggested after we had gone through several 
bugs. Right now I am trying distributed ephemeral pinning with 3 MDS 
Dan's suggestion of 6x the default values for recall from the MDS 
documentation thread. So far, it's working quite well.



I'm hopeful your problems will be addressed by:
https://tracker.ceph.com/issues/47307

That does indeed sound a bit like it might fix these kind of issues.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Provide more documentation for MDS performance tuning on large file systems

2020-12-07 Thread Janek Bevendorff
Never mind, when I enable it on a more busy directory, I do see new 
ephemeral pins popping up. Just not on the directories I set it on 
originally. Let's see how that holds up.


On 07/12/2020 13:04, Janek Bevendorff wrote:
Thanks. I tried playing around a bit with 
mds_export_ephemeral_distributed just now, because it's pretty much 
the same thing that your script does manually. Unfortunately, it seems 
to have no effect.


I pinned all top-level directories to mds.0 and then enabled 
ceph.dir.pin.distributed for a few sub trees. Despite 
mds_export_ephemeral_distributed being set to true, all work is done 
by mds.0 now and I also don't see any additional pins in ceph tell 
mds.\* get subtrees.


Any ideas why that might be?


On 07/12/2020 10:49, Dan van der Ster wrote:

On Mon, Dec 7, 2020 at 10:39 AM Janek Bevendorff
 wrote:



What exactly do you set to 64k?
We used to set mds_max_caps_per_client to 5, but once we started
using the tuned caps recall config, we reverted that back to the
default 1M without issue.

mds_max_caps_per_client. As I mentioned, some clients hit this limit
regularly and they aren't entirely idle. I will keep tuning the recall
settings, though.


This 15k caps client I mentioned is not related to the max caps per
client config. In recent nautilus, the MDS will proactively recall
caps from idle clients -- so a client with even just a few caps like
this can provoke the caps recall warnings (if it is buggy, like in
this case). The client doesn't cause any real problems, just the
annoying warnings.

We only see the warnings during normal operation. I remember having
massive issues with early Nautilus releases, but thanks to more
aggressive recall behaviour in newer releases, that is fixed. Back then
it was virtually impossible to keep the MDS within the bounds of its
memory limit. Nowadays, the warnings only appear when the MDS is really
stressed. In that situation, the whole FS performance is already
degraded massively and MDSs are likely to fail and run into the 
rejoin loop.



Multi-active + pinning definitely increases the overall MD throughput
(once you can get the relevant inodes cached), because as you know the
MDS is single threaded and CPU bound at the limit.
We could get something like 4-5k handle_client_requests out of a
single MDS, and that really does scale horizontally as you add MDSs
(and pin).

Okay, I will definitely re-evaluate options for pinning individual
directories, perhaps a small script can do it.

There is a new ephemeral pinning option in the latest latest releases,
but we didn't try it yet.
Here's our script -- it assumes the parent dir is pinned to zero or
that bal is disabled:

https://github.com/cernceph/ceph-scripts/blob/master/tools/cephfs/cephfs-bal-shard 



Too many pins can cause problems -- we have something like 700 pins at
the moment and it's fine, though.

Cheers, Dan




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Provide more documentation for MDS performance tuning on large file systems

2020-12-07 Thread Janek Bevendorff
Thanks. I tried playing around a bit with 
mds_export_ephemeral_distributed just now, because it's pretty much the 
same thing that your script does manually. Unfortunately, it seems to 
have no effect.


I pinned all top-level directories to mds.0 and then enabled 
ceph.dir.pin.distributed for a few sub trees. Despite 
mds_export_ephemeral_distributed being set to true, all work is done by 
mds.0 now and I also don't see any additional pins in ceph tell mds.\* 
get subtrees.


Any ideas why that might be?


On 07/12/2020 10:49, Dan van der Ster wrote:

On Mon, Dec 7, 2020 at 10:39 AM Janek Bevendorff
 wrote:



What exactly do you set to 64k?
We used to set mds_max_caps_per_client to 5, but once we started
using the tuned caps recall config, we reverted that back to the
default 1M without issue.

mds_max_caps_per_client. As I mentioned, some clients hit this limit
regularly and they aren't entirely idle. I will keep tuning the recall
settings, though.


This 15k caps client I mentioned is not related to the max caps per
client config. In recent nautilus, the MDS will proactively recall
caps from idle clients -- so a client with even just a few caps like
this can provoke the caps recall warnings (if it is buggy, like in
this case). The client doesn't cause any real problems, just the
annoying warnings.

We only see the warnings during normal operation. I remember having
massive issues with early Nautilus releases, but thanks to more
aggressive recall behaviour in newer releases, that is fixed. Back then
it was virtually impossible to keep the MDS within the bounds of its
memory limit. Nowadays, the warnings only appear when the MDS is really
stressed. In that situation, the whole FS performance is already
degraded massively and MDSs are likely to fail and run into the rejoin loop.


Multi-active + pinning definitely increases the overall MD throughput
(once you can get the relevant inodes cached), because as you know the
MDS is single threaded and CPU bound at the limit.
We could get something like 4-5k handle_client_requests out of a
single MDS, and that really does scale horizontally as you add MDSs
(and pin).

Okay, I will definitely re-evaluate options for pinning individual
directories, perhaps a small script can do it.

There is a new ephemeral pinning option in the latest latest releases,
but we didn't try it yet.
Here's our script -- it assumes the parent dir is pinned to zero or
that bal is disabled:

https://github.com/cernceph/ceph-scripts/blob/master/tools/cephfs/cephfs-bal-shard

Too many pins can cause problems -- we have something like 700 pins at
the moment and it's fine, though.

Cheers, Dan




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Provide more documentation for MDS performance tuning on large file systems

2020-12-07 Thread Janek Bevendorff




What exactly do you set to 64k?
We used to set mds_max_caps_per_client to 5, but once we started
using the tuned caps recall config, we reverted that back to the
default 1M without issue.


mds_max_caps_per_client. As I mentioned, some clients hit this limit 
regularly and they aren't entirely idle. I will keep tuning the recall 
settings, though.



This 15k caps client I mentioned is not related to the max caps per
client config. In recent nautilus, the MDS will proactively recall
caps from idle clients -- so a client with even just a few caps like
this can provoke the caps recall warnings (if it is buggy, like in
this case). The client doesn't cause any real problems, just the
annoying warnings.


We only see the warnings during normal operation. I remember having 
massive issues with early Nautilus releases, but thanks to more 
aggressive recall behaviour in newer releases, that is fixed. Back then 
it was virtually impossible to keep the MDS within the bounds of its 
memory limit. Nowadays, the warnings only appear when the MDS is really 
stressed. In that situation, the whole FS performance is already 
degraded massively and MDSs are likely to fail and run into the rejoin loop.



Multi-active + pinning definitely increases the overall MD throughput
(once you can get the relevant inodes cached), because as you know the
MDS is single threaded and CPU bound at the limit.
We could get something like 4-5k handle_client_requests out of a
single MDS, and that really does scale horizontally as you add MDSs
(and pin).


Okay, I will definitely re-evaluate options for pinning individual 
directories, perhaps a small script can do it.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Provide more documentation for MDS performance tuning on large file systems

2020-12-07 Thread Janek Bevendorff

Thanks, Dan!

I have played with many thresholds, including the decay rates. It is 
indeed very difficult to assess their effects, since our workloads 
differ widely depending on what people are working on at the moment. I 
would need to develop a proper benchmarking suite to simulate the 
different heavy workloads we have.



We currently run with all those options scaled up 6x the defaults, and
we almost never have caps recall warnings these days, with a couple
thousand cephfs clients.


Under normal operation, we don't either. We had issues in the past with 
Ganesha and still do sometimes, but that's a bug in Ganesha and we don't 
really use it for anything but legacy clients any way. Usually, recall 
works flawlessly, unless some client suddenly starts doing crazy shit. 
We have just a few clients who regularly keep tens of thousands of caps 
open and had I not limited the number, it would be hundreds of 
thousands. Recalling them without threatening stability is not trivial 
and at the least it degrades the performance for everybody else. Any 
pointers here to better handling this situation are greatly appreciated. 
I will definitely try your config recommendations.



2. A user running VSCodium, keeping 15k caps open.. the opportunistic
caps recall eventually starts recalling those but the (el7 kernel)
client won't release them. Stopping Codium seems to be the only way to
release.


As I said, 15k is not much for us. The limits right now are 64k per 
client and a few hit that limit quite regularly. One of those clients is 
our VPN gateway, which, technically, is not a single client, but to the 
CephFS it looks like one due to source NAT. This is certainly something 
I want to tune further, so that clients are routed directly via their 
private IP instead of being NAT'ed. The other ones are our GPU deep 
learning servers (just three of them, but they can generate astounding 
numbers of iops) and the 135-node Hadoop cluster (which is hard to 
sustain for any single machine, so we prefer to use the S3 here).



Otherwise, 4GB is normally sufficient in our env for
mds_cache_memory_limit (3 active MDSs), however this is highly
workload dependent. If several clients are actively taking 100s of
thousands of caps, then the 4GB MDS needs to be ultra busy recalling
caps and latency increases. We saw this live a couple weeks ago: a few
users started doing intensive rsyncs, and some other users noticed an
MD latency increase; it was fixed immediately just by increasing the
mem limit to 8GB.


So you too have 3 active MDSs? Are you using directory pinning? We have 
a very deep and unbalanced directory structure, so I cannot really pin 
any top-level directory without skewing the load massively. From my 
experience, three MDSs without explicit pinning aren't much better or 
even worse than one. But perhaps you have different observations?




I agree some sort of tuning best practises should all be documented
somehow, even though it's complex and rather delicate.


Indeed!


Janek
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS lost, Filesystem degraded and wont mount

2020-12-06 Thread Janek Bevendorff




(Only one of our test clusters saw this happen so far, during mimic
days, and this provoked us to move all MDSs to 64GB VMs, with mds
cache mem limit = 4GB, so there is a large amount of RAM available in
case it's needed.


Ours are running on machines with 128GB RAM. I tried limits between 4 
and 40GB. But the higher the limit, the higher the fall after a crash.


We used to have three MDSs, now I am testing out one to see if that's 
more stable. At the moment, it runs fine, but we have also outsourced 
all the heavy lifting to S3.




What do the crashes look like with the TF training? Do you have a tracker?


At some point the MDS becomes laggy and is killed and not even the hot 
standby is able to resume. There is nothing special going on. You only 
notice that the FS is suddenly degraded and MDS daemons are playing 
Russian roulette until systemd pulls the plug due to too many daemon 
failures. At that point I have to fail the remaining ones, run systemctl 
reset-failed, delete the openfiles objects and restart the daemons.



How many client sessions do need to crash an MDS?


Depends. Surprisingly, it can be as little as one big node with 1.5TB of 
RAM and a few hungry GPUs.




-- Dan



I guess we're pretty lucky with our CephFS's because we have more than
1k clients and it is pretty solid (though the last upgrade had a
hiccup decreasing down to single active MDS).

-- Dan



On Fri, Dec 4, 2020 at 8:20 PM Janek Bevendorff
 wrote:

This is very common issue. Deleting mdsX_openfiles.Y has become part of
my standard maintenance repertoire. As soon as you have a few more
clients and one of them starts opening and closing files in rapid
succession (or does other metadata-heavy things), it becomes very likely
that the MDS crashes and is unable to recover.

There have been numerous fixes in the past, which improved the overall
stability, but it is far from perfect. I am happy to see another patch
in that direction, but I believe more effort needs to be spent here. It
is way too easy to DoS the MDS from a single client. Our 78-node CephFS
beats our old NFS RAID server in terms of throughput, but latency and
stability are way behind.

Janek

On 04/12/2020 11:39, Dan van der Ster wrote:

Excellent!

For the record, this PR is the plan to fix this:
https://github.com/ceph/ceph/pull/36089
(nautilus, octopus PRs here: https://github.com/ceph/ceph/pull/37382
https://github.com/ceph/ceph/pull/37383)

Cheers, Dan

On Fri, Dec 4, 2020 at 11:35 AM Anton Aleksandrov  wrote:

Thank you very much! This solution helped:

Stop all MDS, then:
# rados -p cephfs_metadata_pool rm mds0_openfiles.0
then start one MDS.

We are back online. Amazing!!!  :)


On 04.12.2020 12:20, Dan van der Ster wrote:

Please also make sure the mds_beacon_grace is high on the mon's too.

it doesn't matter which mds you select to be the running one.

Is the processing getting killed, restarted?
If you're confident that the mds is getting OOM killed during rejoin
step, then you might find this useful:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/028964.html

Stop all MDS, then:
# rados -p cephfs_metadata_pool rm mds0_openfiles.0
then start one MDS.

-- Dan

On Fri, Dec 4, 2020 at 11:05 AM Anton Aleksandrov  wrote:

Yes, MDS eats all memory+swap, stays like this for a moment and then
frees memory.

mds_beacon_grace was already set to 1800

Also on other it is seen this message: Map has assigned me to become a
standby.

Does it matter, which MDS we stop and which we leave running?

Anton


On 04.12.2020 11:53, Dan van der Ster wrote:

How many active MDS's did you have? (max_mds == 1, right?)

Stop the other two MDS's so you can focus on getting exactly one running.
Tail the log file and see what it is reporting.
Increase mds_beacon_grace to 600 so that the mon doesn't fail this MDS
while it is rejoining.

Is that single MDS running out of memory during the rejoin phase?

-- dan

On Fri, Dec 4, 2020 at 10:49 AM Anton Aleksandrov  wrote:

Hello community,

we are on ceph 13.2.8 - today something happenned with one MDS and cephs
status tells, that filesystem is degraded. It won't mount either. I have
take server with MDS, that was not working down. There are 2 more MDS
servers, but they stay in "rejoin" state. Also only 1 is shown in
"services", even though there are 2.

Both running MDS servers have these lines in their logs:

heartbeat_map is_healthy 'MDSRank' had timed out after 15
mds.beacon.mds2 Skipping beacon heartbeat to monitors (last acked
28.8979s ago); MDS internal heartbeat is not healthy!

On one of MDS nodes I enabled more detailed debug, so I am getting there
also:

mds.beacon.mds3 Sending beacon up:standby seq 178
mds.beacon.mds3 received beacon reply up:standby seq 178 rtt 0.00068

Makes no sense and too much stress in my head... Anyone could help please?

Anton.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an em

[ceph-users] Re: MDS lost, Filesystem degraded and wont mount

2020-12-05 Thread Janek Bevendorff

On 05/12/2020 09:26, Dan van der Ster wrote:

Hi Janek,

I'd love to hear your standard maintenance procedures. Are you
cleaning up those open files outside of "rejoin" OOMs ?


No, of course not. But those rejoin problems happen more often than I'd 
like them to. It has become much better with recent releases, but if one 
of the clients trains a Tensorflow model from files in the CephFS or 
when our Hadoop cluster starts reading from it, the MDS will almost 
certainly crash or at least degrade massively in performance. S3 doesn't 
have these problems at all, obviously.


That said, our metadata pool resides on rotating platters at the moment 
and we plan to move it to SSDs, but that should only fix latency issues 
and not the crash and rejoin problems (btw it doesn't matter how long 
you set the heartbeat interval, the rejoining MDS will always be 
replaced by a standby before it's finished).





I guess we're pretty lucky with our CephFS's because we have more than
1k clients and it is pretty solid (though the last upgrade had a
hiccup decreasing down to single active MDS).

-- Dan



On Fri, Dec 4, 2020 at 8:20 PM Janek Bevendorff
 wrote:

This is very common issue. Deleting mdsX_openfiles.Y has become part of
my standard maintenance repertoire. As soon as you have a few more
clients and one of them starts opening and closing files in rapid
succession (or does other metadata-heavy things), it becomes very likely
that the MDS crashes and is unable to recover.

There have been numerous fixes in the past, which improved the overall
stability, but it is far from perfect. I am happy to see another patch
in that direction, but I believe more effort needs to be spent here. It
is way too easy to DoS the MDS from a single client. Our 78-node CephFS
beats our old NFS RAID server in terms of throughput, but latency and
stability are way behind.

Janek

On 04/12/2020 11:39, Dan van der Ster wrote:

Excellent!

For the record, this PR is the plan to fix this:
https://github.com/ceph/ceph/pull/36089
(nautilus, octopus PRs here: https://github.com/ceph/ceph/pull/37382
https://github.com/ceph/ceph/pull/37383)

Cheers, Dan

On Fri, Dec 4, 2020 at 11:35 AM Anton Aleksandrov  wrote:

Thank you very much! This solution helped:

Stop all MDS, then:
# rados -p cephfs_metadata_pool rm mds0_openfiles.0
then start one MDS.

We are back online. Amazing!!!  :)


On 04.12.2020 12:20, Dan van der Ster wrote:

Please also make sure the mds_beacon_grace is high on the mon's too.

it doesn't matter which mds you select to be the running one.

Is the processing getting killed, restarted?
If you're confident that the mds is getting OOM killed during rejoin
step, then you might find this useful:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/028964.html

Stop all MDS, then:
# rados -p cephfs_metadata_pool rm mds0_openfiles.0
then start one MDS.

-- Dan

On Fri, Dec 4, 2020 at 11:05 AM Anton Aleksandrov  wrote:

Yes, MDS eats all memory+swap, stays like this for a moment and then
frees memory.

mds_beacon_grace was already set to 1800

Also on other it is seen this message: Map has assigned me to become a
standby.

Does it matter, which MDS we stop and which we leave running?

Anton


On 04.12.2020 11:53, Dan van der Ster wrote:

How many active MDS's did you have? (max_mds == 1, right?)

Stop the other two MDS's so you can focus on getting exactly one running.
Tail the log file and see what it is reporting.
Increase mds_beacon_grace to 600 so that the mon doesn't fail this MDS
while it is rejoining.

Is that single MDS running out of memory during the rejoin phase?

-- dan

On Fri, Dec 4, 2020 at 10:49 AM Anton Aleksandrov  wrote:

Hello community,

we are on ceph 13.2.8 - today something happenned with one MDS and cephs
status tells, that filesystem is degraded. It won't mount either. I have
take server with MDS, that was not working down. There are 2 more MDS
servers, but they stay in "rejoin" state. Also only 1 is shown in
"services", even though there are 2.

Both running MDS servers have these lines in their logs:

heartbeat_map is_healthy 'MDSRank' had timed out after 15
mds.beacon.mds2 Skipping beacon heartbeat to monitors (last acked
28.8979s ago); MDS internal heartbeat is not healthy!

On one of MDS nodes I enabled more detailed debug, so I am getting there
also:

mds.beacon.mds3 Sending beacon up:standby seq 178
mds.beacon.mds3 received beacon reply up:standby seq 178 rtt 0.00068

Makes no sense and too much stress in my head... Anyone could help please?

Anton.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe 

[ceph-users] Re: MDS lost, Filesystem degraded and wont mount

2020-12-04 Thread Janek Bevendorff
This is very common issue. Deleting mdsX_openfiles.Y has become part of 
my standard maintenance repertoire. As soon as you have a few more 
clients and one of them starts opening and closing files in rapid 
succession (or does other metadata-heavy things), it becomes very likely 
that the MDS crashes and is unable to recover.


There have been numerous fixes in the past, which improved the overall 
stability, but it is far from perfect. I am happy to see another patch 
in that direction, but I believe more effort needs to be spent here. It 
is way too easy to DoS the MDS from a single client. Our 78-node CephFS 
beats our old NFS RAID server in terms of throughput, but latency and 
stability are way behind.


Janek

On 04/12/2020 11:39, Dan van der Ster wrote:

Excellent!

For the record, this PR is the plan to fix this:
https://github.com/ceph/ceph/pull/36089
(nautilus, octopus PRs here: https://github.com/ceph/ceph/pull/37382
https://github.com/ceph/ceph/pull/37383)

Cheers, Dan

On Fri, Dec 4, 2020 at 11:35 AM Anton Aleksandrov  wrote:

Thank you very much! This solution helped:

Stop all MDS, then:
# rados -p cephfs_metadata_pool rm mds0_openfiles.0
then start one MDS.

We are back online. Amazing!!!  :)


On 04.12.2020 12:20, Dan van der Ster wrote:

Please also make sure the mds_beacon_grace is high on the mon's too.

it doesn't matter which mds you select to be the running one.

Is the processing getting killed, restarted?
If you're confident that the mds is getting OOM killed during rejoin
step, then you might find this useful:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/028964.html

Stop all MDS, then:
# rados -p cephfs_metadata_pool rm mds0_openfiles.0
then start one MDS.

-- Dan

On Fri, Dec 4, 2020 at 11:05 AM Anton Aleksandrov  wrote:

Yes, MDS eats all memory+swap, stays like this for a moment and then
frees memory.

mds_beacon_grace was already set to 1800

Also on other it is seen this message: Map has assigned me to become a
standby.

Does it matter, which MDS we stop and which we leave running?

Anton


On 04.12.2020 11:53, Dan van der Ster wrote:

How many active MDS's did you have? (max_mds == 1, right?)

Stop the other two MDS's so you can focus on getting exactly one running.
Tail the log file and see what it is reporting.
Increase mds_beacon_grace to 600 so that the mon doesn't fail this MDS
while it is rejoining.

Is that single MDS running out of memory during the rejoin phase?

-- dan

On Fri, Dec 4, 2020 at 10:49 AM Anton Aleksandrov  wrote:

Hello community,

we are on ceph 13.2.8 - today something happenned with one MDS and cephs
status tells, that filesystem is degraded. It won't mount either. I have
take server with MDS, that was not working down. There are 2 more MDS
servers, but they stay in "rejoin" state. Also only 1 is shown in
"services", even though there are 2.

Both running MDS servers have these lines in their logs:

heartbeat_map is_healthy 'MDSRank' had timed out after 15
mds.beacon.mds2 Skipping beacon heartbeat to monitors (last acked
28.8979s ago); MDS internal heartbeat is not healthy!

On one of MDS nodes I enabled more detailed debug, so I am getting there
also:

mds.beacon.mds3 Sending beacon up:standby seq 178
mds.beacon.mds3 received beacon reply up:standby seq 178 rtt 0.00068

Makes no sense and too much stress in my head... Anyone could help please?

Anton.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: NoSuchKey on key that is visible in s3 list/radosgw bk

2020-11-19 Thread Janek Bevendorff

We are doing that as well. But we need to be able to check specific buckets 
additionally. For that we use this second approach.

Since we double-check all output from our script anyway (to see if NoSuchKey 
actually happens), we can rule out false positives.

So far all the files detected this way actually have the issue (they show up in 
the list of S3 Objects, but return a 404 on GET).

For our purposes, I wrote a Python script that enumerates all objects:

s3 = boto3.resource('s3', endpoint_url=endpoint, 
aws_access_key_id=access_key, aws_secret_access_key=secret_key)


file_list = []
for f in tqdm(s3.Bucket(bucket).objects desc='Listing objects...', 
unit=' obj'):

    file_list.append(f.key)

And then runs a Spark job that reads 1 byte from every object, recording 
any botocore.exceptions.ClientError:


def test_object_absent(s3, obj_name):
    try:
    s3.Object(bucket, obj_name).get()['Body'].read(1)
    return False
    except ClientError:
    return True

With 600 executors on 130 hosts, it takes about 30 seconds for a 300k 
object bucket.



On 19/11/2020 09:21, Janek Bevendorff wrote:
I would recommend you get a dump with rados ls -p poolname (can be 
several GB, mine is 61GB) and grep (or ack, which is faster) for the 
names there to get an overview of what is there and what isn't. 
Looking up the names directly can easily give you the wrong picture, 
because it is kinda complicated to derive the correct RADOS name from 
an S3 object name and a single typo will give you a not-found error, 
even if the object is there.


On 19/11/2020 09:12, Denis Krienbühl wrote:
Thanks, we are currently scanning our object storage. It looks like 
we can detect the missing objects that return “No Such Key” looking 
at all “__multipart_” objects returned by radosgw-admin bucket 
radoslist, and checking if they exist using rados stat. We are 
currently not looking at shadow objects as our approach already 
yields more instances of this problem.


On 19 Nov 2020, at 09:09, Janek Bevendorff 
 wrote:



- The head object had a size of 0.
- There was an object with a ’shadow’ in its name, belonging to 
that path.

That is normal. What is not normal is if there are NO shadow objects.

On 18/11/2020 10:06, Denis Krienbühl wrote:
It looks like a single-part object. But we did replace that object 
last night from backup, so I can’t know for sure if the lost one 
was like that.


Another engineer that looked at the Rados objects last night did 
notice two things:


- The head object had a size of 0.
- There was an object with a ’shadow’ in its name, belonging to 
that path.


I’m not knowledgable about Rados, so I’m not sure this is helpful.

On 18 Nov 2020, at 10:01, Janek Bevendorff 
 wrote:


Sorry, it's radosgw-admin object stat --bucket=BUCKETNAME 
--object=OBJECTNAME (forgot the "object" there)


On 18/11/2020 09:58, Janek Bevendorff wrote:
The object, a Docker layer, that went missing has not been 
touched in 2 months. It worked for a while, but then suddenly 
went missing.
Was the object a multipart object? You can check by running 
radosgw-admin stat --bucket=BUCKETNAME --object=OBJECTNAME. It 
should say something "ns": "multipart" in the output. If it says 
"ns": "shadow", it's a single-part object.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: NoSuchKey on key that is visible in s3 list/radosgw bk

2020-11-19 Thread Janek Bevendorff
I would recommend you get a dump with rados ls -p poolname (can be 
several GB, mine is 61GB) and grep (or ack, which is faster) for the 
names there to get an overview of what is there and what isn't. Looking 
up the names directly can easily give you the wrong picture, because it 
is kinda complicated to derive the correct RADOS name from an S3 object 
name and a single typo will give you a not-found error, even if the 
object is there.


On 19/11/2020 09:12, Denis Krienbühl wrote:

Thanks, we are currently scanning our object storage. It looks like we can 
detect the missing objects that return “No Such Key” looking at all 
“__multipart_” objects returned by radosgw-admin bucket radoslist, and checking 
if they exist using rados stat. We are currently not looking at shadow objects 
as our approach already yields more instances of this problem.


On 19 Nov 2020, at 09:09, Janek Bevendorff  
wrote:


- The head object had a size of 0.
- There was an object with a ’shadow’ in its name, belonging to that path.

That is normal. What is not normal is if there are NO shadow objects.

On 18/11/2020 10:06, Denis Krienbühl wrote:

It looks like a single-part object. But we did replace that object last night 
from backup, so I can’t know for sure if the lost one was like that.

Another engineer that looked at the Rados objects last night did notice two 
things:

- The head object had a size of 0.
- There was an object with a ’shadow’ in its name, belonging to that path.

I’m not knowledgable about Rados, so I’m not sure this is helpful.


On 18 Nov 2020, at 10:01, Janek Bevendorff  
wrote:

Sorry, it's radosgw-admin object stat --bucket=BUCKETNAME --object=OBJECTNAME (forgot the 
"object" there)

On 18/11/2020 09:58, Janek Bevendorff wrote:

The object, a Docker layer, that went missing has not been touched in 2 months. 
It worked for a while, but then suddenly went missing.

Was the object a multipart object? You can check by running radosgw-admin stat --bucket=BUCKETNAME --object=OBJECTNAME. 
It should say something "ns": "multipart" in the output. If it says "ns": 
"shadow", it's a single-part object.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: NoSuchKey on key that is visible in s3 list/radosgw bk

2020-11-19 Thread Janek Bevendorff

- The head object had a size of 0.
- There was an object with a ’shadow’ in its name, belonging to that path.

That is normal. What is not normal is if there are NO shadow objects.

On 18/11/2020 10:06, Denis Krienbühl wrote:

It looks like a single-part object. But we did replace that object last night 
from backup, so I can’t know for sure if the lost one was like that.

Another engineer that looked at the Rados objects last night did notice two 
things:

- The head object had a size of 0.
- There was an object with a ’shadow’ in its name, belonging to that path.

I’m not knowledgable about Rados, so I’m not sure this is helpful.


On 18 Nov 2020, at 10:01, Janek Bevendorff  
wrote:

Sorry, it's radosgw-admin object stat --bucket=BUCKETNAME --object=OBJECTNAME (forgot the 
"object" there)

On 18/11/2020 09:58, Janek Bevendorff wrote:

The object, a Docker layer, that went missing has not been touched in 2 months. 
It worked for a while, but then suddenly went missing.

Was the object a multipart object? You can check by running radosgw-admin stat --bucket=BUCKETNAME --object=OBJECTNAME. 
It should say something "ns": "multipart" in the output. If it says "ns": 
"shadow", it's a single-part object.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: NoSuchKey on key that is visible in s3 list/radosgw bk

2020-11-18 Thread Janek Bevendorff
Sorry, it's radosgw-admin object stat --bucket=BUCKETNAME 
--object=OBJECTNAME (forgot the "object" there)


On 18/11/2020 09:58, Janek Bevendorff wrote:


The object, a Docker layer, that went missing has not been touched in 
2 months. It worked for a while, but then suddenly went missing.
Was the object a multipart object? You can check by running 
radosgw-admin stat --bucket=BUCKETNAME --object=OBJECTNAME. It should 
say something "ns": "multipart" in the output. If it says "ns": 
"shadow", it's a single-part object.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: NoSuchKey on key that is visible in s3 list/radosgw bk

2020-11-18 Thread Janek Bevendorff
FYI: I have radosgw-admin gc list --include-all running every three 
minutes for a day, but the list has stayed empty. Though I haven't seen 
any further data loss, either. I will keep it running until the next 
time I seen an object vanish.



On 17/11/2020 09:22, Janek Bevendorff wrote:


I have run radosgw-admin gc list (without --include-all) a few times 
already, but the list was always empty. I will create a cron job 
running it every few minutes and writing out the results.


On 17/11/2020 02:22, Eric Ivancich wrote:
I’m wondering if anyone experiencing this bug would mind running 
`radosgw-admin gc list --include-all` on a schedule and saving the 
results. I’d like to know whether these tail objects are getting 
removed by the gc process. If we find that that’s the case then 
there’s the issue of how they got on the gc list.


Eric


On Nov 16, 2020, at 3:48 AM, Janek Bevendorff 
<mailto:janek.bevendo...@uni-weimar.de>> wrote:


As noted in the bug report, the issue has affected only multipart 
objects at this time. I have added some more remarks there.


And yes, multipart objects tend to have 0 byte head objects in 
general. The affected objects are simply missing all shadow objects, 
leaving us with nothing but the empty head object and a few metadata.



On 13/11/2020 20:14, Eric Ivancich wrote:

Thank you for the answers to those questions, Janek.

And in case anyone hasn’t seen it, we do have a tracker for this issue:

https://tracker.ceph.com/issues/47866

We may want to move most of the conversation to the comments there, 
so everything’s together.


I do want to follow up on your answer to Question 4, Janek:

On Nov 13, 2020, at 12:22 PM, Janek Bevendorff 
<mailto:janek.bevendo...@uni-weimar.de>> wrote:


4. Is anyone experiencing this issue willing to run their RGWs 
with 'debug_ms=1'? That would allow us to see a request from an 
RGW to either remove a tail object or decrement its reference 
counter (and when its counter reaches 0 it will be deleted).


I haven't had any new data loss in the last few days (at least I 
think so, I read 1byte from all objects, but didn't compare 
checksums, so I cannot say if all objects are complete, but at 
least all are there).


With multipart uploads I believe this is a sufficient test, as the 
first bit of data is in the first tail object, and it’s tail 
objects that seem to be disappearing.


However if the object is not uploaded via multipart and if it does 
have tail (_shadow_) objects, then the initial data is stored in 
the head object. So this test would not be truly diagnostic. This 
could be done with a large object, for example, with `s3cmd put 
--disable-multipart …`.


Eric

--
J. Eric Ivancich
he / him / his
Red Hat Storage
Ann Arbor, Michigan, USA



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: NoSuchKey on key that is visible in s3 list/radosgw bk

2020-11-17 Thread Janek Bevendorff
I have run radosgw-admin gc list (without --include-all) a few times 
already, but the list was always empty. I will create a cron job running 
it every few minutes and writing out the results.


On 17/11/2020 02:22, Eric Ivancich wrote:
I’m wondering if anyone experiencing this bug would mind running 
`radosgw-admin gc list --include-all` on a schedule and saving the 
results. I’d like to know whether these tail objects are getting 
removed by the gc process. If we find that that’s the case then 
there’s the issue of how they got on the gc list.


Eric


On Nov 16, 2020, at 3:48 AM, Janek Bevendorff 
<mailto:janek.bevendo...@uni-weimar.de>> wrote:


As noted in the bug report, the issue has affected only multipart 
objects at this time. I have added some more remarks there.


And yes, multipart objects tend to have 0 byte head objects in 
general. The affected objects are simply missing all shadow objects, 
leaving us with nothing but the empty head object and a few metadata.



On 13/11/2020 20:14, Eric Ivancich wrote:

Thank you for the answers to those questions, Janek.

And in case anyone hasn’t seen it, we do have a tracker for this issue:

https://tracker.ceph.com/issues/47866

We may want to move most of the conversation to the comments there, 
so everything’s together.


I do want to follow up on your answer to Question 4, Janek:

On Nov 13, 2020, at 12:22 PM, Janek Bevendorff 
<mailto:janek.bevendo...@uni-weimar.de>> wrote:


4. Is anyone experiencing this issue willing to run their RGWs 
with 'debug_ms=1'? That would allow us to see a request from an 
RGW to either remove a tail object or decrement its reference 
counter (and when its counter reaches 0 it will be deleted).


I haven't had any new data loss in the last few days (at least I 
think so, I read 1byte from all objects, but didn't compare 
checksums, so I cannot say if all objects are complete, but at 
least all are there).


With multipart uploads I believe this is a sufficient test, as the 
first bit of data is in the first tail object, and it’s tail objects 
that seem to be disappearing.


However if the object is not uploaded via multipart and if it does 
have tail (_shadow_) objects, then the initial data is stored in the 
head object. So this test would not be truly diagnostic. This could 
be done with a large object, for example, with `s3cmd put 
--disable-multipart …`.


Eric

--
J. Eric Ivancich
he / him / his
Red Hat Storage
Ann Arbor, Michigan, USA



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: NoSuchKey on key that is visible in s3 list/radosgw bk

2020-11-16 Thread Janek Bevendorff
As noted in the bug report, the issue has affected only multipart 
objects at this time. I have added some more remarks there.


And yes, multipart objects tend to have 0 byte head objects in general. 
The affected objects are simply missing all shadow objects, leaving us 
with nothing but the empty head object and a few metadata.



On 13/11/2020 20:14, Eric Ivancich wrote:

Thank you for the answers to those questions, Janek.

And in case anyone hasn’t seen it, we do have a tracker for this issue:

https://tracker.ceph.com/issues/47866

We may want to move most of the conversation to the comments there, so 
everything’s together.


I do want to follow up on your answer to Question 4, Janek:

On Nov 13, 2020, at 12:22 PM, Janek Bevendorff 
<mailto:janek.bevendo...@uni-weimar.de>> wrote:


4. Is anyone experiencing this issue willing to run their RGWs with 
'debug_ms=1'? That would allow us to see a request from an RGW to 
either remove a tail object or decrement its reference counter (and 
when its counter reaches 0 it will be deleted).


I haven't had any new data loss in the last few days (at least I 
think so, I read 1byte from all objects, but didn't compare 
checksums, so I cannot say if all objects are complete, but at least 
all are there).


With multipart uploads I believe this is a sufficient test, as the 
first bit of data is in the first tail object, and it’s tail objects 
that seem to be disappearing.


However if the object is not uploaded via multipart and if it does 
have tail (_shadow_) objects, then the initial data is stored in the 
head object. So this test would not be truly diagnostic. This could be 
done with a large object, for example, with `s3cmd put 
--disable-multipart …`.


Eric

--
J. Eric Ivancich
he / him / his
Red Hat Storage
Ann Arbor, Michigan, USA

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: NoSuchKey on key that is visible in s3 list/radosgw bk

2020-11-13 Thread Janek Bevendorff


1. It seems like those reporting this issue are seeing it strictly 
after upgrading to Octopus. From what version did each of these sites 
upgrade to Octopus? From Nautilus? Mimic? Luminous?


I upgraded from the latest Luminous release.




2. Does anyone have any lifecycle rules on a bucket experiencing this 
issue? If so, please describe.


Nope.




3. Is anyone making copies of the affected objects (to same or to a 
different bucket) prior to seeing the issue? And if they are making 
copies, does the destination bucket have lifecycle rules? And if they 
are making copies, are those copies ever being removed?


We are not making copies, but we have bucket ACLs in place, which allow 
different users to access the objects. I doubt this is the problem 
though, otherwise we probably would have lost terabytes upon terabytes 
and not 16 objects so far.




4. Is anyone experiencing this issue willing to run their RGWs with 
'debug_ms=1'? That would allow us to see a request from an RGW to 
either remove a tail object or decrement its reference counter (and 
when its counter reaches 0 it will be deleted).


I haven't had any new data loss in the last few days (at least I think 
so, I read 1byte from all objects, but didn't compare checksums, so I 
cannot say if all objects are complete, but at least all are there).





Thanks,

Eric


On Nov 12, 2020, at 4:54 PM, huxia...@horebdata.cn 
<mailto:huxia...@horebdata.cn> wrote:


Looks like this is a very dangerous bug for data safety. Hope the bug 
would be quickly identified and fixed.


best regards,

Samuel



huxia...@horebdata.cn <mailto:huxia...@horebdata.cn>

From: Janek Bevendorff
Date: 2020-11-12 18:17
To:huxia...@horebdata.cn <mailto:huxia...@horebdata.cn>; EDH - Manuel 
Rios; Rafael Lopez

CC: Robin H. Johnson; ceph-users
Subject: Re: [ceph-users] Re: NoSuchKey on key that is visible in s3 
list/radosgw bk
I have never seen this on Luminous. I recently upgraded to Octopus 
and the issue started occurring only few weeks later.


On 12/11/2020 16:37, huxia...@horebdata.cn 
<mailto:huxia...@horebdata.cn> wrote:
which Ceph versions are affected by this RGW bug/issues? Luminous, 
Mimic, Octupos, or the latest?


any idea?

samuel



huxia...@horebdata.cn <mailto:huxia...@horebdata.cn>

From: EDH - Manuel Rios
Date: 2020-11-12 14:27
To: Janek Bevendorff; Rafael Lopez
CC: Robin H. Johnson; ceph-users
Subject: [ceph-users] Re: NoSuchKey on key that is visible in s3 
list/radosgw bk
This same error caused us to wipe a full cluster of 300TB... will be 
related to some rados index/database bug not to s3.


As Janek exposed is a mayor issue, because the error silent happend 
and you can only detect it with S3, when you're going to delete/purge 
a S3 bucket. Dropping NoSuchKey. Error is not related to S3 logic ..


Hope this time dev's can take enought time to find and resolve the 
issue. Error happens with low ec profiles, even with replica x3 in 
some cases.


Regards



-Mensaje original-
De: Janek Bevendorff <mailto:janek.bevendo...@uni-weimar.de>>

Enviado el: jueves, 12 de noviembre de 2020 14:06
Para: Rafael Lopez <mailto:rafael.lo...@monash.edu>>
CC: Robin H. Johnson <mailto:robb...@gentoo.org>>; ceph-users <mailto:ceph-users@ceph.io>>
Asunto: [ceph-users] Re: NoSuchKey on key that is visible in s3 
list/radosgw bk


Here is a bug report concerning (probably) this exact issue:
https://tracker.ceph.com/issues/47866

I left a comment describing the situation and my (limited) 
experiences with it.



On 11/11/2020 10:04, Janek Bevendorff wrote:


Yeah, that seems to be it. There are 239 objects prefixed
.8naRUHSG2zfgjqmwLnTPvvY1m6DZsgh in my dump. However, there are none
of the multiparts from the other file to be found and the head object
is 0 bytes.

I checked another multipart object with an end pointer of 11.
Surprisingly, it had way more than 11 parts (39 to be precise) named
.1, .1_1 .1_2, .1_3, etc. Not sure how Ceph identifies those, but I
could find them in the dump at least.

I have no idea why the objects disappeared. I ran a Spark job over all
buckets, read 1 byte of every object and recorded errors. Of the 78
buckets, two are missing objects. One bucket is missing one object,
the other 15. So, luckily, the incidence is still quite low, but the
problem seems to be expanding slowly.


On 10/11/2020 23:46, Rafael Lopez wrote:

Hi Janek,

What you said sounds right - an S3 single part obj won't have an S3
multipart string as part of the prefix. S3 multipart string looks
like "2~m5Y42lPMIeis5qgJAZJfuNnzOKd7lme".

From memory, single part S3 objects that don't fit in a single rados
object are assigned a random prefix that has nothing to do with
the object name, and the rados tail/data objects (not the head
object) have that prefix.
As per your working example, the prefix for that would be
'.8naRUHSG2zfgjqmwLnTPvvY1m6DZsgh'. So there would be (239) "shadow&

[ceph-users] Re: NoSuchKey on key that is visible in s3 list/radosgw bk

2020-11-12 Thread Janek Bevendorff
I have never seen this on Luminous. I recently upgraded to Octopus and 
the issue started occurring only few weeks later.



On 12/11/2020 16:37, huxia...@horebdata.cn wrote:
which Ceph versions are affected by this RGW bug/issues? Luminous, 
Mimic, Octupos, or the latest?


any idea?

samuel


huxia...@horebdata.cn

*From:* EDH - Manuel Rios <mailto:mrios...@easydatahost.com>
*Date:* 2020-11-12 14:27
*To:* Janek Bevendorff <mailto:janek.bevendo...@uni-weimar.de>;
Rafael Lopez <mailto:rafael.lo...@monash.edu>
*CC:* Robin H. Johnson <mailto:robb...@gentoo.org>; ceph-users
<mailto:ceph-users@ceph.io>
*Subject:* [ceph-users] Re: NoSuchKey on key that is visible in s3
list/radosgw bk
This same error caused us to wipe a full cluster of 300TB... will
be related to some rados index/database bug not to s3.
As Janek exposed is a mayor issue, because the error silent
happend and you can only detect it with S3, when you're going to
delete/purge a S3 bucket. Dropping NoSuchKey. Error is not related
to S3 logic ..
Hope this time dev's can take enought time to find and resolve the
issue. Error happens with low ec profiles, even with replica x3 in
some cases.
Regards
-Mensaje original-
De: Janek Bevendorff 
Enviado el: jueves, 12 de noviembre de 2020 14:06
Para: Rafael Lopez 
CC: Robin H. Johnson ; ceph-users

Asunto: [ceph-users] Re: NoSuchKey on key that is visible in s3
list/radosgw bk
Here is a bug report concerning (probably) this exact issue:
https://tracker.ceph.com/issues/47866
I left a comment describing the situation and my (limited)
experiences with it.
    On 11/11/2020 10:04, Janek Bevendorff wrote:
>
> Yeah, that seems to be it. There are 239 objects prefixed
> .8naRUHSG2zfgjqmwLnTPvvY1m6DZsgh in my dump. However, there are
none
> of the multiparts from the other file to be found and the head
object
> is 0 bytes.
>
> I checked another multipart object with an end pointer of 11.
> Surprisingly, it had way more than 11 parts (39 to be precise)
named
> .1, .1_1 .1_2, .1_3, etc. Not sure how Ceph identifies those, but I
> could find them in the dump at least.
>
> I have no idea why the objects disappeared. I ran a Spark job
over all
> buckets, read 1 byte of every object and recorded errors. Of the 78
> buckets, two are missing objects. One bucket is missing one object,
> the other 15. So, luckily, the incidence is still quite low, but
the
> problem seems to be expanding slowly.
>
>
> On 10/11/2020 23:46, Rafael Lopez wrote:
>> Hi Janek,
>>
>> What you said sounds right - an S3 single part obj won't have
an S3
>> multipart string as part of the prefix. S3 multipart string looks
>> like "2~m5Y42lPMIeis5qgJAZJfuNnzOKd7lme".
>>
>> From memory, single part S3 objects that don't fit in a single
rados
>> object are assigned a random prefix that has nothing to do with
>> the object name, and the rados tail/data objects (not the head
>> object) have that prefix.
>> As per your working example, the prefix for that would be
>> '.8naRUHSG2zfgjqmwLnTPvvY1m6DZsgh'. So there would be (239)
"shadow"
>> objects with names containing that prefix, and if you add up the
>> sizes it should be the size of your S3 object.
>>
>> You should look at working and non working examples of both single
>> and multipart S3 objects, as they are probably all a bit different
>> when you look in rados.
>>
>> I agree it is a serious issue, because once objects are no
longer in
>> rados, they cannot be recovered. If it was a case that there was a
>> link broken or rados objects renamed, then we could work to
>> recover...but as far as I can tell, it looks like stuff is just
>> vanishing from rados. The only explanation I can think of is some
>> (rgw or rados) background process is incorrectly doing
something with
>> these objects (eg. renaming/deleting). I had thought perhaps it
was a
>> bug with the rgw garbage collector..but that is pure speculation.
>>
>> Once you can articulate the problem, I'd recommend logging a bug
>> tracker upstream.
>>
>>
>> On Wed, 11 Nov 2020 at 06:33, Janek Bevendorff
>> > <mailto:janek.bevendo...@uni-weimar.de>> wrote:
>>
>> Here's something else I noticed: when I stat objects that work
>> 

[ceph-users] Re: NoSuchKey on key that is visible in s3 list/radosgw bk

2020-11-12 Thread Janek Bevendorff
Here is a bug report concerning (probably) this exact issue: 
https://tracker.ceph.com/issues/47866


I left a comment describing the situation and my (limited) experiences 
with it.



On 11/11/2020 10:04, Janek Bevendorff wrote:


Yeah, that seems to be it. There are 239 objects prefixed 
.8naRUHSG2zfgjqmwLnTPvvY1m6DZsgh in my dump. However, there are none 
of the multiparts from the other file to be found and the head object 
is 0 bytes.


I checked another multipart object with an end pointer of 11. 
Surprisingly, it had way more than 11 parts (39 to be precise) named 
.1, .1_1 .1_2, .1_3, etc. Not sure how Ceph identifies those, but I 
could find them in the dump at least.


I have no idea why the objects disappeared. I ran a Spark job over all 
buckets, read 1 byte of every object and recorded errors. Of the 78 
buckets, two are missing objects. One bucket is missing one object, 
the other 15. So, luckily, the incidence is still quite low, but the 
problem seems to be expanding slowly.



On 10/11/2020 23:46, Rafael Lopez wrote:

Hi Janek,

What you said sounds right - an S3 single part obj won't have an S3 
multipart string as part of the prefix. S3 multipart string looks 
like "2~m5Y42lPMIeis5qgJAZJfuNnzOKd7lme".


From memory, single part S3 objects that don't fit in a single rados 
object are assigned a random prefix that has nothing to do with 
the object name, and the rados tail/data objects (not the head 
object) have that prefix.
As per your working example, the prefix for that would be 
'.8naRUHSG2zfgjqmwLnTPvvY1m6DZsgh'. So there would be (239) "shadow" 
objects with names containing that prefix, and if you add up the 
sizes it should be the size of your S3 object.


You should look at working and non working examples of both single 
and multipart S3 objects, as they are probably all a bit different 
when you look in rados.


I agree it is a serious issue, because once objects are no longer in 
rados, they cannot be recovered. If it was a case that there was a 
link broken or rados objects renamed, then we could work to 
recover...but as far as I can tell, it looks like stuff is just 
vanishing from rados. The only explanation I can think of is some 
(rgw or rados) background process is incorrectly doing something with 
these objects (eg. renaming/deleting). I had thought perhaps it was a 
bug with the rgw garbage collector..but that is pure speculation.


Once you can articulate the problem, I'd recommend logging a bug 
tracker upstream.



On Wed, 11 Nov 2020 at 06:33, Janek Bevendorff 
<mailto:janek.bevendo...@uni-weimar.de>> wrote:


Here's something else I noticed: when I stat objects that work
via radosgw-admin, the stat info contains a "begin_iter" JSON
object with RADOS key info like this


                    "key": {
                        "name":
"29/items/WIDE-20110924034843-crawl420/WIDE-20110924065228-02544.warc.gz",
                        "instance": "",
                        "ns": ""
                    }


and then "end_iter" with key info like this:


                    "key": {
                        "name":
".8naRUHSG2zfgjqmwLnTPvvY1m6DZsgh_239",
                        "instance": "",
                        "ns": "shadow"
                    }

However, when I check the broken 0-byte object, the "begin_iter"
and "end_iter" keys look like this:


                    "key": {
                        "name":

"29/items/WIDE-20110903143858-crawl428/WIDE-20110903143858-01166.warc.gz.2~m5Y42lPMIeis5qgJAZJfuNnzOKd7lme.1",
                        "instance": "",
                        "ns": "multipart"
                    }

[...]


                    "key": {
                        "name":

"29/items/WIDE-20110903143858-crawl428/WIDE-20110903143858-01166.warc.gz.2~m5Y42lPMIeis5qgJAZJfuNnzOKd7lme.19",
                        "instance": "",
                        "ns": "multipart"
                    }

So, it's the full name plus a suffix and the namespace is
multipart, not shadow (or empty). This in itself may just be an
artefact of whether the object was uploaded in one go or as a
multipart object, but the second difference is that I cannot find
any of the multipart objects in my pool's object name dump. I
can, however, find the shadow RADOS object of the intact S3 object.




--
*Rafael Lopez*
Devops Systems Engineer
Monash University eResearch Centre

T: +61 3 9905 9118 
E: rafael.lo...@monash.edu <mailto:rafael.lo...@monash.edu>


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: NoSuchKey on key that is visible in s3 list/radosgw bk

2020-11-11 Thread Janek Bevendorff
Yeah, that seems to be it. There are 239 objects prefixed 
.8naRUHSG2zfgjqmwLnTPvvY1m6DZsgh in my dump. However, there are none of 
the multiparts from the other file to be found and the head object is 0 
bytes.


I checked another multipart object with an end pointer of 11. 
Surprisingly, it had way more than 11 parts (39 to be precise) named .1, 
.1_1 .1_2, .1_3, etc. Not sure how Ceph identifies those, but I could 
find them in the dump at least.


I have no idea why the objects disappeared. I ran a Spark job over all 
buckets, read 1 byte of every object and recorded errors. Of the 78 
buckets, two are missing objects. One bucket is missing one object, the 
other 15. So, luckily, the incidence is still quite low, but the problem 
seems to be expanding slowly.



On 10/11/2020 23:46, Rafael Lopez wrote:

Hi Janek,

What you said sounds right - an S3 single part obj won't have an S3 
multipart string as part of the prefix. S3 multipart string looks like 
"2~m5Y42lPMIeis5qgJAZJfuNnzOKd7lme".


From memory, single part S3 objects that don't fit in a single rados 
object are assigned a random prefix that has nothing to do with 
the object name, and the rados tail/data objects (not the head object) 
have that prefix.
As per your working example, the prefix for that would be 
'.8naRUHSG2zfgjqmwLnTPvvY1m6DZsgh'. So there would be (239) "shadow" 
objects with names containing that prefix, and if you add up the sizes 
it should be the size of your S3 object.


You should look at working and non working examples of both single and 
multipart S3 objects, as they are probably all a bit different when 
you look in rados.


I agree it is a serious issue, because once objects are no longer in 
rados, they cannot be recovered. If it was a case that there was a 
link broken or rados objects renamed, then we could work to 
recover...but as far as I can tell, it looks like stuff is just 
vanishing from rados. The only explanation I can think of is some (rgw 
or rados) background process is incorrectly doing something with these 
objects (eg. renaming/deleting). I had thought perhaps it was a bug 
with the rgw garbage collector..but that is pure speculation.


Once you can articulate the problem, I'd recommend logging a bug 
tracker upstream.



On Wed, 11 Nov 2020 at 06:33, Janek Bevendorff 
<mailto:janek.bevendo...@uni-weimar.de>> wrote:


Here's something else I noticed: when I stat objects that work via
radosgw-admin, the stat info contains a "begin_iter" JSON object
with RADOS key info like this


                    "key": {
                        "name":
"29/items/WIDE-20110924034843-crawl420/WIDE-20110924065228-02544.warc.gz",
                        "instance": "",
                        "ns": ""
                    }


and then "end_iter" with key info like this:


                    "key": {
                        "name":
".8naRUHSG2zfgjqmwLnTPvvY1m6DZsgh_239",
                        "instance": "",
                        "ns": "shadow"
                    }

However, when I check the broken 0-byte object, the "begin_iter"
and "end_iter" keys look like this:


                    "key": {
                        "name":

"29/items/WIDE-20110903143858-crawl428/WIDE-20110903143858-01166.warc.gz.2~m5Y42lPMIeis5qgJAZJfuNnzOKd7lme.1",
                        "instance": "",
                        "ns": "multipart"
                    }

[...]


                    "key": {
                        "name":

"29/items/WIDE-20110903143858-crawl428/WIDE-20110903143858-01166.warc.gz.2~m5Y42lPMIeis5qgJAZJfuNnzOKd7lme.19",
                        "instance": "",
                        "ns": "multipart"
                    }

So, it's the full name plus a suffix and the namespace is
multipart, not shadow (or empty). This in itself may just be an
artefact of whether the object was uploaded in one go or as a
multipart object, but the second difference is that I cannot find
any of the multipart objects in my pool's object name dump. I can,
however, find the shadow RADOS object of the intact S3 object.




--
*Rafael Lopez*
Devops Systems Engineer
Monash University eResearch Centre

T: +61 3 9905 9118 
E: rafael.lo...@monash.edu <mailto:rafael.lo...@monash.edu>


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: NoSuchKey on key that is visible in s3 list/radosgw bk

2020-11-10 Thread Janek Bevendorff
We are having the exact same problem (also Octopus). The object is listed by 
s3cmd, but trying to download it results in a 404 error. radosgw-admin object 
stat shows that the object still exists. Any further ideas how I can restore 
access to this object?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: NoSuchKey on key that is visible in s3 list/radosgw bk

2020-11-10 Thread Janek Bevendorff
Here's something else I noticed: when I stat objects that work via 
radosgw-admin, the stat info contains a "begin_iter" JSON object with RADOS key 
info like this


"key": {
"name": 
"29/items/WIDE-20110924034843-crawl420/WIDE-20110924065228-02544.warc.gz",
"instance": "",
"ns": ""
}


and then "end_iter" with key info like this:


"key": {
"name": ".8naRUHSG2zfgjqmwLnTPvvY1m6DZsgh_239",
"instance": "",
"ns": "shadow"
}


However, when I check the broken 0-byte object, the "begin_iter" and "end_iter" 
keys look like this:


"key": {
"name": 
"29/items/WIDE-20110903143858-crawl428/WIDE-20110903143858-01166.warc.gz.2~m5Y42lPMIeis5qgJAZJfuNnzOKd7lme.1",
"instance": "",
"ns": "multipart"
}

[...]


"key": {
"name": 
"29/items/WIDE-20110903143858-crawl428/WIDE-20110903143858-01166.warc.gz.2~m5Y42lPMIeis5qgJAZJfuNnzOKd7lme.19",
"instance": "",
"ns": "multipart"
}


So, it's the full name plus a suffix and the namespace is multipart, not shadow 
(or empty). This in itself may just be an artefact of whether the object was 
uploaded in one go or as a multipart object, but the second difference is that 
I cannot find any of the multipart objects in my pool's object name dump. I 
can, however, find the shadow RADOS object of the intact S3 object.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: NoSuchKey on key that is visible in s3 list/radosgw bk

2020-11-10 Thread Janek Bevendorff
I found some of the data in the rados ls dump. We host some WARCs from 
the Internet Archive and one affected WARC still has its warc.os.cdx.gz 
file intact, while the actual warc.gz is gone.


A rados stat revealed

WIDE-20110903143858-01166.warc.os.cdx.gz mtime 
2019-07-14T17:48:39.00+0200, size 1060428


for the cdx.gz file, but

WIDE-20110903143858-01166.warc.gz mtime 2019-07-14T17:04:49.00+0200, 
size 0


for the warc.gz.

I couldn't find any of the suffixed multipart objects listed in 
radosgw-admin stat.


WIDE-20110903143858-01166.warc.gz.2~m5Y42lPMIeis5qgJAZJfuNnzOKd7lme.19: 
(2) No such file or directory



On 10/11/2020 10:14, Janek Bevendorff wrote:
Thanks for the reply. This issue seems to be VERY serious. New objects 
are disappearing every day. This is a silent, creeping data loss.


I couldn't find the object with rados stat, but I am now listing all 
the objects and will grep the dump to see if there is anything left.


Janek

On 09/11/2020 23:31, Rafael Lopez wrote:

Hi Mariusz, all

We have seen this issue as well, on redhat ceph 4 (I have an 
unresolved case open). In our case, `radosgw-admin stat` is not a 
sufficient check to guarantee that there are rados objects. You have 
to do a `rados stat` to know that.


In your case, the object is ~48M in size, appears to also use S3 
multipart.
This means, when uploaded, S3 will slice it up into parts based on 
what S3 multipart size you use (5M default, i think 8M here). After 
that, rados further slices any incoming (multipart size objects) into 
rados object objects of 4Mib size (default).


The end result is you have a bunch of rados objects labelled with the 
'prefix' from the `radosgw-admin stat` you ran, as well as a head 
object (named the same as the S3 object you uploaded) that contains 
the metadata so rgw knows how to put the S3 object back together. In 
our case, the head object is there but the other rados pieces that 
hold the actual data seem to be gone, so `radosgw-admin stat` returns 
fine, but we get NoSuchKey when trying to download.


Try `rados -p {rgw buckets pool} stat 
255/38355/juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4`, 
it will show you the rados stat of the head object, which will be 
much smaller than the S3 object.


To check if you actually have all rados objects for this 48M S3 
object, try searching for parts of the prefix or the whole prefix on 
a list of all rados objects in buckets pool.
FYI, the `rados ls` will list every rados object in the bucket, so it 
may be very large and take a long time if you have many objects.


rados -p {rgw buckets pool} ls > {tmpfile}
grep '2~NTy88SkDkXR9ifSrrRcw5WPDxqN3PO2' {tmpfile}
grep 'juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4' 
{tmpfile}


The first grep is actually the S3 multipart ID string added to the 
prefix by rgw.


Rafael

On Tue, 10 Nov 2020 at 01:04, Janek Bevendorff 
<mailto:janek.bevendo...@uni-weimar.de>> wrote:


    We are having the exact same problem (also Octopus). The object is
    listed by s3cmd, but trying to download it results in a 404 error.
    radosgw-admin object stat shows that the object still exists. Any
    further ideas how I can restore access to this object?

    (Sorry if this is a duplicate, but it seems like the mailing list
    hasn't
    accepted my original mail).


    > Mariusz Gronczewski wrote:
    >
    >
    >> Dnia 2020-07-27, o godz. 21:31:33
    >> "Robin H. Johnson" <mailto:robb...@gentoo.org>

    >> <mailto:robb...@gentoo.org
    <mailto:robb...@gentoo.org>>> napisał(a):
    >>
    >>
    >>>
On Mon, Jul 27, 2020 at 08:02:23PM +0200, Mariusz Gronczewski wrote:
    >>>
    >>>> Hi,
    >>>> 
I've got a problem on Octopus (15.2.3, debian packages) install,

    >>>> bucket S3 index shows a file:
    >>>> s3cmd ls s3://upvid/255/38355 --recursive
    >>>> 2020-07-27 17:48  50584342
    >>>>

s3://upvid/255/38355/juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4
    >>>> radosgw-admin bi list also shows it
    >>>> {
    >>>> "type": "plain",
    >>>> "idx":
    >>>>
"255/38355/juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4",
    >>>> "entry": { "name":
    >>>>
"255/38355/juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4",
    >>>> "instance": "", "ver": {
    >>>> "pool": 11,
    >>>> "epoch": 853842
    >>>> },
    >>>> "locator": "",
    >>>> 

[ceph-users] Re: NoSuchKey on key that is visible in s3 list/radosgw bk

2020-11-10 Thread Janek Bevendorff
Thanks for the reply. This issue seems to be VERY serious. New objects 
are disappearing every day. This is a silent, creeping data loss.


I couldn't find the object with rados stat, but I am now listing all the 
objects and will grep the dump to see if there is anything left.


Janek

On 09/11/2020 23:31, Rafael Lopez wrote:

Hi Mariusz, all

We have seen this issue as well, on redhat ceph 4 (I have an 
unresolved case open). In our case, `radosgw-admin stat` is not a 
sufficient check to guarantee that there are rados objects. You have 
to do a `rados stat` to know that.


In your case, the object is ~48M in size, appears to also use S3 
multipart.
This means, when uploaded, S3 will slice it up into parts based on 
what S3 multipart size you use (5M default, i think 8M here). After 
that, rados further slices any incoming (multipart size objects) into 
rados object objects of 4Mib size (default).


The end result is you have a bunch of rados objects labelled with the 
'prefix' from the `radosgw-admin stat` you ran, as well as a head 
object (named the same as the S3 object you uploaded) that contains 
the metadata so rgw knows how to put the S3 object back together. In 
our case, the head object is there but the other rados pieces that 
hold the actual data seem to be gone, so `radosgw-admin stat` returns 
fine, but we get NoSuchKey when trying to download.


Try `rados -p {rgw buckets pool} stat 
255/38355/juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4`, 
it will show you the rados stat of the head object, which will be much 
smaller than the S3 object.


To check if you actually have all rados objects for this 48M S3 
object, try searching for parts of the prefix or the whole prefix on a 
list of all rados objects in buckets pool.
FYI, the `rados ls` will list every rados object in the bucket, so it 
may be very large and take a long time if you have many objects.


rados -p {rgw buckets pool} ls > {tmpfile}
grep '2~NTy88SkDkXR9ifSrrRcw5WPDxqN3PO2' {tmpfile}
grep 'juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4' {tmpfile}

The first grep is actually the S3 multipart ID string added to the 
prefix by rgw.


Rafael

On Tue, 10 Nov 2020 at 01:04, Janek Bevendorff 
<mailto:janek.bevendo...@uni-weimar.de>> wrote:


We are having the exact same problem (also Octopus). The object is
listed by s3cmd, but trying to download it results in a 404 error.
radosgw-admin object stat shows that the object still exists. Any
further ideas how I can restore access to this object?

(Sorry if this is a duplicate, but it seems like the mailing list
hasn't
accepted my original mail).


> Mariusz Gronczewski wrote:
>
>
>> Dnia 2020-07-27, o godz. 21:31:33
>> "Robin H. Johnson" mailto:robb...@gentoo.org>
>> <mailto:robb...@gentoo.org
<mailto:robb...@gentoo.org>>> napisał(a):
>>
>>
>>>
On Mon, Jul 27, 2020 at 08:02:23PM +0200, Mariusz Gronczewski wrote:
>>>
>>>> Hi,
>>>> I've got a problem on Octopus (15.2.3, debian packages) install,
>>>> bucket S3 index shows a file:
>>>> s3cmd ls s3://upvid/255/38355 --recursive
>>>> 2020-07-27 17:48  50584342
>>>>

s3://upvid/255/38355/juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4
>>>> radosgw-admin bi list also shows it
>>>> {
>>>> "type": "plain",
>>>> "idx":
>>>>
"255/38355/juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4",
>>>> "entry": { "name":
>>>>
"255/38355/juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4",
>>>> "instance": "", "ver": {
>>>> "pool": 11,
>>>> "epoch": 853842
>>>> },
>>>> "locator": "",
>>>> "exists": "true",
>>>> "meta": {
>>>> "category": 1,
>>>> "size": 50584342,
>>>> "mtime": "2020-07-27T17:48:27.203008Z",
>>>> "etag": "2b31cc8ce8b1fb92a5f65034f2d12581-7",
>>>> "storage_class": "",
>>>> "owner": "filmweb-app",
>>>> "owner_display_name": "

[ceph-users] Re: NoSuchKey on key that is visible in s3 list/radosgw bk

2020-11-09 Thread Janek Bevendorff
We are having the exact same problem (also Octopus). The object is 
listed by s3cmd, but trying to download it results in a 404 error. 
radosgw-admin object stat shows that the object still exists. Any 
further ideas how I can restore access to this object?


(Sorry if this is a duplicate, but it seems like the mailing list hasn't 
accepted my original mail).




Mariusz Gronczewski wrote:



Dnia 2020-07-27, o godz. 21:31:33
"Robin H. Johnson" > napisał(a):




On Mon, Jul 27, 2020 at 08:02:23PM +0200, Mariusz Gronczewski wrote:


Hi,
I've got a problem on Octopus (15.2.3, debian packages) install,
bucket S3 index shows a file:
s3cmd ls s3://upvid/255/38355 --recursive
2020-07-27 17:48  50584342

s3://upvid/255/38355/juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4
radosgw-admin bi list also shows it
{
"type": "plain",
"idx":
"255/38355/juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4",
"entry": { "name":
"255/38355/juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4",
"instance": "", "ver": {
"pool": 11,
"epoch": 853842
},
"locator": "",
"exists": "true",
"meta": {
"category": 1,
"size": 50584342,
"mtime": "2020-07-27T17:48:27.203008Z",
"etag": "2b31cc8ce8b1fb92a5f65034f2d12581-7",
"storage_class": "",
"owner": "filmweb-app",
"owner_display_name": "filmweb app user",
"content_type": "",
"accounted_size": 50584342,
"user_data": "",
"appendable": "false"
},
"tag": "_3ubjaztglHXfZr05wZCFCPzebQf-ZFP",
"flags": 0,
"pending_map": [],
"versioned_epoch": 0
}
},
but trying to download it via curl (I've set permissions to public0


only gets me
Does the RADOS object for this still exist?

try:
radosgw-admin object stat --bucket ... --object
'255/38355/juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4'

If that doesn't return, then the backing object is gone, and you have
a stale index entry that can be cleaned up in most cases with check
bucket.
For cases where that doesn't fix it, my recommended way to fix it is
write a new 0-byte object to the same name, then delete it.





it does exist:

{
  "name":
"255/38355/juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4",
"size": 50584342, "policy": {
  "acl": {...},
  "owner": {...}
  },
  "etag": "2b31cc8ce8b1fb92a5f65034f2d12581-7",
  "tag": "_3ubjaztglHXfZr05wZCFCPzebQf-ZFP",
  "manifest": {
  "objs": [],
  "obj_size": 50584342,
  "explicit_objs": "false",
  "head_size": 0,
  "max_head_size": 0,
  "prefix":
"255/38355/juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4.2~NTy88SkDkXR9ifSrrRcw5WPDxqN3PO2",
"rules": [ {
  "key": 0,
  "val": {
  "start_part_num": 1,
  "start_ofs": 0,
  "part_size": 8388608,
  "stripe_max_size": 4194304,
  "override_prefix": ""
  }
  },
  {
  "key": 50331648,
  "val": {
  "start_part_num": 7,
  "start_ofs": 50331648,
  "part_size": 252694,
  "stripe_max_size": 4194304,
  "override_prefix": ""
  }
  }
  ],
  "tail_instance": "",
  "tail_placement": {
  "bucket": {
  "name": "upvid",
  "marker":
"88d4f221-0da5-444d-81a8-517771278350.665933.2", "bucket_id":
"88d4f221-0da5-444d-81a8-517771278350.665933.2", "tenant": "",
  "explicit_placement": {
  "data_pool": "",
  "data_extra_pool": "",
  "index_pool": ""
  }
  },
  "placement_rule": "default-placement"
  },
  "begin_iter": {
  "part_ofs": 0,
  "stripe_ofs": 0,
  "ofs": 0,
  "stripe_size": 4194304,
  "cur_part_id": 1,
  "cur_stripe": 0,
  "cur_override_prefix": "",
  "location": {
  "placement_rule": "default-placement",
  "obj": {
  "bucket": {
  "name": "upvid",
  "marker":
"88d4f221-0da5-444d-81a8-517771278350.665933.2", "bucket_id":
"88d4f221-0da5-444d-81a8-517771278350.665933.2", "tenant": "",
  "explicit_placement": {
  "data_pool": "",
  "data_extra_pool": "",
  "index_pool": ""
  }
  },
  "key": {
  "name":
"255/38355/juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4.2~NTy88SkDkXR9ifSrrRcw5WPDxqN3PO2.1",

[ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

2020-04-01 Thread Janek Bevendorff

> I’m actually very curious how well this is performing for you as I’ve 
> definitely not seen a deployment this large. How do you use it?

What exactly do you mean? Our cluster has 11PiB capacity of which about
15% are used at the moment (web-scale corpora and such). We have
deployed 5 MONs and 5MGRs (both on the same hosts) and it works totally
fine overall. We have some MDS performance issues here and there, but
that's not too bad anymore after a few upstream patches and then we have
this annoying Prometheus MGR problem, which kills our MGRs reliably
after a few hours.

>
>> On Mar 27, 2020, at 11:47 AM, shubjero  wrote:
>>
>> I've reported stability problems with ceph-mgr w/ prometheus plugin
>> enabled on all versions we ran in production which were several
>> versions of Luminous and Mimic. Our solution was to disable the
>> prometheus exporter. I am using Zabbix instead. Our cluster is 1404
>> OSD's in size with about 9PB raw with around 35% utilization.
>>
>> On Fri, Mar 27, 2020 at 4:26 AM Janek Bevendorff
>>  wrote:
>>> Sorry, I meant MGR of course. MDS are fine for me. But the MGRs were
>>> failing constantly due to the prometheus module doing something funny.
>>>
>>>
>>> On 26/03/2020 18:10, Paul Choi wrote:
>>>> I won't speculate more into the MDS's stability, but I do wonder about
>>>> the same thing.
>>>> There is one file served by the MDS that would cause the ceph-fuse
>>>> client to hang. It was a file that many people in the company relied
>>>> on for data updates, so very noticeable. The only fix was to fail over
>>>> the MDS.
>>>>
>>>> Since the free disk space dropped, I haven't heard anyone complain...
>>>> 
>>>>
>>>> On Thu, Mar 26, 2020 at 9:43 AM Janek Bevendorff
>>>> >>> <mailto:janek.bevendo...@uni-weimar.de>> wrote:
>>>>
>>>>If there is actually a connection, then it's no wonder our MDS
>>>>kept crashing. Our Ceph has 9.2PiB of available space at the moment.
>>>>
>>>>
>>>>On 26/03/2020 17:32, Paul Choi wrote:
>>>>>    I can't quite explain what happened, but the Prometheus endpoint
>>>>>became stable after the free disk space for the largest pool went
>>>>>substantially lower than 1PB.
>>>>>I wonder if there's some metric that exceeds the maximum size for
>>>>>some int, double, etc?
>>>>>
>>>>>-Paul
>>>>>
>>>>>On Mon, Mar 23, 2020 at 9:50 AM Janek Bevendorff
>>>>>>>>><mailto:janek.bevendo...@uni-weimar.de>> wrote:
>>>>>
>>>>>I haven't seen any MGR hangs so far since I disabled the
>>>>>prometheus
>>>>>module. It seems like the module is not only slow, but kills
>>>>>the whole
>>>>>MGR when the cluster is sufficiently large, so these two
>>>>>issues are most
>>>>>likely connected. The issue has become much, much worse with
>>>>>14.2.8.
>>>>>
>>>>>
>>>>>On 23/03/2020 09:00, Janek Bevendorff wrote:
>>>>>> I am running the very latest version of Nautilus. I will
>>>>>try setting up
>>>>>> an external exporter today and see if that fixes anything.
>>>>>Our cluster
>>>>>> is somewhat large-ish with 1248 OSDs, so I expect stat
>>>>>collection to
>>>>>> take "some" time, but it definitely shouldn't crush the
>>>>>MGRs all the time.
>>>>>> On 21/03/2020 02:33, Paul Choi wrote:
>>>>>>> Hi Janek,
>>>>>>>
>>>>>>> What version of Ceph are you using?
>>>>>>> We also have a much smaller cluster running Nautilus, with
>>>>>no MDS. No
>>>>>>> Prometheus issues there.
>>>>>>> I won't speculate further than this but perhaps Nautilus
>>>>>doesn't have
>>>>>>> the same issue as Mimic?
>>>>>>>
>>>>>>> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff
>>>>>>> >>>><mailto:janek.bevendo...@uni-weimar.de>
>>>>>>> <mailto:janek.bevendo...@uni-weimar.de
>>>>><mailto:janek

[ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

2020-03-27 Thread Janek Bevendorff
Sorry, I meant MGR of course. MDS are fine for me. But the MGRs were
failing constantly due to the prometheus module doing something funny.


On 26/03/2020 18:10, Paul Choi wrote:
> I won't speculate more into the MDS's stability, but I do wonder about
> the same thing.
> There is one file served by the MDS that would cause the ceph-fuse
> client to hang. It was a file that many people in the company relied
> on for data updates, so very noticeable. The only fix was to fail over
> the MDS.
>
> Since the free disk space dropped, I haven't heard anyone complain...
> 
>
> On Thu, Mar 26, 2020 at 9:43 AM Janek Bevendorff
>  <mailto:janek.bevendo...@uni-weimar.de>> wrote:
>
> If there is actually a connection, then it's no wonder our MDS
> kept crashing. Our Ceph has 9.2PiB of available space at the moment.
>
>
> On 26/03/2020 17:32, Paul Choi wrote:
>> I can't quite explain what happened, but the Prometheus endpoint
>> became stable after the free disk space for the largest pool went
>> substantially lower than 1PB.
>> I wonder if there's some metric that exceeds the maximum size for
>> some int, double, etc?
>>
>> -Paul
>>
>> On Mon, Mar 23, 2020 at 9:50 AM Janek Bevendorff
>> > <mailto:janek.bevendo...@uni-weimar.de>> wrote:
>>
>> I haven't seen any MGR hangs so far since I disabled the
>> prometheus
>> module. It seems like the module is not only slow, but kills
>> the whole
>> MGR when the cluster is sufficiently large, so these two
>> issues are most
>> likely connected. The issue has become much, much worse with
>> 14.2.8.
>>
>>
>> On 23/03/2020 09:00, Janek Bevendorff wrote:
>> > I am running the very latest version of Nautilus. I will
>> try setting up
>> > an external exporter today and see if that fixes anything.
>> Our cluster
>> > is somewhat large-ish with 1248 OSDs, so I expect stat
>> collection to
>> > take "some" time, but it definitely shouldn't crush the
>> MGRs all the time.
>> >
>> > On 21/03/2020 02:33, Paul Choi wrote:
>> >> Hi Janek,
>> >>
>> >> What version of Ceph are you using?
>>     >> We also have a much smaller cluster running Nautilus, with
>> no MDS. No
>> >> Prometheus issues there.
>> >> I won't speculate further than this but perhaps Nautilus
>> doesn't have
>> >> the same issue as Mimic?
>> >>
>> >> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff
>> >> > <mailto:janek.bevendo...@uni-weimar.de>
>> >> <mailto:janek.bevendo...@uni-weimar.de
>> <mailto:janek.bevendo...@uni-weimar.de>>> wrote:
>> >>
>> >>     I think this is related to my previous post to this
>> list about MGRs
>> >>     failing regularly and being overall quite slow to
>> respond. The problem
>> >>     has existed before, but the new version has made it
>> way worse. My MGRs
>> >>     keep dyring every few hours and need to be restarted.
>> the Promtheus
>> >>     plugin works, but it's pretty slow and so is the
>> dashboard.
>> >>     Unfortunately, nobody seems to have a solution for
>> this and I
>> >>     wonder why
>> >>     not more people are complaining about this problem.
>> >>
>> >>
>> >>     On 20/03/2020 19:30, Paul Choi wrote:
>> >>     > If I "curl http://localhost:9283/metrics; and wait
>> sufficiently long
>> >>     > enough, I get this - says "No MON connection". But
>> the mons are
>> >>     health and
>> >>     > the cluster is functioning fine.
>> >>     > That said, the mons' rocksdb sizes are fairly big
>> because
>> >>     there's lots of
>> >>     > rebalancing going on. The Prometheus endpoint
>> hanging seems to
>> >>     happen
>> >>     &g

[ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

2020-03-26 Thread Janek Bevendorff
If there is actually a connection, then it's no wonder our MDS kept
crashing. Our Ceph has 9.2PiB of available space at the moment.


On 26/03/2020 17:32, Paul Choi wrote:
> I can't quite explain what happened, but the Prometheus endpoint
> became stable after the free disk space for the largest pool went
> substantially lower than 1PB.
> I wonder if there's some metric that exceeds the maximum size for some
> int, double, etc?
>
> -Paul
>
> On Mon, Mar 23, 2020 at 9:50 AM Janek Bevendorff
>  <mailto:janek.bevendo...@uni-weimar.de>> wrote:
>
> I haven't seen any MGR hangs so far since I disabled the prometheus
> module. It seems like the module is not only slow, but kills the whole
> MGR when the cluster is sufficiently large, so these two issues
> are most
> likely connected. The issue has become much, much worse with 14.2.8.
>
>
> On 23/03/2020 09:00, Janek Bevendorff wrote:
> > I am running the very latest version of Nautilus. I will try
> setting up
> > an external exporter today and see if that fixes anything. Our
> cluster
> > is somewhat large-ish with 1248 OSDs, so I expect stat collection to
> > take "some" time, but it definitely shouldn't crush the MGRs all
> the time.
> >
> > On 21/03/2020 02:33, Paul Choi wrote:
> >> Hi Janek,
> >>
> >> What version of Ceph are you using?
> >> We also have a much smaller cluster running Nautilus, with no
> MDS. No
> >> Prometheus issues there.
>     >> I won't speculate further than this but perhaps Nautilus
> doesn't have
> >> the same issue as Mimic?
> >>
> >> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff
> >>  <mailto:janek.bevendo...@uni-weimar.de>
> >> <mailto:janek.bevendo...@uni-weimar.de
> <mailto:janek.bevendo...@uni-weimar.de>>> wrote:
> >>
> >>     I think this is related to my previous post to this list
> about MGRs
> >>     failing regularly and being overall quite slow to respond.
> The problem
> >>     has existed before, but the new version has made it way
> worse. My MGRs
> >>     keep dyring every few hours and need to be restarted. the
> Promtheus
> >>     plugin works, but it's pretty slow and so is the dashboard.
> >>     Unfortunately, nobody seems to have a solution for this and I
> >>     wonder why
> >>     not more people are complaining about this problem.
> >>
> >>
> >>     On 20/03/2020 19:30, Paul Choi wrote:
> >>     > If I "curl http://localhost:9283/metrics; and wait
> sufficiently long
> >>     > enough, I get this - says "No MON connection". But the
> mons are
> >>     health and
> >>     > the cluster is functioning fine.
> >>     > That said, the mons' rocksdb sizes are fairly big because
> >>     there's lots of
> >>     > rebalancing going on. The Prometheus endpoint hanging
> seems to
> >>     happen
> >>     > regardless of the mon size anyhow.
> >>     >
> >>     >     mon.woodenbox0 is 41 GiB >= mon_data_size_warn (15 GiB)
> >>     >     mon.woodenbox2 is 26 GiB >= mon_data_size_warn (15 GiB)
> >>     >     mon.woodenbox4 is 42 GiB >= mon_data_size_warn (15 GiB)
> >>     >     mon.woodenbox3 is 43 GiB >= mon_data_size_warn (15 GiB)
> >>     >     mon.woodenbox1 is 38 GiB >= mon_data_size_warn (15 GiB)
> >>     >
> >>     > # fg
> >>     > curl -H "Connection: close" http://localhost:9283/metrics
> >>     >  >>     > "-//W3C//DTD XHTML 1.0 Transitional//EN"
> >>     > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd;>
> >>     > 
> >>     > 
> >>     >     
> >>     >     503 Service Unavailable
> >>     >     
> >>     >     #powered_by {
> >>     >         margin-top: 20px;
> >>     >         border-top: 2px solid black;
> >>     >         font-style: italic;
> >>     >     }
> >>     >
> >>     >     #traceback {
> >>     >         color: red;
> >>     >     }
> >>     >     
> >>     > 
> >>     &g

[ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

2020-03-23 Thread Janek Bevendorff
I dug up this issue report, where the problem has been reported before:
https://tracker.ceph.com/issues/39264

Unfortuantely, the issue hasn't got much (or any) attention yet. So
let's get this fixed, the prometheus module is unusable in its current
state.


On 23/03/2020 17:50, Janek Bevendorff wrote:
> I haven't seen any MGR hangs so far since I disabled the prometheus
> module. It seems like the module is not only slow, but kills the whole
> MGR when the cluster is sufficiently large, so these two issues are most
> likely connected. The issue has become much, much worse with 14.2.8.
>
>
> On 23/03/2020 09:00, Janek Bevendorff wrote:
>> I am running the very latest version of Nautilus. I will try setting up
>> an external exporter today and see if that fixes anything. Our cluster
>> is somewhat large-ish with 1248 OSDs, so I expect stat collection to
>> take "some" time, but it definitely shouldn't crush the MGRs all the time.
>>
>> On 21/03/2020 02:33, Paul Choi wrote:
>>> Hi Janek,
>>>
>>> What version of Ceph are you using?
>>> We also have a much smaller cluster running Nautilus, with no MDS. No
>>> Prometheus issues there.
>>> I won't speculate further than this but perhaps Nautilus doesn't have
>>> the same issue as Mimic?
>>>
>>> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff
>>> >> <mailto:janek.bevendo...@uni-weimar.de>> wrote:
>>>
>>> I think this is related to my previous post to this list about MGRs
>>> failing regularly and being overall quite slow to respond. The problem
>>> has existed before, but the new version has made it way worse. My MGRs
>>> keep dyring every few hours and need to be restarted. the Promtheus
>>> plugin works, but it's pretty slow and so is the dashboard.
>>> Unfortunately, nobody seems to have a solution for this and I
>>> wonder why
>>> not more people are complaining about this problem.
>>>
>>>
>>> On 20/03/2020 19:30, Paul Choi wrote:
>>> > If I "curl http://localhost:9283/metrics; and wait sufficiently long
>>> > enough, I get this - says "No MON connection". But the mons are
>>> health and
>>> > the cluster is functioning fine.
>>> > That said, the mons' rocksdb sizes are fairly big because
>>> there's lots of
>>> > rebalancing going on. The Prometheus endpoint hanging seems to
>>> happen
>>> > regardless of the mon size anyhow.
>>> >
>>> >     mon.woodenbox0 is 41 GiB >= mon_data_size_warn (15 GiB)
>>> >     mon.woodenbox2 is 26 GiB >= mon_data_size_warn (15 GiB)
>>> >     mon.woodenbox4 is 42 GiB >= mon_data_size_warn (15 GiB)
>>> >     mon.woodenbox3 is 43 GiB >= mon_data_size_warn (15 GiB)
>>> >     mon.woodenbox1 is 38 GiB >= mon_data_size_warn (15 GiB)
>>> >
>>> > # fg
>>> > curl -H "Connection: close" http://localhost:9283/metrics
>>> > >> > "-//W3C//DTD XHTML 1.0 Transitional//EN"
>>> > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd;>
>>> > 
>>> > 
>>> >     
>>> >     503 Service Unavailable
>>> >     
>>> >     #powered_by {
>>> >         margin-top: 20px;
>>> >         border-top: 2px solid black;
>>> >         font-style: italic;
>>> >     }
>>> >
>>> >     #traceback {
>>> >         color: red;
>>> >     }
>>> >     
>>> > 
>>> >     
>>> >         503 Service Unavailable
>>> >         No MON connection
>>> >         Traceback (most recent call last):
>>> >   File
>>> "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line 670,
>>> > in respond
>>> >     response.body = self.handler()
>>> >   File
>>> "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line
>>> > 217, in __call__
>>> >     self.body = self.oldhandler(*args, **kwargs)
>>> >   File
>>> "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line 61,
>>> > in __call__
>>> >     return self.callable(*self.args, **self.kw

[ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

2020-03-23 Thread Janek Bevendorff
I haven't seen any MGR hangs so far since I disabled the prometheus
module. It seems like the module is not only slow, but kills the whole
MGR when the cluster is sufficiently large, so these two issues are most
likely connected. The issue has become much, much worse with 14.2.8.


On 23/03/2020 09:00, Janek Bevendorff wrote:
> I am running the very latest version of Nautilus. I will try setting up
> an external exporter today and see if that fixes anything. Our cluster
> is somewhat large-ish with 1248 OSDs, so I expect stat collection to
> take "some" time, but it definitely shouldn't crush the MGRs all the time.
>
> On 21/03/2020 02:33, Paul Choi wrote:
>> Hi Janek,
>>
>> What version of Ceph are you using?
>> We also have a much smaller cluster running Nautilus, with no MDS. No
>> Prometheus issues there.
>> I won't speculate further than this but perhaps Nautilus doesn't have
>> the same issue as Mimic?
>>
>> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff
>> > <mailto:janek.bevendo...@uni-weimar.de>> wrote:
>>
>> I think this is related to my previous post to this list about MGRs
>> failing regularly and being overall quite slow to respond. The problem
>> has existed before, but the new version has made it way worse. My MGRs
>> keep dyring every few hours and need to be restarted. the Promtheus
>> plugin works, but it's pretty slow and so is the dashboard.
>> Unfortunately, nobody seems to have a solution for this and I
>> wonder why
>> not more people are complaining about this problem.
>>
>>
>> On 20/03/2020 19:30, Paul Choi wrote:
>> > If I "curl http://localhost:9283/metrics; and wait sufficiently long
>> > enough, I get this - says "No MON connection". But the mons are
>> health and
>> > the cluster is functioning fine.
>> > That said, the mons' rocksdb sizes are fairly big because
>> there's lots of
>> > rebalancing going on. The Prometheus endpoint hanging seems to
>> happen
>> > regardless of the mon size anyhow.
>> >
>> >     mon.woodenbox0 is 41 GiB >= mon_data_size_warn (15 GiB)
>> >     mon.woodenbox2 is 26 GiB >= mon_data_size_warn (15 GiB)
>> >     mon.woodenbox4 is 42 GiB >= mon_data_size_warn (15 GiB)
>> >     mon.woodenbox3 is 43 GiB >= mon_data_size_warn (15 GiB)
>> >     mon.woodenbox1 is 38 GiB >= mon_data_size_warn (15 GiB)
>> >
>> > # fg
>> > curl -H "Connection: close" http://localhost:9283/metrics
>> > > > "-//W3C//DTD XHTML 1.0 Transitional//EN"
>> > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd;>
>> > 
>> > 
>> >     
>> >     503 Service Unavailable
>> >     
>> >     #powered_by {
>> >         margin-top: 20px;
>> >         border-top: 2px solid black;
>> >         font-style: italic;
>> >     }
>> >
>> >     #traceback {
>> >         color: red;
>> >     }
>> >     
>> > 
>> >     
>> >         503 Service Unavailable
>> >         No MON connection
>> >         Traceback (most recent call last):
>> >   File
>> "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line 670,
>> > in respond
>> >     response.body = self.handler()
>> >   File
>> "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line
>> > 217, in __call__
>> >     self.body = self.oldhandler(*args, **kwargs)
>> >   File
>> "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line 61,
>> > in __call__
>> >     return self.callable(*self.args, **self.kwargs)
>> >   File "/usr/lib/ceph/mgr/prometheus/module.py", line 704, in
>> metrics
>> >     return self._metrics(instance)
>> >   File "/usr/lib/ceph/mgr/prometheus/module.py", line 721, in
>> _metrics
>> >     raise cherrypy.HTTPError(503, 'No MON connection')
>> > HTTPError: (503, 'No MON connection')
>> > 
>> >     
>> >       
>> >         Powered by http://www.cherrypy.org;>CherryPy
>> 3.5.0
>> >       
>> >     
>> >     
>> > 
>> >
>>

[ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

2020-03-23 Thread Janek Bevendorff
I am running the very latest version of Nautilus. I will try setting up
an external exporter today and see if that fixes anything. Our cluster
is somewhat large-ish with 1248 OSDs, so I expect stat collection to
take "some" time, but it definitely shouldn't crush the MGRs all the time.

On 21/03/2020 02:33, Paul Choi wrote:
> Hi Janek,
>
> What version of Ceph are you using?
> We also have a much smaller cluster running Nautilus, with no MDS. No
> Prometheus issues there.
> I won't speculate further than this but perhaps Nautilus doesn't have
> the same issue as Mimic?
>
> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff
>  <mailto:janek.bevendo...@uni-weimar.de>> wrote:
>
> I think this is related to my previous post to this list about MGRs
> failing regularly and being overall quite slow to respond. The problem
> has existed before, but the new version has made it way worse. My MGRs
> keep dyring every few hours and need to be restarted. the Promtheus
> plugin works, but it's pretty slow and so is the dashboard.
> Unfortunately, nobody seems to have a solution for this and I
> wonder why
> not more people are complaining about this problem.
>
>
> On 20/03/2020 19:30, Paul Choi wrote:
> > If I "curl http://localhost:9283/metrics; and wait sufficiently long
> > enough, I get this - says "No MON connection". But the mons are
> health and
> > the cluster is functioning fine.
> > That said, the mons' rocksdb sizes are fairly big because
> there's lots of
> > rebalancing going on. The Prometheus endpoint hanging seems to
> happen
> > regardless of the mon size anyhow.
> >
> >     mon.woodenbox0 is 41 GiB >= mon_data_size_warn (15 GiB)
> >     mon.woodenbox2 is 26 GiB >= mon_data_size_warn (15 GiB)
> >     mon.woodenbox4 is 42 GiB >= mon_data_size_warn (15 GiB)
> >     mon.woodenbox3 is 43 GiB >= mon_data_size_warn (15 GiB)
> >     mon.woodenbox1 is 38 GiB >= mon_data_size_warn (15 GiB)
> >
> > # fg
> > curl -H "Connection: close" http://localhost:9283/metrics
> >  > "-//W3C//DTD XHTML 1.0 Transitional//EN"
> > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd;>
> > 
> > 
> >     
> >     503 Service Unavailable
> >     
> >     #powered_by {
> >         margin-top: 20px;
> >         border-top: 2px solid black;
> >         font-style: italic;
> >     }
> >
> >     #traceback {
> >         color: red;
> >     }
> >     
> > 
> >     
> >         503 Service Unavailable
> >         No MON connection
> >         Traceback (most recent call last):
> >   File
> "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line 670,
> > in respond
> >     response.body = self.handler()
> >   File
> "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line
> > 217, in __call__
> >     self.body = self.oldhandler(*args, **kwargs)
> >   File
> "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line 61,
> > in __call__
> >     return self.callable(*self.args, **self.kwargs)
> >   File "/usr/lib/ceph/mgr/prometheus/module.py", line 704, in
> metrics
> >     return self._metrics(instance)
> >   File "/usr/lib/ceph/mgr/prometheus/module.py", line 721, in
> _metrics
> >     raise cherrypy.HTTPError(503, 'No MON connection')
> > HTTPError: (503, 'No MON connection')
> > 
> >     
> >       
> >         Powered by http://www.cherrypy.org;>CherryPy
> 3.5.0
> >       
> >     
> >     
> > 
> >
> > On Fri, Mar 20, 2020 at 6:33 AM Paul Choi  <mailto:pc...@nuro.ai>> wrote:
> >
> >> Hello,
> >>
> >> We are running Mimic 13.2.8 with our cluster, and since
> upgrading to
> >> 13.2.8 the Prometheus plugin seems to hang a lot. It used to
> respond under
> >> 10s but now it often hangs. Restarting the mgr processes helps
> temporarily
> >> but within minutes it gets stuck again.
> >>
> >> The active mgr doesn't exit when doing `systemctl stop
> ceph-mgr.target"
> >> and needs to
> >>  be kill -9'ed.
> >>
> >&g

[ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

2020-03-20 Thread Janek Bevendorff
I think this is related to my previous post to this list about MGRs
failing regularly and being overall quite slow to respond. The problem
has existed before, but the new version has made it way worse. My MGRs
keep dyring every few hours and need to be restarted. the Promtheus
plugin works, but it's pretty slow and so is the dashboard.
Unfortunately, nobody seems to have a solution for this and I wonder why
not more people are complaining about this problem.


On 20/03/2020 19:30, Paul Choi wrote:
> If I "curl http://localhost:9283/metrics; and wait sufficiently long
> enough, I get this - says "No MON connection". But the mons are health and
> the cluster is functioning fine.
> That said, the mons' rocksdb sizes are fairly big because there's lots of
> rebalancing going on. The Prometheus endpoint hanging seems to happen
> regardless of the mon size anyhow.
>
> mon.woodenbox0 is 41 GiB >= mon_data_size_warn (15 GiB)
> mon.woodenbox2 is 26 GiB >= mon_data_size_warn (15 GiB)
> mon.woodenbox4 is 42 GiB >= mon_data_size_warn (15 GiB)
> mon.woodenbox3 is 43 GiB >= mon_data_size_warn (15 GiB)
> mon.woodenbox1 is 38 GiB >= mon_data_size_warn (15 GiB)
>
> # fg
> curl -H "Connection: close" http://localhost:9283/metrics
>  "-//W3C//DTD XHTML 1.0 Transitional//EN"
> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd;>
> 
> 
> 
> 503 Service Unavailable
> 
> #powered_by {
> margin-top: 20px;
> border-top: 2px solid black;
> font-style: italic;
> }
>
> #traceback {
> color: red;
> }
> 
> 
> 
> 503 Service Unavailable
> No MON connection
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line 670,
> in respond
> response.body = self.handler()
>   File "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line
> 217, in __call__
> self.body = self.oldhandler(*args, **kwargs)
>   File "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line 61,
> in __call__
> return self.callable(*self.args, **self.kwargs)
>   File "/usr/lib/ceph/mgr/prometheus/module.py", line 704, in metrics
> return self._metrics(instance)
>   File "/usr/lib/ceph/mgr/prometheus/module.py", line 721, in _metrics
> raise cherrypy.HTTPError(503, 'No MON connection')
> HTTPError: (503, 'No MON connection')
> 
> 
>   
> Powered by http://www.cherrypy.org;>CherryPy 3.5.0
>   
> 
> 
> 
>
> On Fri, Mar 20, 2020 at 6:33 AM Paul Choi  wrote:
>
>> Hello,
>>
>> We are running Mimic 13.2.8 with our cluster, and since upgrading to
>> 13.2.8 the Prometheus plugin seems to hang a lot. It used to respond under
>> 10s but now it often hangs. Restarting the mgr processes helps temporarily
>> but within minutes it gets stuck again.
>>
>> The active mgr doesn't exit when doing `systemctl stop ceph-mgr.target"
>> and needs to
>>  be kill -9'ed.
>>
>> Is there anything I can do to address this issue, or at least get better
>> visibility into the issue?
>>
>> We only have a few plugins enabled:
>> $ ceph mgr module ls
>> {
>> "enabled_modules": [
>> "balancer",
>> "prometheus",
>> "zabbix"
>> ],
>>
>> 3 mgr processes, but it's a pretty large cluster (near 4000 OSDs) and it's
>> a busy one with lots of rebalancing. (I don't know if a busy cluster would
>> seriously affect the mgr's performance, but just throwing it out there)
>>
>>   services:
>> mon: 5 daemons, quorum
>> woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1
>> mgr: woodenbox2(active), standbys: woodenbox0, woodenbox1
>> mds: cephfs-1/1/1 up  {0=woodenbox6=up:active}, 1 up:standby-replay
>> osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs
>> rgw: 4 daemons active
>>
>> Thanks in advance for your help,
>>
>> -Paul Choi
>>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MGRs failing once per day and generally slow response times

2020-03-19 Thread Janek Bevendorff
Sorry for nagging, but is there a solution to this? Routinely restarting
my MGRs every few hours isn't how I want to spend my time (although I
guess I could schedule a cron job for that).


On 16/03/2020 09:35, Janek Bevendorff wrote:
> Over the weekend, all five MGRs failed, which means we have no more
> Prometheus monitoring data. We are obviously monitoring the MGR status
> as well, so we can detect the failure, but it's still a pretty serious
> issue. Any ideas as to why this might happen?
>
>
> On 13/03/2020 16:56, Janek Bevendorff wrote:
>> Indeed. I just had another MGR go bye-bye. I don't think host clock
>> skew is the problem.
>>
>>
>> On 13/03/2020 15:29, Anthony D'Atri wrote:
>>> Chrony does converge faster, but I doubt this will solve your
>>> problem if you don’t have quality peers. Or if it’s not really a
>>> time problem.
>>>
>>>> On Mar 13, 2020, at 6:44 AM, Janek Bevendorff
>>>>  wrote:
>>>>
>>>> I replaced ntpd with chronyd and will let you know if it changes
>>>> anything. Thanks.
>>>>
>>>>
>>>>> On 13/03/2020 06:25, Konstantin Shalygin wrote:
>>>>>> On 3/13/20 12:57 AM, Janek Bevendorff wrote:
>>>>>> NTPd is running, all the nodes have the same time to the second.
>>>>>> I don't think that is the problem.
>>>>> As always in such cases - try to switch your ntpd to default EL7
>>>>> daemon - chronyd.
>>>>>
>>>>>
>>>>>
>>>>> k
>>>> ___
>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MGRs failing once per day and generally slow response times

2020-03-16 Thread Janek Bevendorff
Over the weekend, all five MGRs failed, which means we have no more 
Prometheus monitoring data. We are obviously monitoring the MGR status 
as well, so we can detect the failure, but it's still a pretty serious 
issue. Any ideas as to why this might happen?



On 13/03/2020 16:56, Janek Bevendorff wrote:
Indeed. I just had another MGR go bye-bye. I don't think host clock 
skew is the problem.



On 13/03/2020 15:29, Anthony D'Atri wrote:
Chrony does converge faster, but I doubt this will solve your problem 
if you don’t have quality peers. Or if it’s not really a time problem.


On Mar 13, 2020, at 6:44 AM, Janek Bevendorff 
 wrote:


I replaced ntpd with chronyd and will let you know if it changes 
anything. Thanks.




On 13/03/2020 06:25, Konstantin Shalygin wrote:

On 3/13/20 12:57 AM, Janek Bevendorff wrote:
NTPd is running, all the nodes have the same time to the second. I 
don't think that is the problem.
As always in such cases - try to switch your ntpd to default EL7 
daemon - chronyd.




k

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



--
Bauhaus-Universität Weimar
Bauhausstr. 9a, Room 308
99423 Weimar, Germany

Phone: +49 (0)3643 - 58 3577
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MGRs failing once per day and generally slow response times

2020-03-13 Thread Janek Bevendorff
Indeed. I just had another MGR go bye-bye. I don't think host clock skew 
is the problem.



On 13/03/2020 15:29, Anthony D'Atri wrote:

Chrony does converge faster, but I doubt this will solve your problem if you 
don’t have quality peers.  Or if it’s not really a time problem.


On Mar 13, 2020, at 6:44 AM, Janek Bevendorff  
wrote:

I replaced ntpd with chronyd and will let you know if it changes anything. 
Thanks.



On 13/03/2020 06:25, Konstantin Shalygin wrote:

On 3/13/20 12:57 AM, Janek Bevendorff wrote:
NTPd is running, all the nodes have the same time to the second. I don't think 
that is the problem.

As always in such cases - try to switch your ntpd to default EL7 daemon - 
chronyd.



k

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Bauhaus-Universität Weimar
Bauhausstr. 9a, Room 308
99423 Weimar, Germany

Phone: +49 (0)3643 - 58 3577
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MGRs failing once per day and generally slow response times

2020-03-13 Thread Janek Bevendorff
I replaced ntpd with chronyd and will let you know if it changes 
anything. Thanks.



On 13/03/2020 06:25, Konstantin Shalygin wrote:

On 3/13/20 12:57 AM, Janek Bevendorff wrote:
NTPd is running, all the nodes have the same time to the second. I 
don't think that is the problem. 


As always in such cases - try to switch your ntpd to default EL7 
daemon - chronyd.




k 

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MGRs failing once per day and generally slow response times

2020-03-12 Thread Janek Bevendorff

Hi Caspar,

NTPd is running, all the nodes have the same time to the second. I don't 
think that is the problem.


Janek


On 12/03/2020 12:02, Caspar Smit wrote:

Janek,

This error already should have put you in the right direction:

"possible clock skew"

Probably the date/times are too far apart on your nodes.
Make sure all your nodes are time synced using NTP

Kind regards,
Caspar

Op wo 11 mrt. 2020 om 09:47 schreef Janek Bevendorff <
janek.bevendo...@uni-weimar.de>:


Additional information: I just found this in the logs of one failed MGR:

2020-03-11 09:32:55.265 7f59dcb94700 -1 monclient: _check_auth_rotating
possible clock skew, rotating keys expired way too early (before
2020-03-11 08:32:55.268325)

It's the same message that used to appear previously when MGRs crashed,
so perhaps the overall issue is still the same, just massively accelerated.


On 11/03/2020 09:43, Janek Bevendorff wrote:

Hi,

I've always had some MGR stability issues with daemons crashing at
random times, but since the upgrade to 14.2.8 they regularly stop
responding after some time until I restart them (which I have to do at
least once a day).

I noticed right after the upgrade that the prometheus module was
entirely unresponsive and ceph fs status took about half a minute to
return. Once all the cluster chatter had settled and the PGs had been
rebalanced (auto-scale was messing with PGs after the upgarde), it
became usable again, but everything's still slower than before.
Prometheus takes several seconds to list metrics, ceph fs status takes
about 1-2 seconds.

However, after some time, MGRs stop responding and are kicked from the
list of standbys. With log level 5 all they are writing to the log
files is this:

2020-03-11 09:30:40.539 7f8f88984700  4 mgr[prometheus]
:::xxx.xxx.xxx.xxx - - [11/Mar/2020:09:30:40] "GET /metrics
HTTP/1.1" 200 - "" "Prometheus/2.15.2"
2020-03-11 09:30:41.371 7f8f9ee62700  4 mgr send_beacon standby
2020-03-11 09:30:43.392 7f8f9ee62700  4 mgr send_beacon standby
2020-03-11 09:30:45.412 7f8f9ee62700  4 mgr send_beacon standby
2020-03-11 09:30:47.436 7f8f9ee62700  4 mgr send_beacon standby
2020-03-11 09:30:49.460 7f8f9ee62700  4 mgr send_beacon standby

I have seen another email on this list complaining about slow ceph fs
status, I believe this issue is connected.

Besides the standard always-on modules I have enabled the prometheus,
dashboard, and telemetry modules.

Best
Janek
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Bauhaus-Universität Weimar
Bauhausstr. 9a, Room 308
99423 Weimar, Germany

Phone: +49 (0)3643 - 58 3577
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MGRs failing once per day and generally slow response times

2020-03-11 Thread Janek Bevendorff

Additional information: I just found this in the logs of one failed MGR:

2020-03-11 09:32:55.265 7f59dcb94700 -1 monclient: _check_auth_rotating 
possible clock skew, rotating keys expired way too early (before 
2020-03-11 08:32:55.268325)


It's the same message that used to appear previously when MGRs crashed, 
so perhaps the overall issue is still the same, just massively accelerated.



On 11/03/2020 09:43, Janek Bevendorff wrote:

Hi,

I've always had some MGR stability issues with daemons crashing at 
random times, but since the upgrade to 14.2.8 they regularly stop 
responding after some time until I restart them (which I have to do at 
least once a day).


I noticed right after the upgrade that the prometheus module was 
entirely unresponsive and ceph fs status took about half a minute to 
return. Once all the cluster chatter had settled and the PGs had been 
rebalanced (auto-scale was messing with PGs after the upgarde), it 
became usable again, but everything's still slower than before. 
Prometheus takes several seconds to list metrics, ceph fs status takes 
about 1-2 seconds.


However, after some time, MGRs stop responding and are kicked from the 
list of standbys. With log level 5 all they are writing to the log 
files is this:


2020-03-11 09:30:40.539 7f8f88984700  4 mgr[prometheus] 
:::xxx.xxx.xxx.xxx - - [11/Mar/2020:09:30:40] "GET /metrics 
HTTP/1.1" 200 - "" "Prometheus/2.15.2"

2020-03-11 09:30:41.371 7f8f9ee62700  4 mgr send_beacon standby
2020-03-11 09:30:43.392 7f8f9ee62700  4 mgr send_beacon standby
2020-03-11 09:30:45.412 7f8f9ee62700  4 mgr send_beacon standby
2020-03-11 09:30:47.436 7f8f9ee62700  4 mgr send_beacon standby
2020-03-11 09:30:49.460 7f8f9ee62700  4 mgr send_beacon standby

I have seen another email on this list complaining about slow ceph fs 
status, I believe this issue is connected.


Besides the standard always-on modules I have enabled the prometheus, 
dashboard, and telemetry modules.


Best
Janek
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] MGRs failing once per day and generally slow response times

2020-03-11 Thread Janek Bevendorff

Hi,

I've always had some MGR stability issues with daemons crashing at 
random times, but since the upgrade to 14.2.8 they regularly stop 
responding after some time until I restart them (which I have to do at 
least once a day).


I noticed right after the upgrade that the prometheus module was 
entirely unresponsive and ceph fs status took about half a minute to 
return. Once all the cluster chatter had settled and the PGs had been 
rebalanced (auto-scale was messing with PGs after the upgarde), it 
became usable again, but everything's still slower than before. 
Prometheus takes several seconds to list metrics, ceph fs status takes 
about 1-2 seconds.


However, after some time, MGRs stop responding and are kicked from the 
list of standbys. With log level 5 all they are writing to the log files 
is this:


2020-03-11 09:30:40.539 7f8f88984700  4 mgr[prometheus] 
:::xxx.xxx.xxx.xxx - - [11/Mar/2020:09:30:40] "GET /metrics 
HTTP/1.1" 200 - "" "Prometheus/2.15.2"

2020-03-11 09:30:41.371 7f8f9ee62700  4 mgr send_beacon standby
2020-03-11 09:30:43.392 7f8f9ee62700  4 mgr send_beacon standby
2020-03-11 09:30:45.412 7f8f9ee62700  4 mgr send_beacon standby
2020-03-11 09:30:47.436 7f8f9ee62700  4 mgr send_beacon standby
2020-03-11 09:30:49.460 7f8f9ee62700  4 mgr send_beacon standby

I have seen another email on this list complaining about slow ceph fs 
status, I believe this issue is connected.


Besides the standard always-on modules I have enabled the prometheus, 
dashboard, and telemetry modules.


Best
Janek
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Unexpected recovering after nautilus 14.2.7 -> 14.2.8

2020-03-05 Thread Janek Bevendorff
I also had some inadvertent recovery going on, although I think it 
started after I had restarted all MON, MGR, and MDS nodes and before I 
started restarting OSDs.



On 05/03/2020 09:49, Dan van der Ster wrote:

Did you have `144 total, 144 up, 144 in` also before the upgrade?
If an osd was out, then you upgraded/restarted and it went back in, it
would trigger data movement.
(I usually set noin before an upgrade).

-- dan

On Thu, Mar 5, 2020 at 9:46 AM Rainer Krienke  wrote:

I found some information in ceph.log that might help to find out what
happened. node2  was the one I rebooted:

2020-03-05 07:24:29.844953 osd.45 (osd.45) 483 : cluster [DBG] 36.323
scrub starts
2020-03-05 07:24:33.552221 osd.45 (osd.45) 484 : cluster [DBG] 36.323
scrub ok
2020-03-05 07:24:38.948404 mon.node2 (mon.0) 692706 : cluster [DBG]
osdmap e31855: 144 total, 144 up, 144 in
2020-03-05 07:24:39.969404 mon.node2 (mon.0) 692713 : cluster [DBG]
osdmap e31856: 144 total, 144 up, 144 in
2020-03-05 07:24:39.979238 mon.node2 (mon.0) 692714 : cluster [WRN]
Health check failed: 1 pools have many more objects per pg than average
(MANY_OBJECTS_PER_PG)
2020-03-05 07:24:40.533392 mon.node2 (mon.0) 692717 : cluster [DBG]
osdmap e31857: 144 total, 144 up, 144 in
2020-03-05 07:24:41.550395 mon.node2 (mon.0) 692728 : cluster [DBG]
osdmap e31858: 144 total, 144 up, 144 in
2020-03-05 07:24:41.598004 osd.127 (osd.127) 691 : cluster [DBG]
36.3eds0 starting backfill to osd.18(4) from (0'0,0'0] MAX to 31854'297918
2020-03-05 07:24:41.619293 osd.127 (osd.127) 692 : cluster [DBG]
36.3eds0 starting backfill to osd.49(5) from (0'0,0'0] MAX to 31854'297918
2020-03-05 07:24:41.631869 osd.127 (osd.127) 693 : cluster [DBG]
36.3eds0 starting backfill to osd.65(2) from (0'0,0'0] MAX to 31854'297918
2020-03-05 07:24:41.644089 osd.127 (osd.127) 694 : cluster [DBG]
36.3eds0 starting backfill to osd.97(3) from (0'0,0'0] MAX to 31854'297918
2020-03-05 07:24:41.656223 osd.127 (osd.127) 695 : cluster [DBG]
36.3eds0 starting backfill to osd.122(0) from (0'0,0'0] MAX to 31854'297918
2020-03-05 07:24:41.669265 osd.127 (osd.127) 696 : cluster [DBG]
36.3eds0 starting backfill to osd.134(1) from (0'0,0'0] MAX to 31854'297918
2020-03-05 07:24:41.582485 osd.69 (osd.69) 549 : cluster [DBG] 36.3fes0
starting backfill to osd.13(1) from (0'0,0'0] MAX to 31854'280018
2020-03-05 07:24:41.590541 osd.5 (osd.5) 349 : cluster [DBG] 36.3f2s0
starting backfill to osd.10(0) from (0'0,0'0] MAX to 31854'331157
2020-03-05 07:24:41.596496 osd.69 (osd.69) 550 : cluster [DBG] 36.3fes0
starting backfill to osd.25(5) from (0'0,0'0] MAX to 31854'280018
2020-03-05 07:24:41.601781 osd.86 (osd.86) 457 : cluster [DBG] 36.3ees0
starting backfill to osd.10(4) from (0'0,0'0] MAX to 31854'511090
2020-03-05 07:24:41.603864 osd.69 (osd.69) 551 : cluster [DBG] 36.3fes0
starting backfill to osd.58(2) from (0'0,0'0] MAX to 31854'280018
2020-03-05 07:24:41.610409 osd.69 (osd.69) 552 : cluster [DBG] 36.3fes0
starting backfill to osd.78(3) from (0'0,0'0] MAX to 31854'280018
2020-03-05 07:24:41.614494 osd.5 (osd.5) 350 : cluster [DBG] 36.3f2s0
starting backfill to osd.41(1) from (0'0,0'0] MAX to 31854'331157
2020-03-05 07:24:41.617208 osd.69 (osd.69) 553 : cluster [DBG] 36.3fes0
starting backfill to osd.99(0) from (0'0,0'0] MAX to 31854'280018
2020-03-05 07:24:41.622645 osd.86 (osd.86) 458 : cluster [DBG] 36.3ees0
starting backfill to osd.48(5) from (0'0,0'0] MAX to 31854'511090
2020-03-05 07:24:41.624049 osd.69 (osd.69) 554 : cluster [DBG] 36.3fes0
starting backfill to osd.121(4) from (0'0,0'0] MAX to 31854'280018
2020-03-05 07:24:41.625556 osd.5 (osd.5) 351 : cluster [DBG] 36.3f2s0
starting backfill to osd.61(3) from (0'0,0'0] MAX to 31854'331157
2020-03-05 07:24:41.631348 osd.86 (osd.86) 459 : cluster [DBG] 36.3ees0
starting backfill to osd.78(3) from (0'0,0'0] MAX to 31854'511090
2020-03-05 07:24:41.634572 osd.5 (osd.5) 352 : cluster [DBG] 36.3f2s0
starting backfill to osd.71(4) from (0'0,0'0] MAX to 31854'331157
2020-03-05 07:24:41.641651 osd.86 (osd.86) 460 : cluster [DBG] 36.3ees0
starting backfill to osd.90(0) from (0'0,0'0] MAX to 31854'511090
2020-03-05 07:24:41.644983 osd.5 (osd.5) 353 : cluster [DBG] 36.3f2s0
starting backfill to osd.122(5) from (0'0,0'0] MAX to 31854'331157
2020-03-05 07:24:41.649661 osd.86 (osd.86) 461 : cluster [DBG] 36.3ees0
starting backfill to osd.118(2) from (0'0,0'0] MAX to 31854'511090
2020-03-05 07:24:41.652407 osd.5 (osd.5) 354 : cluster [DBG] 36.3f2s0
starting backfill to osd.131(2) from (0'0,0'0] MAX to 31854'331157
2020-03-05 07:24:41.659823 osd.86 (osd.86) 462 : cluster [DBG] 36.3ees0
starting backfill to osd.139(1) from (0'0,0'0] MAX to 31854'511090
2020-03-05 07:24:42.055680 mon.node2 (mon.0) 692729 : cluster [INF]
osd.23 marked itself down
2020-03-05 07:24:42.055765 mon.node2 (mon.0) 692730 : cluster [INF]
osd.18 marked itself down
2020-03-05 07:24:42.055919 mon.node2 (mon.0) 692731 : cluster [INF]
osd.21 marked itself down
2020-03-05 07:24:42.056002 mon.node2 

[ceph-users] Re: High CPU usage by ceph-mgr in 14.2.6

2020-01-30 Thread Janek Bevendorff
I can report similar results, although it's probably not just due to 
cluster size.


Our cluster has 1248 OSDs at the moment and we have three active MDSs to 
spread the metadata operations evenly. However, I noticed that it isn't 
spread evenly at all. Usually, it's just one MDS (in our case mds.1) 
which handles most of the load and slowing down the others as a result. 
What we see is a significantly higher latency curve for this one MDS 
than for the other two. All MDSs operate at 100-150% CPU utilisation 
when multiple clients (we have up to 320) are actively reading or 
writing data (note: we have quite an uneven data distribution, so 
directory pinning isn't really an option).


In the end, it turned out that some clients were running updatedb 
processes which tried to index the CephFS. After fixing that, the 
constant request load went down and with it the CPU load on the MDSs, 
but the underlying problem isn't solved of course. We just don't have 
any clients constantly operating on some of our largest directories anymore.



On 29/01/2020 20:28, Neha Ojha wrote:

Hi Joe,

Can you grab a wallclock profiler dump from the mgr process and share
it with us? This was useful for us to get to the root cause of the
issue in 14.2.5.

Quoting Mark's suggestion from "[ceph-users] High CPU usage by
ceph-mgr in 14.2.5" below.

If you can get a wallclock profiler on the mgr process we might be able
to figure out specifics of what's taking so much time (ie processing
pg_summary or something else).  Assuming you have gdb with the python
bindings and the ceph debug packages installed, if you (are anyone)
could try gdbpmp on the 100% mgr process that would be fantastic.


https://github.com/markhpc/gdbpmp


gdbpmp.py -p`pidof ceph-mgr` -n 1000 -o mgr.gdbpmp


If you want to view the results:


gdbpmp.py -i mgr.gdbpmp -t 1

Thanks,
Neha



On Wed, Jan 29, 2020 at 7:35 AM  wrote:

Modules that are normally enabled:

ceph mgr module ls | jq -r '.enabled_modules'
[
   "dashboard",
   "prometheus",
   "restful"
]

We did test with all modules disabled, restarted the mgrs and saw no difference.

Joe
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph MDS randomly hangs with no useful error message

2020-01-26 Thread janek . bevendorff
The latest one. 14.2.6.On 26 Jan 2020 4:23 pm, "Yan, Zheng"  wrote:On Wed, Jan 22, 2020 at 6:22 PM Janek Bevendorff

 wrote:

>

>

> > I don't find any clue from the backtrace. please run 'ceph daemon

> > mds. dump_historic_ops' and ''ceph daemon mds.xxx perf reset; ceph

> > daemon mds.xxx perf dump'. send the outputs to us.

> >

> Hi, I assume you mean ceph daemon mds.xxx perf reset _all_?

>

> Here's the output of historic ops https://pastebin.com/yxvjJHY9

>

> and perf dump: https://pastebin.com/BfpAiYT7

>



which version your mds is?

>


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Provide more documentation for MDS performance tuning on large file systems

2020-01-25 Thread Janek Bevendorff
Hello,

Over the last week I have tried optimising the performance of our MDS
nodes for the large amount of files and concurrent clients we have. It
turns out that despite various stability fixes in recent releases, the
default configuration still doesn't appear to be optimal for keeping the
cache size under control and avoid intermittent I/O blocks.

Unfortunately, it is very hard to tweak the configuration to something
that works, because the tuning parameters needed are largely
undocumented or only described in very technical terms in the source
code making them quite unapproachable for administrators not familiar
with all the CephFS internals. I would therefore like to ask if it were
possible to document the "advanced" MDS settings more clearly as to what
they do and in what direction they have to be tuned for more or less
aggressive cap recall, for instance (sometimes it is not clear if a
threshold is a min or a max threshold).

I am am in the very (un)fortunate situation to have folders with a
several 100K direct sub folders or files (and one extreme case with
almost 7 million dentries), which is a pretty good benchmark for
measuring cap growth while performing operations on them. For the time
being, I came up with this configuration, which seems to work for me,
but is still far from optimal:

mds basic    mds_cache_memory_limit  10737418240
mds advanced mds_cache_trim_threshold    131072
mds advanced mds_max_caps_per_client 50
mds advanced mds_recall_max_caps 17408
mds advanced mds_recall_max_decay_rate   2.00

The parameters I am least sure about---because I understand the least
how they actually work---are mds_cache_trim_threshold and
mds_recall_max_decay_rate. Despite reading the description in
src/common/options.cc, I understand only half of what they're doing and
I am also not quite sure in which direction to tune them for optimal
results.

Another point where I am struggling is the correct configuration of
mds_recall_max_caps. The default of 5K doesn't work too well for me, but
values above 20K also don't seem to be a good choice. While high values
result in fewer blocked ops and better performance without destabilising
the MDS, they also lead to slow but unbounded cache growth, which seems
counter-intuitive. 17K was the maximum I could go. Higher values work
for most use cases, but when listing very large folders with millions of
dentries, the MDS cache size slowly starts to exceed the limit after a
few hours, since the MDSs are failing to keep clients below
mds_max_caps_per_client despite not showing any "failing to respond to
cache pressure" warnings.

With the configuration above, I do not have cache size issues any more,
but it comes at the cost of performance and slow/blocked ops. A few
hints as to how I could optimise my settings for better client
performance would be much appreciated and so would be additional
documentation for all the "advanced" MDS settings.

Thanks a lot
Janek
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [Ceph-users] Re: MDS failing under load with large cache sizes

2020-01-07 Thread janek . bevendorff
I had two MDS nodes. One was still active, but the other was stuck rejoining, which already caused the FS to hang (i.e. Ait was down, yes). Since at first I thought this was the old cache size bug, I deleted the open files objects and when that didn't seem to have an effect, I tried restarting the MDS nodes, so then both were seemingly stuck rejoining.The main difference was that the MDSs were still able to sent beacons fast enough so they weren't killed and could eventually recover, but it took a long time (for that to happen and me to realise). In between, I also tried failing all MDSs and spawning an entirely new one from a clean slate, but I had the same problem there, so eventually I just waited and it worked.I figured it was trimming the cache, but I have no idea why and where that cache came from. While the FS was down, I unmounted all 300+ clients, but until full recovery, "ceph fs status" would still claim that all of then were connected, which was abviously not true.On 7 Jan 2020 2:43 pm, Stefan Kooman  wrote:Quoting Janek Bevendorff (janek.bevendo...@uni-weimar.de):

> Update: turns out I just had to wait for an hour. The MDSs were sending

> Beacons regularly, so the MONs didn't try to kill them and instead let

> them finish doing whatever they were doing.

> 

> Unlike the other bug where the number of open files outgrows what the

> MDS can handle, this incident allowed "self-healing", but I still

> consider this a severe bug.



Just to get this straight : was your fs offline during this time? Do you

have any idea why it was busy trimming it's cache (because that was wat

is was doing, right?).



Gr. Stefan



-- 

| BIT BV  https://www.bit.nl/    Kamer van Koophandel 09090351

| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [Ceph-users] Re: MDS failing under load with large cache sizes

2020-01-06 Thread Janek Bevendorff
Update: turns out I just had to wait for an hour. The MDSs were sending
Beacons regularly, so the MONs didn't try to kill them and instead let
them finish doing whatever they were doing.

Unlike the other bug where the number of open files outgrows what the
MDS can handle, this incident allowed "self-healing", but I still
consider this a severe bug.


On 06/01/2020 12:05, Janek Bevendorff wrote:
> Hi, my MDS failed again, but this time I cannot recover it by deleting
> the mds*_openfiles .0 object. The startup behaviour is also different.
> Both inode count and cache size stay at zero while the MDS is replaying.
>
> When I set the MDS log level to 7, I get tons of these messages:
>
> 2020-01-06 11:59:49.303 7f30149e4700  7 mds.1.cache  current root is
> [dir 0x1073682 /XXX/XXX/ [2,head] auth v=5527265 cv=0/0 dir_auth=1
> state=1073741824 f(v0 m2019-08-14 16:39:17.790395 4=1+3) n(v84855
> rc2019-09-17 08:54:57.569803 b3226894326662 5255834=4707755+548079)
> hs=1+0,ss=0+0 | child=1 subtree=1 0x5608a02e7900]
> 2020-01-06 11:59:49.323 7f30149e4700  7 mds.1.cache adjust_subtree_auth
> -1,-2 -> -2,-2 on [dir 0x1000ae4a784 /XXX/XXX/ [2,head] auth v=114
> cv=0/0 state=1073741824 f(v0 m2019-08-23 05:07:32.658490 9=9+0) n(v1
> rc2019-09-16 15:51:58.418555 b21646377 9=9+0) hs=0+0,ss=0+0 0x5608c602cd00]
> 2020-01-06 11:59:49.323 7f30149e4700  7 mds.1.cache  current root is
> [dir 0x1073682 /XXX/XXX/ [2,head] auth v=5527265 cv=0/0 dir_auth=1
> state=1073741824 f(v0 m2019-08-14 16:39:17.790395 4=1+3) n(v84855
> rc2019-09-17 08:54:57.569803 b3226894326662 5255834=4707755+548079)
> hs=1+0,ss=0+0 | child=1 subtree=1 0x5608a02e7900]
> 2020-01-06 11:59:49.343 7f30149e4700  7 mds.1.cache adjust_subtree_auth
> -1,-2 -> -2,-2 on [dir 0x1000ae4a78b /XXX/XXX/ [2,head] auth v=102
> cv=0/0 state=1073741824 f(v0 m2019-08-23 05:07:35.046498 9=9+0) n(v1
> rc2019-09-16 15:51:58.478556 b1430317 9=9+0) hs=0+0,ss=0+0 0x5608c602d200]
> 2020-01-06 11:59:49.343 7f30149e4700  7 mds.1.cache  current root is
> [dir 0x1073682 /XXX/XXX/ [2,head] auth v=5527265 cv=0/0 dir_auth=1
> state=1073741824 f(v0 m2019-08-14 16:39:17.790395 4=1+3) n(v84855
> rc2019-09-17 08:54:57.569803 b3226894326662 5255834=4707755+548079)
> hs=1+0,ss=0+0 | child=1 subtree=1 0x5608a02e7900]
> 2020-01-06 11:59:49.363 7f30149e4700  7 mds.1.cache adjust_subtree_auth
> -1,-2 -> -2,-2 on [dir 0x1000ae4a78e /XXX/XXX/ [2,head] auth v=91 cv=0/0
> state=1073741824 f(v0 m2019-08-23 05:07:38.986513 8=8+0) n(v1
> rc2019-09-16 15:51:58.498556 b1932614 8=8+0) hs=0+0,ss=0+0 0x5608c602d700]
> 2020-01-06 11:59:49.363 7f30149e4700  7 mds.1.cache  current root is
> [dir 0x1073682 /XXX/XXX/ [2,head] auth v=5527265 cv=0/0 dir_auth=1
> state=1073741824 f(v0 m2019-08-14 16:39:17.790395 4=1+3) n(v84855
> rc2019-09-17 08:54:57.569803 b3226894326662 5255834=4707755+548079)
> hs=1+0,ss=0+0 | child=1 subtree=1 0x5608a02e7900]
>
> Is there any way I can recover the MDS? I tried wiping sessions on
> startup etc., but nothing worked.
>
> Thanks
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [Ceph-users] Re: MDS failing under load with large cache sizes

2020-01-06 Thread Janek Bevendorff
Hi, my MDS failed again, but this time I cannot recover it by deleting
the mds*_openfiles .0 object. The startup behaviour is also different.
Both inode count and cache size stay at zero while the MDS is replaying.

When I set the MDS log level to 7, I get tons of these messages:

2020-01-06 11:59:49.303 7f30149e4700  7 mds.1.cache  current root is
[dir 0x1073682 /XXX/XXX/ [2,head] auth v=5527265 cv=0/0 dir_auth=1
state=1073741824 f(v0 m2019-08-14 16:39:17.790395 4=1+3) n(v84855
rc2019-09-17 08:54:57.569803 b3226894326662 5255834=4707755+548079)
hs=1+0,ss=0+0 | child=1 subtree=1 0x5608a02e7900]
2020-01-06 11:59:49.323 7f30149e4700  7 mds.1.cache adjust_subtree_auth
-1,-2 -> -2,-2 on [dir 0x1000ae4a784 /XXX/XXX/ [2,head] auth v=114
cv=0/0 state=1073741824 f(v0 m2019-08-23 05:07:32.658490 9=9+0) n(v1
rc2019-09-16 15:51:58.418555 b21646377 9=9+0) hs=0+0,ss=0+0 0x5608c602cd00]
2020-01-06 11:59:49.323 7f30149e4700  7 mds.1.cache  current root is
[dir 0x1073682 /XXX/XXX/ [2,head] auth v=5527265 cv=0/0 dir_auth=1
state=1073741824 f(v0 m2019-08-14 16:39:17.790395 4=1+3) n(v84855
rc2019-09-17 08:54:57.569803 b3226894326662 5255834=4707755+548079)
hs=1+0,ss=0+0 | child=1 subtree=1 0x5608a02e7900]
2020-01-06 11:59:49.343 7f30149e4700  7 mds.1.cache adjust_subtree_auth
-1,-2 -> -2,-2 on [dir 0x1000ae4a78b /XXX/XXX/ [2,head] auth v=102
cv=0/0 state=1073741824 f(v0 m2019-08-23 05:07:35.046498 9=9+0) n(v1
rc2019-09-16 15:51:58.478556 b1430317 9=9+0) hs=0+0,ss=0+0 0x5608c602d200]
2020-01-06 11:59:49.343 7f30149e4700  7 mds.1.cache  current root is
[dir 0x1073682 /XXX/XXX/ [2,head] auth v=5527265 cv=0/0 dir_auth=1
state=1073741824 f(v0 m2019-08-14 16:39:17.790395 4=1+3) n(v84855
rc2019-09-17 08:54:57.569803 b3226894326662 5255834=4707755+548079)
hs=1+0,ss=0+0 | child=1 subtree=1 0x5608a02e7900]
2020-01-06 11:59:49.363 7f30149e4700  7 mds.1.cache adjust_subtree_auth
-1,-2 -> -2,-2 on [dir 0x1000ae4a78e /XXX/XXX/ [2,head] auth v=91 cv=0/0
state=1073741824 f(v0 m2019-08-23 05:07:38.986513 8=8+0) n(v1
rc2019-09-16 15:51:58.498556 b1932614 8=8+0) hs=0+0,ss=0+0 0x5608c602d700]
2020-01-06 11:59:49.363 7f30149e4700  7 mds.1.cache  current root is
[dir 0x1073682 /XXX/XXX/ [2,head] auth v=5527265 cv=0/0 dir_auth=1
state=1073741824 f(v0 m2019-08-14 16:39:17.790395 4=1+3) n(v84855
rc2019-09-17 08:54:57.569803 b3226894326662 5255834=4707755+548079)
hs=1+0,ss=0+0 | child=1 subtree=1 0x5608a02e7900]

Is there any way I can recover the MDS? I tried wiping sessions on
startup etc., but nothing worked.

Thanks
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [Ceph-users] Re: MDS failing under load with large cache sizes

2019-12-17 Thread Janek Bevendorff

Hey Patrick,

I just wanted to give you some feedback about how 14.2.5 is working for 
me. I've had the chance to test it for a day now and overall, the 
experience is much better, although not perfect (perhaps far from it).


I have two active MDS (I figured that'd spread the meta data load a 
little and seems to work pretty well for me). After the upgrade to the 
new release, I removed all special recall settings, so my MDS config is 
basically on default. The only thing I set is a mds_max_caps_per_client 
of 200k, a mds_cache_reservation of 0.1 and 40G of mds_cache_memory_limit.


Right now, everything seems to be running smoothly, although I notice 
that the max cap setting isn't fully honoured. The overall cache size 
seems fairly constant at 15M (for mds.0, mds.1 a little less), but the 
client cap count can easily exceed 10M if I run something like `find` on 
a large directory.


We have one particularly problematic folder containing about 400 sub 
folders holding a total of about 35M files among them. My first attempts 
at running `find -type d` on those had the weird effect that after 
pretty much exactly 2M caps, mds.1 got killed and replaced by a standby. 
Fortunately, the standby managed to take over in a matter of seconds 
(sometimes up to a few minutes) resetting the cap count to about 5k. The 
same thing then happened again once the new MDS reached the magical 2M 
caps. I would suppose that this is still the same problem as before, but 
with the huge improvement that the take-over standby MDS can actually 
recover. Previously, it would just die the same way after a minute or 
two of futile recovery attempts and the FS would be down indefinitely 
until I delete the openfiles object.


Right now, I cannot reproduce the crash any more---the caps to surge to 
10-15M, but no crash. However, I keep seeing the dreaded "client failing 
to respond to cache pressure" message occasionally. So far, though, the 
MDS have been able to keep up and reduce the number of caps after about 
15M, though, so that the message disappears after a while and the cap 
count growth isn't entirely unbounded. I ran a `find -type d` on the 
most problematic folder and attached two perf dumps for you (current cap 
count on the client: 14660568):


https://pastebin.com/W2dVJiW0
https://pastebin.com/pzQ5uQQ3

Cheers
Janek

P.S. Just as I was finishing this email, the rank 0 MDS actually 
crashed. Unfortunately, I didn't have increased debug levels enabled, so 
its death note is rather uninformative:


2019-12-17 09:42:12.325 7f7633dde700  1 mds.deltaweb011 Updating MDS map 
to version 103112 from mon.3
2019-12-17 09:43:27.774 7f7633dde700  1 mds.deltaweb011 Updating MDS map 
to version 103113 from mon.3
2019-12-17 09:43:40.086 7f7633dde700  1 mds.deltaweb011 Updating MDS map 
to version 103114 from mon.3

2019-12-17 09:44:46.203 7f7633dde700 -1 *** Caught signal (Aborted) **
 in thread 7f7633dde700 thread_name:ms_dispatch

Also, this time around the recovery appears to be a lot more 
problematic, so I'm afraid I have to apply the previous procedure again 
of deleting the openfiles object to get it back up. I don't think my 
`find` alone would have crashed the MDS, but if another client is doing 
similar things at the same time, it overloads the MDS.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


  1   2   >