[ceph-users] Re: Client failing to respond to capability release

2023-09-01 Thread Patrick Donnelly
Hello Frank,

On Tue, Aug 22, 2023 at 11:42 AM Frank Schilder  wrote:
>
> Hi all,
>
> I have this warning the whole day already (octopus latest cluster):
>
> HEALTH_WARN 4 clients failing to respond to capability release; 1 pgs not 
> deep-scrubbed in time
> [WRN] MDS_CLIENT_LATE_RELEASE: 4 clients failing to respond to capability 
> release
> mds.ceph-24(mds.1): Client sn352.hpc.ait.dtu.dk:con-fs2-hpc failing to 
> respond to capability release client_id: 145698301
> mds.ceph-24(mds.1): Client sn463.hpc.ait.dtu.dk:con-fs2-hpc failing to 
> respond to capability release client_id: 189511877
> mds.ceph-24(mds.1): Client sn350.hpc.ait.dtu.dk:con-fs2-hpc failing to 
> respond to capability release client_id: 189511887
> mds.ceph-24(mds.1): Client sn403.hpc.ait.dtu.dk:con-fs2-hpc failing to 
> respond to capability release client_id: 231250695
>
> If I look at the session info from mds.1 for these clients I see this:
>
> # ceph tell mds.1 session ls | jq -c '[.[] | {id: .id, h: 
> .client_metadata.hostname, addr: .inst, fs: .client_metadata.root, caps: 
> .num_caps, req: .request_load_avg}]|sort_by(.caps)|.[]' | grep -e 145698301 
> -e 189511877 -e 189511887 -e 231250695
> {"id":189511887,"h":"sn350.hpc.ait.dtu.dk","addr":"client.189511887 
> v1:192.168.57.221:0/4262844211","fs":"/hpc/groups","caps":2,"req":0}
> {"id":231250695,"h":"sn403.hpc.ait.dtu.dk","addr":"client.231250695 
> v1:192.168.58.18:0/1334540218","fs":"/hpc/groups","caps":3,"req":0}
> {"id":189511877,"h":"sn463.hpc.ait.dtu.dk","addr":"client.189511877 
> v1:192.168.58.78:0/3535879569","fs":"/hpc/groups","caps":4,"req":0}
> {"id":145698301,"h":"sn352.hpc.ait.dtu.dk","addr":"client.145698301 
> v1:192.168.57.223:0/2146607320","fs":"/hpc/groups","caps":7,"req":0}
>
> We have mds_min_caps_per_client=4096, so it looks like the limit is well 
> satisfied. Also, the file system is pretty idle at the moment.
>
> Why and what exactly is the MDS complaining about here?

These days, you'll generally see this because the client is "quiet"
and the MDS is opportunistically recalling caps to reduce future work
when shrinking its cache is necessary. This would be indicated by:

* The MDS is not complaining about an oversized cache.
* The session listing shows the session is quiet (the
"session_cache_liveness" is near 0).

However, the MDS should respect mds_min_caps_per_client by (a) not
recalling more caps than mds_min_caps_per_client and (b) not
complaining the client has caps < mds_min_caps_per_client when it's
quiet.

So, you may have found a bug. The next time this happens, a `ceph tell
mds.X config diff`, `ceph tell mds.X perf dump`, and selection of the
relevant `ceph tell mds.X session ls` will help debug this I think.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Red Hat Partner Engineer
IBM, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: When to use the auth profiles simple-rados-client and profile simple-rados-client-with-blocklist?

2023-09-01 Thread Patrick Donnelly
Hello Christian,

On Tue, Aug 22, 2023 at 7:51 AM Christian Rohmann
 wrote:
>
> Hey ceph-users,
>
> 1) When configuring Gnocchi to use Ceph storage (see
> https://gnocchi.osci.io/install.html#ceph-requirements)
> I was wondering if one could use any of the auth profiles like
>   * simple-rados-client
>   * simple-rados-client-with-blocklist ?
>
> Or are those for different use cases?
>
> 2) I was also wondering why the documentation mentions "(Monitor only)"
> but then it says
> "Gives a user read-only permissions for monitor, OSD, and PG data."?
>
> 3) And are those profiles really for "read-only" users? Why don't they
> have "read-only" in their name like the rbd and the corresponding
> "rbd-read-only" profile?

I don't know anything about Gnocchi (except the food) but to answer
the question in $SUBJECT:

https://docs.ceph.com/en/reef/rados/api/libcephsqlite/#user

You would want to use the simple-rados-client-with-blocklist profile
for a libcephsqlite application.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Red Hat Partner Engineer
IBM, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Critical Information: DELL/Toshiba SSDs dying after 70,000 hours of operation

2023-09-01 Thread Anthony D'Atri
Is a secure-erase suggested after the firmware update?  Sometimes manufacturers 
do that.

> On Sep 1, 2023, at 05:16, Frédéric Nass  
> wrote:
> 
> Hello, 
> 
> This message to inform you that DELL has released a new firmwares for these 
> SSD drives to fix the 70.000 POH issue: 
> 
> [ 
> https://www.dell.com/support/home/en-us/drivers/driversdetails?driverid=69j5f=w12r2=poweredge-r730xd
>  | Toshiba A3B4 for model number(s) PX02SMF020, PX02SMF040, PX02SMF080 and 
> PX02SMB160. ] 
> [ 
> https://www.dell.com/support/home/en-us/drivers/driversdetails?driverid=31jmh=rt
>  | Toshiba A4B4 for model number(s) PX02SSF010, PX02SSF020, PX02SSF040 and 
> PX02SSB080. ] [ 
> https://www.dell.com/support/home/en-us/drivers/driversdetails?driverid=69j5f=w12r2=poweredge-r730xd
>  ] 
> [ 
> https://www.dell.com/support/home/en-us/drivers/driversdetails?driverid=tc8kc=rt
>  | Toshiba A5B4 for model number(s) PX03SNF020, PX03SNF080 and PX03SNB160. ] 
> 
> Based on our recent experience, this firmware gets dead SSD drives back to 
> life with their data (after the upgrade, you may need to import foreign 
> config by pressing 'F' key on the next start) 
> 
> Many thanks to DELL French TAMs and DELL engineering for providing this 
> firmware in a short time. 
> 
> Best regards, 
> Frédéric. 
> 
> - Le 19 Juin 23, à 10:46, Frédéric Nass  
> a écrit : 
> 
>> Hello,
> 
>> This message does not concern Ceph itself but a hardware vulnerability which 
>> can
>> lead to permanent loss of data on a Ceph cluster equipped with the same
>> hardware in separate fault domains.
> 
>> The DELL / Toshiba PX02SMF020, PX02SMF040, PX02SMF080 and PX02SMB160 SSD 
>> drives
>> of the 13G generation of DELL servers are subject to a vulnerability which
>> renders them unusable after 70,000 hours of operation, i.e. approximately 7
>> years and 11 months of activity.
> 
>> This topic has been discussed here:
>> https://www.dell.com/community/PowerVault/TOSHIBA-PX02SMF080-has-lost-communication-on-the-same-date/td-p/8353438
> 
>> The risk is all the greater since these disks may die at the same time in the
>> same server leading to the loss of all data in the server.
> 
>> To date, DELL has not provided any firmware fixing this vulnerability, the
>> latest firmware version being "A3B3" released on Sept. 12, 2016:
>> https://www.dell.com/support/home/en-us/ 
>> drivers/driversdetails?driverid=hhd9k
> 
>> If your have servers running these drives, check their uptime. If they are 
>> close
>> to the 70,000 hour limit, replace them immediately.
> 
>> The smartctl tool does not report the uptime for these SSDs, but if you have
>> HDDs in the server, you can query their SMART status and get their uptime,
>> which should be about the same as the SSDs.
>> The smartctl command is: smartctl -a -d megaraid,XX /dev/sdc (where XX is the
>> iSCSI bus number).
> 
>> We have informed DELL about this but have no information yet on the arrival 
>> of a
>> fix.
> 
>> We have lost 6 disks, in 3 different servers, in the last few weeks. Our
>> observation shows that the drives don't survive full shutdown and restart of
>> the machine (power off then power on in iDrac), but they may also die during 
>> a
>> single reboot (init 6) or even while the machine is running.
> 
>> Fujitsu released a corrective firmware in June 2021 but this firmware is most
>> certainly not applicable to DELL drives:
>> https://www.fujitsu.com/us/imagesgig5/PY-CIB070-00.pdf
> 
>> Regards,
>> Frederic
> 
>> Sous-direction Infrastructure and Services
>> Direction du Numérique
>> Université de Lorraine
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Permissions of the .snap directory do not inherit ACLs in 17.2.6

2023-09-01 Thread MARTEL Arnaud
Hi,

I'm facing the same situation as described in bug #57084 
(https://tracker.ceph.com/issues/57084) since I upgraded from 16.2.13 to 17.2.6

for example:

root@faiserver:~# getfacl /mnt/ceph/default/
# file: mnt/ceph/default/
# owner: 99
# group: nogroup
# flags: -s-
user::rwx
user:s-sac-acquisition:rwx
group::rwx
group:acquisition:r-x
group:SAC_R:r-x
mask::rwx
other::---
default:user::rwx
default:user:s-sac-acquisition:rwx
default:group::rwx
default:group:acquisition:r-x
default:group:SAC_R:r-x
default:mask::rwx
default:other::---

root@faiserver:~# getfacl /mnt/ceph/default/.snap
# file: mnt/ceph/default/.snap
# owner: 99
# group: nogroup
# flags: -s-
user::rwx
group::rwx
other::r-x


Before creating a new bug report, could you tell me if someone has the same 
problem with 17.2.6 ??

Kind regards,
Arnaud
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSDs spam log with scrub starts

2023-09-01 Thread Adrien Georget

Hi,

Is there any logging parameters to mute this?
I already disabled "clog_to_monitors" on OSDs (enable by default) as all 
"scrubs starts" logs were also sent to monitors (up to 60M logs / day).
I tried to set debug_osd to 0/5 and others log/debug params but I could 
not find anything to mute this.


Adrien

Le 31/08/2023 à 19:40, David Orman a écrit :

https://github.com/ceph/ceph/pull/48070 may be relevant.

I think this may have gone out in 16.2.11. I would tend to agree, personally 
this feels quite noisy at default logging levels for production clusters.

David

On Thu, Aug 31, 2023, at 11:17, Zakhar Kirpichenko wrote:

This is happening to our 16.2.14 cluster as well. I'm not sure whether this
was happening before the upgrade to 16.2.14.

/Z

On Thu, 31 Aug 2023, 17:49 Adrien Georget, 
wrote:


Hello,

On our 16.2.14 CephFS cluster, all OSDs are spamming logs with messages
like "log_channel(cluster) log [DBG] : xxx scrub starts".
All OSDs are concerned, for different PGs. Cluster is healthy without
any recovery ops.

For a single PG, we can have hundreds of scrub starts msg in less than
an hour. With 720 OSDs (8k PG, EC8+2), it can lead to millions of
messages by hour...
For example with PG 3.1d57 or||3.1988 :

|Aug 31 16:02:09
ceph-86cd8a68-7649-11ed-b2be-5cba2c7fdb30-osd-58[1310188]: debug
2023-08-31T14:02:09.453+ 7fdab1ec4700  0 log_channel(cluster) log
[DBG] : 3.1d57 scrub starts||
||Aug 31 16:02:11
ceph-86cd8a68-7649-11ed-b2be-5cba2c7fdb30-osd-58[1310188]: debug
2023-08-31T14:02:11.446+ 7fdab1ec4700  0 log_channel(cluster) log
[DBG] : 3.1d57 scrub starts||
||Aug 31 16:02:12
ceph-86cd8a68-7649-11ed-b2be-5cba2c7fdb30-osd-58[1310188]: debug
2023-08-31T14:02:12.428+ 7fdab1ec4700  0 log_channel(cluster) log
[DBG] : 3.1d57 scrub starts||
||Aug 31 16:02:13
ceph-86cd8a68-7649-11ed-b2be-5cba2c7fdb30-osd-58[1310188]: debug
2023-08-31T14:02:13.456+ 7fdab1ec4700  0 log_channel(cluster) log
[DBG] : 3.1d57 scrub starts||
||Aug 31 16:02:14
ceph-86cd8a68-7649-11ed-b2be-5cba2c7fdb30-osd-58[1310188]: debug
2023-08-31T14:02:14.431+ 7fdab1ec4700  0 log_channel(cluster) log
[DBG] : 3.1d57 scrub starts||
||Aug 31 16:02:15
ceph-86cd8a68-7649-11ed-b2be-5cba2c7fdb30-osd-58[1310188]: debug
2023-08-31T14:02:15.475+ 7fdab1ec4700  0 log_channel(cluster) log
[DBG] : 3.1d57 scrub starts||
||Aug 31 16:02:21
ceph-86cd8a68-7649-11ed-b2be-5cba2c7fdb30-osd-58[1310188]: debug
2023-08-31T14:02:21.516+ 7fdab1ec4700  0 log_channel(cluster) log
[DBG] : 3.1d57 scrub starts||
||Aug 31 16:02:23
ceph-86cd8a68-7649-11ed-b2be-5cba2c7fdb30-osd-58[1310188]: debug
2023-08-31T14:02:23.555+ 7fdab1ec4700  0 log_channel(cluster) log
[DBG] : 3.1d57 scrub starts||
||Aug 31 16:02:24
ceph-86cd8a68-7649-11ed-b2be-5cba2c7fdb30-osd-58[1310188]: debug
2023-08-31T14:02:24.510+ 7fdab1ec4700  0 log_channel(cluster) log
[DBG] : 3.1d57 deep-scrub starts||

Aug 31 16:02:10
ceph-86cd8a68-7649-11ed-b2be-5cba2c7fdb30-osd-276[1325507]: debug
2023-08-31T14:02:10.384+ 7f0606ce3700  0 log_channel(cluster) log
[DBG] : 3.1988 deep-scrub starts
Aug 31 16:02:11
ceph-86cd8a68-7649-11ed-b2be-5cba2c7fdb30-osd-276[1325507]: debug
2023-08-31T14:02:11.377+ 7f0606ce3700  0 log_channel(cluster) log
[DBG] : 3.1988 scrub starts
Aug 31 16:02:13
ceph-86cd8a68-7649-11ed-b2be-5cba2c7fdb30-osd-276[1325507]: debug
2023-08-31T14:02:13.383+ 7f0606ce3700  0 log_channel(cluster) log
[DBG] : 3.1988 scrub starts
Aug 31 16:02:15
ceph-86cd8a68-7649-11ed-b2be-5cba2c7fdb30-osd-276[1325507]: debug
2023-08-31T14:02:15.383+ 7f0606ce3700  0 log_channel(cluster) log
[DBG] : 3.1988 deep-scrub starts
Aug 31 16:02:17
ceph-86cd8a68-7649-11ed-b2be-5cba2c7fdb30-osd-276[1325507]: debug
2023-08-31T14:02:17.336+ 7f0606ce3700  0 log_channel(cluster) log
[DBG] : 3.1988 scrub starts
Aug 31 16:02:19
ceph-86cd8a68-7649-11ed-b2be-5cba2c7fdb30-osd-276[1325507]: debug
2023-08-31T14:02:19.328+ 7f0606ce3700  0 log_channel(cluster) log
[DBG] : 3.1988 scrub starts
||
||PG_STAT  OBJECTS  MISSING_ON_PRIMARY  DEGRADED MISPLACED  UNFOUND
BYTES OMAP_BYTES*  OMAP_KEYS*  LOG DISK_LOG  STATE
STATE_STAMP  VERSION  REPORTED
UP UP_PRIMARY
ACTING ACTING_PRIMARY
LAST_SCRUB   SCRUB_STAMP  LAST_DEEP_SCRUB
DEEP_SCRUB_STAMP SNAPTRIMQ_LEN||
||3.1d57 52757   0 0 00
1675960266480   0   1799 1799
active+clean 2023-08-31T14:27:24.025755+   236010'4532653
236011:8745383  [58,421,335,9,59,199,390,481,425,480] 58
[58,421,335,9,59,199,390,481,425,480]  58 231791'4531915
*2023-08-29T22:41:12.266874+* 229377'4526369
*2023-08-26T04:30:42.894505+* 0|
|3.1988 52867   0 0 00
1686038728080   0   1811 1811
active+clean 2023-08-31T14:32:13.361420+   236018'4241611
236018:9815753

[ceph-users] Re: Critical Information: DELL/Toshiba SSDs dying after 70,000 hours of operation

2023-09-01 Thread Frédéric Nass
Hello, 

This message to inform you that DELL has released a new firmwares for these SSD 
drives to fix the 70.000 POH issue: 

[ 
https://www.dell.com/support/home/en-us/drivers/driversdetails?driverid=69j5f=w12r2=poweredge-r730xd
 | Toshiba A3B4 for model number(s) PX02SMF020, PX02SMF040, PX02SMF080 and 
PX02SMB160. ] 
[ 
https://www.dell.com/support/home/en-us/drivers/driversdetails?driverid=31jmh=rt
 | Toshiba A4B4 for model number(s) PX02SSF010, PX02SSF020, PX02SSF040 and 
PX02SSB080. ] [ 
https://www.dell.com/support/home/en-us/drivers/driversdetails?driverid=69j5f=w12r2=poweredge-r730xd
 ] 
[ 
https://www.dell.com/support/home/en-us/drivers/driversdetails?driverid=tc8kc=rt
 | Toshiba A5B4 for model number(s) PX03SNF020, PX03SNF080 and PX03SNB160. ] 

Based on our recent experience, this firmware gets dead SSD drives back to life 
with their data (after the upgrade, you may need to import foreign config by 
pressing 'F' key on the next start) 

Many thanks to DELL French TAMs and DELL engineering for providing this 
firmware in a short time. 

Best regards, 
Frédéric. 

- Le 19 Juin 23, à 10:46, Frédéric Nass  a 
écrit : 

> Hello,

> This message does not concern Ceph itself but a hardware vulnerability which 
> can
> lead to permanent loss of data on a Ceph cluster equipped with the same
> hardware in separate fault domains.

> The DELL / Toshiba PX02SMF020, PX02SMF040, PX02SMF080 and PX02SMB160 SSD 
> drives
> of the 13G generation of DELL servers are subject to a vulnerability which
> renders them unusable after 70,000 hours of operation, i.e. approximately 7
> years and 11 months of activity.

> This topic has been discussed here:
> https://www.dell.com/community/PowerVault/TOSHIBA-PX02SMF080-has-lost-communication-on-the-same-date/td-p/8353438

> The risk is all the greater since these disks may die at the same time in the
> same server leading to the loss of all data in the server.

> To date, DELL has not provided any firmware fixing this vulnerability, the
> latest firmware version being "A3B3" released on Sept. 12, 2016:
> https://www.dell.com/support/home/en-us/ drivers/driversdetails?driverid=hhd9k

> If your have servers running these drives, check their uptime. If they are 
> close
> to the 70,000 hour limit, replace them immediately.

> The smartctl tool does not report the uptime for these SSDs, but if you have
> HDDs in the server, you can query their SMART status and get their uptime,
> which should be about the same as the SSDs.
> The smartctl command is: smartctl -a -d megaraid,XX /dev/sdc (where XX is the
> iSCSI bus number).

> We have informed DELL about this but have no information yet on the arrival 
> of a
> fix.

> We have lost 6 disks, in 3 different servers, in the last few weeks. Our
> observation shows that the drives don't survive full shutdown and restart of
> the machine (power off then power on in iDrac), but they may also die during a
> single reboot (init 6) or even while the machine is running.

> Fujitsu released a corrective firmware in June 2021 but this firmware is most
> certainly not applicable to DELL drives:
> https://www.fujitsu.com/us/imagesgig5/PY-CIB070-00.pdf

> Regards,
> Frederic

> Sous-direction Infrastructure and Services
> Direction du Numérique
> Université de Lorraine
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io