[ceph-users] Re: Client failing to respond to capability release
Hello Frank, On Tue, Aug 22, 2023 at 11:42 AM Frank Schilder wrote: > > Hi all, > > I have this warning the whole day already (octopus latest cluster): > > HEALTH_WARN 4 clients failing to respond to capability release; 1 pgs not > deep-scrubbed in time > [WRN] MDS_CLIENT_LATE_RELEASE: 4 clients failing to respond to capability > release > mds.ceph-24(mds.1): Client sn352.hpc.ait.dtu.dk:con-fs2-hpc failing to > respond to capability release client_id: 145698301 > mds.ceph-24(mds.1): Client sn463.hpc.ait.dtu.dk:con-fs2-hpc failing to > respond to capability release client_id: 189511877 > mds.ceph-24(mds.1): Client sn350.hpc.ait.dtu.dk:con-fs2-hpc failing to > respond to capability release client_id: 189511887 > mds.ceph-24(mds.1): Client sn403.hpc.ait.dtu.dk:con-fs2-hpc failing to > respond to capability release client_id: 231250695 > > If I look at the session info from mds.1 for these clients I see this: > > # ceph tell mds.1 session ls | jq -c '[.[] | {id: .id, h: > .client_metadata.hostname, addr: .inst, fs: .client_metadata.root, caps: > .num_caps, req: .request_load_avg}]|sort_by(.caps)|.[]' | grep -e 145698301 > -e 189511877 -e 189511887 -e 231250695 > {"id":189511887,"h":"sn350.hpc.ait.dtu.dk","addr":"client.189511887 > v1:192.168.57.221:0/4262844211","fs":"/hpc/groups","caps":2,"req":0} > {"id":231250695,"h":"sn403.hpc.ait.dtu.dk","addr":"client.231250695 > v1:192.168.58.18:0/1334540218","fs":"/hpc/groups","caps":3,"req":0} > {"id":189511877,"h":"sn463.hpc.ait.dtu.dk","addr":"client.189511877 > v1:192.168.58.78:0/3535879569","fs":"/hpc/groups","caps":4,"req":0} > {"id":145698301,"h":"sn352.hpc.ait.dtu.dk","addr":"client.145698301 > v1:192.168.57.223:0/2146607320","fs":"/hpc/groups","caps":7,"req":0} > > We have mds_min_caps_per_client=4096, so it looks like the limit is well > satisfied. Also, the file system is pretty idle at the moment. > > Why and what exactly is the MDS complaining about here? These days, you'll generally see this because the client is "quiet" and the MDS is opportunistically recalling caps to reduce future work when shrinking its cache is necessary. This would be indicated by: * The MDS is not complaining about an oversized cache. * The session listing shows the session is quiet (the "session_cache_liveness" is near 0). However, the MDS should respect mds_min_caps_per_client by (a) not recalling more caps than mds_min_caps_per_client and (b) not complaining the client has caps < mds_min_caps_per_client when it's quiet. So, you may have found a bug. The next time this happens, a `ceph tell mds.X config diff`, `ceph tell mds.X perf dump`, and selection of the relevant `ceph tell mds.X session ls` will help debug this I think. -- Patrick Donnelly, Ph.D. He / Him / His Red Hat Partner Engineer IBM, Inc. GPG: 19F28A586F808C2402351B93C3301A3E258DD79D ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: When to use the auth profiles simple-rados-client and profile simple-rados-client-with-blocklist?
Hello Christian, On Tue, Aug 22, 2023 at 7:51 AM Christian Rohmann wrote: > > Hey ceph-users, > > 1) When configuring Gnocchi to use Ceph storage (see > https://gnocchi.osci.io/install.html#ceph-requirements) > I was wondering if one could use any of the auth profiles like > * simple-rados-client > * simple-rados-client-with-blocklist ? > > Or are those for different use cases? > > 2) I was also wondering why the documentation mentions "(Monitor only)" > but then it says > "Gives a user read-only permissions for monitor, OSD, and PG data."? > > 3) And are those profiles really for "read-only" users? Why don't they > have "read-only" in their name like the rbd and the corresponding > "rbd-read-only" profile? I don't know anything about Gnocchi (except the food) but to answer the question in $SUBJECT: https://docs.ceph.com/en/reef/rados/api/libcephsqlite/#user You would want to use the simple-rados-client-with-blocklist profile for a libcephsqlite application. -- Patrick Donnelly, Ph.D. He / Him / His Red Hat Partner Engineer IBM, Inc. GPG: 19F28A586F808C2402351B93C3301A3E258DD79D ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Critical Information: DELL/Toshiba SSDs dying after 70,000 hours of operation
Is a secure-erase suggested after the firmware update? Sometimes manufacturers do that. > On Sep 1, 2023, at 05:16, Frédéric Nass > wrote: > > Hello, > > This message to inform you that DELL has released a new firmwares for these > SSD drives to fix the 70.000 POH issue: > > [ > https://www.dell.com/support/home/en-us/drivers/driversdetails?driverid=69j5f=w12r2=poweredge-r730xd > | Toshiba A3B4 for model number(s) PX02SMF020, PX02SMF040, PX02SMF080 and > PX02SMB160. ] > [ > https://www.dell.com/support/home/en-us/drivers/driversdetails?driverid=31jmh=rt > | Toshiba A4B4 for model number(s) PX02SSF010, PX02SSF020, PX02SSF040 and > PX02SSB080. ] [ > https://www.dell.com/support/home/en-us/drivers/driversdetails?driverid=69j5f=w12r2=poweredge-r730xd > ] > [ > https://www.dell.com/support/home/en-us/drivers/driversdetails?driverid=tc8kc=rt > | Toshiba A5B4 for model number(s) PX03SNF020, PX03SNF080 and PX03SNB160. ] > > Based on our recent experience, this firmware gets dead SSD drives back to > life with their data (after the upgrade, you may need to import foreign > config by pressing 'F' key on the next start) > > Many thanks to DELL French TAMs and DELL engineering for providing this > firmware in a short time. > > Best regards, > Frédéric. > > - Le 19 Juin 23, à 10:46, Frédéric Nass > a écrit : > >> Hello, > >> This message does not concern Ceph itself but a hardware vulnerability which >> can >> lead to permanent loss of data on a Ceph cluster equipped with the same >> hardware in separate fault domains. > >> The DELL / Toshiba PX02SMF020, PX02SMF040, PX02SMF080 and PX02SMB160 SSD >> drives >> of the 13G generation of DELL servers are subject to a vulnerability which >> renders them unusable after 70,000 hours of operation, i.e. approximately 7 >> years and 11 months of activity. > >> This topic has been discussed here: >> https://www.dell.com/community/PowerVault/TOSHIBA-PX02SMF080-has-lost-communication-on-the-same-date/td-p/8353438 > >> The risk is all the greater since these disks may die at the same time in the >> same server leading to the loss of all data in the server. > >> To date, DELL has not provided any firmware fixing this vulnerability, the >> latest firmware version being "A3B3" released on Sept. 12, 2016: >> https://www.dell.com/support/home/en-us/ >> drivers/driversdetails?driverid=hhd9k > >> If your have servers running these drives, check their uptime. If they are >> close >> to the 70,000 hour limit, replace them immediately. > >> The smartctl tool does not report the uptime for these SSDs, but if you have >> HDDs in the server, you can query their SMART status and get their uptime, >> which should be about the same as the SSDs. >> The smartctl command is: smartctl -a -d megaraid,XX /dev/sdc (where XX is the >> iSCSI bus number). > >> We have informed DELL about this but have no information yet on the arrival >> of a >> fix. > >> We have lost 6 disks, in 3 different servers, in the last few weeks. Our >> observation shows that the drives don't survive full shutdown and restart of >> the machine (power off then power on in iDrac), but they may also die during >> a >> single reboot (init 6) or even while the machine is running. > >> Fujitsu released a corrective firmware in June 2021 but this firmware is most >> certainly not applicable to DELL drives: >> https://www.fujitsu.com/us/imagesgig5/PY-CIB070-00.pdf > >> Regards, >> Frederic > >> Sous-direction Infrastructure and Services >> Direction du Numérique >> Université de Lorraine >> ___ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Permissions of the .snap directory do not inherit ACLs in 17.2.6
Hi, I'm facing the same situation as described in bug #57084 (https://tracker.ceph.com/issues/57084) since I upgraded from 16.2.13 to 17.2.6 for example: root@faiserver:~# getfacl /mnt/ceph/default/ # file: mnt/ceph/default/ # owner: 99 # group: nogroup # flags: -s- user::rwx user:s-sac-acquisition:rwx group::rwx group:acquisition:r-x group:SAC_R:r-x mask::rwx other::--- default:user::rwx default:user:s-sac-acquisition:rwx default:group::rwx default:group:acquisition:r-x default:group:SAC_R:r-x default:mask::rwx default:other::--- root@faiserver:~# getfacl /mnt/ceph/default/.snap # file: mnt/ceph/default/.snap # owner: 99 # group: nogroup # flags: -s- user::rwx group::rwx other::r-x Before creating a new bug report, could you tell me if someone has the same problem with 17.2.6 ?? Kind regards, Arnaud ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: OSDs spam log with scrub starts
Hi, Is there any logging parameters to mute this? I already disabled "clog_to_monitors" on OSDs (enable by default) as all "scrubs starts" logs were also sent to monitors (up to 60M logs / day). I tried to set debug_osd to 0/5 and others log/debug params but I could not find anything to mute this. Adrien Le 31/08/2023 à 19:40, David Orman a écrit : https://github.com/ceph/ceph/pull/48070 may be relevant. I think this may have gone out in 16.2.11. I would tend to agree, personally this feels quite noisy at default logging levels for production clusters. David On Thu, Aug 31, 2023, at 11:17, Zakhar Kirpichenko wrote: This is happening to our 16.2.14 cluster as well. I'm not sure whether this was happening before the upgrade to 16.2.14. /Z On Thu, 31 Aug 2023, 17:49 Adrien Georget, wrote: Hello, On our 16.2.14 CephFS cluster, all OSDs are spamming logs with messages like "log_channel(cluster) log [DBG] : xxx scrub starts". All OSDs are concerned, for different PGs. Cluster is healthy without any recovery ops. For a single PG, we can have hundreds of scrub starts msg in less than an hour. With 720 OSDs (8k PG, EC8+2), it can lead to millions of messages by hour... For example with PG 3.1d57 or||3.1988 : |Aug 31 16:02:09 ceph-86cd8a68-7649-11ed-b2be-5cba2c7fdb30-osd-58[1310188]: debug 2023-08-31T14:02:09.453+ 7fdab1ec4700 0 log_channel(cluster) log [DBG] : 3.1d57 scrub starts|| ||Aug 31 16:02:11 ceph-86cd8a68-7649-11ed-b2be-5cba2c7fdb30-osd-58[1310188]: debug 2023-08-31T14:02:11.446+ 7fdab1ec4700 0 log_channel(cluster) log [DBG] : 3.1d57 scrub starts|| ||Aug 31 16:02:12 ceph-86cd8a68-7649-11ed-b2be-5cba2c7fdb30-osd-58[1310188]: debug 2023-08-31T14:02:12.428+ 7fdab1ec4700 0 log_channel(cluster) log [DBG] : 3.1d57 scrub starts|| ||Aug 31 16:02:13 ceph-86cd8a68-7649-11ed-b2be-5cba2c7fdb30-osd-58[1310188]: debug 2023-08-31T14:02:13.456+ 7fdab1ec4700 0 log_channel(cluster) log [DBG] : 3.1d57 scrub starts|| ||Aug 31 16:02:14 ceph-86cd8a68-7649-11ed-b2be-5cba2c7fdb30-osd-58[1310188]: debug 2023-08-31T14:02:14.431+ 7fdab1ec4700 0 log_channel(cluster) log [DBG] : 3.1d57 scrub starts|| ||Aug 31 16:02:15 ceph-86cd8a68-7649-11ed-b2be-5cba2c7fdb30-osd-58[1310188]: debug 2023-08-31T14:02:15.475+ 7fdab1ec4700 0 log_channel(cluster) log [DBG] : 3.1d57 scrub starts|| ||Aug 31 16:02:21 ceph-86cd8a68-7649-11ed-b2be-5cba2c7fdb30-osd-58[1310188]: debug 2023-08-31T14:02:21.516+ 7fdab1ec4700 0 log_channel(cluster) log [DBG] : 3.1d57 scrub starts|| ||Aug 31 16:02:23 ceph-86cd8a68-7649-11ed-b2be-5cba2c7fdb30-osd-58[1310188]: debug 2023-08-31T14:02:23.555+ 7fdab1ec4700 0 log_channel(cluster) log [DBG] : 3.1d57 scrub starts|| ||Aug 31 16:02:24 ceph-86cd8a68-7649-11ed-b2be-5cba2c7fdb30-osd-58[1310188]: debug 2023-08-31T14:02:24.510+ 7fdab1ec4700 0 log_channel(cluster) log [DBG] : 3.1d57 deep-scrub starts|| Aug 31 16:02:10 ceph-86cd8a68-7649-11ed-b2be-5cba2c7fdb30-osd-276[1325507]: debug 2023-08-31T14:02:10.384+ 7f0606ce3700 0 log_channel(cluster) log [DBG] : 3.1988 deep-scrub starts Aug 31 16:02:11 ceph-86cd8a68-7649-11ed-b2be-5cba2c7fdb30-osd-276[1325507]: debug 2023-08-31T14:02:11.377+ 7f0606ce3700 0 log_channel(cluster) log [DBG] : 3.1988 scrub starts Aug 31 16:02:13 ceph-86cd8a68-7649-11ed-b2be-5cba2c7fdb30-osd-276[1325507]: debug 2023-08-31T14:02:13.383+ 7f0606ce3700 0 log_channel(cluster) log [DBG] : 3.1988 scrub starts Aug 31 16:02:15 ceph-86cd8a68-7649-11ed-b2be-5cba2c7fdb30-osd-276[1325507]: debug 2023-08-31T14:02:15.383+ 7f0606ce3700 0 log_channel(cluster) log [DBG] : 3.1988 deep-scrub starts Aug 31 16:02:17 ceph-86cd8a68-7649-11ed-b2be-5cba2c7fdb30-osd-276[1325507]: debug 2023-08-31T14:02:17.336+ 7f0606ce3700 0 log_channel(cluster) log [DBG] : 3.1988 scrub starts Aug 31 16:02:19 ceph-86cd8a68-7649-11ed-b2be-5cba2c7fdb30-osd-276[1325507]: debug 2023-08-31T14:02:19.328+ 7f0606ce3700 0 log_channel(cluster) log [DBG] : 3.1988 scrub starts || ||PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG DISK_LOG STATE STATE_STAMP VERSION REPORTED UP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP LAST_DEEP_SCRUB DEEP_SCRUB_STAMP SNAPTRIMQ_LEN|| ||3.1d57 52757 0 0 00 1675960266480 0 1799 1799 active+clean 2023-08-31T14:27:24.025755+ 236010'4532653 236011:8745383 [58,421,335,9,59,199,390,481,425,480] 58 [58,421,335,9,59,199,390,481,425,480] 58 231791'4531915 *2023-08-29T22:41:12.266874+* 229377'4526369 *2023-08-26T04:30:42.894505+* 0| |3.1988 52867 0 0 00 1686038728080 0 1811 1811 active+clean 2023-08-31T14:32:13.361420+ 236018'4241611 236018:9815753
[ceph-users] Re: Critical Information: DELL/Toshiba SSDs dying after 70,000 hours of operation
Hello, This message to inform you that DELL has released a new firmwares for these SSD drives to fix the 70.000 POH issue: [ https://www.dell.com/support/home/en-us/drivers/driversdetails?driverid=69j5f=w12r2=poweredge-r730xd | Toshiba A3B4 for model number(s) PX02SMF020, PX02SMF040, PX02SMF080 and PX02SMB160. ] [ https://www.dell.com/support/home/en-us/drivers/driversdetails?driverid=31jmh=rt | Toshiba A4B4 for model number(s) PX02SSF010, PX02SSF020, PX02SSF040 and PX02SSB080. ] [ https://www.dell.com/support/home/en-us/drivers/driversdetails?driverid=69j5f=w12r2=poweredge-r730xd ] [ https://www.dell.com/support/home/en-us/drivers/driversdetails?driverid=tc8kc=rt | Toshiba A5B4 for model number(s) PX03SNF020, PX03SNF080 and PX03SNB160. ] Based on our recent experience, this firmware gets dead SSD drives back to life with their data (after the upgrade, you may need to import foreign config by pressing 'F' key on the next start) Many thanks to DELL French TAMs and DELL engineering for providing this firmware in a short time. Best regards, Frédéric. - Le 19 Juin 23, à 10:46, Frédéric Nass a écrit : > Hello, > This message does not concern Ceph itself but a hardware vulnerability which > can > lead to permanent loss of data on a Ceph cluster equipped with the same > hardware in separate fault domains. > The DELL / Toshiba PX02SMF020, PX02SMF040, PX02SMF080 and PX02SMB160 SSD > drives > of the 13G generation of DELL servers are subject to a vulnerability which > renders them unusable after 70,000 hours of operation, i.e. approximately 7 > years and 11 months of activity. > This topic has been discussed here: > https://www.dell.com/community/PowerVault/TOSHIBA-PX02SMF080-has-lost-communication-on-the-same-date/td-p/8353438 > The risk is all the greater since these disks may die at the same time in the > same server leading to the loss of all data in the server. > To date, DELL has not provided any firmware fixing this vulnerability, the > latest firmware version being "A3B3" released on Sept. 12, 2016: > https://www.dell.com/support/home/en-us/ drivers/driversdetails?driverid=hhd9k > If your have servers running these drives, check their uptime. If they are > close > to the 70,000 hour limit, replace them immediately. > The smartctl tool does not report the uptime for these SSDs, but if you have > HDDs in the server, you can query their SMART status and get their uptime, > which should be about the same as the SSDs. > The smartctl command is: smartctl -a -d megaraid,XX /dev/sdc (where XX is the > iSCSI bus number). > We have informed DELL about this but have no information yet on the arrival > of a > fix. > We have lost 6 disks, in 3 different servers, in the last few weeks. Our > observation shows that the drives don't survive full shutdown and restart of > the machine (power off then power on in iDrac), but they may also die during a > single reboot (init 6) or even while the machine is running. > Fujitsu released a corrective firmware in June 2021 but this firmware is most > certainly not applicable to DELL drives: > https://www.fujitsu.com/us/imagesgig5/PY-CIB070-00.pdf > Regards, > Frederic > Sous-direction Infrastructure and Services > Direction du Numérique > Université de Lorraine > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io