[ceph-users] Re: rgw - unable to remove some orphans
Fabio, I have manually deleted a lot of orphans by using the rgw-orphan-list tool. it generates a few files. The one to get the list of orphans begins with orphan-. I have used this list and generated a script that goes and removes the objects in that list. However, in my case, not all objects could be removed and I am left with around 2m objects which error out when trying to delete them. This is why I started this thread. Please note that this tool is experimental and should be used with a great caution and care. Andrei > From: "Fabio Pasetti" > To: "Andrei Mikhailovsky" , "EDH" > > Cc: "ceph-users" > Sent: Wednesday, 4 January, 2023 07:51:10 > Subject: Re: rgw - unable to remove some orphans > Hi everyone, > we’ve got the same issue with our cluster Ceph (release Pacific) and we saw > this > issue for the first time when we start to use it as offload storage for Veeam > Backup. In fact Veeam, at the end of the offload job, when it try to delete > the > oldest files, gave us the “unknown error” which is related with the impossible > delete of multiple object. At the very beginning we supposed to be an s3 api > implementation bug with the multiple delete request, but digging into the > radosgw-admin commands we found the orphan list and we saw that we had a lot > (I > mean hundreds of thousands) of orphans files. Our cluster is about 2.7TB raw > capacity but are 50% full of orphan files. > Is there a way to delete them in a safe way? Or is it possible to change the > garbage collector configuration to avoid this issue with the orphan files? > Thank you all, I was pretty scared that the issue was related with my fault > during the cluster setup > Fabio > From: Andrei Mikhailovsky > Date: Tuesday, 3 January 2023 at 16:35 > To: EDH > Cc: ceph-users > Subject: [ceph-users] Re: rgw - unable to remove some orphans > Manuel, > Wow, I am pretty surprised to hear that the ceph developers hasn't addressed > this issue already. It looks like it is a big issue, which is costing a lot of > money to keep this orphan data unresolved. > Could someone from the developers comment on the issue and let us know if > there > is a workaround? > Cheers > Andrei > - Original Message - > > From: "EDH" > > To: "Andrei Mikhailovsky" , "ceph-users" > > > > Sent: Tuesday, 3 January, 2023 13:36:19 > > Subject: RE: rgw - unable to remove some orphans > > Object index database get corrupted and no ones can fix. We wipped a 500TB > > cluster years ago and move out ceph due this orphans bugs. > > After move all our data we saw in disk more than 100TB data unable to be > > deleted > > by ceph, also know as orphans... no sense. > > We expended thousand hours with this bug, the best solution replicate valid > > data > > to a new ceph cluster. > > Some providers solve this with x4 replica but no money sense. > > Regards, > > Manuel > > CONFIDENTIALITY NOTICE: > > This e-mail message and all attachments transmitted with it may contain > > legally > > privileged, proprietary and/or confidential information intended solely for > > the > > use of the addressee. If you are not the intended recipient, you are hereby > > notified that any review, dissemination, distribution, duplication or other > > use > > of this message and/or its attachments is strictly prohibited. If you are > > not > > the intended recipient, please contact the sender by reply e-mail and > > destroy > > all copies of the original message and its attachments. Thank you. > > No imprimas si no es necesario. Protejamos el Medio Ambiente. > > -Original Message- > > From: Andrei Mikhailovsky > > Sent: martes, 3 de enero de 2023 13:46 > > To: ceph-users > > Subject: [ceph-users] rgw - unable to remove some orphans > > Happy New Year everyone! > > I have a bit of an issue with removing some of the orphan objects that were > > generated with the rgw-orphan-list tool. Over the years rgw generated over > > 14 > > million orphans with an overall waste of over 100TB in size, considering the > > overall data stored in rgw was well under 10TB at max. Anyways, I have > > managed > > to remove around 12m objects over the holiday season, but there are just > > over > > 2m orphans which were not removed. Here is an example of one of the objects > > taken from the orphans list file: > > $ rados -p .rgw.buckets rm 'default.775634629.1__multipart_SQL > > Backups/ALL-POND-LIVE_backup_2021_05_26_204508_8473183.d20210526-u200953.bak.s26895803904.zip.0e6LO9b4w
[ceph-users] Re: rgw - unable to remove some orphans
Manuel, Wow, I am pretty surprised to hear that the ceph developers hasn't addressed this issue already. It looks like it is a big issue, which is costing a lot of money to keep this orphan data unresolved. Could someone from the developers comment on the issue and let us know if there is a workaround? Cheers Andrei - Original Message - > From: "EDH" > To: "Andrei Mikhailovsky" , "ceph-users" > > Sent: Tuesday, 3 January, 2023 13:36:19 > Subject: RE: rgw - unable to remove some orphans > Object index database get corrupted and no ones can fix. We wipped a 500TB > cluster years ago and move out ceph due this orphans bugs. > After move all our data we saw in disk more than 100TB data unable to be > deleted > by ceph, also know as orphans... no sense. > > We expended thousand hours with this bug, the best solution replicate valid > data > to a new ceph cluster. > > Some providers solve this with x4 replica but no money sense. > > Regards, > Manuel > > CONFIDENTIALITY NOTICE: > This e-mail message and all attachments transmitted with it may contain > legally > privileged, proprietary and/or confidential information intended solely for > the > use of the addressee. If you are not the intended recipient, you are hereby > notified that any review, dissemination, distribution, duplication or other > use > of this message and/or its attachments is strictly prohibited. If you are not > the intended recipient, please contact the sender by reply e-mail and destroy > all copies of the original message and its attachments. Thank you. > No imprimas si no es necesario. Protejamos el Medio Ambiente. > > > -Original Message- > From: Andrei Mikhailovsky > Sent: martes, 3 de enero de 2023 13:46 > To: ceph-users > Subject: [ceph-users] rgw - unable to remove some orphans > > Happy New Year everyone! > > I have a bit of an issue with removing some of the orphan objects that were > generated with the rgw-orphan-list tool. Over the years rgw generated over 14 > million orphans with an overall waste of over 100TB in size, considering the > overall data stored in rgw was well under 10TB at max. Anyways, I have managed > to remove around 12m objects over the holiday season, but there are just over > 2m orphans which were not removed. Here is an example of one of the objects > taken from the orphans list file: > > $ rados -p .rgw.buckets rm 'default.775634629.1__multipart_SQL > Backups/ALL-POND-LIVE_backup_2021_05_26_204508_8473183.d20210526-u200953.bak.s26895803904.zip.0e6LO9b4w9H3HepY-3IW_JSOaysLdFs.1_92' > > error removing .rgw.buckets>default.775634629.1__shadow_SQL > Backups/ALL-POND-LIVE_backup_2021_05_26_204508_8473183.d20210526-u200953.bak.s26895803904.zip.0e6LO9b4w9H3HepY-3IW_JSOaysLdFs.1_92: > (2) No such file or directory > > Checking the presence of the object with the rados tool shows that the object > is > there. > > $ cat orphan-list-20230103105849.out |grep -a JSOaysLdFs |grep -a 92 > default.775634629.1__shadow_SQL > Backups/ALL-POND-LIVE_backup_2021_05_26_204508_8473183.d20210526-u200953.bak.s26895803904.zip.0e6LO9b4w9H3HepY-3IW_JSOaysLdFs.1_92 > > $ cat rados-20230103105849.intermediate |grep -a JSOaysLdFs |grep -a 92 > default.775634629.1__shadow_SQL > Backups/ALL-POND-LIVE_backup_2021_05_26_204508_8473183.d20210526-u200953.bak.s26895803904.zip.0e6LO9b4w9H3HepY-3IW_JSOaysLdFs.1_92 > > > Why can't I remove it? I have around 2m objects which can't be removed. What > can > I do to remove them? > > Thanks > > Andrei > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: rgw - unable to remove some orphans
Hi Boris, The objects do exist and I can see it with ls. I can also verify that the total amount of objects in the pool is over 2m more than the amount of files. The total used space of all the buckets is about 10TB less than the total space used up by the .rgw.buckets pool. My colleague has suggested that there are unprintable characters in the object names and thus they can't be removed with cli tools. Could this be the case and if so, how do I remove them? Cheers Andrei - Original Message - > From: "Boris Behrens" > To: "ceph-users" > Sent: Tuesday, 3 January, 2023 12:53:29 > Subject: [ceph-users] Re: rgw - unable to remove some orphans > Hi Andrei, > happy new year to you too. > > The file might be already removed. > You can check if the radosobject is there with `rados -p ls ...` > You can also check if the file is is still in the bucket with > `radosgw-admin bucket radoslist --bucket BUCKET` > > Cheers > Boris > > Am Di., 3. Jan. 2023 um 13:47 Uhr schrieb Andrei Mikhailovsky > : >> >> Happy New Year everyone! >> >> I have a bit of an issue with removing some of the orphan objects that were >> generated with the rgw-orphan-list tool. Over the years rgw generated over 14 >> million orphans with an overall waste of over 100TB in size, considering the >> overall data stored in rgw was well under 10TB at max. Anyways, I have >> managed >> to remove around 12m objects over the holiday season, but there are just over >> 2m orphans which were not removed. Here is an example of one of the objects >> taken from the orphans list file: >> >> $ rados -p .rgw.buckets rm 'default.775634629.1__multipart_SQL >> Backups/ALL-POND-LIVE_backup_2021_05_26_204508_8473183.d20210526-u200953.bak.s26895803904.zip.0e6LO9b4w9H3HepY-3IW_JSOaysLdFs.1_92' >> >> error removing .rgw.buckets>default.775634629.1__shadow_SQL >> Backups/ALL-POND-LIVE_backup_2021_05_26_204508_8473183.d20210526-u200953.bak.s26895803904.zip.0e6LO9b4w9H3HepY-3IW_JSOaysLdFs.1_92: >> (2) No such file or directory >> >> Checking the presence of the object with the rados tool shows that the >> object is >> there. >> >> $ cat orphan-list-20230103105849.out |grep -a JSOaysLdFs |grep -a 92 >> default.775634629.1__shadow_SQL >> Backups/ALL-POND-LIVE_backup_2021_05_26_204508_8473183.d20210526-u200953.bak.s26895803904.zip.0e6LO9b4w9H3HepY-3IW_JSOaysLdFs.1_92 >> >> $ cat rados-20230103105849.intermediate |grep -a JSOaysLdFs |grep -a 92 >> default.775634629.1__shadow_SQL >> Backups/ALL-POND-LIVE_backup_2021_05_26_204508_8473183.d20210526-u200953.bak.s26895803904.zip.0e6LO9b4w9H3HepY-3IW_JSOaysLdFs.1_92 >> >> >> Why can't I remove it? I have around 2m objects which can't be removed. What >> can >> I do to remove them? >> >> Thanks >> >> Andrei >> ___ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io > > > > -- > Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend > im groüen Saal. > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] rgw - unable to remove some orphans
Happy New Year everyone! I have a bit of an issue with removing some of the orphan objects that were generated with the rgw-orphan-list tool. Over the years rgw generated over 14 million orphans with an overall waste of over 100TB in size, considering the overall data stored in rgw was well under 10TB at max. Anyways, I have managed to remove around 12m objects over the holiday season, but there are just over 2m orphans which were not removed. Here is an example of one of the objects taken from the orphans list file: $ rados -p .rgw.buckets rm 'default.775634629.1__multipart_SQL Backups/ALL-POND-LIVE_backup_2021_05_26_204508_8473183.d20210526-u200953.bak.s26895803904.zip.0e6LO9b4w9H3HepY-3IW_JSOaysLdFs.1_92' error removing .rgw.buckets>default.775634629.1__shadow_SQL Backups/ALL-POND-LIVE_backup_2021_05_26_204508_8473183.d20210526-u200953.bak.s26895803904.zip.0e6LO9b4w9H3HepY-3IW_JSOaysLdFs.1_92: (2) No such file or directory Checking the presence of the object with the rados tool shows that the object is there. $ cat orphan-list-20230103105849.out |grep -a JSOaysLdFs |grep -a 92 default.775634629.1__shadow_SQL Backups/ALL-POND-LIVE_backup_2021_05_26_204508_8473183.d20210526-u200953.bak.s26895803904.zip.0e6LO9b4w9H3HepY-3IW_JSOaysLdFs.1_92 $ cat rados-20230103105849.intermediate |grep -a JSOaysLdFs |grep -a 92 default.775634629.1__shadow_SQL Backups/ALL-POND-LIVE_backup_2021_05_26_204508_8473183.d20210526-u200953.bak.s26895803904.zip.0e6LO9b4w9H3HepY-3IW_JSOaysLdFs.1_92 Why can't I remove it? I have around 2m objects which can't be removed. What can I do to remove them? Thanks Andrei ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: radosgw not working after upgrade to Quincy
Thanks, Konstantin. Will try > From: "Konstantin Shalygin" > To: "Andrei Mikhailovsky" > Cc: "ceph-users" > Sent: Thursday, 29 December, 2022 03:42:56 > Subject: Re: [ceph-users] radosgw not working after upgrade to Quincy > Hi, > Just try to read your logs: >> 2022-12-29T02:07:38.953+ 7f5df868ccc0 0 WARNING: skipping unknown >> framework: > > civetweb > You try to use the `civetweb`, it was absent in quincy release. You need to > update your configs and use `beast` instead > k >> On 29 Dec 2022, at 09:20, Andrei Mikhailovsky wrote: >> Please let me know how to fix the problem? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] radosgw not working after upgrade to Quincy
Hello everyone, After the upgrade from Pacific to Quincy the radosgw service is no longer listening on network port, but the process is running. I get the following in the log: 2022-12-29T02:07:35.641+ 7f5df868ccc0 0 ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable), process radosgw, pid 36072 2022-12-29T02:07:35.641+ 7f5df868ccc0 0 framework: civetweb 2022-12-29T02:07:35.641+ 7f5df868ccc0 0 framework conf key: port, val: 443s 2022-12-29T02:07:35.641+ 7f5df868ccc0 0 framework conf key: ssl_certificate, val: /etc/ssl/private/s3.arhont. com-bundle.pem 2022-12-29T02:07:35.641+ 7f5df868ccc0 1 radosgw_Main not setting numa affinity 2022-12-29T02:07:35.645+ 7f5df868ccc0 1 rgw_d3n: rgw_d3n_l1_local_datacache_enabled=0 2022-12-29T02:07:35.645+ 7f5df868ccc0 1 D3N datacache enabled: 0 2022-12-29T02:07:38.917+ 7f5d15ffb700 -1 sync log trim: bool {anonymous}::sanity_check_endpoints(const DoutPre fixProvider*, rgw::sal::RadosStore*):688 WARNING: Cluster is is misconfigured! Zonegroup default (default) in Rea lm london-ldex ( 29474c50-f1c2-4155-ac3b-a42e9d413624) has no endpoints! 2022-12-29T02:07:38.917+ 7f5d15ffb700 -1 sync log trim: bool {anonymous}::sanity_check_endpoints(const DoutPre fixProvider*, rgw::sal::RadosStore*):698 ERROR: Cluster is is misconfigured! Zone default (default) in Zonegroup default ( default) in Realm london-ldex ( 29474c50-f1c2-4155-ac3b-a42e9d413624) has no endpoints! Trimming is imp ossible. 2022-12-29T02:07:38.917+ 7f5d15ffb700 -1 sync log trim: RGWCoroutine* create_meta_log_trim_cr(const DoutPrefixProvider*, rgw::sal::RadosStore*, RGWHTTPManager*, int, utime_t):718 ERROR: Cluster is is misconfigured! Refusing to trim. 2022-12-29T02:07:38.917+ 7f5d15ffb700 -1 rgw rados thread: Bailing out of trim thread! 2022-12-29T02:07:38.917+ 7f5d15ffb700 0 rgw rados thread: ERROR: processor->process() returned error r=-22 2022-12-29T02:07:38.953+ 7f5df868ccc0 0 framework: beast 2022-12-29T02:07:38.953+ 7f5df868ccc0 0 framework conf key: ssl_certificate, val: config://rgw/cert/$realm/$zone.crt 2022-12-29T02:07:38.953+ 7f5df868ccc0 0 framework conf key: ssl_private_key, val: config://rgw/cert/$realm/$zone.key 2022-12-29T02:07:38.953+ 7f5df868ccc0 0 WARNING: skipping unknown framework: civetweb 2022-12-29T02:07:38.977+ 7f5df868ccc0 1 mgrc service_daemon_register rgw.1371662715 metadata {arch=x86_64,ceph_release=quincy,ceph_version=ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable),ceph_version_short=17.2.5,cpu=Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz,distro=ubuntu,distro_description=Ubuntu 20.04.5 LTS,distro_version=20.04,frontend_config#0=civetweb port=443s ssl_certificate=/etc/ssl/private/s3.arhont.com-bundle.pem,frontend_type#0=civetweb,hostname=arh-ibstorage1-ib,id=radosgw1.gateway,kernel_description=#62~20.04.1-Ubuntu SMP Tue Nov 22 21:24:20 UTC 2022,kernel_version=5.15.0-56-generic,mem_swap_kb=24686688,mem_total_kb=98747048,num_handles=1,os=Linux,pid=36072,realm_id=29474c50-f1c2-4155-ac3b-a42e9d413624,realm_name=london-ldex,zone_id=default,zone_name=default,zonegroup_id=default,zonegroup_name=default} 2022-12-29T02:07:39.177+ 7f5d057fa700 0 lifecycle: RGWLC::process() failed to acquire lock on lc.29, sleep 5, try again I have been running radosgw service on 15.2.x cluster previously without any issues. Last week I have upgraded the cluster to 16.2.x followed by a further upgrade to 17.2. Here is what my configuration file looks like: [client.radosgw1.gateway] host = arh-ibstorage1-ib keyring = /etc/ceph/keyring.radosgw1.gateway log_file = /var/log/ceph/radosgw.log rgw_dns_name = s3.arhont.com rgw_num_rados_handles = 8 rgw_thread_pool_size = 512 rgw_cache_enabled = true rgw cache lru size = 10 rgw enable ops log = false rgw enable usage log = false rgw_frontends = civetweb port=443s ssl_certificate=/etc/ssl/private/s3.arhont.com-bundle.pem Please let me know how to fix the problem? Many thanks Andrei ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] cluster network change
Hello cephers, I've got a few questions for the community to help us with migrating ceph cluster from Infiniband networking to 10G Ethernet with no or minimal downtime. Please find below the details of the cluster as well as info on what we are trying to achieve. 1. Cluster Info: Ceph version - 15.2.15 Four storage servers running mon + osd + mgr + rgw services Ubuntu 20.04 server Networks: Infiniband (storage network) (ipoib interface and NOT RDMA) 192.168.168.0/24 ; 10G Ethernet (management network) 192.168.169.0/24 each server has an IP in each of the networks, i.e. 192.168.168.201 and 192.168.169.201 and so forth 2. What we would like to do: We are decommissioning our Infiniband infrastructure and moving towards 10G Ethernet. We would like to move ceph cluster from the current 192.168.168.0/24(IB) onto either 192.168.169.0/24(eth) running on 10G ethernet. Alternatively we could create a new ceph vlan on 10G ethernet and shift the IP range 192.168.168.0/24 to the new ceph vlan running on 10G instead of Infiniband. We would like to make the move with no or minimal downtime as we have critical services running on top of ceph, such as VMs, etc. Could someone suggest on the best/safest route to take for such migration? Is it a plausible scenarion, when one server is switched to the 192.168.169 network, while the others run in the original 192.168.168 network? From the networking view point it does not introduce any difficulties, but woud it create problems with the ceph itself? p.s. How would one go about changing an IP of the ceph server, providing the network remains the same? Many thanks Andrei ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Unable to add osds with ceph-volume
Hello everyone, I am running ceph version 15.2.8 on Ubuntu servers. I am using bluestore osds with data on hdd and db and wal on ssd drives. Each ssd has been partitioned such that it holds 5 dbs and 5 wals. The ssd were were prepared a while back probably when I was running ceph 13.x. I have been gradually adding new osd drives as needed. Recently, I've tried to add more osds, which have failed to my surprise. Previously I've had no issues adding the drives. However, it seems that I can no longer do that with version 15.2.x Here is what I get: root@arh-ibstorage4-ib /home/andrei ceph-volume lvm prepare --bluestore --data /dev/sds --block.db /dev/ssd3/db5 --block.wal /dev/ssd3/wal5 Running command: /usr/bin/ceph-authtool --gen-print-key Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 6aeef34b-0724-4d20-a10b-197cab23e24d Running command: /usr/sbin/vgcreate --force --yes ceph-1c7cef26-327a-4785-96b3-dcb1b97e8e2f /dev/sds stderr: WARNING: PV /dev/sdp in VG ceph-bc7587b5-0112-4097-8c9f-4442e8ea5645 is using an old PV header, modify the VG to update. stderr: WARNING: PV /dev/sdo in VG ceph-33eda27c-53ed-493e-87a8-39e1862da809 is using an old PV header, modify the VG to update. stderr: WARNING: PV /dev/sdn in VG ssd2 is using an old PV header, modify the VG to update. stderr: WARNING: PV /dev/sdm in VG ssd1 is using an old PV header, modify the VG to update. stderr: WARNING: PV /dev/sdj in VG ceph-9d8da00c-f6b9-473f-b499-fa60d74b46c5 is using an old PV header, modify the VG to update. stderr: WARNING: PV /dev/sdi in VG ceph-1603149e-1e50-4b86-a360-1372f4243603 is using an old PV header, modify the VG to update. stderr: WARNING: PV /dev/sdh in VG ceph-a5f4416c-8e69-4a66-a884-1d1229785acb is using an old PV header, modify the VG to update. stderr: WARNING: PV /dev/sde in VG ceph-aac71121-e308-4e25-ae95-ca51bca7aaff is using an old PV header, modify the VG to update. stderr: WARNING: PV /dev/sdd in VG ceph-1e216580-c01b-42c5-a10f-293674a55c4c is using an old PV header, modify the VG to update. stderr: WARNING: PV /dev/sdc in VG ceph-630f7716-3d05-41bb-92c9-25402e9bb264 is using an old PV header, modify the VG to update. stderr: WARNING: PV /dev/sdb in VG ceph-a549c28d-9b06-46d5-8ba3-3bd99ff54f57 is using an old PV header, modify the VG to update. stderr: WARNING: PV /dev/sda in VG ceph-70943bd0-de71-4651-a73d-c61bc624755f is using an old PV header, modify the VG to update. stdout: Physical volume "/dev/sds" successfully created. stdout: Volume group "ceph-1c7cef26-327a-4785-96b3-dcb1b97e8e2f" successfully created Running command: /usr/sbin/lvcreate --yes -l 3814911 -n osd-block-6aeef34b-0724-4d20-a10b-197cab23e24d ceph-1c7cef26-327a-4785-96b3-dcb1b97e8e2f stdout: Logical volume "osd-block-6aeef34b-0724-4d20-a10b-197cab23e24d" created. --> blkid could not detect a PARTUUID for device: /dev/ssd3/wal5 --> Was unable to complete a new OSD, will rollback changes Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring osd purge-new osd.15 --yes-i-really-mean-it stderr: 2021-04-28T20:05:52.290+0100 7f76bbfa9700 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.bootstrap-osd.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc /ceph/keyring.bin,: (2) No such file or directory 2021-04-28T20:05:52.290+0100 7f76bbfa9700 -1 AuthRegistry(0x7f76b4058e60) no keyring found at /etc/ceph/ceph.client.bootstrap-osd.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyrin g,/etc/ceph/keyring.bin,, disabling cephx stderr: purged osd.15 --> RuntimeError: unable to use device I have tried to find a solution, but wasn't able to resolve the problem? I am sure that I've previously added new volumes using the above command. lvdisplay shows: --- Logical volume --- LV Path /dev/ssd3/wal5 LV Name wal5 VG Name ssd3 LV UUID WPQJs9-olAj-ACbU-qnEM-6ytu-aLMv-hAABYy LV Write Access read/write LV Creation host, time arh-ibstorage4-ib, 2020-07-29 23:45:17 +0100 LV Status available # open 0 LV Size 1.00 GiB Current LE 256 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:6 --- Logical volume --- LV Path /dev/ssd3/db5 LV Name db5 VG Name ssd3 LV UUID FVT2Mm-a00P-eCoQ-FZAf-AulX-4q9r-PaDTC6 LV Write Access read/write LV Creation host, time arh-ibstorage4-ib, 2020-07-29 23:46:01 +0100 LV Status available # open 0 LV Size 177.00 GiB Current LE 45312 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:11 How do I resolve the errors and create the new osd? Cheers Andrei ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Running ceph on multiple networks
Hello everyone, I have a small ceph cluster consisting of 4 Ubuntu 20.04 osd servers mainly serving rbd images to Cloudstack kvm cluster. The ceph version is 15.2.9. The network is done in such a way that all storage cluster is ran over infiniband qdr links (ipoib). We've got the management network for our ceph servers and kvm over ethernet (192.168.1.1/24) and the ipoib storage network 192.168.2.1/24. We are in the process of updating our cluster with new hardware and planning to scrap the infiniband connectivity altogether and replace it with 10gbit ethernet. We are also going to replace the kvm host servers too. We were hoping to have minimal or preferably no downtime in this process. I was wondering if we could run the ceph services (mon, osd, radosgw) concurrently over two networks after we've added the 10G ethernet? While the upgrades and migration taking place, we need to have the ceph running over the current ipoib 192.168.2.1/24 as well as the 10G 192.168.3.1/24. Could you please help me with this? Cheers Andrei ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: osd recommended scheduler
Thanks for your reply, Wido. Isn't CFQ being deprecated in the latest kernel versions? From what I've read in the Ubuntu support pages, the cfq, deadline and noop are no longer supported since 2019 / kernel version 5.3 and later. There are, however, the following schedulers: bfq, kyber, mq-deadline and none. Could someone please suggest which of these new schedulers does ceph team recommend using for HDD drives and SSD drives? We have both drive types in use. Many thanks Andrei - Original Message - > From: "Wido den Hollander" > To: "Andrei Mikhailovsky" , "ceph-users" > > Sent: Tuesday, 2 February, 2021 07:44:13 > Subject: Re: [ceph-users] osd recommended scheduler > On 28/01/2021 18:09, Andrei Mikhailovsky wrote: >> >> Hello everyone, >> >> Could some one please let me know what is the recommended modern kernel disk >> scheduler that should be used for SSD and HDD osds? The information in the >> manuals is pretty dated and refer to the schedulers which have been >> deprecated >> from the recent kernels. >> > > Afaik noop is usually the one use for Flash devices. > > CFQ is used on HDDs most of the time as it allows for better scheduling/QoS. > > Wido > >> Thanks >> >> Andrei >> ___ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: radosgw process crashes multiple times an hour
bump - Original Message - > From: "andrei" > To: "Daniel Gryniewicz" > Cc: "ceph-users" > Sent: Thursday, 28 January, 2021 17:07:00 > Subject: [ceph-users] Re: radosgw process crashes multiple times an hour > Hi Daniel, > > Thanks for you're reply. I've checked the package versions on that server and > all ceph related packages on that server are from 15.2.8 version: > > ii librados215.2.8-1focal amd64RADOS distributed object store > client library > ii libradosstriper1 15.2.8-1focal amd64RADOS striping interface > ii python3-rados15.2.8-1focal amd64Python 3 libraries for the > Ceph > librados library > ii radosgw 15.2.8-1focal amd64REST gateway for RADOS > distributed object store > ii librbd115.2.8-1focal amd64RADOS block device client > library > ii python3-rbd15.2.8-1focal amd64Python 3 libraries for the Ceph > librbd library > ii ceph 15.2.8-1focal amd64distributed > storage > and file system > ii ceph-base 15.2.8-1focal amd64common ceph > daemon > libraries and management tools > ii ceph-common 15.2.8-1focal amd64common utilities > to > mount and interact with a ceph storage cluster > ii ceph-fuse 15.2.8-1focal amd64FUSE-based client > for the Ceph distributed file system > ii ceph-mds 15.2.8-1focal amd64metadata server > for > the ceph distributed file system > ii ceph-mgr 15.2.8-1focal amd64manager for the > ceph distributed storage system > ii ceph-mgr-cephadm 15.2.8-1focal all cephadm > orchestrator module for ceph-mgr > ii ceph-mgr-dashboard15.2.8-1focal all dashboard module > for ceph-mgr > ii ceph-mgr-diskprediction-cloud 15.2.8-1focal all > diskprediction-cloud module for ceph-mgr > ii ceph-mgr-diskprediction-local 15.2.8-1focal all > diskprediction-local module for ceph-mgr > ii ceph-mgr-k8sevents15.2.8-1focal all kubernetes events > module for ceph-mgr > ii ceph-mgr-modules-core 15.2.8-1focal all ceph manager > modules which are always enabled > ii ceph-mgr-rook 15.2.8-1focal all rook module for > ceph-mgr > ii ceph-mon 15.2.8-1focal amd64monitor server > for > the ceph storage system > ii ceph-osd 15.2.8-1focal amd64OSD server for > the > ceph storage system > ii cephadm 15.2.8-1focal amd64cephadm utility > to > bootstrap ceph daemons with systemd and containers > ii libcephfs215.2.8-1focal amd64Ceph distributed > file system client library > ii python3-ceph 15.2.8-1focal amd64Meta-package for > python libraries for the Ceph libraries > ii python3-ceph-argparse 15.2.8-1focal all Python 3 utility > libraries for Ceph CLI > ii python3-ceph-common 15.2.8-1focal all Python 3 utility > libraries for Ceph > ii python3-cephfs15.2.8-1focal amd64Python 3 > libraries > for the Ceph libcephfs library > > As this is a brand new 20.04 server I do not see how the older version could > have got onto it. > > Andrei > > > - Original Message - >> From: "Daniel Gryniewicz" >> To: "ceph-users" >> Sent: Thursday, 28 January, 2021 14:06:16 >> Subject: [ceph-users] Re: radosgw process crashes multiple times an hour > >> It looks like your radosgw is using a different version of librados. In >> the backtrace, the top useful line begins: >> >> librados::v14_2_0 >> >> when it should be v15.2.0, like the ceph::buffer in the same line. >> >> Is there an old librados lying around that didn't get cleaned up somehow? >> >> Daniel >> >> >> >> On 1/28/21 7:27 AM, Andrei Mikhailovsky wrote: >>> Hello, >>> >>> I am experiencing very frequent crashes of the radosgw service. It happens >>> multiple times every hour. As an example, over the last 12 hours we've had >>> 35 >>> crashes. Has anyone experienced similar behaviour of the radosgw octopus >>> release service? More info below: >>> >>> Radosgw service is running on two Ubuntu servers. I have tried upgrading OS >>> on >>> one of the servers to Ubuntu 20.04 with latest updates. The second server is >>> still running Ubuntu 18.04. Both servic
[ceph-users] Re: osd recommended scheduler
Bump - Original Message - > From: "andrei" > To: "ceph-users" > Sent: Thursday, 28 January, 2021 17:09:23 > Subject: [ceph-users] osd recommended scheduler > Hello everyone, > > Could some one please let me know what is the recommended modern kernel disk > scheduler that should be used for SSD and HDD osds? The information in the > manuals is pretty dated and refer to the schedulers which have been deprecated > from the recent kernels. > > Thanks > > Andrei > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] osd recommended scheduler
Hello everyone, Could some one please let me know what is the recommended modern kernel disk scheduler that should be used for SSD and HDD osds? The information in the manuals is pretty dated and refer to the schedulers which have been deprecated from the recent kernels. Thanks Andrei ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: radosgw process crashes multiple times an hour
Hi Daniel, Thanks for you're reply. I've checked the package versions on that server and all ceph related packages on that server are from 15.2.8 version: ii librados215.2.8-1focal amd64RADOS distributed object store client library ii libradosstriper1 15.2.8-1focal amd64RADOS striping interface ii python3-rados15.2.8-1focal amd64Python 3 libraries for the Ceph librados library ii radosgw 15.2.8-1focal amd64REST gateway for RADOS distributed object store ii librbd115.2.8-1focal amd64RADOS block device client library ii python3-rbd15.2.8-1focal amd64Python 3 libraries for the Ceph librbd library ii ceph 15.2.8-1focal amd64distributed storage and file system ii ceph-base 15.2.8-1focal amd64common ceph daemon libraries and management tools ii ceph-common 15.2.8-1focal amd64common utilities to mount and interact with a ceph storage cluster ii ceph-fuse 15.2.8-1focal amd64FUSE-based client for the Ceph distributed file system ii ceph-mds 15.2.8-1focal amd64metadata server for the ceph distributed file system ii ceph-mgr 15.2.8-1focal amd64manager for the ceph distributed storage system ii ceph-mgr-cephadm 15.2.8-1focal all cephadm orchestrator module for ceph-mgr ii ceph-mgr-dashboard15.2.8-1focal all dashboard module for ceph-mgr ii ceph-mgr-diskprediction-cloud 15.2.8-1focal all diskprediction-cloud module for ceph-mgr ii ceph-mgr-diskprediction-local 15.2.8-1focal all diskprediction-local module for ceph-mgr ii ceph-mgr-k8sevents15.2.8-1focal all kubernetes events module for ceph-mgr ii ceph-mgr-modules-core 15.2.8-1focal all ceph manager modules which are always enabled ii ceph-mgr-rook 15.2.8-1focal all rook module for ceph-mgr ii ceph-mon 15.2.8-1focal amd64monitor server for the ceph storage system ii ceph-osd 15.2.8-1focal amd64OSD server for the ceph storage system ii cephadm 15.2.8-1focal amd64cephadm utility to bootstrap ceph daemons with systemd and containers ii libcephfs215.2.8-1focal amd64Ceph distributed file system client library ii python3-ceph 15.2.8-1focal amd64Meta-package for python libraries for the Ceph libraries ii python3-ceph-argparse 15.2.8-1focal all Python 3 utility libraries for Ceph CLI ii python3-ceph-common 15.2.8-1focal all Python 3 utility libraries for Ceph ii python3-cephfs15.2.8-1focal amd64Python 3 libraries for the Ceph libcephfs library As this is a brand new 20.04 server I do not see how the older version could have got onto it. Andrei - Original Message - > From: "Daniel Gryniewicz" > To: "ceph-users" > Sent: Thursday, 28 January, 2021 14:06:16 > Subject: [ceph-users] Re: radosgw process crashes multiple times an hour > It looks like your radosgw is using a different version of librados. In > the backtrace, the top useful line begins: > > librados::v14_2_0 > > when it should be v15.2.0, like the ceph::buffer in the same line. > > Is there an old librados lying around that didn't get cleaned up somehow? > > Daniel > > > > On 1/28/21 7:27 AM, Andrei Mikhailovsky wrote: >> Hello, >> >> I am experiencing very frequent crashes of the radosgw service. It happens >> multiple times every hour. As an example, over the last 12 hours we've had 35 >> crashes. Has anyone experienced similar behaviour of the radosgw octopus >> release service? More info below: >> >> Radosgw service is running on two Ubuntu servers. I have tried upgrading OS >> on >> one of the servers to Ubuntu 20.04 with latest updates. The second server is >> still running Ubuntu 18.04. Both services crash occasionally, but the service >> which is running on Ubuntu 20.04 crashes far more often it seems. The ceph >> cluster itself is pretty old and was initially setup around 2013. The cluster >> was updated pretty regularly with every major release. Currently, I've got >> Octopus 15.2.8 running on all osd, mon, mgr and radosgw servers. >> >> Crash Backtrace: >> >> ceph crash info >> 2021-01-28T11:36:48.912771Z_08f80efd-c0ad-4551-88ce-905ca9cd3aa8 >> |less >> { >> "backtrace": [ >> "(()+0x46210) [0x7f815a49a210]", >> "(gsignal()+0xcb) [0x7f815a49a18b]", >> "(abort()+0x12b) [0x7f815a479859]",
[ceph-users] radosgw process crashes multiple times an hour
Hello, I am experiencing very frequent crashes of the radosgw service. It happens multiple times every hour. As an example, over the last 12 hours we've had 35 crashes. Has anyone experienced similar behaviour of the radosgw octopus release service? More info below: Radosgw service is running on two Ubuntu servers. I have tried upgrading OS on one of the servers to Ubuntu 20.04 with latest updates. The second server is still running Ubuntu 18.04. Both services crash occasionally, but the service which is running on Ubuntu 20.04 crashes far more often it seems. The ceph cluster itself is pretty old and was initially setup around 2013. The cluster was updated pretty regularly with every major release. Currently, I've got Octopus 15.2.8 running on all osd, mon, mgr and radosgw servers. Crash Backtrace: ceph crash info 2021-01-28T11:36:48.912771Z_08f80efd-c0ad-4551-88ce-905ca9cd3aa8 |less { "backtrace": [ "(()+0x46210) [0x7f815a49a210]", "(gsignal()+0xcb) [0x7f815a49a18b]", "(abort()+0x12b) [0x7f815a479859]", "(()+0x9e951) [0x7f8150ee9951]", "(()+0xaa47c) [0x7f8150ef547c]", "(()+0xaa4e7) [0x7f8150ef54e7]", "(()+0xaa799) [0x7f8150ef5799]", "(()+0x344ba) [0x7f815a1404ba]", "(()+0x71e04) [0x7f815a17de04]", "(librados::v14_2_0::IoCtx::nobjects_begin(librados::v14_2_0::ObjectCursor const&, ceph::buffer::v15_2_0::list const&)+0x5d) [0x7f815a18c7bd]", "(RGWSI_RADOS::Pool::List::init(std::__cxx11::basic_string, std::allocator > const&, RGWAccessListFilter*)+0x115) [0x7f815b0d9935]", "(RGWSI_SysObj_Core::pool_list_objects_init(rgw_pool const&, std::__cxx11::basic_string, std::allocator > const&, std::__cxx11::basic_string, std::allocator > const&, RGWSI_SysObj::Pool::ListCtx*)+0x255) [0x7f815abd7035]", "(RGWSI_MetaBackend_SObj::list_init(RGWSI_MetaBackend::Context*, std::__cxx11::basic_string, std::allocator > const&)+0x206) [0x7f815b0ccfe6]", "(RGWMetadataHandler_GenericMetaBE::list_keys_init(std::__cxx11::basic_string, std::allocator > const&, void**)+0x41) [0x7f815ad23201]", "(RGWMetadataManager::list_keys_init(std::__cxx11::basic_string, std::allocator > const&, std::__cxx11::basic_string, std::allocator > const&, void**)+0x71) [0x7f815ad254d1]", "(AsyncMetadataList::_send_request()+0x9b) [0x7f815b13c70b]", "(RGWAsyncRadosProcessor::handle_request(RGWAsyncRadosRequest*)+0x25) [0x7f815ae60f25]", "(RGWAsyncRadosProcessor::RGWWQ::_process(RGWAsyncRadosRequest*, ThreadPool::TPHandle&)+0x11) [0x7f815ae69401]", "(ThreadPool::worker(ThreadPool::WorkThread*)+0x5bb) [0x7f81517b072b]", "(ThreadPool::WorkThread::entry()+0x15) [0x7f81517b17f5]", "(()+0x9609) [0x7f815130d609]", "(clone()+0x43) [0x7f815a576293]" ], "ceph_version": "15.2.8", "crash_id": "2021-01-28T11:36:48.912771Z_08f80efd-c0ad-4551-88ce-905ca9cd3aa8", "entity_name": "client.radosgw1.gateway", "os_id": "ubuntu", "os_name": "Ubuntu", "os_version": "20.04.1 LTS (Focal Fossa)", "os_version_id": "20.04", "process_name": "radosgw", "stack_sig": "347474f09a756104ac2bb99d80e0c1fba3e9dc6f26e4ef68fe55946c103b274a", "timestamp": "2021-01-28T11:36:48.912771Z", "utsname_hostname": "arh-ibstorage1-ib", "utsname_machine": "x86_64", "utsname_release": "5.4.0-64-generic", "utsname_sysname": "Linux", "utsname_version": "#72-Ubuntu SMP Fri Jan 15 10:27:54 UTC 2021" } radosgw.log file (file names were redacted): -25> 2021-01-28T11:36:48.794+ 7f8043fff700 1 civetweb: 0x7f814c0cf010: 176.35.173.88 - - [28/Jan/2021:11:36:48 +] "PUT /-u115134.JPG HTTP/1.1" 400 460 - - -24> 2021-01-28T11:36:48.814+ 7f80437fe700 1 == starting new request req=0x7f80437f5780 = -23> 2021-01-28T11:36:48.814+ 7f80437fe700 2 req 5169 0s initializing for trans_id = tx01431-006012a1d0-31197b5c-default -22> 2021-01-28T11:36:48.814+ 7f80437fe700 2 req 5169 0s getting op 1 -21> 2021-01-28T11:36:48.814+ 7f80437fe700 2 req 5169 0s s3:put_obj verifying requester -20> 2021-01-28T11:36:48.814+ 7f80437fe700 2 req 5169 0s s3:put_obj normalizing buckets and tenants -19> 2021-01-28T11:36:48.814+ 7f80437fe700 2 req 5169 0s s3:put_obj init permissions -18> 2021-01-28T11:36:48.814+ 7f80437fe700 0 req 5169 0s NOTICE: invalid dest placement: default-placement/REDUCED_REDUNDANCY -17> 2021-01-28T11:36:48.814+ 7f80437fe700 1 op->ERRORHANDLER: err_no=-22 new_err_no=-22 -16> 2021-01-28T11:36:48.814+ 7f80437fe700 2 req 5169 0s s3:put_obj op status=0 -15> 2021-01-28T11:36:48.814+ 7f80437fe700 2 req 5169 0s s3:put_obj http status=400 -14> 2021-01-28T11:36:48.814+ 7f80437fe700 1 == req done req=0x7f80437f5780 op status=0 http_status=400 latency=0s == -13> 2021-01-28T11:36:48.822+ 7f80437fe700 1 civetweb: 0x7f814c0cf9e8: 176.35.173.88 - - [28/Jan/2021:11:36:48 +] "PUT /-d20201223-u115132.JPG HTTP/1.1" 400 460 - - -12> 2021-01-28T11:36:48.878+ 7f8043fff700 1 == starting new request req=0x7f8043ff6780 = -11>
[ceph-users] Re: frequent Monitor down
Eugen, I've got four physical servers and I've installed mon on all of them. I've discussed it with Wido and a few other chaps from ceph and there is no issue in doing it. The quorum issues would happen if you have 2 mons. If you've got more than 2 you should be fine. Andrei - Original Message - > From: "Eugen Block" > To: "Andrei Mikhailovsky" > Cc: "ceph-users" > Sent: Wednesday, 28 October, 2020 20:19:15 > Subject: Re: [ceph-users] Re: frequent Monitor down > Why do you have 4 MONs in the first place? That way a quorum is > difficult to achieve, could it be related to that? > > Zitat von Andrei Mikhailovsky : > >> Yes, I have, Eugen, I see no obvious reason / error / etc. I see a >> lot of entries relating to Compressing as well as monitor going down. >> >> Andrei >> >> >> >> - Original Message - >>> From: "Eugen Block" >>> To: "ceph-users" >>> Sent: Wednesday, 28 October, 2020 11:51:20 >>> Subject: [ceph-users] Re: frequent Monitor down >> >>> Have you looked into syslog and mon logs? >>> >>> >>> Zitat von Andrei Mikhailovsky : >>> >>>> Hello everyone, >>>> >>>> I am having regular messages that the Monitors are going down and up: >>>> >>>> 2020-10-27T09:50:49.032431+ mon .arh-ibstorage2-ib ( mon .1) >>>> 2248 : cluster [WRN] Health check failed: 1/4 mons down, quorum >>>> arh-ibstorage2-ib,arh-ibstorage3-ib,arh-ibstorage4-ib (MON_DOWN) >>>> 2020-10-27T09:50:49.123511+ mon .arh-ibstorage2-ib ( mon .1) >>>> 2250 : cluster [WRN] overall HEALTH_WARN 23 OSD(s) experiencing >>>> BlueFS spillover; 3 large omap objects; 1/4 mons down, quorum >>>> arh-ibstorage2-ib,arh-ibstorage3-ib,arh-ibstorage4-ib; noout flag(s) >>>> set; 43 pgs not deep-scrubbed in time; 12 pgs not scrubbed in time >>>> 2020-10-27T09:50:52.735457+ mon .arh-ibstorage1-ib ( mon .0) >>>> 31287 : cluster [INF] Health check cleared: MON_DOWN (was: 1/4 mons >>>> down, quorum arh-ibstorage2-ib,arh-ibstorage3-ib,arh-ibstorage4-ib) >>>> 2020-10-27T12:35:20.556458+ mon .arh-ibstorage2-ib ( mon .1) >>>> 2260 : cluster [WRN] Health check failed: 1/4 mons down, quorum >>>> arh-ibstorage2-ib,arh-ibstorage3-ib,arh-ibstorage4-ib (MON_DOWN) >>>> 2020-10-27T12:35:20.643282+ mon .arh-ibstorage2-ib ( mon .1) >>>> 2262 : cluster [WRN] overall HEALTH_WARN 23 OSD(s) experiencing >>>> BlueFS spillover; 3 large omap objects; 1/4 mons down, quorum >>>> arh-ibstorage2-ib,arh-ibstorage3-ib,arh-ibstorage4-ib; noout flag(s) >>>> set; 47 pgs not deep-scrubbed in time; 14 pgs not scrubbed in time >>>> >>>> >>>> This happens on a daily basis several times a day. >>>> >>>> Could you please let me know how to fix this annoying problem? >>>> >>>> I am running ceph version 15.2.4 >>>> (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable) on >>>> Ubuntu 18.04 LTS with latest updates. >>>> >>>> Thanks >>>> >>>> Andrei >>>> ___ >>>> ceph-users mailing list -- ceph-users@ceph.io >>>> To unsubscribe send an email to ceph-users-le...@ceph.io >>> >>> >>> ___ >>> ceph-users mailing list -- ceph-users@ceph.io > >> To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: frequent Monitor down
Yes, I have, Eugen, I see no obvious reason / error / etc. I see a lot of entries relating to Compressing as well as monitor going down. Andrei - Original Message - > From: "Eugen Block" > To: "ceph-users" > Sent: Wednesday, 28 October, 2020 11:51:20 > Subject: [ceph-users] Re: frequent Monitor down > Have you looked into syslog and mon logs? > > > Zitat von Andrei Mikhailovsky : > >> Hello everyone, >> >> I am having regular messages that the Monitors are going down and up: >> >> 2020-10-27T09:50:49.032431+ mon .arh-ibstorage2-ib ( mon .1) >> 2248 : cluster [WRN] Health check failed: 1/4 mons down, quorum >> arh-ibstorage2-ib,arh-ibstorage3-ib,arh-ibstorage4-ib (MON_DOWN) >> 2020-10-27T09:50:49.123511+ mon .arh-ibstorage2-ib ( mon .1) >> 2250 : cluster [WRN] overall HEALTH_WARN 23 OSD(s) experiencing >> BlueFS spillover; 3 large omap objects; 1/4 mons down, quorum >> arh-ibstorage2-ib,arh-ibstorage3-ib,arh-ibstorage4-ib; noout flag(s) >> set; 43 pgs not deep-scrubbed in time; 12 pgs not scrubbed in time >> 2020-10-27T09:50:52.735457+ mon .arh-ibstorage1-ib ( mon .0) >> 31287 : cluster [INF] Health check cleared: MON_DOWN (was: 1/4 mons >> down, quorum arh-ibstorage2-ib,arh-ibstorage3-ib,arh-ibstorage4-ib) >> 2020-10-27T12:35:20.556458+ mon .arh-ibstorage2-ib ( mon .1) >> 2260 : cluster [WRN] Health check failed: 1/4 mons down, quorum >> arh-ibstorage2-ib,arh-ibstorage3-ib,arh-ibstorage4-ib (MON_DOWN) >> 2020-10-27T12:35:20.643282+ mon .arh-ibstorage2-ib ( mon .1) >> 2262 : cluster [WRN] overall HEALTH_WARN 23 OSD(s) experiencing >> BlueFS spillover; 3 large omap objects; 1/4 mons down, quorum >> arh-ibstorage2-ib,arh-ibstorage3-ib,arh-ibstorage4-ib; noout flag(s) >> set; 47 pgs not deep-scrubbed in time; 14 pgs not scrubbed in time >> >> >> This happens on a daily basis several times a day. >> >> Could you please let me know how to fix this annoying problem? >> >> I am running ceph version 15.2.4 >> (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable) on >> Ubuntu 18.04 LTS with latest updates. >> >> Thanks >> >> Andrei >> ___ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io > > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] frequent Monitor down
Hello everyone, I am having regular messages that the Monitors are going down and up: 2020-10-27T09:50:49.032431+ mon .arh-ibstorage2-ib ( mon .1) 2248 : cluster [WRN] Health check failed: 1/4 mons down, quorum arh-ibstorage2-ib,arh-ibstorage3-ib,arh-ibstorage4-ib (MON_DOWN) 2020-10-27T09:50:49.123511+ mon .arh-ibstorage2-ib ( mon .1) 2250 : cluster [WRN] overall HEALTH_WARN 23 OSD(s) experiencing BlueFS spillover; 3 large omap objects; 1/4 mons down, quorum arh-ibstorage2-ib,arh-ibstorage3-ib,arh-ibstorage4-ib; noout flag(s) set; 43 pgs not deep-scrubbed in time; 12 pgs not scrubbed in time 2020-10-27T09:50:52.735457+ mon .arh-ibstorage1-ib ( mon .0) 31287 : cluster [INF] Health check cleared: MON_DOWN (was: 1/4 mons down, quorum arh-ibstorage2-ib,arh-ibstorage3-ib,arh-ibstorage4-ib) 2020-10-27T12:35:20.556458+ mon .arh-ibstorage2-ib ( mon .1) 2260 : cluster [WRN] Health check failed: 1/4 mons down, quorum arh-ibstorage2-ib,arh-ibstorage3-ib,arh-ibstorage4-ib (MON_DOWN) 2020-10-27T12:35:20.643282+ mon .arh-ibstorage2-ib ( mon .1) 2262 : cluster [WRN] overall HEALTH_WARN 23 OSD(s) experiencing BlueFS spillover; 3 large omap objects; 1/4 mons down, quorum arh-ibstorage2-ib,arh-ibstorage3-ib,arh-ibstorage4-ib; noout flag(s) set; 47 pgs not deep-scrubbed in time; 14 pgs not scrubbed in time This happens on a daily basis several times a day. Could you please let me know how to fix this annoying problem? I am running ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable) on Ubuntu 18.04 LTS with latest updates. Thanks Andrei ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Infiniband support
Rafael, We've been using ceph with ipoib for over 7 years and it's been supported. However, I am not too sure of the the native rdma support. There has been discussions on/off for a while now, but I've not seen much. Perhaps others know. Cheers > From: "Rafael Quaglio" > To: "ceph-users" > Sent: Wednesday, 26 August, 2020 14:08:57 > Subject: [ceph-users] Infiniband support > Hi, > I could not see in the doc if Ceph has infiniband support. Is there someone > using it? > Also, is there any rdma support working natively? > Can anyoune point me where to find more information about it? > Thanks, > Rafael. > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: rgw-orphan-list
Bump - Original Message - > From: "Andrei Mikhailovsky" > To: "ceph-users" > Sent: Monday, 24 August, 2020 16:37:49 > Subject: [ceph-users] rgw-orphan-list > While continuing my saga with the rgw orphans and dozens of terabytes of > wasted > space I have used the rgw-orphan-list tool. after about 45 mins the tool has > crashed ((( > > > # time rgw-orphan-list .rgw.buckets > Pool is ".rgw.buckets". > Note: output files produced will be tagged with the current timestamp -- > 202008241403. > running 'rados ls' at Mon Aug 24 15:03:29 BST 2020 > running 'radosgw-admin bucket radoslist' at Mon Aug 24 15:26:37 BST 2020 > /usr/bin/rgw-orphan-list: line 64: 31745 Aborted (core dumped) radosgw-admin > bucket radoslist > "$rgwadmin_out" 2> "$rgwadmin_err" > An error was encountered while running 'radosgw-admin radoslist'. Aborting. > Review file './radosgw-admin-202008241403.error' for details. > *** > *** WARNING: The results are incomplete. Do not use! *** > *** > > I've got the error file with more information on the error if anyone is > interested in improving the tool. > > Cheers > > Andrei > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] rgw-orphan-list
While continuing my saga with the rgw orphans and dozens of terabytes of wasted space I have used the rgw-orphan-list tool. after about 45 mins the tool has crashed ((( # time rgw-orphan-list .rgw.buckets Pool is ".rgw.buckets". Note: output files produced will be tagged with the current timestamp -- 202008241403. running 'rados ls' at Mon Aug 24 15:03:29 BST 2020 running 'radosgw-admin bucket radoslist' at Mon Aug 24 15:26:37 BST 2020 /usr/bin/rgw-orphan-list: line 64: 31745 Aborted (core dumped) radosgw-admin bucket radoslist > "$rgwadmin_out" 2> "$rgwadmin_err" An error was encountered while running 'radosgw-admin radoslist'. Aborting. Review file './radosgw-admin-202008241403.error' for details. *** *** WARNING: The results are incomplete. Do not use! *** *** I've got the error file with more information on the error if anyone is interested in improving the tool. Cheers Andrei ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: RGW unable to delete a bucket
Hi Gents, thanks for your replies. I have stopped the removal command and reinitiated it again. To my surprise it has removed the bucket pretty quickly this time around. Not too sure if it has removed all the objects or orphans. I will need to further investigate this matter. Andrei - Original Message - > From: "Matt Benjamin" > To: "EDH" > Cc: "Andrei Mikhailovsky" , "ceph-users" > , "Ivancich, Eric" > Sent: Thursday, 6 August, 2020 21:48:53 > Subject: Re: RGW unable to delete a bucket > Hi Folks, > > I don't know of a downstream issue that looks like this, and we've > upstreamed every fix for bucket listing and cleanup we have. We are > pursuing a space leak believed to arise in "radosgw-admin bucket rm > --purge-objects" but not a non-terminating listing. > > The only upstream release not planned to get a backport of orphans > list tools is Luminous. I thought backport to Octopus was already > done by the backport team? > > regards, > > Matt > > On Thu, Aug 6, 2020 at 2:40 PM EDH - Manuel Rios > wrote: >> >> You'r not the only one affected by this issue >> >> As far as i know several huge companies hitted this bug too, but private >> patches >> or tools are not public released. >> >> This is caused for the a resharding process during upload in previous >> versions. >> >> Workarround for us.: >> >> - Delete objects of the bucket at rados level. >> - Delete the index file of the bucket. >> >> Pray to god to not happen again. >> >> Still pending backporting to Nautilus of the new experimental tool to find >> orphans in RGW >> >> Maybe @Matt Benjamin can give us and ETA for get ready that tool >> backported... >> >> Regards >> >> >> >> -Mensaje original- >> De: Andrei Mikhailovsky >> Enviado el: jueves, 6 de agosto de 2020 13:55 >> Para: ceph-users >> Asunto: [ceph-users] Re: RGW unable to delete a bucket >> >> BUMP... >> >> >> - Original Message - >> > From: "Andrei Mikhailovsky" >> > To: "ceph-users" >> > Sent: Tuesday, 4 August, 2020 17:16:28 >> > Subject: [ceph-users] RGW unable to delete a bucket >> >> > Hi >> > >> > I am trying to delete a bucket using the following command: >> > >> > # radosgw-admin bucket rm --bucket= --purge-objects >> > >> > However, in console I get the following messages. About 100+ of those >> > messages per second. >> > >> > 2020-08-04T17:11:06.411+0100 7fe64cacf080 1 >> > RGWRados::Bucket::List::list_objects_ordered INFO ordered bucket >> > listing requires read #1 >> > >> > >> > The command has been running for about 35 days days and it still >> > hasn't finished. The size of the bucket is under 1TB for sure. Probably >> > around >> > 500GB. >> > >> > I have recently removed about a dozen of old buckets without any >> > issues. It's this particular bucket that is being very stubborn. >> > >> > Anything I can do to remove it, including it's objects and any orphans >> > it might have? >> > >> > >> > Thanks >> > >> > Andrei >> > ___ >> > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an >> > email to ceph-users-le...@ceph.io >> ___ >> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to >> ceph-users-le...@ceph.io >> > > > -- > > Matt Benjamin > Red Hat, Inc. > 315 West Huron Street, Suite 140A > Ann Arbor, Michigan 48103 > > http://www.redhat.com/en/technologies/storage > > tel. 734-821-5101 > fax. 734-769-8938 > cel. 734-216-5309 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] RGW unable to delete a bucket
Hi I am trying to delete a bucket using the following command: # radosgw-admin bucket rm --bucket= --purge-objects However, in console I get the following messages. About 100+ of those messages per second. 2020-08-04T17:11:06.411+0100 7fe64cacf080 1 RGWRados::Bucket::List::list_objects_ordered INFO ordered bucket listing requires read #1 The command has been running for about 35 days days and it still hasn't finished. The size of the bucket is under 1TB for sure. Probably around 500GB. I have recently removed about a dozen of old buckets without any issues. It's this particular bucket that is being very stubborn. Anything I can do to remove it, including it's objects and any orphans it might have? Thanks Andrei ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Module crash has failed (Octopus)
Thanks Michael. I will try it. Cheers Andrei - Original Message - > From: "Michael Fladischer" > To: "ceph-users" > Sent: Tuesday, 4 August, 2020 08:51:52 > Subject: [ceph-users] Re: Module crash has failed (Octopus) > Hi Andrei, > > Am 03.08.2020 um 16:26 schrieb Andrei Mikhailovsky: >> Module 'crash' has failed: dictionary changed size during iteration > > I had the same error after upgrading to Octopus and I fixed it by > stopping all MGRs, removing /var/lib/ceph/crash/posted on all MGR nodes > (make a backup copy on each node first, just in case). Then I restarted > all the MGRs and the error was gone. > > HTH, > Michael > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Module crash has failed (Octopus)
Hello everyone, I am running my Octopus 15.2.4 version and a couple of days ago noticed an ERROR state on the cluster with the following message: Module 'crash' has failed: dictionary changed size during iteration I couldn't find much info on this error. I've tried restarting the mon servers, which made no effect. How do I fix the error? Many thanks Andrei ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Octopus upgrade breaks Ubuntu 18.04 libvirt
509] geteuid() = 0 [pid 54509] getuid()= 0 [pid 54509] geteuid() = 0 [pid 54509] socket(AF_UNIX, SOCK_STREAM, 0) = 5 [pid 54509] connect(5, {sa_family=AF_UNIX, sun_path="/var/run/libvirt/libvirt-sock"}, 110) = 0 [pid 54509] getsockname(5, {sa_family=AF_UNIX}, [128->2]) = 0 [pid 54509] futex(0x7f9137287a08, FUTEX_WAKE_PRIVATE, 2147483647) = 0 [pid 54509] fcntl(5, F_GETFD) = 0 [pid 54509] fcntl(5, F_SETFD, FD_CLOEXEC) = 0 [pid 54509] fcntl(5, F_GETFL) = 0x2 (flags O_RDWR) [pid 54509] fcntl(5, F_SETFL, O_RDWR|O_NONBLOCK) = 0 [pid 54509] futex(0x7f9137287908, FUTEX_WAKE_PRIVATE, 2147483647) = 0 [pid 54509] pipe2([6, 7], O_CLOEXEC)= 0 [pid 54509] write(4, "\0", 1) = 1 [pid 54510] <... poll resumed> )= 1 ([{fd=3, revents=POLLIN}]) [pid 54509] futex(0x7f91372879d0, FUTEX_WAKE_PRIVATE, 2147483647) = 0 [pid 54510] read(3, [pid 54509] futex(0x7f9137287920, FUTEX_WAKE_PRIVATE, 2147483647 [pid 54510] <... read resumed> "\0", 1) = 1 [pid 54509] <... futex resumed> ) = 0 [pid 54509] brk(0x55b0b2c2f000 [pid 54510] poll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}], 2, -1 [pid 54509] <... brk resumed> ) = 0x55b0b2c2f000 [pid 54509] write(4, "\0", 1) = 1 [pid 54510] <... poll resumed> )= 1 ([{fd=3, revents=POLLIN}]) [pid 54509] rt_sigprocmask(SIG_BLOCK, [PIPE CHLD WINCH], [pid 54510] read(3, [pid 54509] <... rt_sigprocmask resumed> [], 8) = 0 [pid 54510] <... read resumed> "\0", 1) = 1 [pid 54509] poll([{fd=5, events=POLLOUT}, {fd=6, events=POLLIN}], 2, -1 [pid 54510] poll([{fd=3, events=POLLIN}], 1, -1 [pid 54509] <... poll resumed> )= 1 ([{fd=5, revents=POLLOUT}]) [pid 54509] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 [pid 54509] write(5, "\0\0\0\34 \0\200\206\0\0\0\1\0\0\0B\0\0\0\0\0\0\0\0\0\0\0\0", 28) = 28 [pid 54509] rt_sigprocmask(SIG_BLOCK, [PIPE CHLD WINCH], [], 8) = 0 [pid 54509] poll([{fd=5, events=POLLIN}, {fd=6, events=POLLIN}], 2, -1 Andrei - Original Message - > From: "Alexander E. Patrakov" > To: "Andrei Mikhailovsky" > Cc: "dillaman" , "ceph-users" > Sent: Wednesday, 8 July, 2020 14:50:56 > Subject: Re: [ceph-users] Re: Octopus upgrade breaks Ubuntu 18.04 libvirt > Please strace both virsh and libvirtd (you can attach to it by pid), > and make sure that the strace command uses the "-f" switch (i.e. > traces all threads). > > On Wed, Jul 8, 2020 at 6:20 PM Andrei Mikhailovsky wrote: >> >> Jason, >> >> After adding the 1:storage to the log line of the config and restarting the >> service I do not see anything in the logs. I've started the "virsh pool-list" >> command several times and there is absolutely nothing in the logs. The >> command >> keeps hanging >> >> >> running the strace virsh pool-list shows (the last 50-100 lines or so): >> >> >> >> ioctl(0, TCGETS, {B38400 opost isig icanon echo ...}) = 0 >> ioctl(0, TCGETS, {B38400 opost isig icanon echo ...}) = 0 >> getuid()= 0 >> geteuid() = 0 >> openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/gconv/gconv-modules.cache", >> O_RDONLY) = 3 >> fstat(3, {st_mode=S_IFREG|0644, st_size=26376, ...}) = 0 >> mmap(NULL, 26376, PROT_READ, MAP_SHARED, 3, 0) = 0x7fe979933000 >> close(3)= 0 >> futex(0x7fe978505a08, FUTEX_WAKE_PRIVATE, 2147483647) = 0 >> uname({sysname="Linux", nodename="ais-cloudhost1", ...}) = 0 >> futex(0x7fe9790bfce0, FUTEX_WAKE_PRIVATE, 2147483647) = 0 >> socket(AF_INET6, SOCK_DGRAM, IPPROTO_IP) = 3 >> close(3)= 0 >> futex(0x7fe9790c0700, FUTEX_WAKE_PRIVATE, 2147483647) = 0 >> pipe2([3, 4], O_NONBLOCK|O_CLOEXEC) = 0 >> mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = >> 0x7fe96ca98000 >> mprotect(0x7fe96ca99000, 8388608, PROT_READ|PROT_WRITE) = 0 >> clone(child_stack=0x7fe96d297db0, >> flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SE >> TTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fe96d2989d0, >> tls=0x7fe96d298700, child_tidptr=0x7fe96d2 >> 989d0) = 54218 >> futex(0x7fe9790bffb8, FUTEX_WAKE_PRIVATE, 2147483647) = 0 >> futex(0x7fe9790c06f8, FUTEX_WAKE_PRIVATE, 2147483647) = 0 >> geteuid() = 0 >> access("/etc/libvirt/libvirt.conf", F_OK) = 0 >> openat(AT_FDCWD, "/etc/libvirt/libvirt.conf", O_RDONLY) = 5 >> read(5, "#\n# This can
[ceph-users] Re: Octopus upgrade breaks Ubuntu 18.04 libvirt
Jason, After adding the 1:storage to the log line of the config and restarting the service I do not see anything in the logs. I've started the "virsh pool-list" command several times and there is absolutely nothing in the logs. The command keeps hanging running the strace virsh pool-list shows (the last 50-100 lines or so): ioctl(0, TCGETS, {B38400 opost isig icanon echo ...}) = 0 ioctl(0, TCGETS, {B38400 opost isig icanon echo ...}) = 0 getuid()= 0 geteuid() = 0 openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/gconv/gconv-modules.cache", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=26376, ...}) = 0 mmap(NULL, 26376, PROT_READ, MAP_SHARED, 3, 0) = 0x7fe979933000 close(3)= 0 futex(0x7fe978505a08, FUTEX_WAKE_PRIVATE, 2147483647) = 0 uname({sysname="Linux", nodename="ais-cloudhost1", ...}) = 0 futex(0x7fe9790bfce0, FUTEX_WAKE_PRIVATE, 2147483647) = 0 socket(AF_INET6, SOCK_DGRAM, IPPROTO_IP) = 3 close(3)= 0 futex(0x7fe9790c0700, FUTEX_WAKE_PRIVATE, 2147483647) = 0 pipe2([3, 4], O_NONBLOCK|O_CLOEXEC) = 0 mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7fe96ca98000 mprotect(0x7fe96ca99000, 8388608, PROT_READ|PROT_WRITE) = 0 clone(child_stack=0x7fe96d297db0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SE TTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fe96d2989d0, tls=0x7fe96d298700, child_tidptr=0x7fe96d2 989d0) = 54218 futex(0x7fe9790bffb8, FUTEX_WAKE_PRIVATE, 2147483647) = 0 futex(0x7fe9790c06f8, FUTEX_WAKE_PRIVATE, 2147483647) = 0 geteuid() = 0 access("/etc/libvirt/libvirt.conf", F_OK) = 0 openat(AT_FDCWD, "/etc/libvirt/libvirt.conf", O_RDONLY) = 5 read(5, "#\n# This can be used to setup UR"..., 8192) = 547 read(5, "", 7645) = 0 close(5)= 0 getuid()= 0 geteuid() = 0 access("/proc/vz", F_OK)= -1 ENOENT (No such file or directory) geteuid() = 0 getuid()= 0 geteuid() = 0 socket(AF_UNIX, SOCK_STREAM, 0) = 5 connect(5, {sa_family=AF_UNIX, sun_path="/var/run/libvirt/libvirt-sock"}, 110) = 0 getsockname(5, {sa_family=AF_UNIX}, [128->2]) = 0 futex(0x7fe9790c0a08, FUTEX_WAKE_PRIVATE, 2147483647) = 0 fcntl(5, F_GETFD) = 0 fcntl(5, F_SETFD, FD_CLOEXEC) = 0 fcntl(5, F_GETFL) = 0x2 (flags O_RDWR) fcntl(5, F_SETFL, O_RDWR|O_NONBLOCK)= 0 futex(0x7fe9790c0908, FUTEX_WAKE_PRIVATE, 2147483647) = 0 pipe2([6, 7], O_CLOEXEC)= 0 write(4, "\0", 1) = 1 futex(0x7fe9790bfb60, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x7fe9790c09d0, FUTEX_WAKE_PRIVATE, 2147483647) = 0 futex(0x7fe9790c0920, FUTEX_WAKE_PRIVATE, 2147483647) = 0 brk(0x5598ffebb000) = 0x5598ffebb000 write(4, "\0", 1) = 1 futex(0x7fe9790bfb60, FUTEX_WAKE_PRIVATE, 1) = 1 rt_sigprocmask(SIG_BLOCK, [PIPE CHLD WINCH], [], 8) = 0 poll([{fd=5, events=POLLOUT}, {fd=6, events=POLLIN}], 2, -1) = 1 ([{fd=5, revents=POLLOUT}]) rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 write(5, "\0\0\0\34 \0\200\206\0\0\0\1\0\0\0B\0\0\0\0\0\0\0\0\0\0\0\0", 28) = 28 rt_sigprocmask(SIG_BLOCK, [PIPE CHLD WINCH], [], 8) = 0 poll([{fd=5, events=POLLIN}, {fd=6, events=POLLIN}], 2, -1 It get's stuck at the last line and there is nothing happening. Andrei - Original Message - > From: "Jason Dillaman" > To: "Andrei Mikhailovsky" > Cc: "ceph-users" > Sent: Tuesday, 7 July, 2020 23:33:03 > Subject: Re: [ceph-users] Re: Octopus upgrade breaks Ubuntu 18.04 libvirt > On Tue, Jul 7, 2020 at 5:14 PM Andrei Mikhailovsky wrote: >> >> Hi Jason, >> The extract from the debug log file is given below in the first message. It >> just >> repeats those lines every so often. >> >> I can't find anything else. > > I would expect lots of debug logs from the storage backend. Do you > have a "1:storage" entry in your libvirtd.conf? > >> Cheers >> - Original Message - >> > From: "Jason Dillaman" >> > To: "Andrei Mikhailovsky" >> > Cc: "ceph-users" >> > Sent: Tuesday, 7 July, 2020 16:33:25 >> > Subject: Re: [ceph-users] Re: Octopus upgrade breaks Ubuntu 18.04 libvirt >> >> > On Tue, Jul 7, 2020 at 11:07 AM Andrei Mikhailovsky >> > wrote: >> >> >> >> I've left the virsh pool-list command 'hang' for a whil
[ceph-users] Re: Octopus upgrade breaks Ubuntu 18.04 libvirt
Jason, this is what I currently have: log_filters="1:libvirt 1:util 1:qemu" log_outputs="1:file:/var/log/libvirt/libvirtd.log" I will add the 1:storage and send more logs. Thanks for trying to help. Andrei - Original Message - > From: "Jason Dillaman" > To: "Andrei Mikhailovsky" > Cc: "ceph-users" > Sent: Tuesday, 7 July, 2020 23:33:03 > Subject: Re: [ceph-users] Re: Octopus upgrade breaks Ubuntu 18.04 libvirt > On Tue, Jul 7, 2020 at 5:14 PM Andrei Mikhailovsky wrote: >> >> Hi Jason, >> The extract from the debug log file is given below in the first message. It >> just >> repeats those lines every so often. >> >> I can't find anything else. > > I would expect lots of debug logs from the storage backend. Do you > have a "1:storage" entry in your libvirtd.conf? > >> Cheers >> - Original Message - >> > From: "Jason Dillaman" >> > To: "Andrei Mikhailovsky" >> > Cc: "ceph-users" >> > Sent: Tuesday, 7 July, 2020 16:33:25 >> > Subject: Re: [ceph-users] Re: Octopus upgrade breaks Ubuntu 18.04 libvirt >> >> > On Tue, Jul 7, 2020 at 11:07 AM Andrei Mikhailovsky >> > wrote: >> >> >> >> I've left the virsh pool-list command 'hang' for a while and it did >> >> eventually >> >> get the results back. In about 4 hours! >> > >> > Perhaps enable the debug logging of libvirt [1] to determine what it's >> > spending its time on? >> > >> >> root@ais-cloudhost1:/home/andrei# time virsh pool-list >> >> Name State Autostart >> >> --- >> >> 12ca033f-e673-4060-8db9-909d79650f39 active no >> >> bcc753c6-e47a-3b7c-904a-fcc1d0a594c5 active no >> >> cf771bc7-8998-354d-8e10-5564585a3c20 active no >> >> d8d5ec36-3cb0-39af-8fc6-084a4abd5d28 active no >> >> >> >> >> >> real234m23.877s >> >> user0m0.351s >> >> sys 0m0.506s >> >> >> >> >> >> >> >> The second attempt was a mere 2 hours with a bit. >> >> >> >> >> >> root@ais-cloudhost1:/home/andrei# time virsh pool-list >> >> Name State Autostart >> >> --- >> >> 12ca033f-e673-4060-8db9-909d79650f39 active no >> >> bcc753c6-e47a-3b7c-904a-fcc1d0a594c5 active no >> >> cf771bc7-8998-354d-8e10-5564585a3c20 active no >> >> d8d5ec36-3cb0-39af-8fc6-084a4abd5d28 active no >> >> >> >> >> >> real148m54.763s >> >> user0m0.241s >> >> sys 0m0.304s >> >> >> >> >> >> >> >> Am I the only person having these issues with libvirt and Octopus release? >> >> >> >> Cheers >> >> >> >> - Original Message - >> >> > From: "Andrei Mikhailovsky" >> >> > To: "ceph-users" >> >> > Sent: Monday, 6 July, 2020 19:27:25 >> >> > Subject: [ceph-users] Re: Octopus upgrade breaks Ubuntu 18.04 libvirt >> >> >> >> > A quick update. >> >> > >> >> > I have done a fresh install of the CloudStack host server running >> >> > Ubuntu 18.04 >> >> > with the latest updates. I've installed ceph 12.x and connected it to >> >> > Cloudstack which uses kvm/libvirt/ceph/rbd. The rest of the ceph >> >> > services >> >> > (mon,mgr,osd,etc) are all running 15.2.3. Works like a charm. >> >> > >> >> > As soon as I've updated the host server to version 15.2.3, Libvirt >> >> > stopped >> >> > working. It just hangs without doing much it seems. Common commands >> >> > like 'virsh >> >> > pool-list' or 'virsh list' are just hanging. I've strace the process >> >> > and it >> >> > just doesn't show any activity. >> >> > >> >> > >> >> > 2020-07-06 18:18:36.930+: 3273: info : >> >> > virEventPollUpdateTimeout:265 : >> >> > EVENT_POLL_UPDATE_TIMEOUT: timer=993 frequen >> >> > cy=5000 >> >> > 2020-07-06 18:18:36.930+: 3273: debug : >> >> > virEventPollUpdateTimeout:282 : Set >> >> > t
[ceph-users] Re: Octopus upgrade breaks Ubuntu 18.04 libvirt
Hi Jason, The extract from the debug log file is given below in the first message. It just repeats those lines every so often. I can't find anything else. Cheers - Original Message - > From: "Jason Dillaman" > To: "Andrei Mikhailovsky" > Cc: "ceph-users" > Sent: Tuesday, 7 July, 2020 16:33:25 > Subject: Re: [ceph-users] Re: Octopus upgrade breaks Ubuntu 18.04 libvirt > On Tue, Jul 7, 2020 at 11:07 AM Andrei Mikhailovsky wrote: >> >> I've left the virsh pool-list command 'hang' for a while and it did >> eventually >> get the results back. In about 4 hours! > > Perhaps enable the debug logging of libvirt [1] to determine what it's > spending its time on? > >> root@ais-cloudhost1:/home/andrei# time virsh pool-list >> Name State Autostart >> --- >> 12ca033f-e673-4060-8db9-909d79650f39 active no >> bcc753c6-e47a-3b7c-904a-fcc1d0a594c5 active no >> cf771bc7-8998-354d-8e10-5564585a3c20 active no >> d8d5ec36-3cb0-39af-8fc6-084a4abd5d28 active no >> >> >> real234m23.877s >> user0m0.351s >> sys 0m0.506s >> >> >> >> The second attempt was a mere 2 hours with a bit. >> >> >> root@ais-cloudhost1:/home/andrei# time virsh pool-list >> Name State Autostart >> --- >> 12ca033f-e673-4060-8db9-909d79650f39 active no >> bcc753c6-e47a-3b7c-904a-fcc1d0a594c5 active no >> cf771bc7-8998-354d-8e10-5564585a3c20 active no >> d8d5ec36-3cb0-39af-8fc6-084a4abd5d28 active no >> >> >> real148m54.763s >> user0m0.241s >> sys 0m0.304s >> >> >> >> Am I the only person having these issues with libvirt and Octopus release? >> >> Cheers >> >> - Original Message - >> > From: "Andrei Mikhailovsky" >> > To: "ceph-users" >> > Sent: Monday, 6 July, 2020 19:27:25 >> > Subject: [ceph-users] Re: Octopus upgrade breaks Ubuntu 18.04 libvirt >> >> > A quick update. >> > >> > I have done a fresh install of the CloudStack host server running Ubuntu >> > 18.04 >> > with the latest updates. I've installed ceph 12.x and connected it to >> > Cloudstack which uses kvm/libvirt/ceph/rbd. The rest of the ceph services >> > (mon,mgr,osd,etc) are all running 15.2.3. Works like a charm. >> > >> > As soon as I've updated the host server to version 15.2.3, Libvirt stopped >> > working. It just hangs without doing much it seems. Common commands like >> > 'virsh >> > pool-list' or 'virsh list' are just hanging. I've strace the process and it >> > just doesn't show any activity. >> > >> > >> > 2020-07-06 18:18:36.930+: 3273: info : virEventPollUpdateTimeout:265 : >> > EVENT_POLL_UPDATE_TIMEOUT: timer=993 frequen >> > cy=5000 >> > 2020-07-06 18:18:36.930+: 3273: debug : virEventPollUpdateTimeout:282 >> > : Set >> > timer freq=5000 expires=1594059521930 >> > 2020-07-06 18:18:36.930+: 3273: debug : >> > virEventPollInterruptLocked:722 : >> > Skip interrupt, 1 140123172218240 >> > 2020-07-06 18:18:36.930+: 3273: info : virEventPollUpdateHandle:152 : >> > EVENT_POLL_UPDATE_HANDLE: watch=1004 events=1 >> > 2020-07-06 18:18:36.930+: 3273: debug : >> > virEventPollInterruptLocked:722 : >> > Skip interrupt, 1 140123172218240 >> > 2020-07-06 18:18:36.930+: 3273: debug : >> > virEventPollCleanupTimeouts:525 : >> > Cleanup 8 >> > 2020-07-06 18:18:36.930+: 3273: debug : virEventPollCleanupHandles:574 >> > : >> > Cleanup 22 >> > 2020-07-06 18:18:36.930+: 3273: debug : virEventRunDefaultImpl:324 : >> > running >> > default event implementation >> > 2020-07-06 18:18:36.930+: 3273: debug : >> > virEventPollCleanupTimeouts:525 : >> > Cleanup 8 >> > 2020-07-06 18:18:36.930+: 3273: debug : virEventPollCleanupHandles:574 >> > : >> > Cleanup 22 >> > 2020-07-06 18:18:36.931+: 3273: debug : virEventPollMakePollFDs:401 : >> > Prepare n=0 w=1, f=5 e=1 d=0 >> > 2020-07-06 18:18:36.931+: 3273: debug : virEventPollMakePollFDs:401 : >> > Prepare n=1 w=2, f=7 e=1 d=0 >> > 2020-07-06 18:18:36.931+: 3273: debug : virEventPollMakePollFDs:401 : >> > Prepare n=2 w=3, f=10 e=1 d=0 >> > 2020-07-06 18:18:36.931+: 327
[ceph-users] Re: Octopus upgrade breaks Ubuntu 18.04 libvirt
I've left the virsh pool-list command 'hang' for a while and it did eventually get the results back. In about 4 hours! root@ais-cloudhost1:/home/andrei# time virsh pool-list Name State Autostart --- 12ca033f-e673-4060-8db9-909d79650f39 active no bcc753c6-e47a-3b7c-904a-fcc1d0a594c5 active no cf771bc7-8998-354d-8e10-5564585a3c20 active no d8d5ec36-3cb0-39af-8fc6-084a4abd5d28 active no real234m23.877s user0m0.351s sys 0m0.506s The second attempt was a mere 2 hours with a bit. root@ais-cloudhost1:/home/andrei# time virsh pool-list Name State Autostart --- 12ca033f-e673-4060-8db9-909d79650f39 active no bcc753c6-e47a-3b7c-904a-fcc1d0a594c5 active no cf771bc7-8998-354d-8e10-5564585a3c20 active no d8d5ec36-3cb0-39af-8fc6-084a4abd5d28 active no real148m54.763s user0m0.241s sys 0m0.304s Am I the only person having these issues with libvirt and Octopus release? Cheers - Original Message - > From: "Andrei Mikhailovsky" > To: "ceph-users" > Sent: Monday, 6 July, 2020 19:27:25 > Subject: [ceph-users] Re: Octopus upgrade breaks Ubuntu 18.04 libvirt > A quick update. > > I have done a fresh install of the CloudStack host server running Ubuntu 18.04 > with the latest updates. I've installed ceph 12.x and connected it to > Cloudstack which uses kvm/libvirt/ceph/rbd. The rest of the ceph services > (mon,mgr,osd,etc) are all running 15.2.3. Works like a charm. > > As soon as I've updated the host server to version 15.2.3, Libvirt stopped > working. It just hangs without doing much it seems. Common commands like > 'virsh > pool-list' or 'virsh list' are just hanging. I've strace the process and it > just doesn't show any activity. > > > 2020-07-06 18:18:36.930+: 3273: info : virEventPollUpdateTimeout:265 : > EVENT_POLL_UPDATE_TIMEOUT: timer=993 frequen > cy=5000 > 2020-07-06 18:18:36.930+: 3273: debug : virEventPollUpdateTimeout:282 : > Set > timer freq=5000 expires=1594059521930 > 2020-07-06 18:18:36.930+: 3273: debug : virEventPollInterruptLocked:722 : > Skip interrupt, 1 140123172218240 > 2020-07-06 18:18:36.930+: 3273: info : virEventPollUpdateHandle:152 : > EVENT_POLL_UPDATE_HANDLE: watch=1004 events=1 > 2020-07-06 18:18:36.930+: 3273: debug : virEventPollInterruptLocked:722 : > Skip interrupt, 1 140123172218240 > 2020-07-06 18:18:36.930+: 3273: debug : virEventPollCleanupTimeouts:525 : > Cleanup 8 > 2020-07-06 18:18:36.930+: 3273: debug : virEventPollCleanupHandles:574 : > Cleanup 22 > 2020-07-06 18:18:36.930+: 3273: debug : virEventRunDefaultImpl:324 : > running > default event implementation > 2020-07-06 18:18:36.930+: 3273: debug : virEventPollCleanupTimeouts:525 : > Cleanup 8 > 2020-07-06 18:18:36.930+: 3273: debug : virEventPollCleanupHandles:574 : > Cleanup 22 > 2020-07-06 18:18:36.931+: 3273: debug : virEventPollMakePollFDs:401 : > Prepare n=0 w=1, f=5 e=1 d=0 > 2020-07-06 18:18:36.931+: 3273: debug : virEventPollMakePollFDs:401 : > Prepare n=1 w=2, f=7 e=1 d=0 > 2020-07-06 18:18:36.931+: 3273: debug : virEventPollMakePollFDs:401 : > Prepare n=2 w=3, f=10 e=1 d=0 > 2020-07-06 18:18:36.931+: 3273: debug : virEventPollMakePollFDs:401 : > Prepare n=3 w=4, f=11 e=1 d=0 > 2020-07-06 18:18:36.931+: 3273: debug : virEventPollMakePollFDs:401 : > Prepare n=4 w=5, f=12 e=1 d=0 > 2020-07-06 18:18:36.931+: 3273: debug : virEventPollMakePollFDs:401 : > Prepare n=5 w=6, f=13 e=1 d=0 > 2020-07-06 18:18:36.931+: 3273: debug : virEventPollMakePollFDs:401 : > Prepare n=6 w=7, f=14 e=1 d=0 > 2020-07-06 18:18:36.931+: 3273: debug : virEventPollMakePollFDs:401 : > Prepare n=7 w=8, f=15 e=1 d=0 > 2020-07-06 18:18:36.931+: 3273: debug : virEventPollMakePollFDs:401 : > Prepare n=8 w=9, f=16 e=1 d=0 > 2020-07-06 18:18:36.931+: 3273: debug : virEventPollMakePollFDs:401 : > Prepare n=9 w=10, f=17 e=1 d=0 > 2020-07-06 18:18:36.931+: 3273: debug : virEventPollMakePollFDs:401 : > Prepare n=10 w=11, f=18 e=1 d=0 > > 2020-07-06 18:18:36.932+: 3273: debug : virEventPollMakePollFDs:401 : > Prepare n=11 w=12, f=19 e=0 d=0 > 2020-07-06 18:18:36.932+: 3273: debug : virEventPollMakePollFDs:401 : > Prepare n=12 w=13, f=19 e=1 d=0 > 2020-07-06 18:18:36.932+: 3273: debug : virEventPollMakePollFDs:401 : > Prepare n=13 w=14, f=24 e=1 d=0 > 2020-07-06 18:18:36.932+: 3273: debug : virEventPollMakePollFDs:401 : > Prepare n=14 w=15, f=25 e=1 d=0 > 2020-07-06 18:18:36.932+: 3273: debug : virEventPollMakePollFDs:401 : > Prepare n=15 w=19, f=26 e=1 d=0 > 2020-07-06 18:18:36.932+: 3273: debug :
[ceph-users] Re: Octopus upgrade breaks Ubuntu 18.04 libvirt
A quick update. I have done a fresh install of the CloudStack host server running Ubuntu 18.04 with the latest updates. I've installed ceph 12.x and connected it to Cloudstack which uses kvm/libvirt/ceph/rbd. The rest of the ceph services (mon,mgr,osd,etc) are all running 15.2.3. Works like a charm. As soon as I've updated the host server to version 15.2.3, Libvirt stopped working. It just hangs without doing much it seems. Common commands like 'virsh pool-list' or 'virsh list' are just hanging. I've strace the process and it just doesn't show any activity. 2020-07-06 18:18:36.930+: 3273: info : virEventPollUpdateTimeout:265 : EVENT_POLL_UPDATE_TIMEOUT: timer=993 frequen cy=5000 2020-07-06 18:18:36.930+: 3273: debug : virEventPollUpdateTimeout:282 : Set timer freq=5000 expires=1594059521930 2020-07-06 18:18:36.930+: 3273: debug : virEventPollInterruptLocked:722 : Skip interrupt, 1 140123172218240 2020-07-06 18:18:36.930+: 3273: info : virEventPollUpdateHandle:152 : EVENT_POLL_UPDATE_HANDLE: watch=1004 events=1 2020-07-06 18:18:36.930+: 3273: debug : virEventPollInterruptLocked:722 : Skip interrupt, 1 140123172218240 2020-07-06 18:18:36.930+: 3273: debug : virEventPollCleanupTimeouts:525 : Cleanup 8 2020-07-06 18:18:36.930+: 3273: debug : virEventPollCleanupHandles:574 : Cleanup 22 2020-07-06 18:18:36.930+: 3273: debug : virEventRunDefaultImpl:324 : running default event implementation 2020-07-06 18:18:36.930+: 3273: debug : virEventPollCleanupTimeouts:525 : Cleanup 8 2020-07-06 18:18:36.930+: 3273: debug : virEventPollCleanupHandles:574 : Cleanup 22 2020-07-06 18:18:36.931+: 3273: debug : virEventPollMakePollFDs:401 : Prepare n=0 w=1, f=5 e=1 d=0 2020-07-06 18:18:36.931+: 3273: debug : virEventPollMakePollFDs:401 : Prepare n=1 w=2, f=7 e=1 d=0 2020-07-06 18:18:36.931+: 3273: debug : virEventPollMakePollFDs:401 : Prepare n=2 w=3, f=10 e=1 d=0 2020-07-06 18:18:36.931+: 3273: debug : virEventPollMakePollFDs:401 : Prepare n=3 w=4, f=11 e=1 d=0 2020-07-06 18:18:36.931+: 3273: debug : virEventPollMakePollFDs:401 : Prepare n=4 w=5, f=12 e=1 d=0 2020-07-06 18:18:36.931+: 3273: debug : virEventPollMakePollFDs:401 : Prepare n=5 w=6, f=13 e=1 d=0 2020-07-06 18:18:36.931+: 3273: debug : virEventPollMakePollFDs:401 : Prepare n=6 w=7, f=14 e=1 d=0 2020-07-06 18:18:36.931+: 3273: debug : virEventPollMakePollFDs:401 : Prepare n=7 w=8, f=15 e=1 d=0 2020-07-06 18:18:36.931+: 3273: debug : virEventPollMakePollFDs:401 : Prepare n=8 w=9, f=16 e=1 d=0 2020-07-06 18:18:36.931+: 3273: debug : virEventPollMakePollFDs:401 : Prepare n=9 w=10, f=17 e=1 d=0 2020-07-06 18:18:36.931+: 3273: debug : virEventPollMakePollFDs:401 : Prepare n=10 w=11, f=18 e=1 d=0 2020-07-06 18:18:36.932+: 3273: debug : virEventPollMakePollFDs:401 : Prepare n=11 w=12, f=19 e=0 d=0 2020-07-06 18:18:36.932+: 3273: debug : virEventPollMakePollFDs:401 : Prepare n=12 w=13, f=19 e=1 d=0 2020-07-06 18:18:36.932+: 3273: debug : virEventPollMakePollFDs:401 : Prepare n=13 w=14, f=24 e=1 d=0 2020-07-06 18:18:36.932+: 3273: debug : virEventPollMakePollFDs:401 : Prepare n=14 w=15, f=25 e=1 d=0 2020-07-06 18:18:36.932+: 3273: debug : virEventPollMakePollFDs:401 : Prepare n=15 w=19, f=26 e=1 d=0 2020-07-06 18:18:36.932+: 3273: debug : virEventPollMakePollFDs:401 : Prepare n=16 w=82, f=79 e=1 d=0 2020-07-06 18:18:36.932+: 3273: debug : virEventPollMakePollFDs:401 : Prepare n=17 w=944, f=22 e=1 d=0 2020-07-06 18:18:36.932+: 3273: debug : virEventPollMakePollFDs:401 : Prepare n=18 w=993, f=82 e=1 d=0 2020-07-06 18:18:36.932+: 3273: debug : virEventPollMakePollFDs:401 : Prepare n=19 w=1001, f=30 e=1 d=0 2020-07-06 18:18:36.932+: 3273: debug : virEventPollMakePollFDs:401 : Prepare n=20 w=1002, f=33 e=1 d=0 2020-07-06 18:18:36.932+: 3273: debug : virEventPollMakePollFDs:401 : Prepare n=21 w=1004, f=83 e=1 d=0 2020-07-06 18:18:36.933+: 3273: debug : virEventPollCalculateTimeout:338 : Calculate expiry of 8 timers 2020-07-06 18:18:36.933+: 3273: debug : virEventPollCalculateTimeout:346 : Got a timeout scheduled for 1594059521930 2020-07-06 18:18:36.933+: 3273: debug : virEventPollCalculateTimeout:359 : Schedule timeout then=1594059521930 now=1594059516933 2020-07-06 18:18:36.933+: 3273: debug : virEventPollCalculateTimeout:369 : Timeout at 1594059521930 due in 4997 ms 2020-07-06 18:18:36.933+: 3273: info : virEventPollRunOnce:640 : EVENT_POLL_RUN: nhandles=21 timeout=4997 The ceph itself seems to work, i.e. I can execute ceph -s / rbd -p ls -l, etc. That produces the output. It's just the libvirt seems to be no joy. The version of libvirt installed is: libvirt-bin 4.0.0-1ubuntu8.1 Any idea how I can make ceph Octopus to play nicely with libvirt? Cheers Andrei - Original Message - > From: "Andrei Mikhailovsky" > To: "ceph-users"
[ceph-users] Re: Advice on SSD choices for WAL/DB?
Thanks for the information, Burkhard. My current setup shows a bunch of these warnings (24 osds with spillover out of 36 which have wal/db on the ssd): osd.36 spilled over 1.9 GiB metadata from 'db' device (7.2 GiB used of 30 GiB) to slow device osd.37 spilled over 13 GiB metadata from 'db' device (4.2 GiB used of 30 GiB) to slow device osd.44 spilled over 26 GiB metadata from 'db' device (13 GiB used of 30 GiB) to slow device osd.45 spilled over 33 GiB metadata from 'db' device (10 GiB used of 30 GiB) to slow device osd.46 spilled over 37 GiB metadata from 'db' device (8.8 GiB used of 30 GiB) to slow device >From the above for example, osd.36 is a 3TB disk and osd.45 is 10TB disk. I was hoping to address those spillovers with the upgrade too, if it means increasing the ssd space. Currently we've got WAL of 1GB and DB is 30GB. Am I right in understanding that in case of osd.46 the DB size should be at least 67GB to stop the spillover (30 + 37)? Cheers Andrei - Original Message - > From: "Burkhard Linke" > To: "ceph-users" > Sent: Wednesday, 1 July, 2020 13:09:34 > Subject: [ceph-users] Re: Advice on SSD choices for WAL/DB? > Hi, > > On 7/1/20 1:57 PM, Andrei Mikhailovsky wrote: >> Hello, >> >> We are planning to perform a small upgrade to our cluster and slowly start >> adding 12TB SATA HDD drives. We need to accommodate for additional SSD WAL/DB >> requirements as well. Currently we are considering the following: >> >> HDD Drives - Seagate EXOS 12TB >> SSD Drives for WAL/DB - Intel D3 S4510 960GB or Intel D3 S4610 960GB >> >> Our cluster isn't hosting any IO intensive DBs nor IO hungry VMs such as >> Exchange, MSSQL, etc. >> >> From the documentation that I've read the recommended size for DB is >> between 1% >> and 4% of the size of the osd. Would 2% figure be sufficient enough (so >> around >> 240GB DB size for each 12TB osd?) > > > The documentation is wrong. Rocksdb uses different levels to store data, > and need to store each level either completely in the DB partition or on > the data partition. There have been a number of mail threads about the > correct sizing. > > > In your case the best size would be 30GB for the DB part + the WAL size > (usually 2 GB). For compaction and other actions the ideal DB size needs > to be doubled, so you end up with 62GB per OSD. Larger DB partitions are > a waste of capacity, unless it can hold the next level (300GB per OSD). > > > If you have spare capacity on the SSD (>100GB) you can either leave it > untouched or create a small SSD based OSD for small pools that require a > lower latency, e.g. a small extra fast pool for RBD or the RGW > configuration pools. > >> >> Also, from your experience, which is a better model for the SSD DB/WAL? Would >> Intel S4510 be sufficient enough for our purpose or would the S4610 be a much >> better choice? Are there any other cost effective performance to consider >> instead of the above models? > > The SSD model should support fast sync writes, similar to the known > requirements for filestore journal SSDs. If your selected model is a > good fit according to the test methods, then it is probably also a good > choice for bluestore DBs. > > > Since not all data is written to the bluestore DB (no full data journal > in contrast to filestore), the amount of data written to the SSD is > probably lower. The DWPD requirements might be lower. To be on the safe > side, use the better model (higher DWPD / "write intensive") if possible. > > Regards, > > Burkhard > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Advice on SSD choices for WAL/DB?
Hello, We are planning to perform a small upgrade to our cluster and slowly start adding 12TB SATA HDD drives. We need to accommodate for additional SSD WAL/DB requirements as well. Currently we are considering the following: HDD Drives - Seagate EXOS 12TB SSD Drives for WAL/DB - Intel D3 S4510 960GB or Intel D3 S4610 960GB Our cluster isn't hosting any IO intensive DBs nor IO hungry VMs such as Exchange, MSSQL, etc. >From the documentation that I've read the recommended size for DB is between >1% and 4% of the size of the osd. Would 2% figure be sufficient enough (so >around 240GB DB size for each 12TB osd?) Also, from your experience, which is a better model for the SSD DB/WAL? Would Intel S4510 be sufficient enough for our purpose or would the S4610 be a much better choice? Are there any other cost effective performance to consider instead of the above models? The same question to the HDD. Any other drives we should consider instead of the Seagate EXOS series? Thanks for you help and suggestions. Andrei ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Octopus upgrade breaks Ubuntu 18.04 libvirt
Hello, I've upgraded ceph to Octopus (15.2.3 from repo) on one of the Ubuntu 18.04 host servers. The update caused problem with libvirtd which hangs when it tries to access the storage pools. The problem doesn't exist on Nautilus. The libvirtd process simply hangs. Nothing seem to happen. The log file for the libvirtd shows: 2020-06-29 19:30:51.556+: 12040: debug : virNetlinkEventCallback:707 : dispatching to max 0 clients, called from event watch 11 2020-06-29 19:30:51.556+: 12040: debug : virNetlinkEventCallback:720 : event not handled. 2020-06-29 19:30:51.556+: 12040: debug : virNetlinkEventCallback:707 : dispatching to max 0 clients, called from event watch 11 2020-06-29 19:30:51.556+: 12040: debug : virNetlinkEventCallback:720 : event not handled. 2020-06-29 19:30:51.557+: 12040: debug : virNetlinkEventCallback:707 : dispatching to max 0 clients, called from event watch 11 2020-06-29 19:30:51.557+: 12040: debug : virNetlinkEventCallback:720 : event not handled. 2020-06-29 19:30:51.591+: 12040: debug : virNetlinkEventCallback:707 : dispatching to max 0 clients, called from event watch 11 2020-06-29 19:30:51.591+: 12040: debug : virNetlinkEventCallback:720 : event not handled. Running strace on the libvirtd process shows: root@ais-cloudhost1:/home/andrei# strace -p 12040 strace: Process 12040 attached restart_syscall(<... resuming interrupted poll ...> Nothing happens after that point. The same host server can get access to the ceph cluster and the pools by running ceph -s or rbd -p ls -l commands for example. Need some help to get the host servers working again with Octopus. Cheers ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Octopus missing rgw-orphan-list tool
Hello, I have been struggling a lot with radosgw buckets space wastage, which is currently stands at about 2/3 of utilised space is wasted and unaccounted for. I've tried to use the tools to find the orphan objects, but these were running in loop for weeks on without producing any results. Wido and a few others pointed out that this function is broke in was deprecated and that instead the rgw-orphan-list should be used instead. I have upgraded to Octopus and I have been following the documentation [ https://docs.ceph.com/docs/master/radosgw/orphans/ | https://docs.ceph.com/docs/master/radosgw/orphans/ ] . However, the ceph and radon packages for Ubuntu 18.04 do not seem to have this tool. The same applies to the bucket radoslist option to the radosgw-admin command. root@arh-ibstorage1-ib:~# radosgw-admin bucket radoslist ERROR: Unrecognized argument: 'radoslist' Expected one of the following: check chown limit link list reshard rewrite rm stats sync unlink root@arh-ibstorage1-ib:~# dpkg -l *rados\* Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-===--- un librados (no description available) ii librados2 15.2.3-1bionic amd64 RADOS distributed object store client library ii libradosstriper1 15.2.3-1bionic amd64 RADOS striping interface ii python3-rados 15.2.3-1bionic amd64 Python 3 libraries for the Ceph librados library ii radosgw 15.2.3-1bionic amd64 REST gateway for RADOS distributed object store I am running Ubuntu 18.04 with version 15.2.3 of ceph and radosgw. Please suggest what should I do to remove the wasted space that radosgw is creating? I've calculated the wasted space by adding up the reported usage of all the buckets and checking it agains the output of the rados df command. The buckets are using around 11TB. The rados df reports 68TB of usage with replica of 2. Rather alarming! Thanks for you help ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: RGW orphans search
Hi Manuel, Thanks for the tip. Do you know if the latest code has this bug fixed? I was planning to upgrade to the latest major. Cheers - Original Message - > From: "EDH" > To: "Andrei Mikhailovsky" , "ceph-users" > > Sent: Saturday, 30 May, 2020 14:45:44 > Subject: RE: RGW orphans search > Hi Andrei, > > Orphans find code is not running. Will be deprecated in next reléase maybe > 14.2.10 > > Check: https://docs.ceph.com/docs/master/radosgw/orphans/ > > Stop progress is bugged. > > You got the same issue than us, multiparts are not being clean due a sharding > bugs. > > Or fast solution for recover 100TB , s3cmd sync to a other bucket and them > delete the old bucket. > > Not transparent at all but Works. > > Other recomendation: disable Dynamic shard and put a fixed shard number at > your > config. > > Regards > Manuel > > > -Mensaje original- > De: Andrei Mikhailovsky > Enviado el: sábado, 30 de mayo de 2020 13:12 > Para: ceph-users > Asunto: [ceph-users] RGW orphans search > > Hello, > > I am trying to clean up some wasted space (about 1/3 of used space in the > rados > pool is currently unaccounted for including the replication level). I've > started the search command 20 days ago ( radosgw-admin orphans find > --pool=.rgw.buckets --job-id=ophans_clean1 --yes-i-really-mean-it ) and it's > still showing me the same thing: > > [ > { > "orphan_search_state": { > "info": { > "orphan_search_info": { > "job_name": "ophans_clean1", > "pool": ".rgw.buckets", > "num_shards": 64, > "start_time": "2020-05-10 21:39:28.913405Z" > } > }, > "stage": { > "orphan_search_stage": { > "search_stage": "iterate_bucket_index", > "shard": 0, > "marker": "" > } > } > } > } > ] > > > The output of the command keeps showing this (hundreds of thousands of lines): > > storing 1 entries at orphan.scan.ophans_clean1.linked.60 > > The total size of the pool is around 30TB and the buckets usage is just under > 10TB. The replica is 2. The activity on the cluster has spiked up since I've > started the command (currently seeing between 10-20K iops compared to a > typical > 2-5k iops). > > Has anyone experienced this behaviour? It seems like the command should have > finished by now with only 30TB of used up space. I am running 13.2.10-1xenial > version of ceph. > > Cheers > > Andrei > ___ > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to > ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] RGW orphans search
Hello, I am trying to clean up some wasted space (about 1/3 of used space in the rados pool is currently unaccounted for including the replication level). I've started the search command 20 days ago ( radosgw-admin orphans find --pool=.rgw.buckets --job-id=ophans_clean1 --yes-i-really-mean-it ) and it's still showing me the same thing: [ { "orphan_search_state": { "info": { "orphan_search_info": { "job_name": "ophans_clean1", "pool": ".rgw.buckets", "num_shards": 64, "start_time": "2020-05-10 21:39:28.913405Z" } }, "stage": { "orphan_search_stage": { "search_stage": "iterate_bucket_index", "shard": 0, "marker": "" } } } } ] The output of the command keeps showing this (hundreds of thousands of lines): storing 1 entries at orphan.scan.ophans_clean1.linked.60 The total size of the pool is around 30TB and the buckets usage is just under 10TB. The replica is 2. The activity on the cluster has spiked up since I've started the command (currently seeing between 10-20K iops compared to a typical 2-5k iops). Has anyone experienced this behaviour? It seems like the command should have finished by now with only 30TB of used up space. I am running 13.2.10-1xenial version of ceph. Cheers Andrei ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: rados buckets copy
Thanks for the suggestion! - Original Message - > From: "Szabo, Istvan (Agoda)" > To: "Andrei Mikhailovsky" , "ceph-users" > > Sent: Thursday, 7 May, 2020 03:48:04 > Subject: RE: rados buckets copy > Hi, > > You might try s3 browser app, it is quite easy to navigate and copy between > buckets. > > Istvan Szabo > Senior Infrastructure Engineer > --- > Agoda Services Co., Ltd. > e: istvan.sz...@agoda.com > ------- > > -Original Message- > From: Andrei Mikhailovsky > Sent: Thursday, April 30, 2020 6:16 PM > To: ceph-users > Subject: [ceph-users] Re: rados buckets copy > > Email received from outside the company. If in doubt don't click links nor > open > attachments! > > > Can anyone suggest of the best ways to copy the buckets? I don't see a command > line option of the radosgw admin tool for that. > > - Original Message - >> From: "Andrei Mikhailovsky" >> To: "EDH" >> Cc: "ceph-users" >> Sent: Wednesday, 29 April, 2020 00:07:18 >> Subject: [ceph-users] Re: rados buckets copy > >> Hi Manuel, >> >> My replica is 2, hence about 10TB of unaccounted usage. >> >> Andrei >> >> - Original Message - >>> From: "EDH - Manuel Rios" >>> To: "Andrei Mikhailovsky" >>> Sent: Tuesday, 28 April, 2020 23:57:20 >>> Subject: RE: rados buckets copy >> >>> Is your replica x3? 9x3 27... plus some overhead rounded >>> >>> Ceph df show including replicas , bucket stats just bucket usage no >>> replicas. >>> >>> -Mensaje original- >>> De: Andrei Mikhailovsky Enviado el: miércoles, 29 >>> de abril de 2020 0:55 >>> Para: ceph-users >>> Asunto: [ceph-users] rados buckets copy >>> >>> Hello, >>> >>> I have a problem with radosgw service where the actual disk usage >>> (ceph df shows 28TB usage) is way more than reported by the >>> radosgw-admin bucket stats (9TB usage). I have tried to get to the >>> end of the problem, but no one seems to be able to help. As a last >>> resort I will attempt to copy the buckets, rename them and remove the old >>> buckets. >>> >>> What is the best way of doing this (probably on a high level) so that >>> the copy process doesn't carry on the wasted space to the new buckets? >>> >>> Cheers >>> >>> Andrei >>> ___ >>> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an >>> email to ceph-users-le...@ceph.io >> ___ >> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an >> email to ceph-users-le...@ceph.io > ___ > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to > ceph-users-le...@ceph.io > > > This message is confidential and is for the sole use of the intended > recipient(s). It may also be privileged or otherwise protected by copyright or > other legal rules. If you have received it by mistake please let us know by > reply email and delete it from your system. It is prohibited to copy this > message or disclose its content to anyone. Any confidentiality or privilege is > not waived or lost by any mistaken delivery or unauthorized disclosure of the > message. All messages sent to and from Agoda may be monitored to ensure > compliance with company policies, to protect the company's interests and to > remove potential malware. Electronic messages may be intercepted, amended, > lost > or deleted, or contain viruses. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: rados buckets copy
Hi Manuel, My replica is 2, hence about 10TB of unaccounted usage. Andrei - Original Message - > From: "EDH - Manuel Rios" > To: "Andrei Mikhailovsky" > Sent: Tuesday, 28 April, 2020 23:57:20 > Subject: RE: rados buckets copy > Is your replica x3? 9x3 27... plus some overhead rounded > > Ceph df show including replicas , bucket stats just bucket usage no replicas. > > -Mensaje original- > De: Andrei Mikhailovsky > Enviado el: miércoles, 29 de abril de 2020 0:55 > Para: ceph-users > Asunto: [ceph-users] rados buckets copy > > Hello, > > I have a problem with radosgw service where the actual disk usage (ceph df > shows > 28TB usage) is way more than reported by the radosgw-admin bucket stats (9TB > usage). I have tried to get to the end of the problem, but no one seems to be > able to help. As a last resort I will attempt to copy the buckets, rename them > and remove the old buckets. > > What is the best way of doing this (probably on a high level) so that the copy > process doesn't carry on the wasted space to the new buckets? > > Cheers > > Andrei > ___ > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to > ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] rados buckets copy
Hello, I have a problem with radosgw service where the actual disk usage (ceph df shows 28TB usage) is way more than reported by the radosgw-admin bucket stats (9TB usage). I have tried to get to the end of the problem, but no one seems to be able to help. As a last resort I will attempt to copy the buckets, rename them and remove the old buckets. What is the best way of doing this (probably on a high level) so that the copy process doesn't carry on the wasted space to the new buckets? Cheers Andrei ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] is ceph balancer doing anything?
Hello everyone, A few weeks ago I have enabled the ceph balancer on my cluster as per the instructions here: [ https://docs.ceph.com/docs/mimic/mgr/balancer/ | https://docs.ceph.com/docs/mimic/mgr/balancer/ ] I am running ceph version: ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable) The cluster has 48 osds (40 osds in hdd pools and 8 osds in ssd pool) Currently, the balancer status is showing as Active. # ceph balancer status { "active": true, "plans": [], "mode": "upmap" } The health status of the cluster is: health: HEALTH_OK Previously, I've used the old REWEIGHT to change the placement of data as I've seen very uneven usage (ranging from about 60% usage on some OSDs to over 90% on others). So, I have a number of osds with reweight of 1 an some going down to 0.75. At the moment the osd usage ranges between about 65% to to just under 90%, so still a huge variation. After switching on the balancer, I've not actually seen any activity or data migration, so I am not sure if the balancer is working at all. Could someone tell me how do I check if balancing is doing its job? The second question is as the balancer is now switched on, do I suppose to set the reweight values back to their default value of 1? Many thanks ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Raw use 10 times higher than data use
Hi Mark, thanks for coming back regarding the small objects which are under the min_alloc size. I am sure there are plenty of such objects as the rgw has backups of windows pcs/servers which are not compressed. Could you please confirm something for me. When I do the "radosgw-admin bucket stats" command and check the bucket usage, does the reported usage show the usage on the osd or simply the cumulative size of files stored in the bucket. For example, if I store a single file the size of 2 bytes, will it show 2 bytes or will it show the min_alloc size? I was judging the space usage based on the output of the bucket stats command and compared it with the rbd df output. Thanks - Original Message - > From: "Mark Nelson" > To: "ceph-users" > Sent: Thursday, 26 September, 2019 17:52:37 > Subject: [ceph-users] Re: Raw use 10 times higher than data use > Hi Andrei, > > > Probably the first thing to check is if you have objects that are under > the min_alloc size. Those objects will result in wasted space as they > will use the full min_alloc size. IE by default a 1K RGW object on HDD > will take 64KB, while on NVMe it will take 16KB. We are considering > possibly setting the min_alloc size in master to 4K now that we've > improved performance of the write path, but there is a trade-off as this > will result in more rocksdb metadata and likely more overhead as the DB > grows. We still have testing we need to perform to see if it's a good > idea as a default value. We are also considering inlining very small > (<4K) objects in the onode itself, but that also will require > significant testing as it may put additional load on the DB as well. > > > Mark > > On 9/26/19 4:58 AM, Andrei Mikhailovsky wrote: >> Hi Georg, >> >> I am having a similar issue with the RGW pool. However, not to the extent of >> 10x >> error rate. In my case, the error rate is a bout 2-3x. My real data usage is >> around 6TB, but Ceph uses over 17TB. I have asked this question here, but no >> one seems to know the solution and how to go about finding the wasted space >> and >> clearing it. >> >> @ceph_guys - does anyone in the company work in the area of finding the bugs >> that relate to the wasted space? Anyone could assist us in debugging the >> fixing >> our issues? >> >> Thanks >> >> Andrei >> >> - Original Message - >>> From: "Georg F" >>> To: ceph-users@ceph.io >>> Sent: Thursday, 26 September, 2019 10:50:01 >>> Subject: [ceph-users] Raw use 10 times higher than data use >>> Hi all, >>> >>> I've recently moved a 1TiB pool (3TiB raw use) from hdd osds (7) to newly >>> added >>> nvme osds (14). The hdd osds should be almost empty by now as just small >>> pools >>> reside on them. The pools on the hdd osds in sum store about 25GiB, which >>> should use about 75GiB with a pool size of 3. Wal and db are on separate >>> devices. >>> >>> However the outputs of ceph df and ceph osd df tell a different story: >>> >>> # ceph df >>> RAW STORAGE: >>> CLASS SIZE AVAIL USEDRAW USED %RAW USED >>> hdd 19 TiB 18 TiB 775 GiB 782 GiB 3.98 >>> >>> # ceph osd df | egrep "(ID|hdd)" >>> ID CLASS WEIGHT REWEIGHT SIZERAW USE DATAOMAPMETA AVAIL >>> %USE >>> VAR PGS STATUS >>> 8 hdd 2.72392 1.0 2.8 TiB 111 GiB 10 GiB 111 KiB 1024 MiB 2.7 TiB >>> 3.85 >>> 0.60 65 up >>> 6 hdd 2.17914 1.0 2.3 TiB 112 GiB 11 GiB 83 KiB 1024 MiB 2.2 TiB >>> 4.82 >>> 0.75 58 up >>> 3 hdd 2.72392 1.0 2.8 TiB 114 GiB 13 GiB 71 KiB 1024 MiB 2.7 TiB >>> 3.94 >>> 0.62 76 up >>> 5 hdd 2.72392 1.0 2.8 TiB 109 GiB 7.6 GiB 83 KiB 1024 MiB 2.7 TiB >>> 3.76 >>> 0.59 63 up >>> 4 hdd 2.72392 1.0 2.8 TiB 112 GiB 11 GiB 55 KiB 1024 MiB 2.7 TiB >>> 3.87 >>> 0.60 59 up >>> 7 hdd 2.72392 1.0 2.8 TiB 114 GiB 13 GiB 8 KiB 1024 MiB 2.7 TiB >>> 3.93 >>> 0.61 66 up >>> 2 hdd 2.72392 1.0 2.8 TiB 111 GiB 9.9 GiB 78 KiB 1024 MiB 2.7 TiB >>> 3.84 >>> 0.60 69 up >>> >>> The sum of "DATA" is 75,5GiB which is what I am expecting to be used by the >>> pools. How come the sum of "RAW USE" is 783GiB? More than 10x the size of >>> the >>> stored data. On my nvme osds the "RA
[ceph-users] Re: Raw use 10 times higher than data use
Hi Georg, I am having a similar issue with the RGW pool. However, not to the extent of 10x error rate. In my case, the error rate is a bout 2-3x. My real data usage is around 6TB, but Ceph uses over 17TB. I have asked this question here, but no one seems to know the solution and how to go about finding the wasted space and clearing it. @ceph_guys - does anyone in the company work in the area of finding the bugs that relate to the wasted space? Anyone could assist us in debugging the fixing our issues? Thanks Andrei - Original Message - > From: "Georg F" > To: ceph-users@ceph.io > Sent: Thursday, 26 September, 2019 10:50:01 > Subject: [ceph-users] Raw use 10 times higher than data use > Hi all, > > I've recently moved a 1TiB pool (3TiB raw use) from hdd osds (7) to newly > added > nvme osds (14). The hdd osds should be almost empty by now as just small pools > reside on them. The pools on the hdd osds in sum store about 25GiB, which > should use about 75GiB with a pool size of 3. Wal and db are on separate > devices. > > However the outputs of ceph df and ceph osd df tell a different story: > > # ceph df > RAW STORAGE: >CLASS SIZE AVAIL USEDRAW USED %RAW USED >hdd 19 TiB 18 TiB 775 GiB 782 GiB 3.98 > > # ceph osd df | egrep "(ID|hdd)" > ID CLASS WEIGHT REWEIGHT SIZERAW USE DATAOMAPMETA AVAIL > %USE > VAR PGS STATUS > 8 hdd 2.72392 1.0 2.8 TiB 111 GiB 10 GiB 111 KiB 1024 MiB 2.7 TiB 3.85 > 0.60 65 up > 6 hdd 2.17914 1.0 2.3 TiB 112 GiB 11 GiB 83 KiB 1024 MiB 2.2 TiB 4.82 > 0.75 58 up > 3 hdd 2.72392 1.0 2.8 TiB 114 GiB 13 GiB 71 KiB 1024 MiB 2.7 TiB 3.94 > 0.62 76 up > 5 hdd 2.72392 1.0 2.8 TiB 109 GiB 7.6 GiB 83 KiB 1024 MiB 2.7 TiB 3.76 > 0.59 63 up > 4 hdd 2.72392 1.0 2.8 TiB 112 GiB 11 GiB 55 KiB 1024 MiB 2.7 TiB 3.87 > 0.60 59 up > 7 hdd 2.72392 1.0 2.8 TiB 114 GiB 13 GiB 8 KiB 1024 MiB 2.7 TiB 3.93 > 0.61 66 up > 2 hdd 2.72392 1.0 2.8 TiB 111 GiB 9.9 GiB 78 KiB 1024 MiB 2.7 TiB 3.84 > 0.60 69 up > > The sum of "DATA" is 75,5GiB which is what I am expecting to be used by the > pools. How come the sum of "RAW USE" is 783GiB? More than 10x the size of the > stored data. On my nvme osds the "RAW USE" to "DATA" overhead is <1%: > > ceph osd df|egrep "(ID|nvme)" > ID CLASS WEIGHT REWEIGHT SIZERAW USE DATAOMAPMETA AVAIL > %USE > VAR PGS STATUS > 0 nvme 2.61989 1.0 2.6 TiB 181 GiB 180 GiB 31 KiB 1.0 GiB 2.4 TiB 6.74 > 1.05 12 up > 1 nvme 2.61989 1.0 2.6 TiB 151 GiB 150 GiB 39 KiB 1024 MiB 2.5 TiB 5.62 > 0.88 10 up > 13 nvme 2.61989 1.0 2.6 TiB 239 GiB 238 GiB 55 KiB 1.0 GiB 2.4 TiB > 8.89 > 1.39 16 up > -- truncated -- > > I am running ceph version 14.2.3 (0f776cf838a1ae3130b2b73dc26be9c95c6ccc39) > nautilus (stable) which was upgraded recently from 13.2.1. > > Any help is appreciated. > > Best regards, > Georg > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io