[ceph-users] Understand ceph df details
Hi everyone, I'm trying to understand where is the difference between the command : ceph df details And the result I'm getting when I run this script : total_bytes=0 while read user; do echo $user bytes=$(radosgw-admin user stats --uid=${user} | grep total_bytes_rounded | tr -dc "0-9") if [ ! -z ${bytes} ]; then total_bytes=$((total_bytes + bytes)) pretty_bytes=$(echo "scale=2; $bytes / 1000^4" | bc) echo " ($bytes B) $pretty_bytes TiB" fi pretty_total_bytes=$(echo "scale=2; $total_bytes / 1000^4" | bc) done <<< "$(radosgw-admin user list | jq -r .[])" echo "" echo "Total : ($total_bytes B) $pretty_total_bytes TiB" When I run df I get this : default.rgw.buckets.data 70 N/A N/A 226TiB 89.23 27.2TiB 61676992 61.68M 2.05GiB 726MiB 677TiB And when I use my script I don't have the same result : Total : (207579728699392 B) 207.57 TiB It means that I have 20 TiB somewhere but I can't find and must of all understand where this 20 TiB. Does anyone have an explanation ? Fi : [root@ceph_monitor01 ~]# radosgw-admin gc list -include-all | grep oid | wc -l 23 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph-ansible / block-db block-wal
Hi Everyone, Does anyone know how to indicate block-db and block-wal to device on ansible ? In ceph-deploy it is quite easy : ceph-deploy osd create osd_host08 --data /dev/sdl --block-db /dev/sdm12 --block-wal /dev/sdn12 -bluestore On my data nodes I have 12 HDDs and 2 SSDs I use those SSDs for block-db and block-wal. How to indicate for each osd which partition to use ? And finally, how do you handle the deployment if you have multiple data nodes setup ? SSDs on sdm and sdn on one host and SSDs on sda and sdb on another ? Thank you for your help. Regards, ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Missing field "host" in logs sent to Graylog
Hi everyone, We are facing a problem where we cannot read logs sent to graylog because it is missing one mandatory field. GELF message (received from ) has empty mandatory "host" field. Does anyone know what we are missing ? I know there was someone facing the same issue but it seems that he didn't had an answer. We are running : Ceph : 12.2.12 Graylog : 3.0 Thanks ! Regards, ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph Buckets Backup
Hi everyone, As aynone ever made a backup of a ceph bucket into Amazon Glacier ? If so did you use a script that use the api to "migrate" the objects ? If no one use amazon s3, how did you make those backups ? Thanks in advance. Regards, ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Weird behaviour of ceph-deploy
Things are not evolving, If I found an alternative to add a new osds nodes in the future I’ll mark it here. I’m abandoning ceph-deploy since it seems to be buggy. Regards, De : ceph-users De la part de CUZA Frédéric Envoyé : 18 June 2019 10:40 À : Brian Topping Cc : ceph-users@lists.ceph.com Objet : Re: [ceph-users] Weird behaviour of ceph-deploy 1. All node are under 12.2.12 (luminous stable) 2. Forward and Reverse DNS/SSH is working fine. 3. Things go bad at the install step, ceph is correctly installed but node is not present in the cluster. It seems that this is the step that goes wrong. Regards, De : Brian Topping mailto:brian.topp...@gmail.com>> Envoyé : 17 June 2019 16:39 À : CUZA Frédéric mailto:frederic.c...@sib.fr>> Cc : ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> Objet : Re: [ceph-users] Weird behaviour of ceph-deploy I don’t have an answer for you, but it’s going to help others to have shown: 1. Versions of all nodes involved and multi-master configuration 2. Confirm forward and reverse DNS and SSH / remote sudo since you are using deploy 3. Specific steps that did not behave properly On Jun 17, 2019, at 6:29 AM, CUZA Frédéric mailto:frederic.c...@sib.fr>> wrote: I’ll keep updating this until I find a solution so if anyone faces the same problem he might have solution. Atm : I install the new osd node with ceph-deploy and nothing change, node is still not present in the cluster nor in the crushmap. I decided to manually add it to the crush map : ceph osd crush add-bucket sd0051 host and move it to where it should be : ceph osd crush move sd0051 room=roomA Then I added an osd to that node : ceph-deploy osd create sd0051 --data /dev/sde --block-db /dev/sda1 --block-wal /dev/sdb1 –bluestore Once finally created the osd is still not linked to the host where it is created and I can’t move it the this host right now; Regards, De : ceph-users mailto:ceph-users-boun...@lists.ceph.com>> De la part de CUZA Frédéric Envoyé : 15 June 2019 00:34 À : ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> Objet : Re: [ceph-users] Weird behaviour of ceph-deploy Little update : I check one osd I’ve installed even if the host isn’t not present in the crushmap (or in cluster I guess) and I found this : monclient: wait_auth_rotating timed out after 30 osd.xxx 0 unable to obtain rotating service keys; retrying I alosy add the host to the admins host : ceph-deploy admin sd0051 and nothing change. When I do the install there is not .conf pushed the new node. Regards, De : ceph-users mailto:ceph-users-boun...@lists.ceph.com>> De la part de CUZA Frédéric Envoyé : 14 June 2019 18:28 À : ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> Objet : [ceph-users] Weird behaviour of ceph-deploy Hi everyone, I am facing a strange behavious from ceph-deploy. I try to add a new node to our cluster : ceph-deploy install --no-adjust-repos sd0051 Everything seems to work fine but the new bucket (host) is not created in the crushmap and when I try to add a new osd to that host, the osd is created but is not link to any host (normal behaviour the host is not present). Anyone already faces this ? FI : We already add new node with this and this is the first time we face it. Thanks ! ___ ceph-users mailing list ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Weird behaviour of ceph-deploy
1. All node are under 12.2.12 (luminous stable) 2. Forward and Reverse DNS/SSH is working fine. 3. Things go bad at the install step, ceph is correctly installed but node is not present in the cluster. It seems that this is the step that goes wrong. Regards, De : Brian Topping Envoyé : 17 June 2019 16:39 À : CUZA Frédéric Cc : ceph-users@lists.ceph.com Objet : Re: [ceph-users] Weird behaviour of ceph-deploy I don’t have an answer for you, but it’s going to help others to have shown: 1. Versions of all nodes involved and multi-master configuration 2. Confirm forward and reverse DNS and SSH / remote sudo since you are using deploy 3. Specific steps that did not behave properly On Jun 17, 2019, at 6:29 AM, CUZA Frédéric mailto:frederic.c...@sib.fr>> wrote: I’ll keep updating this until I find a solution so if anyone faces the same problem he might have solution. Atm : I install the new osd node with ceph-deploy and nothing change, node is still not present in the cluster nor in the crushmap. I decided to manually add it to the crush map : ceph osd crush add-bucket sd0051 host and move it to where it should be : ceph osd crush move sd0051 room=roomA Then I added an osd to that node : ceph-deploy osd create sd0051 --data /dev/sde --block-db /dev/sda1 --block-wal /dev/sdb1 –bluestore Once finally created the osd is still not linked to the host where it is created and I can’t move it the this host right now; Regards, De : ceph-users mailto:ceph-users-boun...@lists.ceph.com>> De la part de CUZA Frédéric Envoyé : 15 June 2019 00:34 À : ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> Objet : Re: [ceph-users] Weird behaviour of ceph-deploy Little update : I check one osd I’ve installed even if the host isn’t not present in the crushmap (or in cluster I guess) and I found this : monclient: wait_auth_rotating timed out after 30 osd.xxx 0 unable to obtain rotating service keys; retrying I alosy add the host to the admins host : ceph-deploy admin sd0051 and nothing change. When I do the install there is not .conf pushed the new node. Regards, De : ceph-users mailto:ceph-users-boun...@lists.ceph.com>> De la part de CUZA Frédéric Envoyé : 14 June 2019 18:28 À : ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> Objet : [ceph-users] Weird behaviour of ceph-deploy Hi everyone, I am facing a strange behavious from ceph-deploy. I try to add a new node to our cluster : ceph-deploy install --no-adjust-repos sd0051 Everything seems to work fine but the new bucket (host) is not created in the crushmap and when I try to add a new osd to that host, the osd is created but is not link to any host (normal behaviour the host is not present). Anyone already faces this ? FI : We already add new node with this and this is the first time we face it. Thanks ! ___ ceph-users mailing list ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Weird behaviour of ceph-deploy
I'll keep updating this until I find a solution so if anyone faces the same problem he might have solution. Atm : I install the new osd node with ceph-deploy and nothing change, node is still not present in the cluster nor in the crushmap. I decided to manually add it to the crush map : ceph osd crush add-bucket sd0051 host and move it to where it should be : ceph osd crush move sd0051 room=roomA Then I added an osd to that node : ceph-deploy osd create sd0051 --data /dev/sde --block-db /dev/sda1 --block-wal /dev/sdb1 -bluestore Once finally created the osd is still not linked to the host where it is created and I can't move it the this host right now; Regards, De : ceph-users De la part de CUZA Frédéric Envoyé : 15 June 2019 00:34 À : ceph-users@lists.ceph.com Objet : Re: [ceph-users] Weird behaviour of ceph-deploy Little update : I check one osd I've installed even if the host isn't not present in the crushmap (or in cluster I guess) and I found this : monclient: wait_auth_rotating timed out after 30 osd.xxx 0 unable to obtain rotating service keys; retrying I alosy add the host to the admins host : ceph-deploy admin sd0051 and nothing change. When I do the install there is not .conf pushed the new node. Regards, De : ceph-users mailto:ceph-users-boun...@lists.ceph.com>> De la part de CUZA Frédéric Envoyé : 14 June 2019 18:28 À : ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> Objet : [ceph-users] Weird behaviour of ceph-deploy Hi everyone, I am facing a strange behavious from ceph-deploy. I try to add a new node to our cluster : ceph-deploy install --no-adjust-repos sd0051 Everything seems to work fine but the new bucket (host) is not created in the crushmap and when I try to add a new osd to that host, the osd is created but is not link to any host (normal behaviour the host is not present). Anyone already faces this ? FI : We already add new node with this and this is the first time we face it. Thanks ! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Weird behaviour of ceph-deploy
Little update : I check one osd I've installed even if the host isn't not present in the crushmap (or in cluster I guess) and I found this : monclient: wait_auth_rotating timed out after 30 osd.xxx 0 unable to obtain rotating service keys; retrying I alosy add the host to the admins host : ceph-deploy admin sd0051 and nothing change. When I do the install there is not .conf pushed the new node. Regards, De : ceph-users De la part de CUZA Frédéric Envoyé : 14 June 2019 18:28 À : ceph-users@lists.ceph.com Objet : [ceph-users] Weird behaviour of ceph-deploy Hi everyone, I am facing a strange behavious from ceph-deploy. I try to add a new node to our cluster : ceph-deploy install --no-adjust-repos sd0051 Everything seems to work fine but the new bucket (host) is not created in the crushmap and when I try to add a new osd to that host, the osd is created but is not link to any host (normal behaviour the host is not present). Anyone already faces this ? FI : We already add new node with this and this is the first time we face it. Thanks ! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Weird behaviour of ceph-deploy
Hi everyone, I am facing a strange behavious from ceph-deploy. I try to add a new node to our cluster : ceph-deploy install --no-adjust-repos sd0051 Everything seems to work fine but the new bucket (host) is not created in the crushmap and when I try to add a new osd to that host, the osd is created but is not link to any host (normal behaviour the host is not present). Anyone already faces this ? FI : We already add new node with this and this is the first time we face it. Thanks ! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Multiple rbd images from different clusters
Hi, Thank you all for you quick answer. I think that will solve our problem. This is what we came up with this : rbd -c /etc/ceph/Oceph.conf --keyring /etc/ceph/Oceph.client.admin.keyring export rbd/disk_test - | rbd -c /etc/ceph/Nceph.conf --keyring /etc/ceph/Nceph.client.admin.keyring import - rbd/disk_test This rbd image is a test with only 5Gb of datas inside of it. Unfortunately the command seems to be stuck and nothing happens, both ports 7800 / 6789 / 22. We can't find no logs on any monitors. Thanks ! -Message d'origine- De : ceph-users De la part de Jason Dillaman Envoyé : 04 June 2019 14:11 À : Burkhard Linke Cc : ceph-users Objet : Re: [ceph-users] Multiple rbd images from different clusters On Tue, Jun 4, 2019 at 8:07 AM Jason Dillaman wrote: > > On Tue, Jun 4, 2019 at 4:45 AM Burkhard Linke > wrote: > > > > Hi, > > > > On 6/4/19 10:12 AM, CUZA Frédéric wrote: > > > > Hi everyone, > > > > > > > > We want to migrate datas from one cluster (Hammer) to a new one (Mimic). We > > do not wish to upgrade the actual cluster as all the hardware is EOS and we > > upgrade the configuration of the servers. > > > > We can’t find a “proper” way to mount two rbd images from two different > > cluster on the same host. > > > > Does anyone know what is the “good” procedure to achieve this ? > > Copy your "/etc/ceph/ceph.conf" and associated keyrings for both > clusters to a single machine (preferably running a Mimic "rbd" client) > under "/etc/ceph/.conf" and > "/etc/ceph/.client..keyring". > > You can then use "rbd -c export --export-format 2 > - | rbd -c import --export-format=2 - > ". The "--export-format=2" option will also copy all > associated snapshots with the images. If you don't want/need the > snapshots, just drop that optional. That "-c" should be "--cluster" if specifying by name, otherwise with "-c" it's the full path to the two different conf files. > > > > Just my 2 ct: > > > > the 'rbd' commands allows specifying a configuration file (-c). You need to > > setup two configuration files, one for each cluster. You can also use two > > different cluster names (--cluster option). AFAIK the name is only used to > > locate the configuration file. I'm not sure how well the kernel works with > > mapping RBDs from two different cluster. > > > > > > If you only want to transfer RBDs from one cluster to another, you do not > > need to map and mount them; the 'rbd' command has the sub commands 'export' > > and 'import'. You can pipe them to avoid writing data to a local disk. This > > should be the fastest way to transfer the RBDs. > > > > > > Regards, > > > > Burkhard > > > > -- > > Dr. rer. nat. Burkhard Linke > > Bioinformatics and Systems Biology > > Justus-Liebig-University Giessen > > 35392 Giessen, Germany > > Phone: (+49) (0)641 9935810 > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > -- > Jason -- Jason ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd.ReadOnlyImage: [errno 30]
Thank you all for you quick answer. I think that will solve our problem. This is what we came up with this : rbd -c /etc/ceph/Oceph.conf --keyring /etc/ceph/Oceph.client.admin.keyring export rbd/disk_test - | rbd -c /etc/ceph/Nceph.conf --keyring /etc/ceph/Nceph.client.admin.keyring import - rbd/disk_test This rbd image is a test with only 5Gb of datas inside of it. Unfortunately the command seems to be stuck and nothing happens, both ports 7800 / 6789 / 22. We can't find no logs on any monitors. Thanks ! -Message d'origine- De : ceph-users De la part de Jason Dillaman Envoyé : 04 June 2019 14:14 À : 解决 Cc : ceph-users Objet : Re: [ceph-users] rbd.ReadOnlyImage: [errno 30] On Tue, Jun 4, 2019 at 4:55 AM 解决 wrote: > > Hi all, > We use ceph(luminous) + openstack(queens) in my test > environment。The virtual machine does not start properly after the > disaster test and the image of virtual machine can not create snap.The > procedure is as follows: > #!/usr/bin/env python > > import rados > import rbd > with rados.Rados(conffile='/etc/ceph/ceph.conf',rados_id='nova') as cluster: > with cluster.open_ioctx('vms') as ioctx: > rbd_inst = rbd.RBD() > print "start open rbd image" > with rbd.Image(ioctx, '10df4634-4401-45ca-9c57-f349b78da475_disk') as > image: > print "start create snapshot" > image.create_snap('myimage_snap1') > > when i run it ,it show readonlyimage,as follows: > > start open rbd image > start create snapshot > Traceback (most recent call last): > File "testpool.py", line 17, in > image.create_snap('myimage_snap1') > File "rbd.pyx", line 1790, in rbd.Image.create_snap > (/builddir/build/BUILD/ceph-12.2.5/build/src/pybind/rbd/pyrex/rbd.c:15 > 682) > rbd.ReadOnlyImage: [errno 30] error creating snapshot myimage_snap1 > from 10df4634-4401-45ca-9c57-f349b78da475_disk > > but i run it with admin instead of nova,it is ok. > > "ceph auth list" as follow > > installed auth entries: > > osd.1 > key: AQBL7uRcfuyxEBAAoK8JrQWMU6EEf/g83zKJjg== > caps: [mon] allow profile osd > caps: [osd] allow * > osd.10 > key: AQCV7uRcdsB9IBAAHbHHCaylVUZIPKFX20polQ== > caps: [mon] allow profile osd > caps: [osd] allow * > osd.11 > key: AQCW7uRcRIMRIhAAbXfLbQwijEO5ZQFWFZaO5w== > caps: [mon] allow profile osd > caps: [osd] allow * > osd.2 > key: AQBL7uRcfFMWDBAAo7kjQobGBbIHYfZkx45pOw== > caps: [mon] allow profile osd > caps: [osd] allow * > osd.4 > key: AQBk7uRc97CPOBAAK9IBJICvchZPc5p80bISsg== > caps: [mon] allow profile osd > caps: [osd] allow * > osd.5 > key: AQBk7uRcOdqaORAAkQeEtYsE6rLWLPhYuCTdHA== > caps: [mon] allow profile osd > caps: [osd] allow * > osd.7 > key: AQB97uRc+1eRJxAA34DImQIMFjzHSXZ25djp0Q== > caps: [mon] allow profile osd > caps: [osd] allow * > osd.8 > key: AQB97uRcFilBJhAAXzSzNJsgwpobC8654Xo7Sw== > caps: [mon] allow profile osd > caps: [osd] allow * > client.admin > key: AQAU7uRcNia+BBAA09mOYdX+yJWbLCjcuMih0A== > auid: 0 > caps: [mds] allow > caps: [mgr] allow * > caps: [mon] allow * > caps: [osd] allow * > client.cinder > key: AQBp7+RcOzPHGxAA7azgyayVu2RRNWJ7JxSJEg== > caps: [mon] allow r > caps: [osd] allow class-read object_prefix rbd_children, allow rwx > pool=volumes, allow rwx pool=volumes-cache, allow rwx pool=vms, allow > rwx pool=vms-cache, allow rx pool=images, allow rx pool=images-cache > client.cinder-backup > key: AQBq7+RcVOwGNRAAiwJ59ZvAUc0H4QkVeN82vA== > caps: [mon] allow r > caps: [osd] allow class-read object_prefix rbd_children, allow rwx > pool=backups, allow rwx pool=backups-cache client.glance > key: AQDf7uRc32hDBBAAkGucQEVTWqnIpNvihXf/Ng== > caps: [mon] allow r > caps: [osd] allow class-read object_prefix rbd_children, allow rwx > pool=images, allow rwx pool=images-cache client.nova > key: AQDN7+RcqDABIxAAXnFcVjBp/S5GkgOy0wqB1Q== > caps: [mon] allow r > caps: [osd] allow class-read object_prefix rbd_children, allow rwx > pool=volumes, allow rwx pool=volumes-cache, allow rwx pool=vms, allow > rwx pool=vms-cache, allow rwx pool=images, allow rwx pool=images-cache > client.radosgw.gateway > key: AQAU7uRccP06CBAA6zLFtDQoTstl8CNclYRugQ== > auid: 0 > caps: [mon] allow rwx > caps: [osd] allow rwx > mgr.172.30.126.26 > key: AQAr7uRclc52MhAA+GWCQEVnAHB01tMFpgJtTQ== > caps: [mds] allow * > caps: [mon] allow profile mgr > caps: [osd] allow * > mgr.172.30.126.27 > key: AQAs7uRclkD2OBAAW/cUhcZEebZnQulqVodiXQ== > caps: [mds] allow * > caps: [mon] allow profile mgr > caps: [osd] allow * > mgr.172.30.126.28 > key: AQAu7uRcT9OLBBAAZbEjb/N1NnZpIgfaAcThyQ== > caps: [mds] allow * > caps: [mon] allow profile mgr > caps: [osd] allow * > > > Can someone explain it to me? Your clients don't have the correct caps. See [1] or [2]. > thanks!! > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [1] http://docs.ceph.com/docs/mimic/releases/luminous/#upgrade-from-jewel-or-kraken [2]
[ceph-users] Multiple rbd images from different clusters
Hi everyone, We want to migrate datas from one cluster (Hammer) to a new one (Mimic). We do not wish to upgrade the actual cluster as all the hardware is EOS and we upgrade the configuration of the servers. We can't find a "proper" way to mount two rbd images from two different cluster on the same host. Does anyone know what is the "good" procedure to achieve this ? Cheers and thanks, Fred. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Whole cluster flapping
Hi, Just to inform that I finally resolved my problem few weeks ago now but I wanted to make sure that it was solved permanently. I set the timeout of OSDs to a larger number of seconds and set the no out and no down flag on the cluster. Basically I just waited that the “clean” ended but I noticed that some OSDs where doing more timeout than the others (since the datas were no longer important for me) I started to take OSDs out and purging them one by one. I did this for about 1/8 of my OSDs and now there is no more flapping or OSDs going down but I still have some slow query popping up time to time. Seems that the pool purged and deleted couldn’t make it till “the end” of it. By purging the remaining OSDs where datas seemed to be helped to bring back the cluster to a stable state. Thank you all for your help. Regards, De : Will Marley Envoyé : 08 August 2018 16:14 À : Webert de Souza Lima ; CUZA Frédéric Cc : ceph-users Objet : RE: [ceph-users] Whole cluster flapping Hi again Frederic, It may be worth looking at a recovery sleep. osd recovery sleep Description: Time in seconds to sleep before next recovery or backfill op. Increasing this value will slow down recovery operation while client operations will be less impacted. Type: Float Default: 0 osd recovery sleep hdd Description: Time in seconds to sleep before next recovery or backfill op for HDDs. Type: Float Default: 0.1 osd recovery sleep ssd Description: Time in seconds to sleep before next recovery or backfill op for SSDs. Type: Float Default: 0 osd recovery sleep hybrid Description: Time in seconds to sleep before next recovery or backfill op when osd data is on HDD and osd journal is on SSD. Type: Float Default: 0.025 (Pulled from http://docs.ceph.com/docs/master/rados/configuration/osd-config-ref/) When we faced similar issues, using the command ceph tell osd.* injectargs '--osd-recovery-sleep 2 allowed the OSDs to respond with a heartbeat whilst taking a break between recovery operations. I’d suggest tweaking the sleep wait time to find a sweet spot. This may be worth a try, so let us know how you get on. Regards, Will From: ceph-users mailto:ceph-users-boun...@lists.ceph.com>> On Behalf Of Webert de Souza Lima Sent: 08 August 2018 15:06 To: frederic.c...@sib.fr<mailto:frederic.c...@sib.fr> Cc: ceph-users mailto:ceph-users@lists.ceph.com>> Subject: Re: [ceph-users] Whole cluster flapping So your OSDs are really too busy to respond heartbeats. You'll be facing this for sometime until cluster loads get lower. I would set `ceph osd set nodeep-scrub` until the heavy disk IO stops. maybe you can schedule it for enable during the night and disabling in the morning. Regards, Webert Lima DevOps Engineer at MAV Tecnologia Belo Horizonte - Brasil IRC NICK - WebertRLZ On Wed, Aug 8, 2018 at 9:18 AM CUZA Frédéric mailto:frederic.c...@sib.fr>> wrote: Thx for the command line, I did take a look too it what I don’t really know what to search for, my bad…. All this flapping is due to deep-scrub when it starts on an OSD things start to go bad. I set out all the OSDs that were flapping the most (1 by 1 after rebalancing) and it looks better even if some osds keep going down/up with the same message in logs : 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fdabd897700' had timed out after 90 (I update it to 90 instead of 15s) Regards, De : ceph-users mailto:ceph-users-boun...@lists.ceph.com>> De la part de Webert de Souza Lima Envoyé : 07 August 2018 16:28 À : ceph-users mailto:ceph-users@lists.ceph.com>> Objet : Re: [ceph-users] Whole cluster flapping oops, my bad, you're right. I don't know much you can see but maybe you can dig around performance counters and see what's happening on those OSDs, try these: ~# ceph daemonperf osd.XX ~# ceph daemon osd.XX perf dump change XX to your OSD numbers. Regards, Webert Lima DevOps Engineer at MAV Tecnologia Belo Horizonte - Brasil IRC NICK - WebertRLZ On Tue, Aug 7, 2018 at 10:47 AM CUZA Frédéric mailto:frederic.c...@sib.fr>> wrote: Pool is already deleted and no longer present in stats. Regards, De : ceph-users mailto:ceph-users-boun...@lists.ceph.com>> De la part de Webert de Souza Lima Envoyé : 07 August 2018 15:08 À : ceph-users mailto:ceph-users@lists.ceph.com>> Objet : Re: [ceph-users] Whole cluster flapping Frédéric, see if the number of objects is decreasing in the pool with `ceph df [detail]` Regards, Webert Lima DevOps Engineer at MAV Tecnologia Belo Horizonte - Brasil IRC NICK - WebertRLZ On Tue, Aug 7, 2018 at 5:46 AM CUZA Frédéric mailto:frederic.c...@sib.fr>> wrote: It’s been over a week now and the whole cluster keeps flapping, it is never the same OSDs that go down. Is there a way to get the progress of this recovery ? (The pool hat I deleted is no longer present (for a while now)) In fact, there is a lot of i/o activity on the server where
Re: [ceph-users] Whole cluster flapping
Thx for the command line, I did take a look too it what I don’t really know what to search for, my bad…. All this flapping is due to deep-scrub when it starts on an OSD things start to go bad. I set out all the OSDs that were flapping the most (1 by 1 after rebalancing) and it looks better even if some osds keep going down/up with the same message in logs : 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fdabd897700' had timed out after 90 (I update it to 90 instead of 15s) Regards, De : ceph-users De la part de Webert de Souza Lima Envoyé : 07 August 2018 16:28 À : ceph-users Objet : Re: [ceph-users] Whole cluster flapping oops, my bad, you're right. I don't know much you can see but maybe you can dig around performance counters and see what's happening on those OSDs, try these: ~# ceph daemonperf osd.XX ~# ceph daemon osd.XX perf dump change XX to your OSD numbers. Regards, Webert Lima DevOps Engineer at MAV Tecnologia Belo Horizonte - Brasil IRC NICK - WebertRLZ On Tue, Aug 7, 2018 at 10:47 AM CUZA Frédéric mailto:frederic.c...@sib.fr>> wrote: Pool is already deleted and no longer present in stats. Regards, De : ceph-users mailto:ceph-users-boun...@lists.ceph.com>> De la part de Webert de Souza Lima Envoyé : 07 August 2018 15:08 À : ceph-users mailto:ceph-users@lists.ceph.com>> Objet : Re: [ceph-users] Whole cluster flapping Frédéric, see if the number of objects is decreasing in the pool with `ceph df [detail]` Regards, Webert Lima DevOps Engineer at MAV Tecnologia Belo Horizonte - Brasil IRC NICK - WebertRLZ On Tue, Aug 7, 2018 at 5:46 AM CUZA Frédéric mailto:frederic.c...@sib.fr>> wrote: It’s been over a week now and the whole cluster keeps flapping, it is never the same OSDs that go down. Is there a way to get the progress of this recovery ? (The pool hat I deleted is no longer present (for a while now)) In fact, there is a lot of i/o activity on the server where osds go down. Regards, De : ceph-users mailto:ceph-users-boun...@lists.ceph.com>> De la part de Webert de Souza Lima Envoyé : 31 July 2018 16:25 À : ceph-users mailto:ceph-users@lists.ceph.com>> Objet : Re: [ceph-users] Whole cluster flapping The pool deletion might have triggered a lot of IO operations on the disks and the process might be too busy to respond to hearbeats, so the mons mark them as down due to no response. Check also the OSD logs to see if they are actually crashing and restarting, and disk IO usage (i.e. iostat). Regards, Webert Lima DevOps Engineer at MAV Tecnologia Belo Horizonte - Brasil IRC NICK - WebertRLZ On Tue, Jul 31, 2018 at 7:23 AM CUZA Frédéric mailto:frederic.c...@sib.fr>> wrote: Hi Everyone, I just upgrade our cluster to Luminous 12.2.7 and I delete a quite large pool that we had (120 TB). Our cluster is made of 14 Nodes with each composed of 12 OSDs (1 HDD -> 1 OSD), we have SDD for journal. After I deleted the large pool my cluster started to flapping on all OSDs. Osds are marked down and then marked up as follow : 2018-07-31 10:42:51.504319 mon.ceph_monitor01 [INF] osd.97 172.29.228.72:6800/95783<http://172.29.228.72:6800/95783> boot 2018-07-31 10:42:55.330993 mon.ceph_monitor01 [WRN] Health check update: 5798/5845200 objects misplaced (0.099%) (OBJECT_MISPLACED) 2018-07-31 10:42:55.331065 mon.ceph_monitor01 [WRN] Health check update: Degraded data redundancy: 221365/5845200 objects degraded (3.787%), 98 pgs degraded, 317 pgs undersized (PG_DEGRADED) 2018-07-31 10:42:55.331093 mon.ceph_monitor01 [WRN] Health check update: 81 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-07-31 10:42:55.548385 mon.ceph_monitor01 [WRN] Health check update: Reduced data availability: 13 pgs inactive, 4 pgs peering (PG_AVAILABILITY) 2018-07-31 10:42:55.610556 mon.ceph_monitor01 [INF] osd.96 172.29.228.72:6803/95830<http://172.29.228.72:6803/95830> boot 2018-07-31 10:43:00.331787 mon.ceph_monitor01 [WRN] Health check update: 5 osds down (OSD_DOWN) 2018-07-31 10:43:00.331930 mon.ceph_monitor01 [WRN] Health check update: 5782/5845401 objects misplaced (0.099%) (OBJECT_MISPLACED) 2018-07-31 10:43:00.331950 mon.ceph_monitor01 [WRN] Health check update: Degraded data redundancy: 167757/5845401 objects degraded (2.870%), 77 pgs degraded, 223 pgs undersized (PG_DEGRADED) 2018-07-31 10:43:00.331966 mon.ceph_monitor01 [WRN] Health check update: 76 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-07-31 10:43:01.729891 mon.ceph_monitor01 [WRN] Health check update: Reduced data availability: 7 pgs inactive, 6 pgs peering (PG_AVAILABILITY) 2018-07-31 10:43:01.753867 mon.ceph_monitor01 [INF] osd.4 172.29.228.246:6812/3144542<http://172.29.228.246:6812/3144542> boot 2018-07-31 10:43:05.332624 mon.ceph_monitor01 [WRN] Health check update: 4 osds down (OSD_DOWN) 2018-07-31 10:43:05.332691 mon.ceph_monitor01 [WRN] Health check update: 5767/5845569 objects misplaced (0.099%) (OBJECT_
Re: [ceph-users] Whole cluster flapping
Pool is already deleted and no longer present in stats. Regards, De : ceph-users De la part de Webert de Souza Lima Envoyé : 07 August 2018 15:08 À : ceph-users Objet : Re: [ceph-users] Whole cluster flapping Frédéric, see if the number of objects is decreasing in the pool with `ceph df [detail]` Regards, Webert Lima DevOps Engineer at MAV Tecnologia Belo Horizonte - Brasil IRC NICK - WebertRLZ On Tue, Aug 7, 2018 at 5:46 AM CUZA Frédéric mailto:frederic.c...@sib.fr>> wrote: It’s been over a week now and the whole cluster keeps flapping, it is never the same OSDs that go down. Is there a way to get the progress of this recovery ? (The pool hat I deleted is no longer present (for a while now)) In fact, there is a lot of i/o activity on the server where osds go down. Regards, De : ceph-users mailto:ceph-users-boun...@lists.ceph.com>> De la part de Webert de Souza Lima Envoyé : 31 July 2018 16:25 À : ceph-users mailto:ceph-users@lists.ceph.com>> Objet : Re: [ceph-users] Whole cluster flapping The pool deletion might have triggered a lot of IO operations on the disks and the process might be too busy to respond to hearbeats, so the mons mark them as down due to no response. Check also the OSD logs to see if they are actually crashing and restarting, and disk IO usage (i.e. iostat). Regards, Webert Lima DevOps Engineer at MAV Tecnologia Belo Horizonte - Brasil IRC NICK - WebertRLZ On Tue, Jul 31, 2018 at 7:23 AM CUZA Frédéric mailto:frederic.c...@sib.fr>> wrote: Hi Everyone, I just upgrade our cluster to Luminous 12.2.7 and I delete a quite large pool that we had (120 TB). Our cluster is made of 14 Nodes with each composed of 12 OSDs (1 HDD -> 1 OSD), we have SDD for journal. After I deleted the large pool my cluster started to flapping on all OSDs. Osds are marked down and then marked up as follow : 2018-07-31 10:42:51.504319 mon.ceph_monitor01 [INF] osd.97 172.29.228.72:6800/95783<http://172.29.228.72:6800/95783> boot 2018-07-31 10:42:55.330993 mon.ceph_monitor01 [WRN] Health check update: 5798/5845200 objects misplaced (0.099%) (OBJECT_MISPLACED) 2018-07-31 10:42:55.331065 mon.ceph_monitor01 [WRN] Health check update: Degraded data redundancy: 221365/5845200 objects degraded (3.787%), 98 pgs degraded, 317 pgs undersized (PG_DEGRADED) 2018-07-31 10:42:55.331093 mon.ceph_monitor01 [WRN] Health check update: 81 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-07-31 10:42:55.548385 mon.ceph_monitor01 [WRN] Health check update: Reduced data availability: 13 pgs inactive, 4 pgs peering (PG_AVAILABILITY) 2018-07-31 10:42:55.610556 mon.ceph_monitor01 [INF] osd.96 172.29.228.72:6803/95830<http://172.29.228.72:6803/95830> boot 2018-07-31 10:43:00.331787 mon.ceph_monitor01 [WRN] Health check update: 5 osds down (OSD_DOWN) 2018-07-31 10:43:00.331930 mon.ceph_monitor01 [WRN] Health check update: 5782/5845401 objects misplaced (0.099%) (OBJECT_MISPLACED) 2018-07-31 10:43:00.331950 mon.ceph_monitor01 [WRN] Health check update: Degraded data redundancy: 167757/5845401 objects degraded (2.870%), 77 pgs degraded, 223 pgs undersized (PG_DEGRADED) 2018-07-31 10:43:00.331966 mon.ceph_monitor01 [WRN] Health check update: 76 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-07-31 10:43:01.729891 mon.ceph_monitor01 [WRN] Health check update: Reduced data availability: 7 pgs inactive, 6 pgs peering (PG_AVAILABILITY) 2018-07-31 10:43:01.753867 mon.ceph_monitor01 [INF] osd.4 172.29.228.246:6812/3144542<http://172.29.228.246:6812/3144542> boot 2018-07-31 10:43:05.332624 mon.ceph_monitor01 [WRN] Health check update: 4 osds down (OSD_DOWN) 2018-07-31 10:43:05.332691 mon.ceph_monitor01 [WRN] Health check update: 5767/5845569 objects misplaced (0.099%) (OBJECT_MISPLACED) 2018-07-31 10:43:05.332718 mon.ceph_monitor01 [WRN] Health check update: Degraded data redundancy: 130565/5845569 objects degraded (2.234%), 67 pgs degraded, 220 pgs undersized (PG_DEGRADED) 2018-07-31 10:43:05.332736 mon.ceph_monitor01 [WRN] Health check update: 83 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-07-31 10:43:07.004993 mon.ceph_monitor01 [WRN] Health check update: Reduced data availability: 5 pgs inactive, 5 pgs peering (PG_AVAILABILITY) 2018-07-31 10:43:10.333548 mon.ceph_monitor01 [WRN] Health check update: 5752/5845758 objects misplaced (0.098%) (OBJECT_MISPLACED) 2018-07-31 10:43:10.333593 mon.ceph_monitor01 [WRN] Health check update: Degraded data redundancy: 107805/5845758 objects degraded (1.844%), 59 pgs degraded, 197 pgs undersized (PG_DEGRADED) 2018-07-31 10:43:10.333608 mon.ceph_monitor01 [WRN] Health check update: 95 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-07-31 10:43:15.334451 mon.ceph_monitor01 [WRN] Health check update: 5738/5845923 objects misplaced (0.098%) (OBJECT_MISPLACED) 2018-07-31 10:43:15.334494 mon.ceph_monitor01 [WRN] Health check update: Degraded data redundancy: 107807/5845923 object
Re: [ceph-users] Whole cluster flapping
It’s been over a week now and the whole cluster keeps flapping, it is never the same OSDs that go down. Is there a way to get the progress of this recovery ? (The pool hat I deleted is no longer present (for a while now)) In fact, there is a lot of i/o activity on the server where osds go down. Regards, De : ceph-users De la part de Webert de Souza Lima Envoyé : 31 July 2018 16:25 À : ceph-users Objet : Re: [ceph-users] Whole cluster flapping The pool deletion might have triggered a lot of IO operations on the disks and the process might be too busy to respond to hearbeats, so the mons mark them as down due to no response. Check also the OSD logs to see if they are actually crashing and restarting, and disk IO usage (i.e. iostat). Regards, Webert Lima DevOps Engineer at MAV Tecnologia Belo Horizonte - Brasil IRC NICK - WebertRLZ On Tue, Jul 31, 2018 at 7:23 AM CUZA Frédéric mailto:frederic.c...@sib.fr>> wrote: Hi Everyone, I just upgrade our cluster to Luminous 12.2.7 and I delete a quite large pool that we had (120 TB). Our cluster is made of 14 Nodes with each composed of 12 OSDs (1 HDD -> 1 OSD), we have SDD for journal. After I deleted the large pool my cluster started to flapping on all OSDs. Osds are marked down and then marked up as follow : 2018-07-31 10:42:51.504319 mon.ceph_monitor01 [INF] osd.97 172.29.228.72:6800/95783<http://172.29.228.72:6800/95783> boot 2018-07-31 10:42:55.330993 mon.ceph_monitor01 [WRN] Health check update: 5798/5845200 objects misplaced (0.099%) (OBJECT_MISPLACED) 2018-07-31 10:42:55.331065 mon.ceph_monitor01 [WRN] Health check update: Degraded data redundancy: 221365/5845200 objects degraded (3.787%), 98 pgs degraded, 317 pgs undersized (PG_DEGRADED) 2018-07-31 10:42:55.331093 mon.ceph_monitor01 [WRN] Health check update: 81 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-07-31 10:42:55.548385 mon.ceph_monitor01 [WRN] Health check update: Reduced data availability: 13 pgs inactive, 4 pgs peering (PG_AVAILABILITY) 2018-07-31 10:42:55.610556 mon.ceph_monitor01 [INF] osd.96 172.29.228.72:6803/95830<http://172.29.228.72:6803/95830> boot 2018-07-31 10:43:00.331787 mon.ceph_monitor01 [WRN] Health check update: 5 osds down (OSD_DOWN) 2018-07-31 10:43:00.331930 mon.ceph_monitor01 [WRN] Health check update: 5782/5845401 objects misplaced (0.099%) (OBJECT_MISPLACED) 2018-07-31 10:43:00.331950 mon.ceph_monitor01 [WRN] Health check update: Degraded data redundancy: 167757/5845401 objects degraded (2.870%), 77 pgs degraded, 223 pgs undersized (PG_DEGRADED) 2018-07-31 10:43:00.331966 mon.ceph_monitor01 [WRN] Health check update: 76 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-07-31 10:43:01.729891 mon.ceph_monitor01 [WRN] Health check update: Reduced data availability: 7 pgs inactive, 6 pgs peering (PG_AVAILABILITY) 2018-07-31 10:43:01.753867 mon.ceph_monitor01 [INF] osd.4 172.29.228.246:6812/3144542<http://172.29.228.246:6812/3144542> boot 2018-07-31 10:43:05.332624 mon.ceph_monitor01 [WRN] Health check update: 4 osds down (OSD_DOWN) 2018-07-31 10:43:05.332691 mon.ceph_monitor01 [WRN] Health check update: 5767/5845569 objects misplaced (0.099%) (OBJECT_MISPLACED) 2018-07-31 10:43:05.332718 mon.ceph_monitor01 [WRN] Health check update: Degraded data redundancy: 130565/5845569 objects degraded (2.234%), 67 pgs degraded, 220 pgs undersized (PG_DEGRADED) 2018-07-31 10:43:05.332736 mon.ceph_monitor01 [WRN] Health check update: 83 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-07-31 10:43:07.004993 mon.ceph_monitor01 [WRN] Health check update: Reduced data availability: 5 pgs inactive, 5 pgs peering (PG_AVAILABILITY) 2018-07-31 10:43:10.333548 mon.ceph_monitor01 [WRN] Health check update: 5752/5845758 objects misplaced (0.098%) (OBJECT_MISPLACED) 2018-07-31 10:43:10.333593 mon.ceph_monitor01 [WRN] Health check update: Degraded data redundancy: 107805/5845758 objects degraded (1.844%), 59 pgs degraded, 197 pgs undersized (PG_DEGRADED) 2018-07-31 10:43:10.333608 mon.ceph_monitor01 [WRN] Health check update: 95 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-07-31 10:43:15.334451 mon.ceph_monitor01 [WRN] Health check update: 5738/5845923 objects misplaced (0.098%) (OBJECT_MISPLACED) 2018-07-31 10:43:15.334494 mon.ceph_monitor01 [WRN] Health check update: Degraded data redundancy: 107807/5845923 objects degraded (1.844%), 59 pgs degraded, 197 pgs undersized (PG_DEGRADED) 2018-07-31 10:43:15.334510 mon.ceph_monitor01 [WRN] Health check update: 98 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-07-31 10:43:15.334865 mon.ceph_monitor01 [INF] osd.18 failed (root=default,room=,host=) (8 reporters from different host after 54.650576 >= grace 54.300663) 2018-07-31 10:43:15.336552 mon.ceph_monitor01 [WRN] Health check update: 5 osds down (OSD_DOWN) 2018-07-31 10:43:17.357747 mon.ceph_monitor01 [WRN] Health check update: Reduced data availability: 6 pgs inac
Re: [ceph-users] Whole cluster flapping
Hi, At the moment following some advice and after what I could read from here and from Internet, I set the nodown flag and I increased the heartbeat grace. Since it was 120 TB of data I think it's going to take a little bit of time to fully recover. I keep you all update on the status. Thanks for all. Regards, De : Brent Kennedy Envoyé : 31 July 2018 23:36 À : CUZA Frédéric ; 'ceph-users' Objet : RE: [ceph-users] Whole cluster flapping I have had this happen during large data movements. Stopped happening after I went to 10Gb though(from 1Gb). What I had done is injected a setting ( and adjusted the configs ) to give more time before an OSD was marked down. osd heartbeat grace = 200 mon osd down out interval = 900 For injecting runtime values/settings( under runtime changes ): http://docs.ceph.com/docs/luminous/rados/configuration/ceph-conf/ Probably should check the logs before doing anything to ensure the OSDs or host is not failing. -Brent From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of CUZA Frédéric Sent: Tuesday, July 31, 2018 5:06 AM To: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> Subject: [ceph-users] Whole cluster flapping Hi Everyone, I just upgrade our cluster to Luminous 12.2.7 and I delete a quite large pool that we had (120 TB). Our cluster is made of 14 Nodes with each composed of 12 OSDs (1 HDD -> 1 OSD), we have SDD for journal. After I deleted the large pool my cluster started to flapping on all OSDs. Osds are marked down and then marked up as follow : 2018-07-31 10:42:51.504319 mon.ceph_monitor01 [INF] osd.97 172.29.228.72:6800/95783 boot 2018-07-31 10:42:55.330993 mon.ceph_monitor01 [WRN] Health check update: 5798/5845200 objects misplaced (0.099%) (OBJECT_MISPLACED) 2018-07-31 10:42:55.331065 mon.ceph_monitor01 [WRN] Health check update: Degraded data redundancy: 221365/5845200 objects degraded (3.787%), 98 pgs degraded, 317 pgs undersized (PG_DEGRADED) 2018-07-31 10:42:55.331093 mon.ceph_monitor01 [WRN] Health check update: 81 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-07-31 10:42:55.548385 mon.ceph_monitor01 [WRN] Health check update: Reduced data availability: 13 pgs inactive, 4 pgs peering (PG_AVAILABILITY) 2018-07-31 10:42:55.610556 mon.ceph_monitor01 [INF] osd.96 172.29.228.72:6803/95830 boot 2018-07-31 10:43:00.331787 mon.ceph_monitor01 [WRN] Health check update: 5 osds down (OSD_DOWN) 2018-07-31 10:43:00.331930 mon.ceph_monitor01 [WRN] Health check update: 5782/5845401 objects misplaced (0.099%) (OBJECT_MISPLACED) 2018-07-31 10:43:00.331950 mon.ceph_monitor01 [WRN] Health check update: Degraded data redundancy: 167757/5845401 objects degraded (2.870%), 77 pgs degraded, 223 pgs undersized (PG_DEGRADED) 2018-07-31 10:43:00.331966 mon.ceph_monitor01 [WRN] Health check update: 76 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-07-31 10:43:01.729891 mon.ceph_monitor01 [WRN] Health check update: Reduced data availability: 7 pgs inactive, 6 pgs peering (PG_AVAILABILITY) 2018-07-31 10:43:01.753867 mon.ceph_monitor01 [INF] osd.4 172.29.228.246:6812/3144542 boot 2018-07-31 10:43:05.332624 mon.ceph_monitor01 [WRN] Health check update: 4 osds down (OSD_DOWN) 2018-07-31 10:43:05.332691 mon.ceph_monitor01 [WRN] Health check update: 5767/5845569 objects misplaced (0.099%) (OBJECT_MISPLACED) 2018-07-31 10:43:05.332718 mon.ceph_monitor01 [WRN] Health check update: Degraded data redundancy: 130565/5845569 objects degraded (2.234%), 67 pgs degraded, 220 pgs undersized (PG_DEGRADED) 2018-07-31 10:43:05.332736 mon.ceph_monitor01 [WRN] Health check update: 83 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-07-31 10:43:07.004993 mon.ceph_monitor01 [WRN] Health check update: Reduced data availability: 5 pgs inactive, 5 pgs peering (PG_AVAILABILITY) 2018-07-31 10:43:10.333548 mon.ceph_monitor01 [WRN] Health check update: 5752/5845758 objects misplaced (0.098%) (OBJECT_MISPLACED) 2018-07-31 10:43:10.333593 mon.ceph_monitor01 [WRN] Health check update: Degraded data redundancy: 107805/5845758 objects degraded (1.844%), 59 pgs degraded, 197 pgs undersized (PG_DEGRADED) 2018-07-31 10:43:10.333608 mon.ceph_monitor01 [WRN] Health check update: 95 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-07-31 10:43:15.334451 mon.ceph_monitor01 [WRN] Health check update: 5738/5845923 objects misplaced (0.098%) (OBJECT_MISPLACED) 2018-07-31 10:43:15.334494 mon.ceph_monitor01 [WRN] Health check update: Degraded data redundancy: 107807/5845923 objects degraded (1.844%), 59 pgs degraded, 197 pgs undersized (PG_DEGRADED) 2018-07-31 10:43:15.334510 mon.ceph_monitor01 [WRN] Health check update: 98 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-07-31 10:43:15.334865 mon.ceph_monitor01 [INF] osd.18 failed (root=default,room=,host=) (8 reporters from different host after 54.650576 >= grace 54.300663) 2018-07-31 10:43:15.336552 mon.ceph_monitor01 [WRN] Health
[ceph-users] Whole cluster flapping
Hi Everyone, I just upgrade our cluster to Luminous 12.2.7 and I delete a quite large pool that we had (120 TB). Our cluster is made of 14 Nodes with each composed of 12 OSDs (1 HDD -> 1 OSD), we have SDD for journal. After I deleted the large pool my cluster started to flapping on all OSDs. Osds are marked down and then marked up as follow : 2018-07-31 10:42:51.504319 mon.ceph_monitor01 [INF] osd.97 172.29.228.72:6800/95783 boot 2018-07-31 10:42:55.330993 mon.ceph_monitor01 [WRN] Health check update: 5798/5845200 objects misplaced (0.099%) (OBJECT_MISPLACED) 2018-07-31 10:42:55.331065 mon.ceph_monitor01 [WRN] Health check update: Degraded data redundancy: 221365/5845200 objects degraded (3.787%), 98 pgs degraded, 317 pgs undersized (PG_DEGRADED) 2018-07-31 10:42:55.331093 mon.ceph_monitor01 [WRN] Health check update: 81 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-07-31 10:42:55.548385 mon.ceph_monitor01 [WRN] Health check update: Reduced data availability: 13 pgs inactive, 4 pgs peering (PG_AVAILABILITY) 2018-07-31 10:42:55.610556 mon.ceph_monitor01 [INF] osd.96 172.29.228.72:6803/95830 boot 2018-07-31 10:43:00.331787 mon.ceph_monitor01 [WRN] Health check update: 5 osds down (OSD_DOWN) 2018-07-31 10:43:00.331930 mon.ceph_monitor01 [WRN] Health check update: 5782/5845401 objects misplaced (0.099%) (OBJECT_MISPLACED) 2018-07-31 10:43:00.331950 mon.ceph_monitor01 [WRN] Health check update: Degraded data redundancy: 167757/5845401 objects degraded (2.870%), 77 pgs degraded, 223 pgs undersized (PG_DEGRADED) 2018-07-31 10:43:00.331966 mon.ceph_monitor01 [WRN] Health check update: 76 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-07-31 10:43:01.729891 mon.ceph_monitor01 [WRN] Health check update: Reduced data availability: 7 pgs inactive, 6 pgs peering (PG_AVAILABILITY) 2018-07-31 10:43:01.753867 mon.ceph_monitor01 [INF] osd.4 172.29.228.246:6812/3144542 boot 2018-07-31 10:43:05.332624 mon.ceph_monitor01 [WRN] Health check update: 4 osds down (OSD_DOWN) 2018-07-31 10:43:05.332691 mon.ceph_monitor01 [WRN] Health check update: 5767/5845569 objects misplaced (0.099%) (OBJECT_MISPLACED) 2018-07-31 10:43:05.332718 mon.ceph_monitor01 [WRN] Health check update: Degraded data redundancy: 130565/5845569 objects degraded (2.234%), 67 pgs degraded, 220 pgs undersized (PG_DEGRADED) 2018-07-31 10:43:05.332736 mon.ceph_monitor01 [WRN] Health check update: 83 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-07-31 10:43:07.004993 mon.ceph_monitor01 [WRN] Health check update: Reduced data availability: 5 pgs inactive, 5 pgs peering (PG_AVAILABILITY) 2018-07-31 10:43:10.333548 mon.ceph_monitor01 [WRN] Health check update: 5752/5845758 objects misplaced (0.098%) (OBJECT_MISPLACED) 2018-07-31 10:43:10.333593 mon.ceph_monitor01 [WRN] Health check update: Degraded data redundancy: 107805/5845758 objects degraded (1.844%), 59 pgs degraded, 197 pgs undersized (PG_DEGRADED) 2018-07-31 10:43:10.333608 mon.ceph_monitor01 [WRN] Health check update: 95 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-07-31 10:43:15.334451 mon.ceph_monitor01 [WRN] Health check update: 5738/5845923 objects misplaced (0.098%) (OBJECT_MISPLACED) 2018-07-31 10:43:15.334494 mon.ceph_monitor01 [WRN] Health check update: Degraded data redundancy: 107807/5845923 objects degraded (1.844%), 59 pgs degraded, 197 pgs undersized (PG_DEGRADED) 2018-07-31 10:43:15.334510 mon.ceph_monitor01 [WRN] Health check update: 98 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-07-31 10:43:15.334865 mon.ceph_monitor01 [INF] osd.18 failed (root=default,room=,host=) (8 reporters from different host after 54.650576 >= grace 54.300663) 2018-07-31 10:43:15.336552 mon.ceph_monitor01 [WRN] Health check update: 5 osds down (OSD_DOWN) 2018-07-31 10:43:17.357747 mon.ceph_monitor01 [WRN] Health check update: Reduced data availability: 6 pgs inactive, 6 pgs peering (PG_AVAILABILITY) 2018-07-31 10:43:20.339495 mon.ceph_monitor01 [WRN] Health check update: 5724/5846073 objects misplaced (0.098%) (OBJECT_MISPLACED) 2018-07-31 10:43:20.339543 mon.ceph_monitor01 [WRN] Health check update: Degraded data redundancy: 122901/5846073 objects degraded (2.102%), 65 pgs degraded, 201 pgs undersized (PG_DEGRADED) 2018-07-31 10:43:20.339559 mon.ceph_monitor01 [WRN] Health check update: 78 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-07-31 10:43:22.481251 mon.ceph_monitor01 [WRN] Health check update: 4 osds down (OSD_DOWN) 2018-07-31 10:43:22.498621 mon.ceph_monitor01 [INF] osd.18 172.29.228.5:6812/14996 boot 2018-07-31 10:43:25.340099 mon.ceph_monitor01 [WRN] Health check update: 5712/5846235 objects misplaced (0.098%) (OBJECT_MISPLACED) 2018-07-31 10:43:25.340147 mon.ceph_monitor01 [WRN] Health check update: Reduced data availability: 6 pgs inactive, 3 pgs peering (PG_AVAILABILITY) 2018-07-31 10:43:25.340163 mon.ceph_monitor01 [WRN] Health check update: Degraded data redundancy: 138553/5846235 objects
Re: [ceph-users] Be careful with orphans find (was Re: Lost TB for Object storage)
Hi Matthew, Thanks for the advice but we are no longer using orphans find since the problem does not seem to be solved with it. Regards, -Message d'origine- De : Matthew Vernon Envoyé : 20 July 2018 11:03 À : CUZA Frédéric ; ceph-users@lists.ceph.com Objet : Be careful with orphans find (was Re: [ceph-users] Lost TB for Object storage) Hi, On 19/07/18 17:19, CUZA Frédéric wrote: > After that we tried to remove the orphans : > > radosgw-admin orphans find -pool= default.rgw.buckets.data > --job-id=ophans_clean > > radosgw-admin orphans finish --job-id=ophans_clean > > It finds some orphans : 85, but the command finish seems not to work, > so we decided to manually delete those ophans by piping the output of > find in a log file. I would advise caution with using the "orphans find" code in radosgw-admin. On the advice of our vendor, we ran this and automatically removed the resulting objects. Unfortunately, a small proportion of the objects found and removed thus were not in fact orphans - meaning we ended up with some damaged S3 objects; they appeared in bucket listings, but you'd get 404 if you tried to download them. We have asked our vendor to make the wider community aware of the issue, but they have not (yet) done so. Regards, Matthew -- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Lost TB for Object storage
Hi Guys, We are running a Ceph Luminous 12.2.6 cluster. The cluster is used both for RBD storage and Ceph Object Storage and is about 742 TB raw space. We have an application that push snapshots of our VMs through RGW all seem to be fine except that we have a decorrelation between what the S3 API shows and the command "ceph df detail" S3 API (python script) : Total : 44325.84438523278GB Ceph df detail : NAME ID USED default.rgw.buckets.data 59 104T So it is about 60 TB... We tried to clean the gc but nothing is shown : # radosgw-admin gc list --include-all [] # After that we tried to remove the orphans : radosgw-admin orphans find -pool= default.rgw.buckets.data --job-id=ophans_clean radosgw-admin orphans finish --job-id=ophans_clean It finds some orphans : 85, but the command finish seems not to work, so we decided to manually delete those ophans by piping the output of find in a log file. Even after that we still have a huge decorrelation between what the s3 api show and ceph. When we list object with s3 API I find the exact information that the application is returning. (Which is normal since the application use this API) We listed object with the rados CLI and we found that there is more objects than we can found we the S3 API. We are actually out of idea and we can't figure out what's wrong. Some of you already faced this problem ? Regards, ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com