[ceph-users] Understand ceph df details

2020-01-21 Thread CUZA Frédéric
Hi everyone,

I'm trying to understand where is the difference between the command :
ceph df details

And the result I'm getting when I run this script :
total_bytes=0
while read user; do
  echo $user
  bytes=$(radosgw-admin user stats --uid=${user} | grep total_bytes_rounded | 
tr -dc "0-9")
  if [ ! -z ${bytes} ]; then
total_bytes=$((total_bytes + bytes))
pretty_bytes=$(echo "scale=2; $bytes / 1000^4" | bc)
echo "  ($bytes B) $pretty_bytes TiB"
  fi
  pretty_total_bytes=$(echo "scale=2; $total_bytes / 1000^4" | bc)
done <<< "$(radosgw-admin user list | jq -r .[])"
echo ""
echo "Total : ($total_bytes B) $pretty_total_bytes TiB"


When I run df I get this :
default.rgw.buckets.data   70 N/A   N/A  226TiB 
89.23   27.2TiB 61676992 61.68M 2.05GiB  726MiB   
677TiB

And when I use my script I don't have the same result :
Total : (207579728699392 B) 207.57 TiB

It means that I have 20 TiB somewhere but I can't find and must of all 
understand where this 20 TiB.
Does anyone have an explanation ?


Fi :
[root@ceph_monitor01 ~]# radosgw-admin gc list -include-all | grep oid | wc -l
23
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-ansible / block-db block-wal

2019-10-30 Thread CUZA Frédéric
Hi Everyone,

Does anyone know how to indicate block-db and block-wal to device on ansible ?
In ceph-deploy it is quite easy :
ceph-deploy osd create osd_host08 --data /dev/sdl --block-db /dev/sdm12 
--block-wal /dev/sdn12 -bluestore

On my data nodes I have 12 HDDs and 2 SSDs I use those SSDs for block-db and 
block-wal.
How to indicate for each osd which partition to use ?

And finally, how do you handle the deployment if you have multiple data nodes 
setup ?
SSDs on sdm and sdn on one host and SSDs on sda and sdb on another ?

Thank you for your help.

Regards,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Missing field "host" in logs sent to Graylog

2019-09-30 Thread CUZA Frédéric
Hi everyone,

We are facing a problem where we cannot read logs sent to graylog because it is 
missing one mandatory field.

GELF message  (received from 
) has empty mandatory "host" field.

Does anyone know what we are missing  ?
I know there was someone facing the same issue but it seems that he didn't had 
an answer.

We are running :
Ceph : 12.2.12
Graylog : 3.0

Thanks !

Regards,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph Buckets Backup

2019-09-26 Thread CUZA Frédéric
Hi everyone,
As aynone ever made  a backup of a ceph bucket into Amazon Glacier ?
If so did you use a script that use the api to "migrate" the objects ?

If no one use amazon s3, how did you make those backups ?

Thanks in advance.

Regards,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Weird behaviour of ceph-deploy

2019-06-18 Thread CUZA Frédéric
Things are not evolving, If I found an alternative to add a new osds nodes in 
the future I’ll mark it here.
I’m abandoning ceph-deploy since it seems to be buggy.

Regards,

De : ceph-users  De la part de CUZA Frédéric
Envoyé : 18 June 2019 10:40
À : Brian Topping 
Cc : ceph-users@lists.ceph.com
Objet : Re: [ceph-users] Weird behaviour of ceph-deploy


1.   All node are under 12.2.12 (luminous stable)

2.   Forward and Reverse DNS/SSH is working fine.

3.   Things go bad at the install step, ceph is correctly installed but 
node is not present in the cluster. It seems that this is the step that goes 
wrong.


Regards,

De : Brian Topping mailto:brian.topp...@gmail.com>>
Envoyé : 17 June 2019 16:39
À : CUZA Frédéric mailto:frederic.c...@sib.fr>>
Cc : ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Objet : Re: [ceph-users] Weird behaviour of ceph-deploy

I don’t have an answer for you, but it’s going to help others to have shown:

  1.  Versions of all nodes involved and multi-master configuration
  2.  Confirm forward and reverse DNS and SSH / remote sudo since you are using 
deploy
  3.  Specific steps that did not behave properly
On Jun 17, 2019, at 6:29 AM, CUZA Frédéric 
mailto:frederic.c...@sib.fr>> wrote:

I’ll keep updating this until I find a solution so if anyone faces the same 
problem he might have solution.

Atm : I install the new osd node with ceph-deploy and nothing change, node is 
still not present in the cluster nor in the crushmap.
I decided to manually add it to the crush map :
ceph osd crush add-bucket sd0051 host
and move it to where it should be :
ceph osd crush move sd0051 room=roomA
Then I added an osd to that node :
ceph-deploy osd create sd0051 --data /dev/sde --block-db /dev/sda1 --block-wal 
/dev/sdb1 –bluestore
Once finally created the osd is still not linked to the host where it is 
created and I can’t move it the this host right now;


Regards,


De : ceph-users 
mailto:ceph-users-boun...@lists.ceph.com>> 
De la part de CUZA Frédéric
Envoyé : 15 June 2019 00:34
À : ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Objet : Re: [ceph-users] Weird behaviour of ceph-deploy

Little update :
I check one osd I’ve installed even if the host isn’t not present in the 
crushmap (or in cluster I guess) and I found this :

monclient: wait_auth_rotating timed out after 30
osd.xxx 0 unable to obtain rotating service keys; retrying

I alosy add the host to the admins host :
ceph-deploy admin sd0051
and nothing change.

When I do the install there is not .conf pushed the new node.

Regards,

De : ceph-users 
mailto:ceph-users-boun...@lists.ceph.com>> 
De la part de CUZA Frédéric
Envoyé : 14 June 2019 18:28
À : ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Objet : [ceph-users] Weird behaviour of ceph-deploy

Hi everyone,

I am facing a strange behavious from ceph-deploy.
I try to add a new node to our cluster :
ceph-deploy install --no-adjust-repos sd0051

Everything seems to work fine but the new bucket (host) is not created in the 
crushmap and when I try to add a new osd to that host, the osd is created but 
is not link to any host (normal behaviour the host is not present).
Anyone already faces this ?

FI : We already add new node with this and this is the first time we face it.

Thanks !

___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Weird behaviour of ceph-deploy

2019-06-18 Thread CUZA Frédéric
1.   All node are under 12.2.12 (luminous stable)

2.   Forward and Reverse DNS/SSH is working fine.

3.   Things go bad at the install step, ceph is correctly installed but 
node is not present in the cluster. It seems that this is the step that goes 
wrong.


Regards,

De : Brian Topping 
Envoyé : 17 June 2019 16:39
À : CUZA Frédéric 
Cc : ceph-users@lists.ceph.com
Objet : Re: [ceph-users] Weird behaviour of ceph-deploy

I don’t have an answer for you, but it’s going to help others to have shown:

  1.  Versions of all nodes involved and multi-master configuration
  2.  Confirm forward and reverse DNS and SSH / remote sudo since you are using 
deploy
  3.  Specific steps that did not behave properly
On Jun 17, 2019, at 6:29 AM, CUZA Frédéric 
mailto:frederic.c...@sib.fr>> wrote:

I’ll keep updating this until I find a solution so if anyone faces the same 
problem he might have solution.

Atm : I install the new osd node with ceph-deploy and nothing change, node is 
still not present in the cluster nor in the crushmap.
I decided to manually add it to the crush map :
ceph osd crush add-bucket sd0051 host
and move it to where it should be :
ceph osd crush move sd0051 room=roomA
Then I added an osd to that node :
ceph-deploy osd create sd0051 --data /dev/sde --block-db /dev/sda1 --block-wal 
/dev/sdb1 –bluestore
Once finally created the osd is still not linked to the host where it is 
created and I can’t move it the this host right now;


Regards,


De : ceph-users 
mailto:ceph-users-boun...@lists.ceph.com>> 
De la part de CUZA Frédéric
Envoyé : 15 June 2019 00:34
À : ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Objet : Re: [ceph-users] Weird behaviour of ceph-deploy

Little update :
I check one osd I’ve installed even if the host isn’t not present in the 
crushmap (or in cluster I guess) and I found this :

monclient: wait_auth_rotating timed out after 30
osd.xxx 0 unable to obtain rotating service keys; retrying

I alosy add the host to the admins host :
ceph-deploy admin sd0051
and nothing change.

When I do the install there is not .conf pushed the new node.

Regards,

De : ceph-users 
mailto:ceph-users-boun...@lists.ceph.com>> 
De la part de CUZA Frédéric
Envoyé : 14 June 2019 18:28
À : ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Objet : [ceph-users] Weird behaviour of ceph-deploy

Hi everyone,

I am facing a strange behavious from ceph-deploy.
I try to add a new node to our cluster :
ceph-deploy install --no-adjust-repos sd0051

Everything seems to work fine but the new bucket (host) is not created in the 
crushmap and when I try to add a new osd to that host, the osd is created but 
is not link to any host (normal behaviour the host is not present).
Anyone already faces this ?

FI : We already add new node with this and this is the first time we face it.

Thanks !

___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Weird behaviour of ceph-deploy

2019-06-17 Thread CUZA Frédéric
I'll keep updating this until I find a solution so if anyone faces the same 
problem he might have solution.

Atm : I install the new osd node with ceph-deploy and nothing change, node is 
still not present in the cluster nor in the crushmap.
I decided to manually add it to the crush map :
ceph osd crush add-bucket sd0051 host
and move it to where it should be :
ceph osd crush move sd0051 room=roomA
Then I added an osd to that node :
ceph-deploy osd create sd0051 --data /dev/sde --block-db /dev/sda1 --block-wal 
/dev/sdb1 -bluestore
Once finally created the osd is still not linked to the host where it is 
created and I can't move it the this host right now;


Regards,


De : ceph-users  De la part de CUZA Frédéric
Envoyé : 15 June 2019 00:34
À : ceph-users@lists.ceph.com
Objet : Re: [ceph-users] Weird behaviour of ceph-deploy

Little update :
I check one osd I've installed even if the host isn't not present in the 
crushmap (or in cluster I guess) and I found this :

monclient: wait_auth_rotating timed out after 30
osd.xxx 0 unable to obtain rotating service keys; retrying

I alosy add the host to the admins host :
ceph-deploy admin sd0051
and nothing change.

When I do the install there is not .conf pushed the new node.

Regards,

De : ceph-users 
mailto:ceph-users-boun...@lists.ceph.com>> 
De la part de CUZA Frédéric
Envoyé : 14 June 2019 18:28
À : ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Objet : [ceph-users] Weird behaviour of ceph-deploy

Hi everyone,

I am facing a strange behavious from ceph-deploy.
I try to add a new node to our cluster :
ceph-deploy install --no-adjust-repos sd0051

Everything seems to work fine but the new bucket (host) is not created in the 
crushmap and when I try to add a new osd to that host, the osd is created but 
is not link to any host (normal behaviour the host is not present).
Anyone already faces this ?

FI : We already add new node with this and this is the first time we face it.

Thanks !

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Weird behaviour of ceph-deploy

2019-06-14 Thread CUZA Frédéric
Little update :
I check one osd I've installed even if the host isn't not present in the 
crushmap (or in cluster I guess) and I found this :

monclient: wait_auth_rotating timed out after 30
osd.xxx 0 unable to obtain rotating service keys; retrying

I alosy add the host to the admins host :
ceph-deploy admin sd0051
and nothing change.

When I do the install there is not .conf pushed the new node.

Regards,

De : ceph-users  De la part de CUZA Frédéric
Envoyé : 14 June 2019 18:28
À : ceph-users@lists.ceph.com
Objet : [ceph-users] Weird behaviour of ceph-deploy

Hi everyone,

I am facing a strange behavious from ceph-deploy.
I try to add a new node to our cluster :
ceph-deploy install --no-adjust-repos sd0051

Everything seems to work fine but the new bucket (host) is not created in the 
crushmap and when I try to add a new osd to that host, the osd is created but 
is not link to any host (normal behaviour the host is not present).
Anyone already faces this ?

FI : We already add new node with this and this is the first time we face it.

Thanks !

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Weird behaviour of ceph-deploy

2019-06-14 Thread CUZA Frédéric
Hi everyone,

I am facing a strange behavious from ceph-deploy.
I try to add a new node to our cluster :
ceph-deploy install --no-adjust-repos sd0051

Everything seems to work fine but the new bucket (host) is not created in the 
crushmap and when I try to add a new osd to that host, the osd is created but 
is not link to any host (normal behaviour the host is not present).
Anyone already faces this ?

FI : We already add new node with this and this is the first time we face it.

Thanks !

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multiple rbd images from different clusters

2019-06-05 Thread CUZA Frédéric
Hi,

Thank you all for you quick answer.
I think that will solve our problem.

This is what we came up with this :
rbd -c /etc/ceph/Oceph.conf --keyring /etc/ceph/Oceph.client.admin.keyring 
export rbd/disk_test - | rbd -c /etc/ceph/Nceph.conf --keyring 
/etc/ceph/Nceph.client.admin.keyring import - rbd/disk_test

This rbd image is a test with only 5Gb of datas inside of it.

Unfortunately the command seems to be stuck and nothing happens, both ports 
7800 / 6789 / 22.

We can't find no logs on any monitors.

Thanks !

-Message d'origine-
De : ceph-users  De la part de Jason Dillaman
Envoyé : 04 June 2019 14:11
À : Burkhard Linke 
Cc : ceph-users 
Objet : Re: [ceph-users] Multiple rbd images from different clusters

On Tue, Jun 4, 2019 at 8:07 AM Jason Dillaman  wrote:
>
> On Tue, Jun 4, 2019 at 4:45 AM Burkhard Linke 
>  wrote:
> >
> > Hi,
> >
> > On 6/4/19 10:12 AM, CUZA Frédéric wrote:
> >
> > Hi everyone,
> >
> >
> >
> > We want to migrate datas from one cluster (Hammer) to a new one (Mimic). We 
> > do not wish to upgrade the actual cluster as all the hardware is EOS and we 
> > upgrade the configuration of the servers.
> >
> > We can’t find a “proper” way to mount two rbd images from two different 
> > cluster on the same host.
> >
> > Does anyone know what is the “good” procedure to achieve this ?
>
> Copy your "/etc/ceph/ceph.conf" and associated keyrings for both 
> clusters to a single machine (preferably running a Mimic "rbd" client) 
> under "/etc/ceph/.conf" and 
> "/etc/ceph/.client..keyring".
>
> You can then use "rbd -c  export --export-format 2 
>  - | rbd -c  import --export-format=2 - 
> ". The "--export-format=2" option will also copy all 
> associated snapshots with the images. If you don't want/need the 
> snapshots, just drop that optional.

That "-c" should be "--cluster" if specifying by name, otherwise with "-c" it's 
the full path to the two different conf files.

> >
> > Just my 2 ct:
> >
> > the 'rbd' commands allows specifying a configuration file (-c). You need to 
> > setup two configuration files, one for each cluster. You can also use two 
> > different cluster names (--cluster option). AFAIK the name is only used to 
> > locate the configuration file. I'm not sure how well the kernel works with 
> > mapping RBDs from two different cluster.
> >
> >
> > If you only want to transfer RBDs from one cluster to another, you do not 
> > need to map and mount them; the 'rbd' command has the sub commands 'export' 
> > and 'import'. You can pipe them to avoid writing data to a local disk. This 
> > should be the fastest way to transfer the RBDs.
> >
> >
> > Regards,
> >
> > Burkhard
> >
> > --
> > Dr. rer. nat. Burkhard Linke
> > Bioinformatics and Systems Biology
> > Justus-Liebig-University Giessen
> > 35392 Giessen, Germany
> > Phone: (+49) (0)641 9935810
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Jason



--
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd.ReadOnlyImage: [errno 30]

2019-06-05 Thread CUZA Frédéric
Thank you all for you quick answer.
I think that will solve our problem.

This is what we came up with this :
rbd -c /etc/ceph/Oceph.conf --keyring /etc/ceph/Oceph.client.admin.keyring 
export rbd/disk_test - | rbd -c /etc/ceph/Nceph.conf --keyring 
/etc/ceph/Nceph.client.admin.keyring import - rbd/disk_test

This rbd image is a test with only 5Gb of datas inside of it.

Unfortunately the command seems to be stuck and nothing happens, both ports 
7800 / 6789 / 22.

We can't find no logs on any monitors.

Thanks !

-Message d'origine-
De : ceph-users  De la part de Jason Dillaman
Envoyé : 04 June 2019 14:14
À : 解决 
Cc : ceph-users 
Objet : Re: [ceph-users] rbd.ReadOnlyImage: [errno 30]

On Tue, Jun 4, 2019 at 4:55 AM 解决  wrote:
>
> Hi all,
> We use ceph(luminous) + openstack(queens) in my test 
> environment。The virtual machine does not start properly after the 
> disaster test and the image of virtual machine can not create snap.The 
> procedure is as follows:
> #!/usr/bin/env python
>
> import rados
> import rbd
> with rados.Rados(conffile='/etc/ceph/ceph.conf',rados_id='nova') as cluster:
> with cluster.open_ioctx('vms') as ioctx:
> rbd_inst = rbd.RBD()
> print "start open rbd image"
> with rbd.Image(ioctx, '10df4634-4401-45ca-9c57-f349b78da475_disk') as 
> image:
> print "start create snapshot"
> image.create_snap('myimage_snap1')
>
> when i run it ,it show readonlyimage,as follows:
>
> start open rbd image
> start create snapshot
> Traceback (most recent call last):
>   File "testpool.py", line 17, in 
> image.create_snap('myimage_snap1')
>   File "rbd.pyx", line 1790, in rbd.Image.create_snap 
> (/builddir/build/BUILD/ceph-12.2.5/build/src/pybind/rbd/pyrex/rbd.c:15
> 682)
> rbd.ReadOnlyImage: [errno 30] error creating snapshot myimage_snap1 
> from 10df4634-4401-45ca-9c57-f349b78da475_disk
>
> but i run it with admin instead of nova,it is ok.
>
> "ceph auth list"  as follow
>
> installed auth entries:
>
> osd.1
> key: AQBL7uRcfuyxEBAAoK8JrQWMU6EEf/g83zKJjg==
> caps: [mon] allow profile osd
> caps: [osd] allow *
> osd.10
> key: AQCV7uRcdsB9IBAAHbHHCaylVUZIPKFX20polQ==
> caps: [mon] allow profile osd
> caps: [osd] allow *
> osd.11
> key: AQCW7uRcRIMRIhAAbXfLbQwijEO5ZQFWFZaO5w==
> caps: [mon] allow profile osd
> caps: [osd] allow *
> osd.2
> key: AQBL7uRcfFMWDBAAo7kjQobGBbIHYfZkx45pOw==
> caps: [mon] allow profile osd
> caps: [osd] allow *
> osd.4
> key: AQBk7uRc97CPOBAAK9IBJICvchZPc5p80bISsg==
> caps: [mon] allow profile osd
> caps: [osd] allow *
> osd.5
> key: AQBk7uRcOdqaORAAkQeEtYsE6rLWLPhYuCTdHA==
> caps: [mon] allow profile osd
> caps: [osd] allow *
> osd.7
> key: AQB97uRc+1eRJxAA34DImQIMFjzHSXZ25djp0Q==
> caps: [mon] allow profile osd
> caps: [osd] allow *
> osd.8
> key: AQB97uRcFilBJhAAXzSzNJsgwpobC8654Xo7Sw==
> caps: [mon] allow profile osd
> caps: [osd] allow *
> client.admin
> key: AQAU7uRcNia+BBAA09mOYdX+yJWbLCjcuMih0A==
> auid: 0
> caps: [mds] allow
> caps: [mgr] allow *
> caps: [mon] allow *
> caps: [osd] allow *
> client.cinder
> key: AQBp7+RcOzPHGxAA7azgyayVu2RRNWJ7JxSJEg==
> caps: [mon] allow r
> caps: [osd] allow class-read object_prefix rbd_children, allow rwx 
> pool=volumes, allow rwx pool=volumes-cache, allow rwx pool=vms, allow 
> rwx pool=vms-cache, allow rx pool=images, allow rx pool=images-cache 
> client.cinder-backup
> key: AQBq7+RcVOwGNRAAiwJ59ZvAUc0H4QkVeN82vA==
> caps: [mon] allow r
> caps: [osd] allow class-read object_prefix rbd_children, allow rwx 
> pool=backups, allow rwx pool=backups-cache client.glance
> key: AQDf7uRc32hDBBAAkGucQEVTWqnIpNvihXf/Ng==
> caps: [mon] allow r
> caps: [osd] allow class-read object_prefix rbd_children, allow rwx 
> pool=images, allow rwx pool=images-cache client.nova
> key: AQDN7+RcqDABIxAAXnFcVjBp/S5GkgOy0wqB1Q==
> caps: [mon] allow r
> caps: [osd] allow class-read object_prefix rbd_children, allow rwx 
> pool=volumes, allow rwx pool=volumes-cache, allow rwx pool=vms, allow 
> rwx pool=vms-cache, allow rwx pool=images, allow rwx pool=images-cache 
> client.radosgw.gateway
> key: AQAU7uRccP06CBAA6zLFtDQoTstl8CNclYRugQ==
> auid: 0
> caps: [mon] allow rwx
> caps: [osd] allow rwx
> mgr.172.30.126.26
> key: AQAr7uRclc52MhAA+GWCQEVnAHB01tMFpgJtTQ==
> caps: [mds] allow *
> caps: [mon] allow profile mgr
> caps: [osd] allow *
> mgr.172.30.126.27
> key: AQAs7uRclkD2OBAAW/cUhcZEebZnQulqVodiXQ==
> caps: [mds] allow *
> caps: [mon] allow profile mgr
> caps: [osd] allow *
> mgr.172.30.126.28
> key: AQAu7uRcT9OLBBAAZbEjb/N1NnZpIgfaAcThyQ==
> caps: [mds] allow *
> caps: [mon] allow profile mgr
> caps: [osd] allow *
>
>
> Can someone explain it to me?

Your clients don't have the correct caps. See [1] or [2].


> thanks!!
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[1] 
http://docs.ceph.com/docs/mimic/releases/luminous/#upgrade-from-jewel-or-kraken
[2] 

[ceph-users] Multiple rbd images from different clusters

2019-06-04 Thread CUZA Frédéric
Hi everyone,

We want to migrate datas from one cluster (Hammer) to a new one (Mimic). We do 
not wish to upgrade the actual cluster as all the hardware is EOS and we 
upgrade the configuration of the servers.
We can't find a "proper" way to mount two rbd images from two different cluster 
on the same host.
Does anyone know what is the "good" procedure to achieve this ?


Cheers and thanks,

Fred.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Whole cluster flapping

2018-08-28 Thread CUZA Frédéric
Hi,

Just to inform that I finally resolved my problem few weeks ago now but I 
wanted to make sure that it was solved permanently.
I set the timeout of OSDs to a larger number of seconds and set the no out and 
no down flag on the cluster.
Basically I just waited that the “clean” ended but I noticed that some OSDs 
where doing more timeout than the others (since the datas were no longer 
important for me) I started to take OSDs out and purging them one by one.
I did this for about 1/8 of my OSDs and now there is no more flapping or OSDs 
going down but I still have some slow query popping up time to time.

Seems that the pool purged and deleted couldn’t make it till “the end” of it.
By purging the remaining OSDs where datas seemed to be helped to bring back the 
cluster to a stable state.

Thank you all for your help.

Regards,

De : Will Marley 
Envoyé : 08 August 2018 16:14
À : Webert de Souza Lima ; CUZA Frédéric 

Cc : ceph-users 
Objet : RE: [ceph-users] Whole cluster flapping

Hi again Frederic,

It may be worth looking at a recovery sleep.
osd recovery sleep
Description:

Time in seconds to sleep before next recovery or backfill op. Increasing this 
value will slow down recovery operation while client operations will be less 
impacted.

Type:

Float

Default:

0

osd recovery sleep hdd
Description:

Time in seconds to sleep before next recovery or backfill op for HDDs.

Type:

Float

Default:

0.1

osd recovery sleep ssd
Description:

Time in seconds to sleep before next recovery or backfill op for SSDs.

Type:

Float

Default:

0

osd recovery sleep hybrid
Description:

Time in seconds to sleep before next recovery or backfill op when osd data is 
on HDD and osd journal is on SSD.

Type:

Float

Default:

0.025


(Pulled from 
http://docs.ceph.com/docs/master/rados/configuration/osd-config-ref/)

When we faced similar issues, using the command ceph tell osd.* injectargs 
'--osd-recovery-sleep 2 allowed the OSDs to respond with a heartbeat whilst 
taking a break between recovery operations. I’d suggest tweaking the sleep wait 
time to find a sweet spot.

This may be worth a try, so let us know how you get on.

Regards,
Will

From: ceph-users 
mailto:ceph-users-boun...@lists.ceph.com>> 
On Behalf Of Webert de Souza Lima
Sent: 08 August 2018 15:06
To: frederic.c...@sib.fr<mailto:frederic.c...@sib.fr>
Cc: ceph-users mailto:ceph-users@lists.ceph.com>>
Subject: Re: [ceph-users] Whole cluster flapping

So your OSDs are really too busy to respond heartbeats.
You'll be facing this for sometime until cluster loads get lower.

I would set `ceph osd set nodeep-scrub` until the heavy disk IO stops.
maybe you can schedule it for enable during the night and disabling in the 
morning.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
Belo Horizonte - Brasil
IRC NICK - WebertRLZ


On Wed, Aug 8, 2018 at 9:18 AM CUZA Frédéric 
mailto:frederic.c...@sib.fr>> wrote:
Thx for the command line, I did take a look too it what I don’t really know 
what to search for, my bad….
All this flapping is due to deep-scrub when it starts on an OSD things start to 
go bad.

I set out all the OSDs that were flapping the most (1 by 1 after rebalancing) 
and it looks better even if some osds keep going down/up with the same message 
in logs :

1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fdabd897700' had timed out 
after 90

(I update it to 90 instead of 15s)

Regards,



De : ceph-users 
mailto:ceph-users-boun...@lists.ceph.com>> 
De la part de Webert de Souza Lima
Envoyé : 07 August 2018 16:28
À : ceph-users mailto:ceph-users@lists.ceph.com>>
Objet : Re: [ceph-users] Whole cluster flapping

oops, my bad, you're right.

I don't know much you can see but maybe you can dig around performance counters 
and see what's happening on those OSDs, try these:

~# ceph daemonperf osd.XX
~# ceph daemon osd.XX perf dump

change XX to your OSD numbers.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
Belo Horizonte - Brasil
IRC NICK - WebertRLZ


On Tue, Aug 7, 2018 at 10:47 AM CUZA Frédéric 
mailto:frederic.c...@sib.fr>> wrote:
Pool is already deleted and no longer present in stats.

Regards,

De : ceph-users 
mailto:ceph-users-boun...@lists.ceph.com>> 
De la part de Webert de Souza Lima
Envoyé : 07 August 2018 15:08
À : ceph-users mailto:ceph-users@lists.ceph.com>>
Objet : Re: [ceph-users] Whole cluster flapping

Frédéric,

see if the number of objects is decreasing in the pool with `ceph df [detail]`

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
Belo Horizonte - Brasil
IRC NICK - WebertRLZ


On Tue, Aug 7, 2018 at 5:46 AM CUZA Frédéric 
mailto:frederic.c...@sib.fr>> wrote:
It’s been over a week now and the whole cluster keeps flapping, it is never the 
same OSDs that go down.
Is there a way to get the progress of this recovery ? (The pool hat I deleted 
is no longer present (for a while now))
In fact, there is a lot of i/o activity on the server where

Re: [ceph-users] Whole cluster flapping

2018-08-08 Thread CUZA Frédéric
Thx for the command line, I did take a look too it what I don’t really know 
what to search for, my bad….
All this flapping is due to deep-scrub when it starts on an OSD things start to 
go bad.

I set out all the OSDs that were flapping the most (1 by 1 after rebalancing) 
and it looks better even if some osds keep going down/up with the same message 
in logs :

1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fdabd897700' had timed out 
after 90

(I update it to 90 instead of 15s)

Regards,



De : ceph-users  De la part de Webert de 
Souza Lima
Envoyé : 07 August 2018 16:28
À : ceph-users 
Objet : Re: [ceph-users] Whole cluster flapping

oops, my bad, you're right.

I don't know much you can see but maybe you can dig around performance counters 
and see what's happening on those OSDs, try these:

~# ceph daemonperf osd.XX
~# ceph daemon osd.XX perf dump

change XX to your OSD numbers.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
Belo Horizonte - Brasil
IRC NICK - WebertRLZ


On Tue, Aug 7, 2018 at 10:47 AM CUZA Frédéric 
mailto:frederic.c...@sib.fr>> wrote:
Pool is already deleted and no longer present in stats.

Regards,

De : ceph-users 
mailto:ceph-users-boun...@lists.ceph.com>> 
De la part de Webert de Souza Lima
Envoyé : 07 August 2018 15:08
À : ceph-users mailto:ceph-users@lists.ceph.com>>
Objet : Re: [ceph-users] Whole cluster flapping

Frédéric,

see if the number of objects is decreasing in the pool with `ceph df [detail]`

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
Belo Horizonte - Brasil
IRC NICK - WebertRLZ


On Tue, Aug 7, 2018 at 5:46 AM CUZA Frédéric 
mailto:frederic.c...@sib.fr>> wrote:
It’s been over a week now and the whole cluster keeps flapping, it is never the 
same OSDs that go down.
Is there a way to get the progress of this recovery ? (The pool hat I deleted 
is no longer present (for a while now))
In fact, there is a lot of i/o activity on the server where osds go down.

Regards,

De : ceph-users 
mailto:ceph-users-boun...@lists.ceph.com>> 
De la part de Webert de Souza Lima
Envoyé : 31 July 2018 16:25
À : ceph-users mailto:ceph-users@lists.ceph.com>>
Objet : Re: [ceph-users] Whole cluster flapping

The pool deletion might have triggered a lot of IO operations on the disks and 
the process might be too busy to respond to hearbeats, so the mons mark them as 
down due to no response.
Check also the OSD logs to see if they are actually crashing and restarting, 
and disk IO usage (i.e. iostat).

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
Belo Horizonte - Brasil
IRC NICK - WebertRLZ


On Tue, Jul 31, 2018 at 7:23 AM CUZA Frédéric 
mailto:frederic.c...@sib.fr>> wrote:
Hi Everyone,

I just upgrade our cluster to Luminous 12.2.7 and I delete a quite large pool 
that we had (120 TB).
Our cluster is made of 14 Nodes with each composed of 12 OSDs (1 HDD -> 1 OSD), 
we have SDD for journal.

After I deleted the large pool my cluster started to flapping on all OSDs.
Osds are marked down and then marked up as follow :

2018-07-31 10:42:51.504319 mon.ceph_monitor01 [INF] osd.97 
172.29.228.72:6800/95783<http://172.29.228.72:6800/95783> boot
2018-07-31 10:42:55.330993 mon.ceph_monitor01 [WRN] Health check update: 
5798/5845200 objects misplaced (0.099%) (OBJECT_MISPLACED)
2018-07-31 10:42:55.331065 mon.ceph_monitor01 [WRN] Health check update: 
Degraded data redundancy: 221365/5845200 objects degraded (3.787%), 98 pgs 
degraded, 317 pgs undersized (PG_DEGRADED)
2018-07-31 10:42:55.331093 mon.ceph_monitor01 [WRN] Health check update: 81 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:42:55.548385 mon.ceph_monitor01 [WRN] Health check update: 
Reduced data availability: 13 pgs inactive, 4 pgs peering (PG_AVAILABILITY)
2018-07-31 10:42:55.610556 mon.ceph_monitor01 [INF] osd.96 
172.29.228.72:6803/95830<http://172.29.228.72:6803/95830> boot
2018-07-31 10:43:00.331787 mon.ceph_monitor01 [WRN] Health check update: 5 osds 
down (OSD_DOWN)
2018-07-31 10:43:00.331930 mon.ceph_monitor01 [WRN] Health check update: 
5782/5845401 objects misplaced (0.099%) (OBJECT_MISPLACED)
2018-07-31 10:43:00.331950 mon.ceph_monitor01 [WRN] Health check update: 
Degraded data redundancy: 167757/5845401 objects degraded (2.870%), 77 pgs 
degraded, 223 pgs undersized (PG_DEGRADED)
2018-07-31 10:43:00.331966 mon.ceph_monitor01 [WRN] Health check update: 76 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:43:01.729891 mon.ceph_monitor01 [WRN] Health check update: 
Reduced data availability: 7 pgs inactive, 6 pgs peering (PG_AVAILABILITY)
2018-07-31 10:43:01.753867 mon.ceph_monitor01 [INF] osd.4 
172.29.228.246:6812/3144542<http://172.29.228.246:6812/3144542> boot
2018-07-31 10:43:05.332624 mon.ceph_monitor01 [WRN] Health check update: 4 osds 
down (OSD_DOWN)
2018-07-31 10:43:05.332691 mon.ceph_monitor01 [WRN] Health check update: 
5767/5845569 objects misplaced (0.099%) (OBJECT_

Re: [ceph-users] Whole cluster flapping

2018-08-07 Thread CUZA Frédéric
Pool is already deleted and no longer present in stats.

Regards,

De : ceph-users  De la part de Webert de 
Souza Lima
Envoyé : 07 August 2018 15:08
À : ceph-users 
Objet : Re: [ceph-users] Whole cluster flapping

Frédéric,

see if the number of objects is decreasing in the pool with `ceph df [detail]`

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
Belo Horizonte - Brasil
IRC NICK - WebertRLZ


On Tue, Aug 7, 2018 at 5:46 AM CUZA Frédéric 
mailto:frederic.c...@sib.fr>> wrote:
It’s been over a week now and the whole cluster keeps flapping, it is never the 
same OSDs that go down.
Is there a way to get the progress of this recovery ? (The pool hat I deleted 
is no longer present (for a while now))
In fact, there is a lot of i/o activity on the server where osds go down.

Regards,

De : ceph-users 
mailto:ceph-users-boun...@lists.ceph.com>> 
De la part de Webert de Souza Lima
Envoyé : 31 July 2018 16:25
À : ceph-users mailto:ceph-users@lists.ceph.com>>
Objet : Re: [ceph-users] Whole cluster flapping

The pool deletion might have triggered a lot of IO operations on the disks and 
the process might be too busy to respond to hearbeats, so the mons mark them as 
down due to no response.
Check also the OSD logs to see if they are actually crashing and restarting, 
and disk IO usage (i.e. iostat).

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
Belo Horizonte - Brasil
IRC NICK - WebertRLZ


On Tue, Jul 31, 2018 at 7:23 AM CUZA Frédéric 
mailto:frederic.c...@sib.fr>> wrote:
Hi Everyone,

I just upgrade our cluster to Luminous 12.2.7 and I delete a quite large pool 
that we had (120 TB).
Our cluster is made of 14 Nodes with each composed of 12 OSDs (1 HDD -> 1 OSD), 
we have SDD for journal.

After I deleted the large pool my cluster started to flapping on all OSDs.
Osds are marked down and then marked up as follow :

2018-07-31 10:42:51.504319 mon.ceph_monitor01 [INF] osd.97 
172.29.228.72:6800/95783<http://172.29.228.72:6800/95783> boot
2018-07-31 10:42:55.330993 mon.ceph_monitor01 [WRN] Health check update: 
5798/5845200 objects misplaced (0.099%) (OBJECT_MISPLACED)
2018-07-31 10:42:55.331065 mon.ceph_monitor01 [WRN] Health check update: 
Degraded data redundancy: 221365/5845200 objects degraded (3.787%), 98 pgs 
degraded, 317 pgs undersized (PG_DEGRADED)
2018-07-31 10:42:55.331093 mon.ceph_monitor01 [WRN] Health check update: 81 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:42:55.548385 mon.ceph_monitor01 [WRN] Health check update: 
Reduced data availability: 13 pgs inactive, 4 pgs peering (PG_AVAILABILITY)
2018-07-31 10:42:55.610556 mon.ceph_monitor01 [INF] osd.96 
172.29.228.72:6803/95830<http://172.29.228.72:6803/95830> boot
2018-07-31 10:43:00.331787 mon.ceph_monitor01 [WRN] Health check update: 5 osds 
down (OSD_DOWN)
2018-07-31 10:43:00.331930 mon.ceph_monitor01 [WRN] Health check update: 
5782/5845401 objects misplaced (0.099%) (OBJECT_MISPLACED)
2018-07-31 10:43:00.331950 mon.ceph_monitor01 [WRN] Health check update: 
Degraded data redundancy: 167757/5845401 objects degraded (2.870%), 77 pgs 
degraded, 223 pgs undersized (PG_DEGRADED)
2018-07-31 10:43:00.331966 mon.ceph_monitor01 [WRN] Health check update: 76 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:43:01.729891 mon.ceph_monitor01 [WRN] Health check update: 
Reduced data availability: 7 pgs inactive, 6 pgs peering (PG_AVAILABILITY)
2018-07-31 10:43:01.753867 mon.ceph_monitor01 [INF] osd.4 
172.29.228.246:6812/3144542<http://172.29.228.246:6812/3144542> boot
2018-07-31 10:43:05.332624 mon.ceph_monitor01 [WRN] Health check update: 4 osds 
down (OSD_DOWN)
2018-07-31 10:43:05.332691 mon.ceph_monitor01 [WRN] Health check update: 
5767/5845569 objects misplaced (0.099%) (OBJECT_MISPLACED)
2018-07-31 10:43:05.332718 mon.ceph_monitor01 [WRN] Health check update: 
Degraded data redundancy: 130565/5845569 objects degraded (2.234%), 67 pgs 
degraded, 220 pgs undersized (PG_DEGRADED)
2018-07-31 10:43:05.332736 mon.ceph_monitor01 [WRN] Health check update: 83 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:43:07.004993 mon.ceph_monitor01 [WRN] Health check update: 
Reduced data availability: 5 pgs inactive, 5 pgs peering (PG_AVAILABILITY)
2018-07-31 10:43:10.333548 mon.ceph_monitor01 [WRN] Health check update: 
5752/5845758 objects misplaced (0.098%) (OBJECT_MISPLACED)
2018-07-31 10:43:10.333593 mon.ceph_monitor01 [WRN] Health check update: 
Degraded data redundancy: 107805/5845758 objects degraded (1.844%), 59 pgs 
degraded, 197 pgs undersized (PG_DEGRADED)
2018-07-31 10:43:10.333608 mon.ceph_monitor01 [WRN] Health check update: 95 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:43:15.334451 mon.ceph_monitor01 [WRN] Health check update: 
5738/5845923 objects misplaced (0.098%) (OBJECT_MISPLACED)
2018-07-31 10:43:15.334494 mon.ceph_monitor01 [WRN] Health check update: 
Degraded data redundancy: 107807/5845923 object

Re: [ceph-users] Whole cluster flapping

2018-08-07 Thread CUZA Frédéric
It’s been over a week now and the whole cluster keeps flapping, it is never the 
same OSDs that go down.
Is there a way to get the progress of this recovery ? (The pool hat I deleted 
is no longer present (for a while now))
In fact, there is a lot of i/o activity on the server where osds go down.

Regards,

De : ceph-users  De la part de Webert de 
Souza Lima
Envoyé : 31 July 2018 16:25
À : ceph-users 
Objet : Re: [ceph-users] Whole cluster flapping

The pool deletion might have triggered a lot of IO operations on the disks and 
the process might be too busy to respond to hearbeats, so the mons mark them as 
down due to no response.
Check also the OSD logs to see if they are actually crashing and restarting, 
and disk IO usage (i.e. iostat).

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
Belo Horizonte - Brasil
IRC NICK - WebertRLZ


On Tue, Jul 31, 2018 at 7:23 AM CUZA Frédéric 
mailto:frederic.c...@sib.fr>> wrote:
Hi Everyone,

I just upgrade our cluster to Luminous 12.2.7 and I delete a quite large pool 
that we had (120 TB).
Our cluster is made of 14 Nodes with each composed of 12 OSDs (1 HDD -> 1 OSD), 
we have SDD for journal.

After I deleted the large pool my cluster started to flapping on all OSDs.
Osds are marked down and then marked up as follow :

2018-07-31 10:42:51.504319 mon.ceph_monitor01 [INF] osd.97 
172.29.228.72:6800/95783<http://172.29.228.72:6800/95783> boot
2018-07-31 10:42:55.330993 mon.ceph_monitor01 [WRN] Health check update: 
5798/5845200 objects misplaced (0.099%) (OBJECT_MISPLACED)
2018-07-31 10:42:55.331065 mon.ceph_monitor01 [WRN] Health check update: 
Degraded data redundancy: 221365/5845200 objects degraded (3.787%), 98 pgs 
degraded, 317 pgs undersized (PG_DEGRADED)
2018-07-31 10:42:55.331093 mon.ceph_monitor01 [WRN] Health check update: 81 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:42:55.548385 mon.ceph_monitor01 [WRN] Health check update: 
Reduced data availability: 13 pgs inactive, 4 pgs peering (PG_AVAILABILITY)
2018-07-31 10:42:55.610556 mon.ceph_monitor01 [INF] osd.96 
172.29.228.72:6803/95830<http://172.29.228.72:6803/95830> boot
2018-07-31 10:43:00.331787 mon.ceph_monitor01 [WRN] Health check update: 5 osds 
down (OSD_DOWN)
2018-07-31 10:43:00.331930 mon.ceph_monitor01 [WRN] Health check update: 
5782/5845401 objects misplaced (0.099%) (OBJECT_MISPLACED)
2018-07-31 10:43:00.331950 mon.ceph_monitor01 [WRN] Health check update: 
Degraded data redundancy: 167757/5845401 objects degraded (2.870%), 77 pgs 
degraded, 223 pgs undersized (PG_DEGRADED)
2018-07-31 10:43:00.331966 mon.ceph_monitor01 [WRN] Health check update: 76 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:43:01.729891 mon.ceph_monitor01 [WRN] Health check update: 
Reduced data availability: 7 pgs inactive, 6 pgs peering (PG_AVAILABILITY)
2018-07-31 10:43:01.753867 mon.ceph_monitor01 [INF] osd.4 
172.29.228.246:6812/3144542<http://172.29.228.246:6812/3144542> boot
2018-07-31 10:43:05.332624 mon.ceph_monitor01 [WRN] Health check update: 4 osds 
down (OSD_DOWN)
2018-07-31 10:43:05.332691 mon.ceph_monitor01 [WRN] Health check update: 
5767/5845569 objects misplaced (0.099%) (OBJECT_MISPLACED)
2018-07-31 10:43:05.332718 mon.ceph_monitor01 [WRN] Health check update: 
Degraded data redundancy: 130565/5845569 objects degraded (2.234%), 67 pgs 
degraded, 220 pgs undersized (PG_DEGRADED)
2018-07-31 10:43:05.332736 mon.ceph_monitor01 [WRN] Health check update: 83 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:43:07.004993 mon.ceph_monitor01 [WRN] Health check update: 
Reduced data availability: 5 pgs inactive, 5 pgs peering (PG_AVAILABILITY)
2018-07-31 10:43:10.333548 mon.ceph_monitor01 [WRN] Health check update: 
5752/5845758 objects misplaced (0.098%) (OBJECT_MISPLACED)
2018-07-31 10:43:10.333593 mon.ceph_monitor01 [WRN] Health check update: 
Degraded data redundancy: 107805/5845758 objects degraded (1.844%), 59 pgs 
degraded, 197 pgs undersized (PG_DEGRADED)
2018-07-31 10:43:10.333608 mon.ceph_monitor01 [WRN] Health check update: 95 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:43:15.334451 mon.ceph_monitor01 [WRN] Health check update: 
5738/5845923 objects misplaced (0.098%) (OBJECT_MISPLACED)
2018-07-31 10:43:15.334494 mon.ceph_monitor01 [WRN] Health check update: 
Degraded data redundancy: 107807/5845923 objects degraded (1.844%), 59 pgs 
degraded, 197 pgs undersized (PG_DEGRADED)
2018-07-31 10:43:15.334510 mon.ceph_monitor01 [WRN] Health check update: 98 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:43:15.334865 mon.ceph_monitor01 [INF] osd.18 failed 
(root=default,room=,host=) (8 reporters from different host after 
54.650576 >= grace 54.300663)
2018-07-31 10:43:15.336552 mon.ceph_monitor01 [WRN] Health check update: 5 osds 
down (OSD_DOWN)
2018-07-31 10:43:17.357747 mon.ceph_monitor01 [WRN] Health check update: 
Reduced data availability: 6 pgs inac

Re: [ceph-users] Whole cluster flapping

2018-08-02 Thread CUZA Frédéric
Hi,
At the moment following some advice and after what I could read from here and 
from Internet, I set the nodown flag and I increased the heartbeat grace.
Since it was 120 TB of data I think it's going to take a little bit of time to 
fully recover.
I keep you all update on the status.

Thanks for all.

Regards,

De : Brent Kennedy 
Envoyé : 31 July 2018 23:36
À : CUZA Frédéric ; 'ceph-users' 

Objet : RE: [ceph-users] Whole cluster flapping

I have had this happen during large data movements.  Stopped happening after I 
went to 10Gb though(from 1Gb).  What I had done is injected a setting ( and 
adjusted the configs ) to give more time before an OSD was marked down.

osd heartbeat grace = 200
mon osd down out interval = 900

For injecting runtime values/settings( under runtime changes ):
http://docs.ceph.com/docs/luminous/rados/configuration/ceph-conf/

Probably should check the logs before doing anything to ensure the OSDs or host 
is not failing.

-Brent

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of CUZA 
Frédéric
Sent: Tuesday, July 31, 2018 5:06 AM
To: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: [ceph-users] Whole cluster flapping

Hi Everyone,

I just upgrade our cluster to Luminous 12.2.7 and I delete a quite large pool 
that we had (120 TB).
Our cluster is made of 14 Nodes with each composed of 12 OSDs (1 HDD -> 1 OSD), 
we have SDD for journal.

After I deleted the large pool my cluster started to flapping on all OSDs.
Osds are marked down and then marked up as follow :

2018-07-31 10:42:51.504319 mon.ceph_monitor01 [INF] osd.97 
172.29.228.72:6800/95783 boot
2018-07-31 10:42:55.330993 mon.ceph_monitor01 [WRN] Health check update: 
5798/5845200 objects misplaced (0.099%) (OBJECT_MISPLACED)
2018-07-31 10:42:55.331065 mon.ceph_monitor01 [WRN] Health check update: 
Degraded data redundancy: 221365/5845200 objects degraded (3.787%), 98 pgs 
degraded, 317 pgs undersized (PG_DEGRADED)
2018-07-31 10:42:55.331093 mon.ceph_monitor01 [WRN] Health check update: 81 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:42:55.548385 mon.ceph_monitor01 [WRN] Health check update: 
Reduced data availability: 13 pgs inactive, 4 pgs peering (PG_AVAILABILITY)
2018-07-31 10:42:55.610556 mon.ceph_monitor01 [INF] osd.96 
172.29.228.72:6803/95830 boot
2018-07-31 10:43:00.331787 mon.ceph_monitor01 [WRN] Health check update: 5 osds 
down (OSD_DOWN)
2018-07-31 10:43:00.331930 mon.ceph_monitor01 [WRN] Health check update: 
5782/5845401 objects misplaced (0.099%) (OBJECT_MISPLACED)
2018-07-31 10:43:00.331950 mon.ceph_monitor01 [WRN] Health check update: 
Degraded data redundancy: 167757/5845401 objects degraded (2.870%), 77 pgs 
degraded, 223 pgs undersized (PG_DEGRADED)
2018-07-31 10:43:00.331966 mon.ceph_monitor01 [WRN] Health check update: 76 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:43:01.729891 mon.ceph_monitor01 [WRN] Health check update: 
Reduced data availability: 7 pgs inactive, 6 pgs peering (PG_AVAILABILITY)
2018-07-31 10:43:01.753867 mon.ceph_monitor01 [INF] osd.4 
172.29.228.246:6812/3144542 boot
2018-07-31 10:43:05.332624 mon.ceph_monitor01 [WRN] Health check update: 4 osds 
down (OSD_DOWN)
2018-07-31 10:43:05.332691 mon.ceph_monitor01 [WRN] Health check update: 
5767/5845569 objects misplaced (0.099%) (OBJECT_MISPLACED)
2018-07-31 10:43:05.332718 mon.ceph_monitor01 [WRN] Health check update: 
Degraded data redundancy: 130565/5845569 objects degraded (2.234%), 67 pgs 
degraded, 220 pgs undersized (PG_DEGRADED)
2018-07-31 10:43:05.332736 mon.ceph_monitor01 [WRN] Health check update: 83 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:43:07.004993 mon.ceph_monitor01 [WRN] Health check update: 
Reduced data availability: 5 pgs inactive, 5 pgs peering (PG_AVAILABILITY)
2018-07-31 10:43:10.333548 mon.ceph_monitor01 [WRN] Health check update: 
5752/5845758 objects misplaced (0.098%) (OBJECT_MISPLACED)
2018-07-31 10:43:10.333593 mon.ceph_monitor01 [WRN] Health check update: 
Degraded data redundancy: 107805/5845758 objects degraded (1.844%), 59 pgs 
degraded, 197 pgs undersized (PG_DEGRADED)
2018-07-31 10:43:10.333608 mon.ceph_monitor01 [WRN] Health check update: 95 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:43:15.334451 mon.ceph_monitor01 [WRN] Health check update: 
5738/5845923 objects misplaced (0.098%) (OBJECT_MISPLACED)
2018-07-31 10:43:15.334494 mon.ceph_monitor01 [WRN] Health check update: 
Degraded data redundancy: 107807/5845923 objects degraded (1.844%), 59 pgs 
degraded, 197 pgs undersized (PG_DEGRADED)
2018-07-31 10:43:15.334510 mon.ceph_monitor01 [WRN] Health check update: 98 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:43:15.334865 mon.ceph_monitor01 [INF] osd.18 failed 
(root=default,room=,host=) (8 reporters from different host after 
54.650576 >= grace 54.300663)
2018-07-31 10:43:15.336552 mon.ceph_monitor01 [WRN] Health

[ceph-users] Whole cluster flapping

2018-07-31 Thread CUZA Frédéric
Hi Everyone,

I just upgrade our cluster to Luminous 12.2.7 and I delete a quite large pool 
that we had (120 TB).
Our cluster is made of 14 Nodes with each composed of 12 OSDs (1 HDD -> 1 OSD), 
we have SDD for journal.

After I deleted the large pool my cluster started to flapping on all OSDs.
Osds are marked down and then marked up as follow :

2018-07-31 10:42:51.504319 mon.ceph_monitor01 [INF] osd.97 
172.29.228.72:6800/95783 boot
2018-07-31 10:42:55.330993 mon.ceph_monitor01 [WRN] Health check update: 
5798/5845200 objects misplaced (0.099%) (OBJECT_MISPLACED)
2018-07-31 10:42:55.331065 mon.ceph_monitor01 [WRN] Health check update: 
Degraded data redundancy: 221365/5845200 objects degraded (3.787%), 98 pgs 
degraded, 317 pgs undersized (PG_DEGRADED)
2018-07-31 10:42:55.331093 mon.ceph_monitor01 [WRN] Health check update: 81 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:42:55.548385 mon.ceph_monitor01 [WRN] Health check update: 
Reduced data availability: 13 pgs inactive, 4 pgs peering (PG_AVAILABILITY)
2018-07-31 10:42:55.610556 mon.ceph_monitor01 [INF] osd.96 
172.29.228.72:6803/95830 boot
2018-07-31 10:43:00.331787 mon.ceph_monitor01 [WRN] Health check update: 5 osds 
down (OSD_DOWN)
2018-07-31 10:43:00.331930 mon.ceph_monitor01 [WRN] Health check update: 
5782/5845401 objects misplaced (0.099%) (OBJECT_MISPLACED)
2018-07-31 10:43:00.331950 mon.ceph_monitor01 [WRN] Health check update: 
Degraded data redundancy: 167757/5845401 objects degraded (2.870%), 77 pgs 
degraded, 223 pgs undersized (PG_DEGRADED)
2018-07-31 10:43:00.331966 mon.ceph_monitor01 [WRN] Health check update: 76 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:43:01.729891 mon.ceph_monitor01 [WRN] Health check update: 
Reduced data availability: 7 pgs inactive, 6 pgs peering (PG_AVAILABILITY)
2018-07-31 10:43:01.753867 mon.ceph_monitor01 [INF] osd.4 
172.29.228.246:6812/3144542 boot
2018-07-31 10:43:05.332624 mon.ceph_monitor01 [WRN] Health check update: 4 osds 
down (OSD_DOWN)
2018-07-31 10:43:05.332691 mon.ceph_monitor01 [WRN] Health check update: 
5767/5845569 objects misplaced (0.099%) (OBJECT_MISPLACED)
2018-07-31 10:43:05.332718 mon.ceph_monitor01 [WRN] Health check update: 
Degraded data redundancy: 130565/5845569 objects degraded (2.234%), 67 pgs 
degraded, 220 pgs undersized (PG_DEGRADED)
2018-07-31 10:43:05.332736 mon.ceph_monitor01 [WRN] Health check update: 83 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:43:07.004993 mon.ceph_monitor01 [WRN] Health check update: 
Reduced data availability: 5 pgs inactive, 5 pgs peering (PG_AVAILABILITY)
2018-07-31 10:43:10.333548 mon.ceph_monitor01 [WRN] Health check update: 
5752/5845758 objects misplaced (0.098%) (OBJECT_MISPLACED)
2018-07-31 10:43:10.333593 mon.ceph_monitor01 [WRN] Health check update: 
Degraded data redundancy: 107805/5845758 objects degraded (1.844%), 59 pgs 
degraded, 197 pgs undersized (PG_DEGRADED)
2018-07-31 10:43:10.333608 mon.ceph_monitor01 [WRN] Health check update: 95 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:43:15.334451 mon.ceph_monitor01 [WRN] Health check update: 
5738/5845923 objects misplaced (0.098%) (OBJECT_MISPLACED)
2018-07-31 10:43:15.334494 mon.ceph_monitor01 [WRN] Health check update: 
Degraded data redundancy: 107807/5845923 objects degraded (1.844%), 59 pgs 
degraded, 197 pgs undersized (PG_DEGRADED)
2018-07-31 10:43:15.334510 mon.ceph_monitor01 [WRN] Health check update: 98 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:43:15.334865 mon.ceph_monitor01 [INF] osd.18 failed 
(root=default,room=,host=) (8 reporters from different host after 
54.650576 >= grace 54.300663)
2018-07-31 10:43:15.336552 mon.ceph_monitor01 [WRN] Health check update: 5 osds 
down (OSD_DOWN)
2018-07-31 10:43:17.357747 mon.ceph_monitor01 [WRN] Health check update: 
Reduced data availability: 6 pgs inactive, 6 pgs peering (PG_AVAILABILITY)
2018-07-31 10:43:20.339495 mon.ceph_monitor01 [WRN] Health check update: 
5724/5846073 objects misplaced (0.098%) (OBJECT_MISPLACED)
2018-07-31 10:43:20.339543 mon.ceph_monitor01 [WRN] Health check update: 
Degraded data redundancy: 122901/5846073 objects degraded (2.102%), 65 pgs 
degraded, 201 pgs undersized (PG_DEGRADED)
2018-07-31 10:43:20.339559 mon.ceph_monitor01 [WRN] Health check update: 78 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:43:22.481251 mon.ceph_monitor01 [WRN] Health check update: 4 osds 
down (OSD_DOWN)
2018-07-31 10:43:22.498621 mon.ceph_monitor01 [INF] osd.18 
172.29.228.5:6812/14996 boot
2018-07-31 10:43:25.340099 mon.ceph_monitor01 [WRN] Health check update: 
5712/5846235 objects misplaced (0.098%) (OBJECT_MISPLACED)
2018-07-31 10:43:25.340147 mon.ceph_monitor01 [WRN] Health check update: 
Reduced data availability: 6 pgs inactive, 3 pgs peering (PG_AVAILABILITY)
2018-07-31 10:43:25.340163 mon.ceph_monitor01 [WRN] Health check update: 
Degraded data redundancy: 138553/5846235 objects 

Re: [ceph-users] Be careful with orphans find (was Re: Lost TB for Object storage)

2018-07-20 Thread CUZA Frédéric
Hi Matthew,
Thanks for the advice but we are no longer using orphans find since the problem 
does not seem to be solved with it.

Regards,

-Message d'origine-
De : Matthew Vernon  
Envoyé : 20 July 2018 11:03
À : CUZA Frédéric ; ceph-users@lists.ceph.com
Objet : Be careful with orphans find (was Re: [ceph-users] Lost TB for Object 
storage)

Hi,

On 19/07/18 17:19, CUZA Frédéric wrote:

> After that we tried to remove the orphans :
> 
> radosgw-admin orphans find -pool= default.rgw.buckets.data 
> --job-id=ophans_clean
> 
> radosgw-admin orphans finish --job-id=ophans_clean
> 
> It finds some orphans : 85, but the command finish seems not to work, 
> so we decided to manually delete those ophans by piping the output of 
> find in a log file.

I would advise caution with using the "orphans find" code in radosgw-admin. On 
the advice of our vendor, we ran this and automatically removed the resulting 
objects. Unfortunately, a small proportion of the objects found and removed 
thus were not in fact orphans - meaning we ended up with some damaged S3 
objects; they appeared in bucket listings, but you'd get 404 if you tried to 
download them.

We have asked our vendor to make the wider community aware of the issue, but 
they have not (yet) done so.

Regards,

Matthew


--
 The Wellcome Sanger Institute is operated by Genome Research  Limited, a 
charity registered in England with number 1021457 and a  company registered in 
England with number 2742969, whose registered  office is 215 Euston Road, 
London, NW1 2BE. 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Lost TB for Object storage

2018-07-19 Thread CUZA Frédéric
Hi Guys,

We are running a Ceph Luminous 12.2.6 cluster.
The cluster is used both for RBD storage and Ceph Object Storage and is about 
742 TB raw space.

We have an application that push snapshots of our VMs through RGW all seem to 
be fine except that we have a decorrelation between what the S3 API shows and 
the command "ceph df detail"
S3 API (python script) :
Total : 44325.84438523278GB
Ceph df detail :
NAME ID   USED
default.rgw.buckets.data   59  104T

So it is about 60 TB...


We tried to clean the gc but nothing is shown :
# radosgw-admin gc list --include-all
[]
#

After that we tried to remove the orphans :
radosgw-admin orphans find -pool= default.rgw.buckets.data --job-id=ophans_clean
radosgw-admin orphans finish --job-id=ophans_clean
It finds some orphans : 85, but the command finish seems not to work, so we 
decided to manually delete those ophans by piping the output of find in a log 
file.

Even after that we still have a huge decorrelation between what the s3 api show 
and ceph.

When we list object with s3 API I find the exact information that the 
application is returning. (Which is normal since the application use this API)
We listed object with the rados CLI and we found that there is more objects 
than we can found we the S3 API.

We are actually out of idea and we can't figure out what's wrong.

Some of you already faced this problem ?

Regards,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com