Re: [ceph-users] Importance of Stable Mon and OSD IPs

2018-01-23 Thread Mayank Kumar
Thanks Burkhard for the detailed explanation. Regarding the following:-

>>>The ceph client (librbd accessing a volume in this case) gets
asynchronous notification from the ceph mons in case of relevant changes,
e.g. updates to the osd map reflecting the failure of an OSD.
i have some more questions:-
1:  Does the asynchronous notification for both osdmap and monmap comes
from mons ?
2:  Are these asynchronous notifications retriable ?
3: Is it possible that the these asynchronous notifications are lost  ?
4: Does the monmap and osdmap reside in the kernel or user space ? The
reason i am asking is , for a rbd volume that is already mounted on a host,
will it continue to receive those asynchronoous notifications for changes
to both osd and mon ips or not ? If All mon ips change,  but the mon
configuration file is updated to reflect the new mon ips, should the
existing rbd volume mounted still be able to contact the osd's and mons or
is there some form of caching in the kernel space for an already mounted
rbd volume


Some more context for why i am getting all these doubts:-
We internally had a ceph cluster with rbd volumes being provisioned by
Kubernetes. With existing rbd volumes already mounted , we wiped out the
old ceph cluster and created a brand new ceph cluster . But the existing
rbd volumes from the old cluster still remained. Any kubernetes pods that
landed on the same host as an old rbd volume would not create because the
volume failed to attach and mount. Looking at the kernel messages we saw
the following:-

-- Logs begin at Fri 2018-01-19 02:05:38 GMT, end at Fri 2018-01-19
19:23:14 GMT. --

Jan 19 19:20:39 host1.com kernel: *libceph: osd2 10.231.171.131:6808
 socket closed (con state CONNECTING)*

Jan 19 19:18:30 host1.com kernel: *libceph: osd28 10.231.171.52:6808
 socket closed (con state CONNECTING)*

Jan 19 19:18:30 host1.com kernel: *libceph: osd0 10.231.171.131:6800
 socket closed (con state CONNECTING)*

Jan 19 19:15:40 host1.com kernel: *libceph: osd21 10.231.171.99:6808
 wrong peer at address*

Jan 19 19:15:40 host1.com kernel: *libceph: wrong peer,
want 10.231.171.99:6808/42661 ,
got 10.231.171.99:6808/73168 *

Jan 19 19:15:34 host1.com kernel: *libceph: osd11 10.231.171.114:6816
 wrong peer at address*

Jan 19 19:15:34 host1.com kernel: *libceph: wrong peer,
want 10.231.171.114:6816/130908 ,
got 10.231.171.114:6816/85562 *

The Ceph cluster had new osd ip and mon ips.

So my questions, since these messages are coming from the kernel module,
why cant the kernel module figure out that the mon and osd ips have
changed. Is there some caching in the kernel ? when rbd create/attach is
called on that host, it is passed new mon ips , so doesnt that update the
old already mounted rbd volumes.

Hope i made my doubts clear and yes i am a beginner in Ceph with very
limited knowledge.

Thanks for your help again
Mayank


On Tue, Jan 23, 2018 at 1:24 AM, Burkhard Linke <
burkhard.li...@computational.bio.uni-giessen.de> wrote:

> Hi,
>
>
> On 01/23/2018 09:53 AM, Mayank Kumar wrote:
>
>> Hi Ceph Experts
>>
>> I am a new user of Ceph and currently using Kubernetes to deploy Ceph RBD
>> Volumes. We our doing some initial work rolling it out to internal
>> customers and in doing that we are using the ip of the host as the ip of
>> the osd and mons. This means if a host goes down , we loose that ip. While
>> we are still experimenting with these behaviors, i wanted to see what the
>> community thinks for the following scenario :-
>>
>> 1: a rbd volume is already attached and mounted on host A
>> 2: the osd on which this rbd volume resides, dies and never comes back up
>> 3: another osd is replaced in its place. I dont know the intricacies
>> here, but i am assuming the data for this rbd volume either moves to
>> different osd's or goes back to the newly installed osd
>> 4: the new osd has completley new ip
>> 5: will the rbd volume attached to host A learn the new osd ip on which
>> its data resides and everything just continues to work ?
>>
>> What if all the mons also have changed ip ?
>>
> A volume does not reside "on a osd". The volume is striped, and each strip
> is stored in a placement group; the placement group on the other hand is
> distributed to several OSDs depending on the crush rules and the number of
> replicates.
>
> If an OSD dies, ceph will backfill the now missing replicates to another
> OSD, given another OSD satisfying the crush rules is available. The same
> process is also triggered if an OSD is added.
>
> This process is somewhat transparent to the ceph client, as long as enough
> replicates a present. The ceph client (librbd accessing a volume in this
> case) gets asynchronous notification from the ceph 

Re: [ceph-users] OSD servers swapping despite having free memory capacity

2018-01-23 Thread Blair Bethwaite
+1 to Warren's advice on checking for memory fragmentation. Are you
seeing kmem allocation failures in dmesg on these hosts?

On 24 January 2018 at 10:44, Warren Wang  wrote:
> Check /proc/buddyinfo for memory fragmentation. We have some pretty severe 
> memory frag issues with Ceph to the point where we keep excessive 
> min_free_kbytes configured (8GB), and are starting to order more memory than 
> we actually need. If you have a lot of objects, you may find that you need to 
> increase vfs_cache_pressure as well, to something like the default of 100.
>
> In your buddyinfo, the columns represent the quantity of each page size 
> available. So if you only see numbers in the first 2 columns, you only have 
> 4K and 8K pages available, and will fail any allocations larger than that. 
> The problem is so severe for us that we have stopped using jumbo frames due 
> to dropped packets as a result of not being able to DMA map pages that will 
> fit 9K frames.
>
> In short, you might have enough memory, but not contiguous. It's even worse 
> on RGW nodes.
>
> Warren Wang
>
> On 1/23/18, 2:56 PM, "ceph-users on behalf of Samuel Taylor Liston" 
>  wrote:
>
> We have a 9 - node (16 - 8TB OSDs per node) running jewel on centos 7.4.  
> The OSDs are configured with encryption.  The cluster is accessed via two - 
> RGWs  and there are 3 - mon servers.  The data pool is using 6+3 erasure 
> coding.
>
> About 2 weeks ago I found two of the nine servers wedged and had to hard 
> power cycle them to get them back.  In this hard reboot 22 - OSDs came back 
> with either a corrupted encryption or data partitions.  These OSDs were 
> removed and recreated, and the resultant rebalance moved along just fine for 
> about a week.  At the end of that week two different nodes were unresponsive 
> complaining of page allocation failures.  This is when I realized the nodes 
> were heavy into swap.  These nodes were configured with 64GB of RAM as a cost 
> saving going against the 1GB per 1TB recommendation.  We have since then 
> doubled the RAM in each of the nodes giving each of them more than the 1GB 
> per 1TB ratio.
>
> The issue I am running into is that these nodes are still swapping; a 
> lot, and over time becoming unresponsive, or throwing page allocation 
> failures.  As an example, “free” will show 15GB of RAM usage (out of 128GB) 
> and 32GB of swap.  I have configured swappiness to 0 and and also turned up 
> the vm.min_free_kbytes to 4GB to try to keep the kernel happy, and yet I am 
> still filling up swap.  It only occurs when the OSDs have mounted partitions 
> and ceph-osd daemons active.
>
> Anyone have an idea where this swap usage might be coming from?
> Thanks for any insight,
>
> Sam Liston (sam.lis...@utah.edu)
> 
> Center for High Performance Computing
> 155 S. 1452 E. Rm 405
> Salt Lake City, Utah 84112 (801)232-6932
> 
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Cheers,
~Blairo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD servers swapping despite having free memory capacity

2018-01-23 Thread Warren Wang
Check /proc/buddyinfo for memory fragmentation. We have some pretty severe 
memory frag issues with Ceph to the point where we keep excessive 
min_free_kbytes configured (8GB), and are starting to order more memory than we 
actually need. If you have a lot of objects, you may find that you need to 
increase vfs_cache_pressure as well, to something like the default of 100.

In your buddyinfo, the columns represent the quantity of each page size 
available. So if you only see numbers in the first 2 columns, you only have 4K 
and 8K pages available, and will fail any allocations larger than that. The 
problem is so severe for us that we have stopped using jumbo frames due to 
dropped packets as a result of not being able to DMA map pages that will fit 9K 
frames.

In short, you might have enough memory, but not contiguous. It's even worse on 
RGW nodes.

Warren Wang

On 1/23/18, 2:56 PM, "ceph-users on behalf of Samuel Taylor Liston" 
 wrote:

We have a 9 - node (16 - 8TB OSDs per node) running jewel on centos 7.4.  
The OSDs are configured with encryption.  The cluster is accessed via two - 
RGWs  and there are 3 - mon servers.  The data pool is using 6+3 erasure coding.

About 2 weeks ago I found two of the nine servers wedged and had to hard 
power cycle them to get them back.  In this hard reboot 22 - OSDs came back 
with either a corrupted encryption or data partitions.  These OSDs were removed 
and recreated, and the resultant rebalance moved along just fine for about a 
week.  At the end of that week two different nodes were unresponsive 
complaining of page allocation failures.  This is when I realized the nodes 
were heavy into swap.  These nodes were configured with 64GB of RAM as a cost 
saving going against the 1GB per 1TB recommendation.  We have since then 
doubled the RAM in each of the nodes giving each of them more than the 1GB per 
1TB ratio.  

The issue I am running into is that these nodes are still swapping; a lot, 
and over time becoming unresponsive, or throwing page allocation failures.  As 
an example, “free” will show 15GB of RAM usage (out of 128GB) and 32GB of swap. 
 I have configured swappiness to 0 and and also turned up the 
vm.min_free_kbytes to 4GB to try to keep the kernel happy, and yet I am still 
filling up swap.  It only occurs when the OSDs have mounted partitions and 
ceph-osd daemons active. 

Anyone have an idea where this swap usage might be coming from? 
Thanks for any insight,

Sam Liston (sam.lis...@utah.edu)

Center for High Performance Computing
155 S. 1452 E. Rm 405
Salt Lake City, Utah 84112 (801)232-6932




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD servers swapping despite having free memory capacity

2018-01-23 Thread Marc Roos
 

Maybe first check what is using the swap?
swap-use.sh | sort -k 5,5 -n


#!/bin/bash

SUM=0
OVERALL=0

for DIR in `find /proc/ -maxdepth 1 -type d | egrep "^/proc/[0-9]"`
  do
  PID=`echo $DIR | cut -d / -f 3`
  PROGNAME=`ps -p $PID -o comm --no-headers`

  for SWAP in `grep Swap $DIR/smaps 2>/dev/null| awk '{ print $2 }'`
  do
let SUM=$SUM+$SWAP
  done
  echo "PID=$PID - Swap used: $SUM - ($PROGNAME )"
  let OVERALL=$OVERALL+$SUM
  SUM=0
done
echo "Overall swap used: $OVERALL"





-Original Message-
From: Lincoln Bryant [mailto:linco...@uchicago.edu] 
Sent: dinsdag 23 januari 2018 21:13
To: Samuel Taylor Liston; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] OSD servers swapping despite having free 
memory capacity

Hi Sam,

What happens if you just disable swap altogether? i.e., with `swapoff 
-a`

--Lincoln

On Tue, 2018-01-23 at 19:54 +, Samuel Taylor Liston wrote:
> We have a 9 - node (16 - 8TB OSDs per node) running jewel on centos 
> 7.4.  The OSDs are configured with encryption.  The cluster is 
> accessed via two - RGWs  and there are 3 - mon servers.  The data pool 

> is using 6+3 erasure coding.
> 
> About 2 weeks ago I found two of the nine servers wedged and had to 
> hard power cycle them to get them back.  In this hard reboot 22 - OSDs 

> came back with either a corrupted encryption or data partitions.  
> These OSDs were removed and recreated, and the resultant rebalance 
> moved along just fine for about a week.  At the end of that week two 
> different nodes were unresponsive complaining of page allocation 
> failures.  This is when I realized the nodes were heavy into swap.  
> These nodes were configured with 64GB of RAM as a cost saving going 
> against the 1GB per 1TB recommendation.  We have since then doubled 
> the RAM in each of the nodes giving each of them more than the 1GB per 

> 1TB ratio.
> 
> The issue I am running into is that these nodes are still swapping; a 
> lot, and over time becoming unresponsive, or throwing page allocation 
> failures.  As an example, “free” will show 15GB of RAM usage (out of
> 128GB) and 32GB of swap.  I have configured swappiness to 0 and and 
> also turned up the vm.min_free_kbytes to 4GB to try to keep the kernel 

> happy, and yet I am still filling up swap.  It only occurs when the 
> OSDs have mounted partitions and ceph-osd daemons active.
> 
> Anyone have an idea where this swap usage might be coming from? Thanks 

> for any insight,
> 
> Sam Liston (sam.lis...@utah.edu)
> 
> Center for High Performance Computing
> 155 S. 1452 E. Rm 405
> Salt Lake City, Utah 84112 (801)232-6932 
> 
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD servers swapping despite having free memory capacity

2018-01-23 Thread Lincoln Bryant
Hi Sam,

What happens if you just disable swap altogether? i.e., with `swapoff
-a`

--Lincoln

On Tue, 2018-01-23 at 19:54 +, Samuel Taylor Liston wrote:
> We have a 9 - node (16 - 8TB OSDs per node) running jewel on centos
> 7.4.  The OSDs are configured with encryption.  The cluster is
> accessed via two - RGWs  and there are 3 - mon servers.  The data
> pool is using 6+3 erasure coding.
> 
> About 2 weeks ago I found two of the nine servers wedged and had to
> hard power cycle them to get them back.  In this hard reboot 22 -
> OSDs came back with either a corrupted encryption or data
> partitions.  These OSDs were removed and recreated, and the resultant
> rebalance moved along just fine for about a week.  At the end of that
> week two different nodes were unresponsive complaining of page
> allocation failures.  This is when I realized the nodes were heavy
> into swap.  These nodes were configured with 64GB of RAM as a cost
> saving going against the 1GB per 1TB recommendation.  We have since
> then doubled the RAM in each of the nodes giving each of them more
> than the 1GB per 1TB ratio.  
> 
> The issue I am running into is that these nodes are still swapping; a
> lot, and over time becoming unresponsive, or throwing page allocation
> failures.  As an example, “free” will show 15GB of RAM usage (out of
> 128GB) and 32GB of swap.  I have configured swappiness to 0 and and
> also turned up the vm.min_free_kbytes to 4GB to try to keep the
> kernel happy, and yet I am still filling up swap.  It only occurs
> when the OSDs have mounted partitions and ceph-osd daemons active. 
> 
> Anyone have an idea where this swap usage might be coming from? 
> Thanks for any insight,
> 
> Sam Liston (sam.lis...@utah.edu)
> 
> Center for High Performance Computing
> 155 S. 1452 E. Rm 405
> Salt Lake City, Utah 84112 (801)232-6932
> 
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD servers swapping despite having free memory capacity

2018-01-23 Thread Samuel Taylor Liston
We have a 9 - node (16 - 8TB OSDs per node) running jewel on centos 7.4.  The 
OSDs are configured with encryption.  The cluster is accessed via two - RGWs  
and there are 3 - mon servers.  The data pool is using 6+3 erasure coding.

About 2 weeks ago I found two of the nine servers wedged and had to hard power 
cycle them to get them back.  In this hard reboot 22 - OSDs came back with 
either a corrupted encryption or data partitions.  These OSDs were removed and 
recreated, and the resultant rebalance moved along just fine for about a week.  
At the end of that week two different nodes were unresponsive complaining of 
page allocation failures.  This is when I realized the nodes were heavy into 
swap.  These nodes were configured with 64GB of RAM as a cost saving going 
against the 1GB per 1TB recommendation.  We have since then doubled the RAM in 
each of the nodes giving each of them more than the 1GB per 1TB ratio.  

The issue I am running into is that these nodes are still swapping; a lot, and 
over time becoming unresponsive, or throwing page allocation failures.  As an 
example, “free” will show 15GB of RAM usage (out of 128GB) and 32GB of swap.  I 
have configured swappiness to 0 and and also turned up the vm.min_free_kbytes 
to 4GB to try to keep the kernel happy, and yet I am still filling up swap.  It 
only occurs when the OSDs have mounted partitions and ceph-osd daemons active. 

Anyone have an idea where this swap usage might be coming from? 
Thanks for any insight,

Sam Liston (sam.lis...@utah.edu)

Center for High Performance Computing
155 S. 1452 E. Rm 405
Salt Lake City, Utah 84112 (801)232-6932




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Future

2018-01-23 Thread Massimiliano Cuttini



Il 23/01/2018 16:49, c...@jack.fr.eu.org ha scritto:

On 01/23/2018 04:33 PM, Massimiliano Cuttini wrote:
With Ceph you have to install an orchestrator 3rd party in order to 
have a clear picture of what is going on.

Which can be ok, but not alway pheasable.

Just as with everything
As said wikipedia, for instance, "Proxmox VE supports local storage 
with LVM group, directory and ZFS, as well as network storage types 
with iSCSI, Fibre Channel, NFS, GlusterFS, CEPH and DRBD.[14]"


Maybe fibre channel shall provides a webinterface. Maybe iSCSI shall 
too. Maybe drbd & glusterfs will provides another one.


Well, you are mixing different technologies:

1) ISCSI and FibreChannel are*networks comunication protocols*.
They just allow hypervisor to communicate to a SAN/NAS, they itself 
doesn't provide any kind of storage.


2) ZFS, glusterFS, NFS are "network ready" filesystem not a software 
deined SAN/NAS.


3) Ceph, ScaleIO, FreeNAS, HP virtualstore... they all are *Software 
Defined *storage.
This means that they setup disks, filesystems and network connections in 
order to be ready to use from client.

They can be thinked as a "storage kind of orchestrator" by theirself.

So only the group 3 is comparable technology.
In this competition I think that Ceph is the only one can win in the 
long run.
It's open, it works, it's easy, it's free, it's improving faster than 
others.
However, right now, it is the only one that miss a decent management 
dashboard.
This is to me so incomprehensible. Ceph is by far a killer app of the 
market.

So why just don't kill its latest barriers and get a mass adoption?




Or maybe this is not their job.

As you said, "Xen is just an hypervisor", thus you are using 
bare-metal low level tool, just like sane folks would use qemu. And 
yes, low-level tools are .. low level.


XenServer is an hypervisor but it has a truly great management dashboard 
which is XenCenter.

I guess VMware has it's own and i guess also that it's good.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Future

2018-01-23 Thread ceph

On 01/23/2018 04:33 PM, Massimiliano Cuttini wrote:
With Ceph you have to install an orchestrator 3rd party in order to have 
a clear picture of what is going on.

Which can be ok, but not alway pheasable.


Just as with everything
As said wikipedia, for instance, "Proxmox VE supports local storage with 
LVM group, directory and ZFS, as well as network storage types with 
iSCSI, Fibre Channel, NFS, GlusterFS, CEPH and DRBD.[14]"


Maybe fibre channel shall provides a webinterface. Maybe iSCSI shall 
too. Maybe drbd & glusterfs will provides another one.


Or maybe this is not their job.

As you said, "Xen is just an hypervisor", thus you are using bare-metal 
low level tool, just like sane folks would use qemu. And yes, low-level 
tools are .. low level.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Future

2018-01-23 Thread Massimiliano Cuttini

Il 23/01/2018 14:32, c...@jack.fr.eu.org ha scritto:


I think I was not clear

There are VMs management system, look at 
https://fr.wikipedia.org/wiki/Proxmox_VE, 
https://en.wikipedia.org/wiki/Ganeti, probably 
https://en.wikipedia.org/wiki/OpenStack too


Theses systems interacts with Ceph.
When you create a VM, a rbd volume is created
When you delete a VM, associated volumes are deleted
When you resize a disk, the volume is resized

There is no need for manual interaction at the Ceph level at any way

If I really understood the end of your email, you're stuck with a 
deficient VM management system, based on xenserver

Your issues are not Ceph's issues, but xen's;


Half and half.

Xen is just an hypervisor while OpenStack is an orchestrator.
An orchestrator manage by API your nodes (both hypervisors and storages 
if you want).


The fact is that Ceph doesn't have an its own web interface while many 
other storage services  have their own (freeNAS or proprietary service 
like lefthand/virtualstorage).
With Ceph you have to install an orchestrator 3rd party in order to have 
a clear picture of what is going on.

Which can be ok, but not alway pheasable.

Coming back to my case Xen it's just an hypervisor, not an orchestrator.
So this means that many taks must be accomplished manually.
A simple web interface that wrap few basic shell command can save hours 
(and can probably be built within few months starting from the actual 
deploy).
I really think Ceph is the future.. but it has to become a service ready 
to use in every kind of scenario (with or without orchestrator).

Right now to me seems not ready.

I'm taking a look at OpenAttic right now.
Probably this can be the missing piece.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Future

2018-01-23 Thread Volker Theile
Hello Massimiliano,

>
>> You're more than welcome - we have a lot of work ahead of us...
>> Feel free to join our Freenode IRC channel #openattic to get in touch!
>
> A curiosity!
> as far as I understood this software was created to manage only Ceph.
> Is it right?
> so... why such a "far away" name for a software dedicated to Ceph?

openATTIC comes from local storage management and has been switched to
Ceph in the near past.

> I read some months ago about openattic but I was thinking it was
> something completly different before you wrote me.
>  :)
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

Volker

-- 
Volker Theile
Software Engineer | openATTIC
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
Phone: +49 173 5876879
E-Mail: vthe...@suse.com




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Future

2018-01-23 Thread ceph

I think I was not clear

There are VMs management system, look at 
https://fr.wikipedia.org/wiki/Proxmox_VE, 
https://en.wikipedia.org/wiki/Ganeti, probably 
https://en.wikipedia.org/wiki/OpenStack too


Theses systems interacts with Ceph.
When you create a VM, a rbd volume is created
When you delete a VM, associated volumes are deleted
When you resize a disk, the volume is resized

There is no need for manual interaction at the Ceph level at any way

If I really understood the end of your email, you're stuck with a 
deficient VM management system, based on xenserver

Your issues are not Ceph's issues, but xen's;


On 01/23/2018 01:58 PM, Massimiliano Cuttini wrote:


Il 23/01/2018 13:20, c...@jack.fr.eu.org ha scritto:
- USER taks: create new images, increase images size, sink images 
size, check daily status and change broken disks whenever is needed.

Who does that ?
For instance, Ceph can be used for VMs. Your VMs system create images, 
resizes images, whatever, not the Ceph's admin.


I would like to have a single big remote storage, but as a best practice 
you should not.

Hypervisor can create images, resize and so on... you right.
However sometimes hypervisor mess up your LVM partitions and this means 
corruption of all VDI in the same disk.


So... the best practice is to setup a remote storage for each VM (you 
can group few if really don't want to have 200connections).
This reduce the risk with VDI corruption (it'll accidentally corrupt one 
not all at once, you can easily restore a snapshoot).
Xenserver as hypervisor doesn't support ceph client and need to go by 
ISCSI.

You need to map RBD on ISCSI, so you need to create a RBD for each LUN.
So at the end... you need to:
-create rbd,
-map iscsi,
-map hypervisor to iscsi,
-drink a coffee,
-create hypervisor virtualization layer (cause every HV want to use it's 
own snapshoot),

-copy the template of the VM request by customer,
-drink a second coffee
and finally run the VM

This is just a nightmare... of course just one of the many that a 
sysadmin have.
if you have 1000 VMs you need a GUI in order to scroll and see the 
panorama.

I don't think that you read your email by command line.
You should neither take a look to your VMs by a command line.

Probably one day I'll quit with XenServer, and all it's constrains 
however right now, i can't and still seems to be the more stable and 
safer way to virtualize.







___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Future

2018-01-23 Thread Massimiliano Cuttini



You're more than welcome - we have a lot of work ahead of us...
Feel free to join our Freenode IRC channel #openattic to get in touch!


A curiosity!
as far as I understood this software was created to manage only Ceph. Is 
it right?

so... why such a "far away" name for a software dedicated to Ceph?
I read some months ago about openattic but I was thinking it was 
something completly different before you wrote me.

 :)

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Future

2018-01-23 Thread Massimiliano Cuttini


Il 23/01/2018 13:20, c...@jack.fr.eu.org ha scritto:
- USER taks: create new images, increase images size, sink images 
size, check daily status and change broken disks whenever is needed.

Who does that ?
For instance, Ceph can be used for VMs. Your VMs system create images, 
resizes images, whatever, not the Ceph's admin.


I would like to have a single big remote storage, but as a best practice 
you should not.

Hypervisor can create images, resize and so on... you right.
However sometimes hypervisor mess up your LVM partitions and this means 
corruption of all VDI in the same disk.


So... the best practice is to setup a remote storage for each VM (you 
can group few if really don't want to have 200connections).
This reduce the risk with VDI corruption (it'll accidentally corrupt one 
not all at once, you can easily restore a snapshoot).

Xenserver as hypervisor doesn't support ceph client and need to go by ISCSI.
You need to map RBD on ISCSI, so you need to create a RBD for each LUN.
So at the end... you need to:
-create rbd,
-map iscsi,
-map hypervisor to iscsi,
-drink a coffee,
-create hypervisor virtualization layer (cause every HV want to use it's 
own snapshoot),

-copy the template of the VM request by customer,
-drink a second coffee
and finally run the VM

This is just a nightmare... of course just one of the many that a 
sysadmin have.

if you have 1000 VMs you need a GUI in order to scroll and see the panorama.
I don't think that you read your email by command line.
You should neither take a look to your VMs by a command line.

Probably one day I'll quit with XenServer, and all it's constrains 
however right now, i can't and still seems to be the more stable and 
safer way to virtualize.







___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Future

2018-01-23 Thread Lenz Grimmer
Ciao Massimiliano,

On 01/23/2018 01:29 PM, Massimiliano Cuttini wrote:

>>   https://www.openattic.org/features.html
>
> Oh god THIS is the answer!

:)

> Lenz, if you need help I can join also development.

You're more than welcome - we have a lot of work ahead of us...
Feel free to join our Freenode IRC channel #openattic to get in touch!

Lenz

-- 
SUSE Linux GmbH - Maxfeldstr. 5 - 90409 Nuernberg (Germany)
GF:Felix Imendörffer,Jane Smithard,Graham Norton,HRB 21284 (AG Nürnberg)



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Future

2018-01-23 Thread Massimiliano Cuttini



   https://www.openattic.org/features.html

Oh god THIS is the answer!
Lenz, if you need help I can join also development.


Lenz



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Future

2018-01-23 Thread Massimiliano Cuttini

Hey Lenz,

OpenAttic seems to implement several good feature and to be more-or-less 
what I was asking.

I'll go through all the website. :)


THANKS!


Il 16/01/2018 09:04, Lenz Grimmer ha scritto:

Hi Massimiliano,

On 01/11/2018 12:15 PM, Massimiliano Cuttini wrote:


_*3) Management complexity*_
Ceph is amazing, but is just too big to have everything under control
(too many services).
Now there is a management console, but as far as I read this management
console just show basic data about performance.
So it doesn't manage at all... it's just a monitor...

In the end You have just to manage everything by your command-line.

[...]


The management complexity can be completly overcome with a great Web
Manager.
A Web Manager, in the end is just a wrapper for Shell Command from the
CephAdminNode to others.
If you think about it a wrapper is just tons of time easier to develop
than what has been already developed.
I do really see that CEPH is the future of storage. But there is some
quick-avoidable complexity that need to be reduced.

If there are already some plan for these issue I really would like to know.

FWIW, there is openATTIC, which provides additional functionality beyond
of what the current dashboard provides. It's a web application that
utilizes various existing APIs (e.g. librados, RGW Admin Ops API)

   https://www.openattic.org/features.html

Lenz



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Future

2018-01-23 Thread ceph



On 01/23/2018 11:04 AM, Massimiliano Cuttini wrote:

Il 22/01/2018 21:55, Jack ha scritto:

On 01/22/2018 08:38 PM, Massimiliano Cuttini wrote:

The web interface is needed because:*cmd-lines are prune to typos.*

And you never misclick, indeed;
Do you really mean: 1) misclick once on an option list, 2) miscklick 
once on the form, 3) mistype the input and 4) misclick again on the 
confirmation dialog box?

Nope, just select an entry, and then click "delete", not "edit".

Well if you misclick that much is better don't tell around you are a 
system engineer ;)

Please welcome the cli interfaces.

- USER taks: create new images, increase images size, sink images size, 
check daily status and change broken disks whenever is needed.

Who does that ?
For instance, Ceph can be used for VMs. Your VMs system create images, 
resizes images, whatever, not the Ceph's admin.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to set mon-clock-drift-allowed tunable

2018-01-23 Thread Hüseyin Atatür YILDIRIM

Hello everyone,

I fixed my NTP server as you said, not changed the default value of  
mon-clock-drift-allowed tunable.

Thank you,
Atatur


[cid:image14d44d.PNG@a59c06b6.409b36ff] 
[cid:image87ad04.JPG@9a3e4b1b.4faa780e]
Hüseyin Atatür YILDIRIM
SİSTEM MÜHENDİSİ
Üniversiteler Mah. İhsan Doğramacı Bul. ODTÜ Teknokent Havelsan A.Ş. 23/B 
Çankaya Ankara TÜRKİYE
[cid:image6489f2.PNG@072302f4.4685372a] +90 312 292 74 00   
[cid:image0a6a43.PNG@18f5aac4.4aa77409] +90 312 219 57 97


[cid:imageee7972.JPG@eb4115d0.4b8225a7]
YASAL UYARI: Bu elektronik posta işbu linki kullanarak ulaşabileceğiniz Koşul 
ve Şartlar dokümanına tabidir. 

LEGAL NOTICE: This e-mail is subject to the Terms and Conditions document which 
can be accessed with this link. 


[http://www.havelsan.com.tr/Library/images/mail/email.jpg]  Lütfen 
gerekmedikçe bu sayfa ve eklerini yazdırmayınız / Please consider the 
environment before printing this email


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] udev rule or script to auto add bcache devices?

2018-01-23 Thread Jens-U. Mozdzen

Hi Stefan,

Zitat von Stefan Priebe - Profihost AG:

Hello,

bcache didn't supported partitions on the past so that a lot of our osds
have their data directly on:
/dev/bcache[0-9]

But that means i can't give them the needed part type of
4fbd7e29-9d25-41b8-afd0-062c0ceff05d and that means that the activation
with udev und ceph-disk does not work.

Had anybody already fixed this or hacked something together?


we had this running for filestore OSDs for quite some time (on  
Luminous and before), but have recently moved on to Bluestore,  
omitting bcache and instead putting block.db on partitions of the SSD  
devices (or rather partitions on an MD-RAID1 made out of two Toshiba  
PX02SMF020).


We simply mounted the OSD file system via label at boot time per fstab  
entries, and had the OSDs started via systemd. In case this matters:  
For historic reasons, the actual mount point wasn't in  
/var/lib/ceph/osd, but a different directory, with according symlinks  
set up under /var/lib/ceph/osd/.


How many OSDs do you run per bcache SSD caching device? Even at just  
4:1, we ran into i/o bottlenecks (using above MD-RAID1 as the caching  
device), hence moving on to Bluestore. The same hardware now provides  
a much more responsive storage subsystem, which of course may be very  
specific to our work load and setup.


Regards
Jens

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ruleset for optimized Ceph hybrid storage

2018-01-23 Thread Niklas

My question is,
Is it possible to create a 3 copies ruleset where first copy is stored 
on class nvme and all other copies is stored on class hdd? At the same 
time making sure that the step to store on class hdd is not placed in 
the same datacenter as the first copy on class nvme?



Below is a simplified ceph setup of a Hybrid solution where one copy is 
stored on one NVMe drive and two HDD drives. Advantage is great read 
performance and cost savings. Disadvantages is low write performance. 
Still the write performance is good thanks to rockdb on Intel Optane 
disks in HDD servers.


I have six servers in this example.
Only NVMe drives on storage101, storage102 and storage103.
Only HDD drives on storage201, storage202 and storage203.
All servers is connected at 40 Gbit public network and 40 Gbit cluster 
network. Backbone is 100 Gbit.


root default
├── datacenter Alfa/
│   ├── host Storage101
│   │   ├── OSD 1TB NVMe
│   │   ├── OSD 1TB NVMe
│   │   └── OSD 1TB NVMe
│   └── host Storage201
│   ├── OSD 10TB HDD
│   ├── OSD 10TB HDD
│   ├── OSD 10TB HDD
│   ├── OSD 10TB HDD
│   ├── OSD 10TB HDD
│   └── OSD 10TB HDD
│
├── datacenter Bravo/
│   ├── host Storage102
│   │   ├── OSD 1TB NVMe
│   │   ├── OSD 1TB NVMe
│   │   └── OSD 1TB NVMe
│   └── host Storage202
│   ├── OSD 10TB HDD
│   ├── OSD 10TB HDD
│   ├── OSD 10TB HDD
│   ├── OSD 10TB HDD
│   ├── OSD 10TB HDD
│   └── OSD 10TB HDD
│
└── datacenter Charlie/
    ├── host Storage101
    │   ├── OSD 1TB NVMe
    │   ├── OSD 1TB NVMe
    │   └── OSD 1TB NVMe
    └── host Storage201
    ├── OSD 10TB HDD
    ├── OSD 10TB HDD
    ├── OSD 10TB HDD
    ├── OSD 10TB HDD
    ├── OSD 10TB HDD
    └── OSD 10TB HDD


rule hybrid {
    id 1
    type replicated
    min_size 1
    max_size 10

    step take default class nvme
    step chooseleaf firstn 1 type datacenter
    step emit

    step take default class hdd
    step chooseleaf firstn -1 type datacenter
    step emit
}
Above rule works but has the problem that storing in class hdd still can 
choose the datacenter where the first copy is stored on class nvme. With 
this setup, failure in one datacenter will make the ceph cluster loose 
qourum.
Is it possible to create a rule so all -1 copies is stored in another 
osd class AND datacenter?



Regards,
Niklas

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Future

2018-01-23 Thread Massimiliano Cuttini

Il 22/01/2018 21:55, Jack ha scritto:

On 01/22/2018 08:38 PM, Massimiliano Cuttini wrote:

The web interface is needed because:*cmd-lines are prune to typos.*

And you never misclick, indeed;
Do you really mean: 1) misclick once on an option list, 2) miscklick 
once on the form, 3) mistype the input and 4) misclick again on the 
confirmation dialog box?

No... i can brag to never misclick that much in a row! :)
Well if you misclick that much is better don't tell around you are a 
system engineer ;)


However I think that everybody can have opinion and different opinion.
But reject the evidence is just flaming.


Yeah, well, whatever, most system engineers know how to handle Ceph.
Most non-system engineers do not.
A task, a job, I don't master other's job, hence it feels natural that
others do not master mine.

Sorry if this sound so strange to you.

Oh this doesn't strange to me.

You simple don't see the big picture.
Ceph was born in order to semplify the redundancy.
But what is the reason to build architecture in high availability?
I guess to live in peace while hardware can broke: change a broken disk 
within some days instead of within some hours (or minute).
This is all made to let us free and to increase our comfort by reducing 
stressing issues.

Focus on big issues and tuning instead of ordinary issues.

My proposal is EXACTLY in the same direction and I'll explain to you. 
There are 2 kinds of taks:
- USER taks: create new images, increase images size, sink images size, 
check daily status and change broken disks whenever is needed.
- SYSTEM taks: install, update, repair, improve, increase pool size, 
tuning performance (this should be done by command line).


If you think your job is just beeing a slave of Customer care & Sales 
folks well ...be happy with this.
If you think your job is be the /broken disks replacer boy /of the 
office than... be that man.
But don't come to me saying you need to be a system engineers to make 
these slavery jobs.
I prefer to focus on mantaining and tuning instead of be the puppet of 
the customer care.


You should try to consider, instead of flaming around, that there are 
people think differently not because they are just not good enought to 
do your job but because they see thinks differently.
Create a separation between /User task /(and move them to a web 
interface proven for dumbs) and /Admins task/ is just good.
Of course all Admin tasks will always be by command line but Users 
should not.


I really want to know if you'll flaming back again, or if you finally 
would try to give me a real answer with a good reason to don't have a 
web interface in order to get rid of slavery jobs.

But I suppose to know the answer.




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Missing udev rule for FC disks (Re: mkjournal error creating journal ... : (13) Permission denied)

2018-01-23 Thread Fulvio Galeazzi

Thanks a lot, Tom, glad this was already taken care of!
  Will keep the patch around until the official one somehow gets into 
my distribution.


  Ciao ciao

Fulvio

 Original Message 
Subject: Re: [ceph-users] Missing udev rule for FC disks (Re: mkjournal 
error creating journal ... : (13) Permission denied)

From: 
To: , 
Date: 1/22/2018 10:34 AM


I believe I've recently spent some time with this issue, so I hope this is 
helpful. Apologies if it's an unrelated dm/udev/ceph-disk problem.

https://lists.freedesktop.org/archives/systemd-devel/2017-July/039222.html

The above email from last July explains the situation somewhat, with the 
outcome (as I understand it) being future versions of lvm/dm will have rules to 
create the necessary partuuid symlinks for dm devices.

I'm unsure when that will make its way into various distribution lvm packages 
(I haven't checked up on this for a month or two actually). For now I've tested 
running with the new dm-disk.rules on the storage nodes that need it, which 
allowed ceph-disk to work as expected.

Cheers
Tom

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Fulvio 
Galeazzi
Sent: 19 January 2018 15:46
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Missing udev rule for FC disks (Re: mkjournal error 
creating journal ... : (13) Permission denied)

Hallo,
   apologies for reviving an old thread, but I just wasted again one full 
day as I had forgotten about this issue...

 To recap, udev rules nowadays do not (at least in my case, I am using 
disks served via FiberChannel) create the links /dev/disk/by-partuuid that 
ceph-disk expects.

I see the "culprit" is this line in (am on CentOS, but Ubuntu has the same 
issue): /usr/lib/udev/rules.d/60-persistent-storage.rules

.
# skip rules for inappropriate block devices 
KERNEL=="fd*|mtd*|nbd*|gnbd*|btibm*|dm-*|md*|zram*|mmcblk[0-9]*rpmb",
GOTO="persistent_storage_end"
.

stating that multipath'ed devices (called dm-*) should be skipped.


I can happily live with the file mentioned below, but was wondering:

- is there any hope that newer kernels may handle multipath devices
properly?

- as an alternative, could it be possible to update ceph-disk
such that symlinks for journal use some other
/dev/disk/by-?

 Thanks!

Fulvio





smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous: example of a single down osd taking out a cluster

2018-01-23 Thread Stefan Kooman
Quoting Dan van der Ster (d...@vanderster.com):
> 
> So, first question is: why didn't that OSD get detected as failing
> much earlier?

We have notiticed that "mon osd adjust heartbeat grace" made the cluster
"realize" OSDs going down _much_ later than the MONs / OSDs themselves.
Setting this parameter to "false" makes it deterministic and the cluster
reacts more quickly. At least that's our experience.

This might not be _the_ reason things worked out differently than
expected (I guess not), but it does have an impact.

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Importance of Stable Mon and OSD IPs

2018-01-23 Thread Burkhard Linke

Hi,


On 01/23/2018 09:53 AM, Mayank Kumar wrote:

Hi Ceph Experts

I am a new user of Ceph and currently using Kubernetes to deploy Ceph 
RBD Volumes. We our doing some initial work rolling it out to internal 
customers and in doing that we are using the ip of the host as the ip 
of the osd and mons. This means if a host goes down , we loose that 
ip. While we are still experimenting with these behaviors, i wanted to 
see what the community thinks for the following scenario :-


1: a rbd volume is already attached and mounted on host A
2: the osd on which this rbd volume resides, dies and never comes back up
3: another osd is replaced in its place. I dont know the intricacies 
here, but i am assuming the data for this rbd volume either moves to 
different osd's or goes back to the newly installed osd

4: the new osd has completley new ip
5: will the rbd volume attached to host A learn the new osd ip on 
which its data resides and everything just continues to work ?


What if all the mons also have changed ip ?
A volume does not reside "on a osd". The volume is striped, and each 
strip is stored in a placement group; the placement group on the other 
hand is distributed to several OSDs depending on the crush rules and the 
number of replicates.


If an OSD dies, ceph will backfill the now missing replicates to another 
OSD, given another OSD satisfying the crush rules is available. The same 
process is also triggered if an OSD is added.


This process is somewhat transparent to the ceph client, as long as 
enough replicates a present. The ceph client (librbd accessing a volume 
in this case) gets asynchronous notification from the ceph mons in case 
of relevant changes, e.g. updates to the osd map reflecting the failure 
of an OSD. Traffic to the OSD will be automatically rerouted depending 
on the crush rules as explained above. The OSD map also contains the IP 
address of all OSDs, so changes to the IP address are just another 
update to the map.


The only problem you might run into is changing the IP address of the 
mons. There's also a mon map listing all active mons; if the mon a ceph 
client is using dies/is removed, the client will switch to another 
active mon from the map. This works fine in a running system; you can 
change the IP address of a mon one by one without any interruption to 
the client (theoretically).


The problem is starting the ceph client. In this case the client uses 
the list of mons from the ceph configuration file to contact one mon and 
receive the initial mon map. If you change the hostnames/IP address of 
the mons, you also need to update the ceph configuration file.


The above outline is how it should work, given a valid ceph and network 
setup. YMMV.


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous: example of a single down osd taking out a cluster

2018-01-23 Thread Gregory Farnum
On Mon, Jan 22, 2018 at 8:46 PM, Dan van der Ster  wrote:
> Here's a bit more info as I read the logs. Firstly, these are in fact
> Filestore OSDs... I was confused, but I don't think it makes a big
> difference.
>
> Next, all the other OSDs had indeed noticed that osd.2 had failed:
>
> 2018-01-22 18:37:20.456535 7f831728e700 -1 osd.0 598 heartbeat_check:
> no reply from 137.138.121.224:6803 osd.2 since back 2018-01-22
> 18:36:59.514902 front 2018-01-22 18:36:59.514902 (cutoff 2018-01-22
> 18:37:00.456532)
>
> 2018-01-22 18:37:21.085178 7fc911169700 -1 osd.1 598 heartbeat_check:
> no reply from 137.138.121.224:6803 osd.2 since back 2018-01-22
> 18:37:00.518067 front 2018-01-22 18:37:00.518067 (cutoff 2018-01-22
> 18:37:01.085175)
>
> 2018-01-22 18:37:21.408881 7f78b8ea4700 -1 osd.4 598 heartbeat_check:
> no reply from 137.138.121.224:6803 osd.2 since back 2018-01-22
> 18:37:00.873298 front 2018-01-22 18:37:00.873298 (cutoff 2018-01-22
> 18:37:01.408880)
>
> 2018-01-22 18:37:21.117301 7f4ac8138700 -1 osd.3 598 heartbeat_check:
> no reply from 137.138.121.224:6803 osd.2 since back 2018-01-22
> 18:37:01.092182 front 2018-01-22 18:37:01.092182 (cutoff 2018-01-22
> 18:37:01.117298)
>
>
>
> The only "reported failed" came from osd.0, who BTW was the only OSD
> who hadn't been marked down for not sending beacons:

And presumably osd.0 was the only one with a functioning connection
the monitors. Why it was the only one, I'm not sure about.

Things to consider:
1) hard killing an OSD and a monitor on the same host tends to cause
trouble. Firstly because if the dead OSD was connected locally, you
have to go through the OSD heartbeat mark down process, and that can
be a bit delayed by other OSDs themselves having to timeout their
monitor and reconnect.
2) Which monitor were the OSDs connected to, and how quickly did they
notice if they were connected to the dead one?
3) How are the constraints on marking down daemons set up on this
cluster? A single OSD per server, but with only 5 servers in a flat
host-only crush map, is a bit outside normal testing and design
patterns and may have tripped a bug.
-Gre

>
> 2018-01-22 18:37:20.457400 7fc1b51ce700  1
> mon.cephcta-mon-658cb618c9@0(leader).osd e598 prepare_failure osd.2
> 137.138.121.224:6800/1377 from osd.0 137.138.156.51:6800/1286 is
> reporting failure:1
> 2018-01-22 18:37:20.457457 7fc1b51ce700  0 log_channel(cluster) log
> [DBG] : osd.2 137.138.121.224:6800/1377 reported failed by osd.0
> 137.138.156.51:6800/1286
>
>
> So presumably it's because only 1 reporter showed up that osd.2 was
> never marked down. (1 being less than "mon_osd_min_down_reporters":
> "2")
>
>
> And BTW, I didn't mention before that the cluster came fully back to
> HEALTH_OK after I hard rebooted the osd.2 machine -- the other OSDs
> were unblocked and recovery healed everything:
>
> 2018-01-22 19:31:12.381762 7fc907956700  0 log_channel(cluster) log
> [WRN] : Monitor daemon marked osd.1 down, but it is still running
> 2018-01-22 19:31:12.381774 7fc907956700  0 log_channel(cluster) log
> [DBG] : map e602 wrongly marked me down at e601
>
> 2018-01-22 19:31:12.515178 7f78af691700  0 log_channel(cluster) log
> [WRN] : Monitor daemon marked osd.4 down, but it is still running
> 2018-01-22 19:31:12.515186 7f78af691700  0 log_channel(cluster) log
> [DBG] : map e602 wrongly marked me down at e601
>
> 2018-01-22 19:31:12.586532 7f4abe925700  0 log_channel(cluster) log
> [WRN] : Monitor daemon marked osd.3 down, but it is still running
> 2018-01-22 19:31:12.586544 7f4abe925700  0 log_channel(cluster) log
> [DBG] : map e602 wrongly marked me down at e601
>
>
> Thanks for the help solving this puzzle,
>
> Dan
>
>
> On Mon, Jan 22, 2018 at 8:07 PM, Dan van der Ster  wrote:
>> Hi all,
>>
>> We just saw an example of one single down OSD taking down a whole
>> (small) luminous 12.2.2 cluster.
>>
>> The cluster has only 5 OSDs, on 5 different servers. Three of those
>> servers also run a mon/mgr combo.
>>
>> First, we had one server (mon+osd) go down legitimately [1] -- I can
>> tell when it went down because the mon quorum broke:
>>
>> 2018-01-22 18:26:31.521695 mon.cephcta-mon-658cb618c9 mon.0
>> 137.138.62.69:6789/0 121277 : cluster [WRN] Health check failed: 1/3
>> mons down, quorum cephcta-mon-658cb618c9,cephcta-mon-3e0d524825
>> (MON_DOWN)
>>
>> Then there's a long pileup of slow requests until the OSD is finally
>> marked down due to no beacon:
>>
>> 2018-01-22 18:47:31.549791 mon.cephcta-mon-658cb618c9 mon.0
>> 137.138.62.69:6789/0 121447 : cluster [WRN] Health check update: 372
>> slow requests are blocked > 32 sec (REQUEST_SLOW)
>> 2018-01-22 18:47:56.671360 mon.cephcta-mon-658cb618c9 mon.0
>> 137.138.62.69:6789/0 121448 : cluster [INF] osd.2 marked down after no
>> beacon for 903.538932 seconds
>> 2018-01-22 18:47:56.672315 mon.cephcta-mon-658cb618c9 mon.0
>> 137.138.62.69:6789/0 121449 : cluster [WRN] Health check failed: 1
>> osds down (OSD_DOWN)
>>
>>
>> So, first question is: why didn't that O

[ceph-users] Importance of Stable Mon and OSD IPs

2018-01-23 Thread Mayank Kumar
Hi Ceph Experts

I am a new user of Ceph and currently using Kubernetes to deploy Ceph RBD
Volumes. We our doing some initial work rolling it out to internal
customers and in doing that we are using the ip of the host as the ip of
the osd and mons. This means if a host goes down , we loose that ip. While
we are still experimenting with these behaviors, i wanted to see what the
community thinks for the following scenario :-

1: a rbd volume is already attached and mounted on host A
2: the osd on which this rbd volume resides, dies and never comes back up
3: another osd is replaced in its place. I dont know the intricacies here,
but i am assuming the data for this rbd volume either moves to different
osd's or goes back to the newly installed osd
4: the new osd has completley new ip
5: will the rbd volume attached to host A learn the new osd ip on which its
data resides and everything just continues to work ?

What if all the mons also have changed ip ?

We are using libceph

Thanks for your help. Any recommendations,best practices or documentation
in this area is also helpful.

thanks
Mayank
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adding disks -> getting unfound objects [Luminous]

2018-01-23 Thread Nico Schottelius

Hey Burkhard,

we did actually restart osd.61, which led to the current status.

Best,

Nico


Burkhard Linke  writes:>
> On 01/23/2018 08:54 AM, Nico Schottelius wrote:
>> Good morning,
>>
>> the osd.61 actually just crashed and the disk is still intact. However,
>> after 8 hours of rebuilding, the unfound objects are still missing:
>
> *snipsnap*
>>
>>
>> Is there any chance to recover those pgs or did we actually lose data
>> with a 2 disk failure?
>>
>> And is there any way out  of this besides going with
>>
>>  ceph pg {pg-id} mark_unfound_lost revert|delete
>>
>> ?
>
> Just my 2 cents:
>
> If the disk is still intact and the data is still readable, you can try
> to export the pg content with ceph-objectstore-tool, and import it into
> another OSD.
>
> On the other hand: if the disk is still intact, just restart the OSD?

--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adding disks -> getting unfound objects [Luminous]

2018-01-23 Thread Burkhard Linke

Hi,


On 01/23/2018 08:54 AM, Nico Schottelius wrote:

Good morning,

the osd.61 actually just crashed and the disk is still intact. However,
after 8 hours of rebuilding, the unfound objects are still missing:


*snipsnap*



Is there any chance to recover those pgs or did we actually lose data
with a 2 disk failure?

And is there any way out  of this besides going with

 ceph pg {pg-id} mark_unfound_lost revert|delete

?


Just my 2 cents:

If the disk is still intact and the data is still readable, you can try 
to export the pg content with ceph-objectstore-tool, and import it into 
another OSD.


On the other hand: if the disk is still intact, just restart the OSD?

Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adding disks -> getting unfound objects [Luminous]

2018-01-23 Thread Nico Schottelius

... while trying to locate which VMs are potentially affected by a
revert/delete, we noticed that

root@server1:~# rados -p one-hdd ls

hangs. Where does ceph store the index of block devices found in a pool?
And is it possible that this information is in one of the damaged pgs?

Nico


Nico Schottelius  writes:

> Good morning,
>
> the osd.61 actually just crashed and the disk is still intact. However,
> after 8 hours of rebuilding, the unfound objects are still missing:
>
> root@server1:~# ceph -s
>   cluster:
> id: 26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab
> health: HEALTH_WARN
> noscrub,nodeep-scrub flag(s) set
> 111436/3017766 objects misplaced (3.693%)
> 9377/1005922 objects unfound (0.932%)
> Reduced data availability: 84 pgs inactive
> Degraded data redundancy: 277034/3017766 objects degraded 
> (9.180%), 84 pgs unclean, 84 pgs degraded, 84 pgs undersized
> mon server2 is low on available space
>
>   services:
> mon: 3 daemons, quorum server5,server3,server2
> mgr: server5(active), standbys: server2, 2, 0, server3
> osd: 54 osds: 54 up, 54 in; 84 remapped pgs
>  flags noscrub,nodeep-scrub
>
>   data:
> pools:   3 pools, 1344 pgs
> objects: 982k objects, 3837 GB
> usage:   10618 GB used, 39030 GB / 49648 GB avail
> pgs: 6.250% pgs not active
>  277034/3017766 objects degraded (9.180%)
>  111436/3017766 objects misplaced (3.693%)
>  9377/1005922 objects unfound (0.932%)
>  1260 active+clean
>  84   recovery_wait+undersized+degraded+remapped+peered
>
>   io:
> client:   68960 B/s rd, 20722 kB/s wr, 12 op/s rd, 77 op/s wr
>
> We tried restarting osd.61, but ceph health detail does not change
> anymore:
>
> HEALTH_WARN noscrub,nodeep-scrub flag(s) set; 111436/3017886 objects 
> misplaced (3.69
> 3%); 9377/1005962 objects unfound (0.932%); Reduced data availability: 84 pgs 
> inacti
> ve; Degraded data redundancy: 277034/3017886 objects degraded (9.180%), 84 
> pgs uncle
> an, 84 pgs degraded, 84 pgs undersized; mon server2 is low on available space
> OSDMAP_FLAGS noscrub,nodeep-scrub flag(s) set
> OBJECT_MISPLACED 111436/3017886 objects misplaced (3.693%)
> OBJECT_UNFOUND 9377/1005962 objects unfound (0.932%)
> pg 4.fa has 117 unfound objects
> pg 4.ff has 107 unfound objects
> pg 4.fd has 113 unfound objects
> ...
> pg 4.2a has 108 unfound objects
>
> PG_AVAILABILITY Reduced data availability: 84 pgs inactive
> pg 4.2a is stuck inactive for 64117.189552, current state 
> recovery_wait+undersiz
> ed+degraded+remapped+peered, last acting [61]
> pg 4.31 is stuck inactive for 64117.147636, current state 
> recovery_wait+undersiz
> ed+degraded+remapped+peered, last acting [61]
> pg 4.32 is stuck inactive for 64117.178461, current state 
> recovery_wait+undersiz
> ed+degraded+remapped+peered, last acting [61]
> pg 4.34 is stuck inactive for 64117.150475, current state 
> recovery_wait+undersiz
> ed+degraded+remapped+peered, last acting [61]
> ...
>
>
> PG_DEGRADED Degraded data redundancy: 277034/3017886 objects degraded 
> (9.180%), 84 pgs unclean, 84 pgs degraded, 84 pgs undersized
> pg 4.2a is stuck unclean for 131612.984555, current state 
> recovery_wait+undersized+degraded+remapped+peered, last acting [61]
> pg 4.31 is stuck undersized for 221.568468, current state 
> recovery_wait+undersized+degraded+remapped+peered, last acting [61]
>
>
> Is there any chance to recover those pgs or did we actually lose data
> with a 2 disk failure?
>
> And is there any way out  of this besides going with
>
> ceph pg {pg-id} mark_unfound_lost revert|delete
>
> ?
>
> Best,
>
> Nico
>
> p.s.: the ceph 4.2a query:
>
> {
> "state": "recovery_wait+undersized+degraded+remapped+peered",
> "snap_trimq": "[]",
> "epoch": 17879,
> "up": [
> 17,
> 13,
> 25
> ],
> "acting": [
> 61
> ],
> "backfill_targets": [
> "13",
> "17",
> "25"
> ],
> "actingbackfill": [
> "13",
> "17",
> "25",
> "61"
> ],
> "info": {
> "pgid": "4.2a",
> "last_update": "17529'53875",
> "last_complete": "17217'45447",
> "log_tail": "17090'43812",
> "last_user_version": 53875,
> "last_backfill": "MAX",
> "last_backfill_bitwise": 0,
> "purged_snaps": [
> {
> "start": "1",
> "length": "3"
> },
> {
> "start": "6",
> "length": "8"
> },
> {
> "start": "10",
> "length": "2"
> }
> ],
> "history": {
> "epoch_created": 9134,
> "epoch_pool_created": 9134,
> "last_epoch_started": 17528,
> "last_interval_started": 17527,
>