[ceph-users] Ceph MDS remove

2015-02-24 Thread ceph-users

Hi all,

I've set up a ceph cluster using this playbook:
https://github.com/ceph/ceph-ansible

I've configured in my hosts list
[mdss]
hostname1
hostname2


I now need to remove this MDS from the cluster.
The only document I found is this:
http://www.sebastien-han.fr/blog/2012/07/04/remove-a-mds-server-from-a-ceph-cluster/

# service ceph -a stop mds
=== mds.z-srv-m-cph02 ===
Stopping Ceph mds.z-srv-m-cph02 on z-srv-m-cph02...done
=== mds.r-srv-m-cph02 ===
Stopping Ceph mds.r-srv-m-cph02 on r-srv-m-cph02...done
=== mds.r-srv-m-cph01 ===
Stopping Ceph mds.r-srv-m-cph01 on r-srv-m-cph01...done
=== mds.0 ===
Stopping Ceph mds.0 on zrh-srv-m-cph01...done
=== mds.192.168.0.1 ===
Stopping Ceph mds.192.168.0.1 on z-srv-m-cph01...done
=== mds.z-srv-m-cph01 ===
Stopping Ceph mds.z-srv-m-cph01 on z-srv-m-cph01...done

[root@z-srv-m-cph01 ceph]# ceph mds stat
e1: 0/0/0 up

1. question: why the MDS are not stopped?
2. When I try to remove them:

# ceph mds rm mds.z-srv-m-cph01 z-srv-m-cph01
Invalid command: mds.z-srv-m-cph01 doesn't represent an int
mds rm   : remove nonactive mds
Error EINVAL: invalid command

The ansible playbook created me a conf like this in ceph.conf:
[mds]

[mds.z-srv-m-cph01]
host = z-srv-m-cph01

Can someone please help on this or at least give some hints?

Thank you very much
Gian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] stuck ceph-deploy mon create-initial / giant

2015-02-24 Thread Stephan Seitz
Hi Loic,

this is the content of our ceph.conf

[global]
fsid = 719f14b2-7475-4b25-8c5f-3ffbcf594d13
mon_initial_members = ceph1, ceph2, ceph3
mon_host = 192.168.10.107,192.168.10.108,192.168.10.109
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
osd pool default size = 2
public network = 192.168.10.0/24
cluster networt = 192.168.108.0/24

where public network is 10GBE native and cluster network a tagged VLAN
on a lacp/802.3ad bond.




Am Montag, den 23.02.2015, 23:24 +0100 schrieb Loic Dachary:
> Hi Stephan,
> 
> Could you share the /etc/ceph/ceph.conf content ? Maybe ceph-create-keys 
> cannot reach the monitor ?
> 
> Cheers
> 
> On 23/02/2015 22:53, Stephan Seitz wrote:
> > Hi all,
> > 
> > I'm currently facing a strange problem when deploying giant on Ubuntu 
> > 14.04. Following the docs, and reaching to
> > ceph-deploy mon create-initial
> > on the three mon hosts leads to every single mon stuck with a companion 
> > ceph-create-keys waiting forever.
> > ceph-deploy quits after waiting for mon-qorum without success. Say:
> > with 
> > error: for each mon.
> > An strace shows repetively the very same bulk, with one line stating the
> > mon state in "pending".
> > Network has been double-checked. netstat / lsof shows "established" TCP
> > connection between the mons.
> > 
> > Trying to fiddle around the quorum and starting the cluster with only 
> > one mon, didn't help either. It behaves in the very same way: Having
> > the 
> > mon stuck and the ceph-create-keys waiting forever.
> > 
> > Could someone please shed some light, what this problem could be caused 
> > by?
> > Except for deploying the first time giant, I didn't face that problem
> > on 
> > former installations.
> > 
> > Thanks in advance,
> > 
> > - Stephan
> > 
> 

-- 

Heinlein Support GmbH
Schwedter Str. 8/9b, 10119 Berlin

http://www.heinlein-support.de

Tel: 030 / 405051-44
Fax: 030 / 405051-19

Zwangsangaben lt. §35a GmbHG: HRB 93818 B / Amtsgericht
Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph MDS remove

2015-02-24 Thread Xavier Villaneau

Hello,

I also had to remove the MDSs on a Giant test cluster a few days ago, 
and stumbled upon the same problems.


Le 24/02/2015 09:58, ceph-users a écrit :

Hi all,

I've set up a ceph cluster using this playbook:
https://github.com/ceph/ceph-ansible

I've configured in my hosts list
[mdss]
hostname1
hostname2


I now need to remove this MDS from the cluster.
The only document I found is this:
http://www.sebastien-han.fr/blog/2012/07/04/remove-a-mds-server-from-a-ceph-cluster/ 



# service ceph -a stop mds
=== mds.z-srv-m-cph02 ===
Stopping Ceph mds.z-srv-m-cph02 on z-srv-m-cph02...done
=== mds.r-srv-m-cph02 ===
Stopping Ceph mds.r-srv-m-cph02 on r-srv-m-cph02...done
=== mds.r-srv-m-cph01 ===
Stopping Ceph mds.r-srv-m-cph01 on r-srv-m-cph01...done
=== mds.0 ===
Stopping Ceph mds.0 on zrh-srv-m-cph01...done
=== mds.192.168.0.1 ===
Stopping Ceph mds.192.168.0.1 on z-srv-m-cph01...done
=== mds.z-srv-m-cph01 ===
Stopping Ceph mds.z-srv-m-cph01 on z-srv-m-cph01...done

[root@z-srv-m-cph01 ceph]# ceph mds stat
e1: 0/0/0 up

1. question: why the MDS are not stopped?


I also had trouble stopping my MDS. They would start up again even if I 
killed the processes… I suggest you try :

sudo stop ceph-mds-all


2. When I try to remove them:

# ceph mds rm mds.z-srv-m-cph01 z-srv-m-cph01
Invalid command: mds.z-srv-m-cph01 doesn't represent an int
mds rm   : remove nonactive mds
Error EINVAL: invalid command


In the mds rm command, the  refers to the ID of the metadata 
pool used by CephFS (since there can only be one right now). And the 
 is simply mds.n where n is 0, 1, etc. Maybe there are 
other possible values for type.id, but it worked for me.



The ansible playbook created me a conf like this in ceph.conf:
[mds]

[mds.z-srv-m-cph01]
host = z-srv-m-cph01


I believe you'll also need to delete the [msd] section in ceph.conf, but 
since I do not know much about ansible I can't give you more advice on this.


Finally, as described on the blog post you linked, you need to reset 
cephfs after (or the health will be complaining) :

ceph mds newfs   --yes-i-really-mean-it

Regards,
--
Xavier


Can someone please help on this or at least give some hints?

Thank you very much
Gian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cluster never reaching clean after osd out

2015-02-24 Thread Stéphane DUGRAVOT
- Mail original -

> I have a Cluster of 3 hosts, running Debian wheezy and Backports Kernel
> 3.16.0-0.bpo.4-amd64.
> For testing I did a
> ~# ceph osd out 20
> from a clean state.
> Ceph starts rebalancing, watching ceph -w one sees changing pgs stuck unclean
> to get up and then go down to about 11.

> Short after that the cluster keeps stuck forever in this state:
> health HEALTH_WARN 68 pgs stuck unclean; recovery 450/169647 objects degraded
> (0.265%); 3691/169647 objects misplaced (2.176%)

> According to the documentation at
> http://ceph.com/docs/master/rados/operations/add-or-rm-osds/ the Cluster
> should reach a clean state after an osd out.

> What am I doing wrong?
Hi Yves and Cephers, 

I have a cluster with 6 nodes and 36 OSD. I have the same pb : 

cluster 1d0503fb-36d0-4dbc-aabe-a2a0709163cd 
health HEALTH_WARN 76 pgs stuck unclean; recovery 1/624 objects degraded 
(0.160%); 7/624 objects misplaced (1.122%) 
monmap e6: 6 mons 
osdmap e616: 36 osds: 36 up, 35 in 
pgmap v16344: 2048 pgs, 1 pools, 689 MB data, 208 objects 
178 GB used, 127 TB / 127 TB avail 
1/624 objects degraded (0.160%); 7/624 objects misplaced (1.122%) 
76 active+remapped 
1972 active+clean 

After 'out' osd.15, ceph didn't return to health ok , and get misplaced object 
... :-/ 
I noticed that this happen when i use a replicated 3 pool. When the pool use a 
replicated 2, ceph returned to health ok... Have you try with a replicated 2 
pool ? 

In the same way, I wonder why he does not return to the status ok 

CEPH OSD TREE 

# id weight type name up/down reweight 
-1000 144 root default 
-200 48 datacenter mo 
-133 48 rack mom02 
-4 24 host mom02h01 
12 4 osd.12 up 1 
13 4 osd.13 up 1 
14 4 osd.14 up 1 
16 4 osd.16 up 1 
17 4 osd.17 up 1 
15 4 osd.15 up 0 
-5 24 host mom02h02 
18 4 osd.18 up 1 
19 4 osd.19 up 1 
20 4 osd.20 up 1 
21 4 osd.21 up 1 
22 4 osd.22 up 1 
23 4 osd.23 up 1 
-202 48 datacenter me 
-135 48 rack mem04 
-6 24 host mem04h01 
24 4 osd.24 up 1 
25 4 osd.25 up 1 
26 4 osd.26 up 1 
27 4 osd.27 up 1 
28 4 osd.28 up 1 
29 4 osd.29 up 1 
-7 24 host mem04h02 
30 4 osd.30 up 1 
31 4 osd.31 up 1 
32 4 osd.32 up 1 
33 4 osd.33 up 1 
34 4 osd.34 up 1 
35 4 osd.35 up 1 
-201 48 datacenter li 
-134 48 rack lis04 
-2 24 host lis04h01 
0 4 osd.0 up 1 
2 4 osd.2 up 1 
3 4 osd.3 up 1 
4 4 osd.4 up 1 
5 4 osd.5 up 1 
1 4 osd.1 up 1 
-3 24 host lis04h02 
6 4 osd.6 up 1 
7 4 osd.7 up 1 
8 4 osd.8 up 1 
9 4 osd.9 up 1 
10 4 osd.10 up 1 
11 4 osd.11 up 1 

Crushmap 

# begin crush map 
tunable choose_local_tries 0 
tunable choose_local_fallback_tries 0 
tunable choose_total_tries 50 
tunable chooseleaf_descend_once 1 

# devices 
device 0 osd.0 
device 1 osd.1 
device 2 osd.2 
device 3 osd.3 
device 4 osd.4 
device 5 osd.5 
device 6 osd.6 
device 7 osd.7 
device 8 osd.8 
device 9 osd.9 
device 10 osd.10 
device 11 osd.11 
device 12 osd.12 
device 13 osd.13 
device 14 osd.14 
device 15 osd.15 
device 16 osd.16 
device 17 osd.17 
device 18 osd.18 
device 19 osd.19 
device 20 osd.20 
device 21 osd.21 
device 22 osd.22 
device 23 osd.23 
device 24 osd.24 
device 25 osd.25 
device 26 osd.26 
device 27 osd.27 
device 28 osd.28 
device 29 osd.29 
device 30 osd.30 
device 31 osd.31 
device 32 osd.32 
device 33 osd.33 
device 34 osd.34 
device 35 osd.35 

# types 
type 0 osd 
type 1 host 
type 2 chassis 
type 3 rack 
type 4 row 
type 5 pdu 
type 6 pod 
type 7 room 
type 8 datacenter 
type 9 region 
type 10 root 

# buckets 
host lis04h01 { 
id -2 # do not change unnecessarily 
# weight 24.000 
alg straw 
hash 0 # rjenkins1 
item osd.0 weight 4.000 
item osd.2 weight 4.000 
item osd.3 weight 4.000 
item osd.4 weight 4.000 
item osd.5 weight 4.000 
item osd.1 weight 4.000 
} 
host lis04h02 { 
id -3 # do not change unnecessarily 
# weight 24.000 
alg straw 
hash 0 # rjenkins1 
item osd.6 weight 4.000 
item osd.7 weight 4.000 
item osd.8 weight 4.000 
item osd.9 weight 4.000 
item osd.10 weight 4.000 
item osd.11 weight 4.000 
} 
host mom02h01 { 
id -4 # do not change unnecessarily 
# weight 24.000 
alg straw 
hash 0 # rjenkins1 
item osd.12 weight 4.000 
item osd.13 weight 4.000 
item osd.14 weight 4.000 
item osd.16 weight 4.000 
item osd.17 weight 4.000 
item osd.15 weight 4.000 
} 
host mom02h02 { 
id -5 # do not change unnecessarily 
# weight 24.000 
alg straw 
hash 0 # rjenkins1 
item osd.18 weight 4.000 
item osd.19 weight 4.000 
item osd.20 weight 4.000 
item osd.21 weight 4.000 
item osd.22 weight 4.000 
item osd.23 weight 4.000 
} 
host mem04h01 { 
id -6 # do not change unnecessarily 
# weight 24.000 
alg straw 
hash 0 # rjenkins1 
item osd.24 weight 4.000 
item osd.25 weight 4.000 
item osd.26 weight 4.000 
item osd.27 weight 4.000 
item osd.28 weight 4.000 
item osd.29 weight 4.000 
} 
host mem04h02 { 
id -7 # do not change unnecessarily 
# weight 24.000 
alg straw 
hash 0 # rjenkins1 
item osd.30 weight 4.000 
item osd.31 weight 4.000 
item osd.32 weight 4.000 
item osd.33 weight 4.000 
item osd.34 weight 4.000 
it

Re: [ceph-users] stuck ceph-deploy mon create-initial / giant

2015-02-24 Thread Loic Dachary


On 24/02/2015 09:58, Stephan Seitz wrote:
> Hi Loic,
> 
> this is the content of our ceph.conf
> 
> [global]
> fsid = 719f14b2-7475-4b25-8c5f-3ffbcf594d13
> mon_initial_members = ceph1, ceph2, ceph3
> mon_host = 192.168.10.107,192.168.10.108,192.168.10.109
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
> filestore_xattr_use_omap = true
> osd pool default size = 2
> public network = 192.168.10.0/24
> cluster networt = 192.168.108.0/24

s/networt/network/ ?

> 
> where public network is 10GBE native and cluster network a tagged VLAN
> on a lacp/802.3ad bond.
> 
> 
> 
> 
> Am Montag, den 23.02.2015, 23:24 +0100 schrieb Loic Dachary:
>> Hi Stephan,
>>
>> Could you share the /etc/ceph/ceph.conf content ? Maybe ceph-create-keys 
>> cannot reach the monitor ?
>>
>> Cheers
>>
>> On 23/02/2015 22:53, Stephan Seitz wrote:
>>> Hi all,
>>>
>>> I'm currently facing a strange problem when deploying giant on Ubuntu 
>>> 14.04. Following the docs, and reaching to
>>> ceph-deploy mon create-initial
>>> on the three mon hosts leads to every single mon stuck with a companion 
>>> ceph-create-keys waiting forever.
>>> ceph-deploy quits after waiting for mon-qorum without success. Say:
>>> with 
>>> error: for each mon.
>>> An strace shows repetively the very same bulk, with one line stating the
>>> mon state in "pending".
>>> Network has been double-checked. netstat / lsof shows "established" TCP
>>> connection between the mons.
>>>
>>> Trying to fiddle around the quorum and starting the cluster with only 
>>> one mon, didn't help either. It behaves in the very same way: Having
>>> the 
>>> mon stuck and the ceph-create-keys waiting forever.
>>>
>>> Could someone please shed some light, what this problem could be caused 
>>> by?
>>> Except for deploying the first time giant, I didn't face that problem
>>> on 
>>> former installations.
>>>
>>> Thanks in advance,
>>>
>>> - Stephan
>>>
>>
> 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph MDS remove

2015-02-24 Thread ceph-users

Sorry,
forgot to mention that I'm running Ceph 0.87 on Centos 7.

On 24/02/2015 10:20, Xavier Villaneau wrote:

Hello,

I also had to remove the MDSs on a Giant test cluster a few days ago,
and stumbled upon the same problems.

Le 24/02/2015 09:58, ceph-users a écrit :

Hi all,

I've set up a ceph cluster using this playbook:
https://github.com/ceph/ceph-ansible

I've configured in my hosts list
[mdss]
hostname1
hostname2


I now need to remove this MDS from the cluster.
The only document I found is this:
http://www.sebastien-han.fr/blog/2012/07/04/remove-a-mds-server-from-a-ceph-cluster/


# service ceph -a stop mds
=== mds.z-srv-m-cph02 ===
Stopping Ceph mds.z-srv-m-cph02 on z-srv-m-cph02...done
=== mds.r-srv-m-cph02 ===
Stopping Ceph mds.r-srv-m-cph02 on r-srv-m-cph02...done
=== mds.r-srv-m-cph01 ===
Stopping Ceph mds.r-srv-m-cph01 on r-srv-m-cph01...done
=== mds.0 ===
Stopping Ceph mds.0 on zrh-srv-m-cph01...done
=== mds.192.168.0.1 ===
Stopping Ceph mds.192.168.0.1 on z-srv-m-cph01...done
=== mds.z-srv-m-cph01 ===
Stopping Ceph mds.z-srv-m-cph01 on z-srv-m-cph01...done

[root@z-srv-m-cph01 ceph]# ceph mds stat
e1: 0/0/0 up

1. question: why the MDS are not stopped?


I also had trouble stopping my MDS. They would start up again even if
I killed the processes… I suggest you try :
sudo stop ceph-mds-all


2. When I try to remove them:

# ceph mds rm mds.z-srv-m-cph01 z-srv-m-cph01
Invalid command: mds.z-srv-m-cph01 doesn't represent an int
mds rm   : remove nonactive mds
Error EINVAL: invalid command


In the mds rm command, the  refers to the ID of the metadata
pool used by CephFS (since there can only be one right now). And the
 is simply mds.n where n is 0, 1, etc. Maybe there are
other possible values for type.id, but it worked for me.


The ansible playbook created me a conf like this in ceph.conf:
[mds]

[mds.z-srv-m-cph01]
host = z-srv-m-cph01


I believe you'll also need to delete the [msd] section in ceph.conf,
but since I do not know much about ansible I can't give you more
advice on this.

Finally, as described on the blog post you linked, you need to reset
cephfs after (or the health will be complaining) :
ceph mds newfs   
--yes-i-really-mean-it


Regards,
--
Xavier


Can someone please help on this or at least give some hints?

Thank you very much
Gian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] stuck ceph-deploy mon create-initial / giant

2015-02-24 Thread Christian Balzer
On Tue, 24 Feb 2015 11:17:22 +0100 Loic Dachary wrote:

> 
> 
> On 24/02/2015 09:58, Stephan Seitz wrote:
> > Hi Loic,
> > 
> > this is the content of our ceph.conf
> > 
> > [global]
> > fsid = 719f14b2-7475-4b25-8c5f-3ffbcf594d13
> > mon_initial_members = ceph1, ceph2, ceph3
> > mon_host = 192.168.10.107,192.168.10.108,192.168.10.109
> > auth_cluster_required = cephx
> > auth_service_required = cephx
> > auth_client_required = cephx
> > filestore_xattr_use_omap = true
> > osd pool default size = 2
> > public network = 192.168.10.0/24
> > cluster networt = 192.168.108.0/24
> 
> s/networt/network/ ?
> 

If this really should turn out to be the case, it is another painfully
obvious reason why I proposed to provide full config parser output in
the logs when in default debugging level.

Stuff like this or the lack of picking up configuration statements would
be immediately visible. 

Christian

> > 
> > where public network is 10GBE native and cluster network a tagged VLAN
> > on a lacp/802.3ad bond.
> > 
> > 
> > 
> > 
> > Am Montag, den 23.02.2015, 23:24 +0100 schrieb Loic Dachary:
> >> Hi Stephan,
> >>
> >> Could you share the /etc/ceph/ceph.conf content ? Maybe
> >> ceph-create-keys cannot reach the monitor ?
> >>
> >> Cheers
> >>
> >> On 23/02/2015 22:53, Stephan Seitz wrote:
> >>> Hi all,
> >>>
> >>> I'm currently facing a strange problem when deploying giant on
> >>> Ubuntu 14.04. Following the docs, and reaching to
> >>> ceph-deploy mon create-initial
> >>> on the three mon hosts leads to every single mon stuck with a
> >>> companion ceph-create-keys waiting forever.
> >>> ceph-deploy quits after waiting for mon-qorum without success. Say:
> >>> with 
> >>> error: for each mon.
> >>> An strace shows repetively the very same bulk, with one line stating
> >>> the mon state in "pending".
> >>> Network has been double-checked. netstat / lsof shows "established"
> >>> TCP connection between the mons.
> >>>
> >>> Trying to fiddle around the quorum and starting the cluster with
> >>> only one mon, didn't help either. It behaves in the very same way:
> >>> Having the 
> >>> mon stuck and the ceph-create-keys waiting forever.
> >>>
> >>> Could someone please shed some light, what this problem could be
> >>> caused by?
> >>> Except for deploying the first time giant, I didn't face that problem
> >>> on 
> >>> former installations.
> >>>
> >>> Thanks in advance,
> >>>
> >>> - Stephan
> >>>
> >>
> > 
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] stuck ceph-deploy mon create-initial / giant

2015-02-24 Thread Loic Dachary


On 24/02/2015 12:00, Christian Balzer wrote:
> On Tue, 24 Feb 2015 11:17:22 +0100 Loic Dachary wrote:
> 
>>
>>
>> On 24/02/2015 09:58, Stephan Seitz wrote:
>>> Hi Loic,
>>>
>>> this is the content of our ceph.conf
>>>
>>> [global]
>>> fsid = 719f14b2-7475-4b25-8c5f-3ffbcf594d13
>>> mon_initial_members = ceph1, ceph2, ceph3
>>> mon_host = 192.168.10.107,192.168.10.108,192.168.10.109
>>> auth_cluster_required = cephx
>>> auth_service_required = cephx
>>> auth_client_required = cephx
>>> filestore_xattr_use_omap = true
>>> osd pool default size = 2
>>> public network = 192.168.10.0/24
>>> cluster networt = 192.168.108.0/24
>>
>> s/networt/network/ ?
>>
> 
> If this really should turn out to be the case, it is another painfully
> obvious reason why I proposed to provide full config parser output in
> the logs when in default debugging level.

I agree. However, it is non trivial to implement because there is not a central 
place where all valid values are defined. It is also likely that third party 
scripts rely on the fact that arbitrary key/values can be stored in the 
configuration file.

> Stuff like this or the lack of picking up configuration statements would
> be immediately visible. 

That's probably not the reason why it fails. Do you confirm ?

> 
> Christian
> 
>>>
>>> where public network is 10GBE native and cluster network a tagged VLAN
>>> on a lacp/802.3ad bond.
>>>
>>>
>>>
>>>
>>> Am Montag, den 23.02.2015, 23:24 +0100 schrieb Loic Dachary:
 Hi Stephan,

 Could you share the /etc/ceph/ceph.conf content ? Maybe
 ceph-create-keys cannot reach the monitor ?

 Cheers

 On 23/02/2015 22:53, Stephan Seitz wrote:
> Hi all,
>
> I'm currently facing a strange problem when deploying giant on
> Ubuntu 14.04. Following the docs, and reaching to
> ceph-deploy mon create-initial
> on the three mon hosts leads to every single mon stuck with a
> companion ceph-create-keys waiting forever.
> ceph-deploy quits after waiting for mon-qorum without success. Say:
> with 
> error: for each mon.
> An strace shows repetively the very same bulk, with one line stating
> the mon state in "pending".
> Network has been double-checked. netstat / lsof shows "established"
> TCP connection between the mons.
>
> Trying to fiddle around the quorum and starting the cluster with
> only one mon, didn't help either. It behaves in the very same way:
> Having the 
> mon stuck and the ceph-create-keys waiting forever.
>
> Could someone please shed some light, what this problem could be
> caused by?
> Except for deploying the first time giant, I didn't face that problem
> on 
> former installations.
>
> Thanks in advance,
>
> - Stephan
>

>>>
>>
> 
> 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD on LVM volume

2015-02-24 Thread Joerg Henne
Hi all,

installing an OSD on an LVM volume seems not to be supported by the current
'ceph-deploy osd' or 'ceph-disk prepare' tools. Therefore I tried to do it
manually as suggested here:
http://eturnerx.blogspot.de/2014/08/how-i-added-my-lvm-volumes-as-osds-in.html

TL;DR, the process is: 
- create volume
- format volume
- mount volume to /var/lib/ceph/ceph-$OSD_ID
- create symlink from /var/lib/ceph/ceph-$OSD_ID/journal to journal volume

This seems to work, however, the disks are not listed by 'ceph-disk list'.

There seem to be other approaches which involve creating a partition table
on the LVM volumes, as described here, for example:
http://dachary.org/?p=2548 which seem to be a bit more involved and I
haven't tried.

Is there a recommended way of running an OSD on top of a LVM volume? What
are the pros and cons of the approaches? Is there a downside to the disks
not being listed by 'ceph-disk list' as per the first approach?

Thanks in advance for any enlightenment!

Joerg Henne


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Wrong object and used space count in cache tier pool

2015-02-24 Thread Xavier Villaneau

Hello ceph-users,

I am currently making tests on a small cluster, and Cache Tiering is one 
of those tests. The cluster runs Ceph 0.87 Giant on three Ubuntu 14.04 
servers with the 3.16.0 kernel, for a total of 8 OSD and 1 MON.


Since there are no SSDs in those servers, I am testing Cache Tiering by 
using an erasure-coded pool as storage and a replicated pool as cache. 
The cache settings are the "defaults" ones you'll find in the 
documentation, and I'm using writeback mode. Also, to simulate the small 
size of cache data, the hot storage pool has a 1024MB space quota. Then 
I write 4MB chunks of data to the storage pool using 'rados bench' (with 
--no-cleanup).


Here are my cache pool settings according to InkScope :
pool15
pool name   test1_ct-cache
auid0
type1 (replicated)
size2
min size1
crush ruleset   0 (replicated_ruleset)
pg num  512
pg placement_num512
quota max_bytes 1 GB
quota max_objects   0
flags names hashpspool,incomplete_clones
tiers   none
tier of 14 (test1_ec-data)
read tier   -1
write tier  -1
cache mode  writeback
cache target_dirty_ratio_micro  40 %
cache target_full_ratio_micro   80 %
cache min_flush_age 0 s
cache min_evict_age 0 s
target max_objects  0
target max_bytes960 MB
hit set_count   1
hit set_period  3600 s
hit set_params  target_size :0
seed :   0
type :   bloom
false_positive_probability : 0.05

I believe the tiering itself works well, I do see objects and bytes 
being transfered from the cache to the storage when I write data. I 
checked with 'rados ls', and the object count in the cold storage is 
always right on spot. But it isn't in the cache, when I do 'ceph df' or 
'rados df' the space and object counts do not match with 'rados ls', and 
are usually much larger :


% ceph df
…
POOLS:
NAME   ID USED   %USED MAX AVAIL OBJECTS
…
test1_ec-data  14  5576M  0.045G 1394
test1_ct-cache 15   772M 0 7410G 250
% rados -p test1_ec-data ls | wc -l
1394
% rados -p test1_ct-cache ls | wc -l
56
# And this corresponds to 220M of data in test1_ct-cache

Not only it prevents me from knowing exactly what the cache is doing, 
but it is also this value that is applied for the quota. And I've seen 
writing operations fail because the space count had reached 1G, although 
I was quite sure there was enough space. The count does not correct 
itself over time, even by waiting overnight. The count only changes when 
I "poke" the pool by changing a setting or writing data, but remains 
wrong (and not by the same number of objects). The changes in object 
counts given by 'rados ls' in both pools match with the number of 
objects written by 'rados bench'.


Does anybody know where this mismatch might come from ? Is there a way 
to see more details about what's going on ? Or is it the normal behavior 
of a cache pool when 'rados bench' is used ?


Thank you in advance for any help.
Regards,
--
Xavier

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] librados - Atomic Write

2015-02-24 Thread Noah Watkins
I'll take a shot at answering this:

Operations are atomic in the sense that there are no partial failures. 
Additionally, access to an object should appear to be serialized. So, two 
in-flight operations A and B will be applied in either A,B or B,A order. If 
ordering is important (e.g. the operations are dependent) then the application 
should enforce ordering.

- Original Message -
From: "Italo Santos" 
To: ceph-users@lists.ceph.com
Sent: Monday, February 23, 2015 12:01:15 PM
Subject: [ceph-users] librados - Atomic Write

Hello, 

The librados write ops are atomic? I mean what happens if two different clients 
try to write the same object with the same content? 

Regards. 

Italo Santos 
http://italosantos.com.br/ 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD on LVM volume

2015-02-24 Thread John Spray


On 24/02/2015 12:49, Joerg Henne wrote:

This seems to work, however, the disks are not listed by 'ceph-disk list'.

Right.  ceph-disk uses GPID partition labels to identify the disks.


Is there a recommended way of running an OSD on top of a LVM volume? What
are the pros and cons of the approaches? Is there a downside to the disks
not being listed by 'ceph-disk list' as per the first approach?

I imagine that without proper partition labels you'll also not get the 
benefit of e.g. the udev magic
that allows plugging OSDs in/out of different hosts.  More generally 
you'll just be in a rather non standard configuration that will confuse 
anyone working on the host.


Can I ask why you want to use LVM?  It is not generally necessary or 
useful with Ceph: Ceph expects to be fed raw drives.


John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Fwd: OSD on LVM volume

2015-02-24 Thread Jörg Henne
2015-02-24 14:05 GMT+01:00 John Spray :

>
> I imagine that without proper partition labels you'll also not get the
> benefit of e.g. the udev magic
> that allows plugging OSDs in/out of different hosts.  More generally
> you'll just be in a rather non standard configuration that will confuse
> anyone working on the host.
>
Ok, thanks for the heads up!


> Can I ask why you want to use LVM?  It is not generally necessary or
> useful with Ceph: Ceph expects to be fed raw drives.
>
I am currently just experimenting with ceph. Although I have a reasonable
number of "lab" nodes, those nodes are shared with other experimentation
and thus it would be rather inconvenient to dedicate the raw disks
exclusively to ceph.

Joerg Henne
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph MDS remove

2015-02-24 Thread gian

Sorry,
forgot to mention that I'm running Ceph 0.87 on Centos 7.

On 24/02/2015 10:20, Xavier Villaneau wrote:

Hello,

I also had to remove the MDSs on a Giant test cluster a few days ago,
and stumbled upon the same problems.

Le 24/02/2015 09:58, ceph-users a écrit :

Hi all,

I've set up a ceph cluster using this playbook:
https://github.com/ceph/ceph-ansible

I've configured in my hosts list
[mdss]
hostname1
hostname2


I now need to remove this MDS from the cluster.
The only document I found is this:
http://www.sebastien-han.fr/blog/2012/07/04/remove-a-mds-server-from-a-ceph-cluster/


# service ceph -a stop mds
=== mds.z-srv-m-cph02 ===
Stopping Ceph mds.z-srv-m-cph02 on z-srv-m-cph02...done
=== mds.r-srv-m-cph02 ===
Stopping Ceph mds.r-srv-m-cph02 on r-srv-m-cph02...done
=== mds.r-srv-m-cph01 ===
Stopping Ceph mds.r-srv-m-cph01 on r-srv-m-cph01...done
=== mds.0 ===
Stopping Ceph mds.0 on zrh-srv-m-cph01...done
=== mds.192.168.0.1 ===
Stopping Ceph mds.192.168.0.1 on z-srv-m-cph01...done
=== mds.z-srv-m-cph01 ===
Stopping Ceph mds.z-srv-m-cph01 on z-srv-m-cph01...done

[root@z-srv-m-cph01 ceph]# ceph mds stat
e1: 0/0/0 up

1. question: why the MDS are not stopped?


I also had trouble stopping my MDS. They would start up again even if
I killed the processes… I suggest you try :
sudo stop ceph-mds-all


2. When I try to remove them:

# ceph mds rm mds.z-srv-m-cph01 z-srv-m-cph01
Invalid command: mds.z-srv-m-cph01 doesn't represent an int
mds rm   : remove nonactive mds
Error EINVAL: invalid command


In the mds rm command, the  refers to the ID of the metadata
pool used by CephFS (since there can only be one right now). And the
 is simply mds.n where n is 0, 1, etc. Maybe there are
other possible values for type.id, but it worked for me.


The ansible playbook created me a conf like this in ceph.conf:
[mds]

[mds.z-srv-m-cph01]
host = z-srv-m-cph01


I believe you'll also need to delete the [msd] section in ceph.conf,
but since I do not know much about ansible I can't give you more
advice on this.

Finally, as described on the blog post you linked, you need to reset
cephfs after (or the health will be complaining) :
ceph mds newfs   
--yes-i-really-mean-it


Regards,
--
Xavier


Can someone please help on this or at least give some hints?

Thank you very much
Gian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph MDS remove

2015-02-24 Thread John Spray


On 24/02/2015 09:20, Xavier Villaneau wrote:



[root@z-srv-m-cph01 ceph]# ceph mds stat
e1: 0/0/0 up

1. question: why the MDS are not stopped?
This is just confusing formatting.  0/0/0 means 0 up, 0 in, max_mds=0. 
This status indicates that you have no filesystem at all.





2. When I try to remove them:

# ceph mds rm mds.z-srv-m-cph01 z-srv-m-cph01
Invalid command: mds.z-srv-m-cph01 doesn't represent an int
mds rm   : remove nonactive mds
Error EINVAL: invalid command


In the mds rm command, the  refers to the ID of the metadata 
pool used by CephFS (since there can only be one right now). And the 
 is simply mds.n where n is 0, 1, etc. Maybe there are 
other possible values for type.id, but it worked for me.
This does not refer to the metadata pool ID.  It's an MDS "gid", which 
is the unique ID assigned to a running daemon (changes if you restart 
the daemon).  You can only "rm" an MDS if it is not currently holding an 
MDS rank.


When you do "ceph mds dump", if there are any up daemons, you can see 
them at the end of the output like this:

4113:172.16.79.251:6818/46415 'a' mds.0.1 up:active seq 2

The  part is a red herring, it's in the syntax 
definition of the command but not actually used (presumably historical, 
possibly just a bug).  Leave it out.


Anyway, "mds rm" doesn't have anything to do with actually stopping a 
daemon, which is an operation purely local to the host it's running on.  
You'd only want to do "mds rm" as a cleaning-up task afterwards to 
remove any ghost of that daemon that was lingering in the MDS map.


I don't know anything about the ansible scripts you're using, so can't 
say much more than that about what you should expect them to be doing in 
this situation.


Cheers,
John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Wrong object and used space count in cache tier pool

2015-02-24 Thread Gregory Farnum
On Tue, Feb 24, 2015 at 6:21 AM, Xavier Villaneau
 wrote:
> Hello ceph-users,
>
> I am currently making tests on a small cluster, and Cache Tiering is one of
> those tests. The cluster runs Ceph 0.87 Giant on three Ubuntu 14.04 servers
> with the 3.16.0 kernel, for a total of 8 OSD and 1 MON.
>
> Since there are no SSDs in those servers, I am testing Cache Tiering by
> using an erasure-coded pool as storage and a replicated pool as cache. The
> cache settings are the "defaults" ones you'll find in the documentation, and
> I'm using writeback mode. Also, to simulate the small size of cache data,
> the hot storage pool has a 1024MB space quota. Then I write 4MB chunks of
> data to the storage pool using 'rados bench' (with --no-cleanup).
>
> Here are my cache pool settings according to InkScope :
> pool15
> pool name   test1_ct-cache
> auid0
> type1 (replicated)
> size2
> min size1
> crush ruleset   0 (replicated_ruleset)
> pg num  512
> pg placement_num512
> quota max_bytes 1 GB
> quota max_objects   0
> flags names hashpspool,incomplete_clones
> tiers   none
> tier of 14 (test1_ec-data)
> read tier   -1
> write tier  -1
> cache mode  writeback
> cache target_dirty_ratio_micro  40 %
> cache target_full_ratio_micro   80 %
> cache min_flush_age 0 s
> cache min_evict_age 0 s
> target max_objects  0
> target max_bytes960 MB
> hit set_count   1
> hit set_period  3600 s
> hit set_params  target_size :0
> seed :   0
> type :   bloom
> false_positive_probability : 0.05
>
> I believe the tiering itself works well, I do see objects and bytes being
> transfered from the cache to the storage when I write data. I checked with
> 'rados ls', and the object count in the cold storage is always right on
> spot. But it isn't in the cache, when I do 'ceph df' or 'rados df' the space
> and object counts do not match with 'rados ls', and are usually much larger
> :
>
> % ceph df
> …
> POOLS:
> NAME   ID USED   %USED MAX AVAIL OBJECTS
> …
> test1_ec-data  14  5576M  0.045G 1394
> test1_ct-cache 15   772M 0 7410G 250
> % rados -p test1_ec-data ls | wc -l
> 1394
> % rados -p test1_ct-cache ls | wc -l
> 56
> # And this corresponds to 220M of data in test1_ct-cache
>
> Not only it prevents me from knowing exactly what the cache is doing, but it
> is also this value that is applied for the quota. And I've seen writing
> operations fail because the space count had reached 1G, although I was quite
> sure there was enough space. The count does not correct itself over time,
> even by waiting overnight. The count only changes when I "poke" the pool by
> changing a setting or writing data, but remains wrong (and not by the same
> number of objects). The changes in object counts given by 'rados ls' in both
> pools match with the number of objects written by 'rados bench'.
>
> Does anybody know where this mismatch might come from ? Is there a way to
> see more details about what's going on ? Or is it the normal behavior of a
> cache pool when 'rados bench' is used ?

Well, I don't think the quota stuff is going to interact well with
caching pools; the size limits are implemented at different places in
the cache.

Similarly, rados ls definitely doesn't work properly on cache pools;
you shouldn't expect anything sensible to come out of it. Among other
things, there are "whiteout" objects in the cache pool (recording that
an object is known not to exist in the base pool) that won't be listed
in "rados ls", and I'm sure there's other stuff too.

If you're trying to limit the cache pool size you want to do that with
the target size and dirty targets/limits.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] librados - Atomic Write

2015-02-24 Thread Italo Santos
Hello Noah,  

In may case the ordering is importante and I seen that librados have an lock 
implementation which I’ll use that on my implementation. Thanks for your help.  

Regards.

Italo Santos
http://italosantos.com.br/


On Tuesday, February 24, 2015 at 12:52, Noah Watkins wrote:

> I'll take a shot at answering this:
>  
> Operations are atomic in the sense that there are no partial failures. 
> Additionally, access to an object should appear to be serialized. So, two 
> in-flight operations A and B will be applied in either A,B or B,A order. If 
> ordering is important (e.g. the operations are dependent) then the 
> application should enforce ordering.
>  
> - Original Message -
> From: "Italo Santos" mailto:okd...@gmail.com)>
> To: ceph-users@lists.ceph.com (mailto:ceph-users@lists.ceph.com)
> Sent: Monday, February 23, 2015 12:01:15 PM
> Subject: [ceph-users] librados - Atomic Write
>  
> Hello,  
>  
> The librados write ops are atomic? I mean what happens if two different 
> clients try to write the same object with the same content?  
>  
> Regards.  
>  
> Italo Santos  
> http://italosantos.com.br/  
>  
>  
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com (mailto:ceph-users@lists.ceph.com)
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>  
>  


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD Startup Best Practice: gpt/udev or SysVInit/systemd ?

2015-02-24 Thread Robert LeBlanc
We have had good luck with letting udev do it's thing on CentOS7.

On Wed, Feb 18, 2015 at 7:46 PM, Anthony Alba  wrote:
> Hi Cephers,
>
> What is your "best practice" for starting up OSDs?
>
> I am trying to determine the most robust technique on CentOS 7 where I
> have too much choice:
>
> udev/gpt/uuid or /etc/init.d/ceph or /etc/systemd/system/ceph-osd@X
>
> 1. Use udev/gpt/UUID: no OSD  sections in  /etc/ceph/mycluster.conf or
> premounts in /etc/fstab.
> Let udev + ceph-disk-activate do its magic.
>
> 2. Use /etc/init.d/ceph start osd or systemctl start ceph-osd@N
> a. do you change partition UUID so no udev kicks in?
> b. do you keep  [osd.N] sections in /etc/ceph/mycluster.conf
> c. premount all journals/OSDs in /etc/fstab?
>
> The problem with this approach, though very explicit and robust, is
> that it is is hard to maintain
> /etc/fstab on the OSD hosts.
>
> - Anthony
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD on LVM volume

2015-02-24 Thread Sebastien Han
A while ago, I managed to have this working but this was really tricky.
See my comment here: 
https://github.com/ceph/ceph-ansible/issues/9#issuecomment-37127128

One use case I had was a system with 2 SSD for the OS and a couple of OSDs.
Both SSD were in RAID1 and the system was configured with lvm already.
So we had to create LVs for each journals.

> On 24 Feb 2015, at 14:41, Jörg Henne  wrote:
> 
> 2015-02-24 14:05 GMT+01:00 John Spray :
> 
> I imagine that without proper partition labels you'll also not get the 
> benefit of e.g. the udev magic
> that allows plugging OSDs in/out of different hosts.  More generally you'll 
> just be in a rather non standard configuration that will confuse anyone 
> working on the host.
> Ok, thanks for the heads up!
>  
> Can I ask why you want to use LVM?  It is not generally necessary or useful 
> with Ceph: Ceph expects to be fed raw drives.
> I am currently just experimenting with ceph. Although I have a reasonable 
> number of "lab" nodes, those nodes are shared with other experimentation and 
> thus it would be rather inconvenient to dedicate the raw disks exclusively to 
> ceph.
> 
> Joerg Henne
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Cheers.
 
Sébastien Han 
Cloud Architect 

"Always give 100%. Unless you're giving blood."

Phone: +33 (0)1 49 70 99 72 
Mail: sebastien@enovance.com 
Address : 11 bis, rue Roquépine - 75008 Paris
Web : www.enovance.com - Twitter : @enovance 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD Performance

2015-02-24 Thread Kevin Walker
Hi All

Just recently joined the list and have been reading/learning about ceph for
the past few months. Overall it looks to be well suited to our cloud
platform but I have stumbled across a few worrying items that hopefully you
guys can clarify the status of.

Reading through various mailing list archives, it would seem an OSD caps
out at about 3k IOPS. Dieter Kasper from Fujistu made an interesting
observation about the size of the OSD code(20k plus lines at that time), is
this being optimized further and has this IOPS limit been improved in Giant?

Is there a way to over come the XFS fragmentation problems other users have
experienced?

Kind regards

Kevin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph MDS remove

2015-02-24 Thread ceph-users

How can I remove the 2nd MDS:
# ceph mds dump
dumped mdsmap epoch 72
epoch   72
flags   0
created 2015-02-24 15:55:10.631958
modified2015-02-24 17:58:49.400841
tableserver 0
root0
session_timeout 60
session_autoclose   300
max_file_size   1099511627776
last_failure62
last_failure_osd_epoch  1656
compat	compat={},rocompat={},incompat={1=base v0.20,2=client writeable 
ranges,3=default file layouts on dirs,4=dir inode in separate 
object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no 
anchor table}

max_mds 1
in  0
up  {0=23376}
failed
stopped
data_pools  7,8
metadata_pool   6
inline_data disabled
23376:  192.168.0.1:6927/713332 'z-srv-m-cph01' mds.0.9 up:active seq 7
23380:  192.168.0.1:6814/713728 '192.168.0.1' mds.-1.0 up:standby seq 2


# ceph mds rm mds.-1.0
Invalid command:  mds.-1.0 doesn't represent an int
mds rm   :  remove nonactive mds
Error EINVAL: invalid command

Any clue?

Thanks
Gian

On 24/02/2015 09:58, ceph-users wrote:

Hi all,

I've set up a ceph cluster using this playbook:
https://github.com/ceph/ceph-ansible

I've configured in my hosts list
[mdss]
hostname1
hostname2


I now need to remove this MDS from the cluster.
The only document I found is this:
http://www.sebastien-han.fr/blog/2012/07/04/remove-a-mds-server-from-a-ceph-cluster/

# service ceph -a stop mds
=== mds.z-srv-m-cph02 ===
Stopping Ceph mds.z-srv-m-cph02 on z-srv-m-cph02...done
=== mds.r-srv-m-cph02 ===
Stopping Ceph mds.r-srv-m-cph02 on r-srv-m-cph02...done
=== mds.r-srv-m-cph01 ===
Stopping Ceph mds.r-srv-m-cph01 on r-srv-m-cph01...done
=== mds.0 ===
Stopping Ceph mds.0 on zrh-srv-m-cph01...done
=== mds.192.168.0.1 ===
Stopping Ceph mds.192.168.0.1 on z-srv-m-cph01...done
=== mds.z-srv-m-cph01 ===
Stopping Ceph mds.z-srv-m-cph01 on z-srv-m-cph01...done

[root@z-srv-m-cph01 ceph]# ceph mds stat
e1: 0/0/0 up

1. question: why the MDS are not stopped?
2. When I try to remove them:

# ceph mds rm mds.z-srv-m-cph01 z-srv-m-cph01
Invalid command: mds.z-srv-m-cph01 doesn't represent an int
mds rm   : remove nonactive mds
Error EINVAL: invalid command

The ansible playbook created me a conf like this in ceph.conf:
[mds]

[mds.z-srv-m-cph01]
host = z-srv-m-cph01

Can someone please help on this or at least give some hints?

Thank you very much
Gian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD Performance

2015-02-24 Thread Mark Nelson

On 02/24/2015 04:21 PM, Kevin Walker wrote:

Hi All

Just recently joined the list and have been reading/learning about ceph
for the past few months. Overall it looks to be well suited to our cloud
platform but I have stumbled across a few worrying items that hopefully
you guys can clarify the status of.

Reading through various mailing list archives, it would seem an OSD caps
out at about 3k IOPS. Dieter Kasper from Fujistu made an interesting
observation about the size of the OSD code(20k plus lines at that time),
is this being optimized further and has this IOPS limit been improved in
Giant?


In recent tests under fairly optimal conditions, I'm seeing performance 
topping out at about 4K object writes/s and 22K object reads/s against 
an OSD with a very fast PCIe SSD.  There are several reasons writes are 
slower than reads, but this is something we are working on improving in 
a variety of ways.


I believe others may have achieved even higher results.



Is there a way to over come the XFS fragmentation problems other users
have experienced?


Setting the newish filestore_xfs_extsize parameter to true appears to 
help in testing we did a couple months ago.  We filled up a cluster to 
near capacity (~70%) and then did 12 hours of random writes.  After the 
test completed, with filestore_xfs_extsize disabled we were seeing 
something like 13% fragmentation, while with it enabled we were seeing 
around 0.02% fragmentation.




Kind regards

Kevin


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD Performance

2015-02-24 Thread Kevin Walker
Hi Mark

Thanks for the info, 22k is not bad, but still massively below what a pcie ssd 
can achieve. Care to expand on why the write IOPS are so low? Was this with a 
separate RAM disk pcie device or SLC SSD for the journal?

That fragmentation percentage looks good. We are considering using just SSD's 
for OSD's and RAM disk pcie devices for the Journals so this would be ok.

Kind regards

Kevin Walker
+968 9765 1742

On 25 Feb 2015, at 02:35, Mark Nelson  wrote:

> On 02/24/2015 04:21 PM, Kevin Walker wrote:
> Hi All
> 
> Just recently joined the list and have been reading/learning about ceph
> for the past few months. Overall it looks to be well suited to our cloud
> platform but I have stumbled across a few worrying items that hopefully
> you guys can clarify the status of.
> 
> Reading through various mailing list archives, it would seem an OSD caps
> out at about 3k IOPS. Dieter Kasper from Fujistu made an interesting
> observation about the size of the OSD code(20k plus lines at that time),
> is this being optimized further and has this IOPS limit been improved in
> Giant?

In recent tests under fairly optimal conditions, I'm seeing performance topping 
out at about 4K object writes/s and 22K object reads/s against an OSD with a 
very fast PCIe SSD.  There are several reasons writes are slower than reads, 
but this is something we are working on improving in a variety of ways.

I believe others may have achieved even higher results.

> 
> Is there a way to over come the XFS fragmentation problems other users
> have experienced?

Setting the newish filestore_xfs_extsize parameter to true appears to help in 
testing we did a couple months ago.  We filled up a cluster to near capacity 
(~70%) and then did 12 hours of random writes.  After the test completed, with 
filestore_xfs_extsize disabled we were seeing something like 13% fragmentation, 
while with it enabled we were seeing around 0.02% fragmentation.

> 
> Kind regards
> 
> Kevin
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD Performance

2015-02-24 Thread Mark Nelson

Hi Kevin,

Writes are probably limited by a combination of locks, concurrent 
O_DSYNC journal writes, fsyncs, etc.  The tests I mentioned were with 
both the OSD and the OSD journal on the same PCIe SSD.  Others have 
looked into this in more detail than I have so  might be able to chime 
in with more specifics.


Mark

On 02/24/2015 04:50 PM, Kevin Walker wrote:

Hi Mark

Thanks for the info, 22k is not bad, but still massively below what a pcie ssd 
can achieve. Care to expand on why the write IOPS are so low? Was this with a 
separate RAM disk pcie device or SLC SSD for the journal?

That fragmentation percentage looks good. We are considering using just SSD's 
for OSD's and RAM disk pcie devices for the Journals so this would be ok.

Kind regards

Kevin Walker
+968 9765 1742

On 25 Feb 2015, at 02:35, Mark Nelson  wrote:


On 02/24/2015 04:21 PM, Kevin Walker wrote:
Hi All

Just recently joined the list and have been reading/learning about ceph
for the past few months. Overall it looks to be well suited to our cloud
platform but I have stumbled across a few worrying items that hopefully
you guys can clarify the status of.

Reading through various mailing list archives, it would seem an OSD caps
out at about 3k IOPS. Dieter Kasper from Fujistu made an interesting
observation about the size of the OSD code(20k plus lines at that time),
is this being optimized further and has this IOPS limit been improved in
Giant?


In recent tests under fairly optimal conditions, I'm seeing performance topping 
out at about 4K object writes/s and 22K object reads/s against an OSD with a 
very fast PCIe SSD.  There are several reasons writes are slower than reads, 
but this is something we are working on improving in a variety of ways.

I believe others may have achieved even higher results.



Is there a way to over come the XFS fragmentation problems other users
have experienced?


Setting the newish filestore_xfs_extsize parameter to true appears to help in 
testing we did a couple months ago.  We filled up a cluster to near capacity 
(~70%) and then did 12 hours of random writes.  After the test completed, with 
filestore_xfs_extsize disabled we were seeing something like 13% fragmentation, 
while with it enabled we were seeing around 0.02% fragmentation.



Kind regards

Kevin


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] More than 50% osds down, CPUs still busy; will the cluster recover without help?

2015-02-24 Thread Gregory Farnum
On Mon, Feb 23, 2015 at 8:59 AM, Chris Murray  wrote:
> ... Trying to send again after reporting bounce backs to dreamhost ...
> ... Trying to send one more time after seeing mails come through the
> list today ...
>
> Hi all,
>
> First off, I should point out that this is a 'small cluster' issue and
> may well be due to the stretched resources. If I'm doomed to destroying
> and starting again, fair be it, but I'm interested to see if things can
> get up and running again.
>
> My experimental ceph cluster now has 5 nodes with 3 osds each. Some
> drives are big, some drives are small. Most are formatted with BTRFS and
> two are still formatted with XFS, which I intend to remove and recreate
> with BTRFS at some point. I gather BTRFS isn't entirely stable yet, but
> compression suits my use-case, so I'm prepared to stick with it while it
> matures. I had to set the following, to avoid osds dying as the IO was
> consumed by the snapshot creation and deletion process (as I understand
> it):
>
> filestore btrfs snap = false
>
> and the mount options look like this:
>
> osd mount options btrfs =
> rw,noatime,space_cache,user_subvol_rm_allowed,compress-force=lzo
>
> Each node is a HP Microserver n36l or n54l, with 8GB of memory, so CPU
> horsepower is lacking somewhat. Ceph is version 0.80.8, and each node is
> also a mon.
>
> My issue is: After adding the 15th osd, the cluster went into a spiral
> of destruction, with osds going down one after another. One might go
> down on occasion, and usually a start of the osd in question will remedy
> things. This time, though, it hasn't, and the problem appears to have
> become worse and worse. I've tried starting osds, restarting whole
> hosts, to no avail. I've brought all osds back 'in' and set noup, nodown
> and noout. I've ceased rbd activity since it was getting blocked anyway.
> The cluster appears to now be 'stuck' in this state:
>
> cluster e3dd7a1a-bd5f-43fe-a06f-58e830b93b7a
>  health HEALTH_WARN 1 pgs backfill; 45 pgs backfill_toofull; 1969
> pgs degraded; 1226 pgs down; 2 pgs incomplete; 1333 pgs peering; 1445
> pgs stale; 1336 pgs stuck inactive; 1445 pgs stuck stale; 4198 pgs stuck
> unclean; recovery 838948/2578420 objects degraded (32.537%); 2 near full
> osd(s); 8/15 in osds are down; noup,nodown,noout flag(s) set
>  monmap e5: 5 mons at
> {0=192.168.12.25:6789/0,1=192.168.12.26:6789/0,2=192.168.12.27:6789/0,3=
> 192.168.12.28:6789/0,4=192.168.12.29:6789/0}, election epoch 2618,
> quorum 0,1,2,3,4 0,1,2,3,4
>  osdmap e63276: 15 osds: 7 up, 15 in
> flags noup,nodown,noout
>   pgmap v3371280: 4288 pgs, 5 pools, 3322 GB data, 835 kobjects
> 4611 GB used, 871 GB / 5563 GB avail
> 838948/2578420 objects degraded (32.537%)
>3 down+remapped+peering
>8 stale+active+degraded+remapped
>   85 active+clean
>1 stale+incomplete
> 1088 stale+down+peering
>  642 active+degraded+remapped
>1 incomplete
>   33 stale+remapped+peering
>  135 down+peering
>1 stale+degraded
>1
> stale+active+degraded+remapped+wait_backfill+backfill_toofull
>  854 active+remapped
>  234 stale+active+degraded
>4 active+degraded+remapped+backfill_toofull
>   40 active+remapped+backfill_toofull
> 1079 active+degraded
>5 stale+active+clean
>   74 stale+peering
>
> Take one of the nodes. It holds osds 12 (down & in), 13 (up & in) and 14
> (down & in).
>
> # ceph osd stat
>  osdmap e63276: 15 osds: 7 up, 15 in
> flags noup,nodown,noout
>
> # ceph daemon osd.12 status
> no valid command found; 10 closest matches:
> config show
> help
> log dump
> get_command_descriptions
> git_version
> config set   [...]
> version
> 2
> config get 
> 0
> admin_socket: invalid command
>
> # ceph daemon osd.13 status
> { "cluster_fsid": "e3dd7a1a-bd5f-43fe-a06f-58e830b93b7a",
>   "osd_fsid": "d7794b10-2366-4c4f-bb4d-5f11098429b6",
>   "whoami": 13,
>   "state": "active",
>   "oldest_map": 48214,
>   "newest_map": 63276,
>   "num_pgs": 790}
>
> # ceph daemon osd.14 status
> admin_socket: exception getting command descriptions: [Errno 111]
> Connection refused
>
> I'm assuming osds 12 and 14 are acting that way because they're not up,
> but why are they different?

Well, you below indicate that osd.14's log says it crashed on an
internal heartbeat timeout (usually, it got stuck waiting for disk IO
or the kernel/btrfs hung), so that's why. The osd.12 process exists
but isn't "up"; osd.14 doesn't even have a socket to connect to.

>
> In terms of logs, ceph-osd.12.log doesn't go beyond this:
> 2015-02-22 10:38:29.629407 7fd24952c780  0 ceph version 0.80.8
> (69eaad7f8308f21573c604f121956e64679a52a7), process ceph-osd, pid 3813
> 2

Re: [ceph-users] OSD Performance

2015-02-24 Thread Christian Balzer
On Wed, 25 Feb 2015 02:50:59 +0400 Kevin Walker wrote:

> Hi Mark
> 
> Thanks for the info, 22k is not bad, but still massively below what a
> pcie ssd can achieve. Care to expand on why the write IOPS are so low?

Aside from what Mark mentioned in his reply there's also latency to be
considered in the overall picture.

But my (and other people's tests, including Mark's recent PDF posted here)
clearly indicate where the problem with small write (4k) IOPS is, the
CPU utilization by mostly Ceph code (but significant OS time, too).

To quote myself:
I did some brief tests with a machine having 8 DC S3700 100GB for OSDs
(replica 1) under 0.80.6 and the right (make that wrong) type of load
(small, 4k I/Os) did melt all of the 8 3.5GHz cores in that box.
While never exceeding 15% utilization of the SSDs.

Even with further optimizations I predict the CPUs() to remain the limiting
factor for small write IOPS. 
So with that in mind, a pure SSD storage node design will have to consider
that and spend money where it actually improves things.

> Was this with a separate RAM disk pcie device or SLC SSD for the journal?
> 
> That fragmentation percentage looks good. We are considering using just
> SSD's for OSD's and RAM disk pcie devices for the Journals so this would
> be ok.
> 
For starters, you clearly have too much money.
You're not going to see a good return on investment, as per what I wrote
above. Even faster journals are pointless, having the journal on the
actual OSD SSDs is a non-issue performance wise and makes things a lot
more straightforward. 
I could totally see a much more primitive (HDD OSDs, journal SSDs) but
more balanced and parallelized cluster outperform your design at the same
cost (but admittedly more space usage). 

Secondly, why would you even care one iota about file system fragmentation
when using SSDs for all your storage?

Regards,

Christian

> Kind regards
> 
> Kevin Walker
> +968 9765 1742
> 
> On 25 Feb 2015, at 02:35, Mark Nelson  wrote:
> 
> > On 02/24/2015 04:21 PM, Kevin Walker wrote:
> > Hi All
> > 
> > Just recently joined the list and have been reading/learning about ceph
> > for the past few months. Overall it looks to be well suited to our
> > cloud platform but I have stumbled across a few worrying items that
> > hopefully you guys can clarify the status of.
> > 
> > Reading through various mailing list archives, it would seem an OSD
> > caps out at about 3k IOPS. Dieter Kasper from Fujistu made an
> > interesting observation about the size of the OSD code(20k plus lines
> > at that time), is this being optimized further and has this IOPS limit
> > been improved in Giant?
> 
> In recent tests under fairly optimal conditions, I'm seeing performance
> topping out at about 4K object writes/s and 22K object reads/s against
> an OSD with a very fast PCIe SSD.  There are several reasons writes are
> slower than reads, but this is something we are working on improving in
> a variety of ways.
> 
> I believe others may have achieved even higher results.
> 
> > 
> > Is there a way to over come the XFS fragmentation problems other users
> > have experienced?
> 
> Setting the newish filestore_xfs_extsize parameter to true appears to
> help in testing we did a couple months ago.  We filled up a cluster to
> near capacity (~70%) and then did 12 hours of random writes.  After the
> test completed, with filestore_xfs_extsize disabled we were seeing
> something like 13% fragmentation, while with it enabled we were seeing
> around 0.02% fragmentation.
> 
> > 
> > Kind regards
> > 
> > Kevin
> > 
> > 
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Centos 7 OSD silently fail to start

2015-02-24 Thread Barclay Jameson
I have tried to install ceph using ceph-deploy but sgdisk seems to
have too many issues so I did a manual install. After mkfs.btrfs on
the disks and journals and mounted them I then tried to start the osds
which failed. The first error was:
#/etc/init.d/ceph start osd.0
/etc/init.d/ceph: osd.0 not found (/etc/ceph/ceph.conf defines ,
/var/lib/ceph defines )

I then manually added the osds to the conf file with the following as
an example:
[osd.0]
osd_host = node01

Now when I run the command :
# /etc/init.d/ceph start osd.0

There is no error or output from the command and in fact when I do a
ceph -s no osds are listed as being up.
Doing as ps aux | grep -i ceph or ps aux | grep -i osd shows there are
no osd running.
I also have done htop to see if any process are running and none are shown.

I had this working on SL6.5 with Firefly but Giant on Centos 7 has
been nothing but a giant pain.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] stuck ceph-deploy mon create-initial / giant

2015-02-24 Thread Brad Hubbard

On 02/24/2015 09:06 PM, Loic Dachary wrote:



On 24/02/2015 12:00, Christian Balzer wrote:

On Tue, 24 Feb 2015 11:17:22 +0100 Loic Dachary wrote:




On 24/02/2015 09:58, Stephan Seitz wrote:

Hi Loic,

this is the content of our ceph.conf

[global]
fsid = 719f14b2-7475-4b25-8c5f-3ffbcf594d13
mon_initial_members = ceph1, ceph2, ceph3
mon_host = 192.168.10.107,192.168.10.108,192.168.10.109
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
osd pool default size = 2
public network = 192.168.10.0/24
cluster networt = 192.168.108.0/24


s/networt/network/ ?



If this really should turn out to be the case, it is another painfully
obvious reason why I proposed to provide full config parser output in
the logs when in default debugging level.


I agree. However, it is non trivial to implement because there is not a central 
place where all valid values are defined. It is also likely that third party 
scripts rely on the fact that arbitrary key/values can be stored in the 
configuration file.



This could be implemented as a warning then?

"Possible invalid or 3rd party entry found: X. Please check your ceph.conf 
file."

People can then report false positives and they an be added to the list of known
values?

Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD Performance

2015-02-24 Thread Kevin Walker
Hi Christian

We are just looking at options at this stage. 

Using a hardware RAM disk for the journal is the same concept as the SolidFire 
guys, who are also using XFS (at least they were last time I crossed paths with 
a customer using SolidFire) and from my experiences with ZFS, using a RAM based 
log device is a far safer option than enterprise slc ssd's for write log data. 
But, I am guessing with the performance of an SSD being a lot higher than a 
spindle, the need for a separate journal is negated. Each OSD has a journal, so 
if it fails and journal fails with it, it is not such a big problem as it would 
be with ZFS?

For the OSD's we are actually thinking of using low cost Samsung 1TB DC SSD's, 
but based on what you are saying even that level of performance will be 
unreachable due to the cpu overhead. 

Does this improve with RDMA?
Is anyone on the list using alternative high core count non x86 architectures  
(Tilera/ThunderX)? 
Would more threads help with this problem?

As mentioned at the beginning, we are looking at options, spindles might end up 
being a better option, with an SSD tier, hence my question about fragmentation, 
but the problem for us is power consumption. Having say 16 OSD nodes (24 
spindles each), plus 3 monitor nodes and 38 xeons consuming 100W each is a huge 
opex bill to factor against ROI. 

We are running VMware vSphere and testing vCloud with OnApp, so are expecting 
we will have to build a couple of nodes to provide FC targets, which adds 
further power consumption. 


Kind regards

Kevin Walker
+968 9765 1742

On 25 Feb 2015, at 04:40, Christian Balzer  wrote:

On Wed, 25 Feb 2015 02:50:59 +0400 Kevin Walker wrote:

> Hi Mark
> 
> Thanks for the info, 22k is not bad, but still massively below what a
> pcie ssd can achieve. Care to expand on why the write IOPS are so low?

Aside from what Mark mentioned in his reply there's also latency to be
considered in the overall picture.

But my (and other people's tests, including Mark's recent PDF posted here)
clearly indicate where the problem with small write (4k) IOPS is, the
CPU utilization by mostly Ceph code (but significant OS time, too).

To quote myself:
I did some brief tests with a machine having 8 DC S3700 100GB for OSDs
(replica 1) under 0.80.6 and the right (make that wrong) type of load
(small, 4k I/Os) did melt all of the 8 3.5GHz cores in that box.
While never exceeding 15% utilization of the SSDs.

Even with further optimizations I predict the CPUs() to remain the limiting
factor for small write IOPS. 
So with that in mind, a pure SSD storage node design will have to consider
that and spend money where it actually improves things.

> Was this with a separate RAM disk pcie device or SLC SSD for the journal?
> 
> That fragmentation percentage looks good. We are considering using just
> SSD's for OSD's and RAM disk pcie devices for the Journals so this would
> be ok.
For starters, you clearly have too much money.
You're not going to see a good return on investment, as per what I wrote
above. Even faster journals are pointless, having the journal on the
actual OSD SSDs is a non-issue performance wise and makes things a lot
more straightforward. 
I could totally see a much more primitive (HDD OSDs, journal SSDs) but
more balanced and parallelized cluster outperform your design at the same
cost (but admittedly more space usage). 

Secondly, why would you even care one iota about file system fragmentation
when using SSDs for all your storage?

Regards,

Christian

> Kind regards
> 
> Kevin Walker
> +968 9765 1742
> 
>> On 25 Feb 2015, at 02:35, Mark Nelson  wrote:
>> 
>> On 02/24/2015 04:21 PM, Kevin Walker wrote:
>> Hi All
>> 
>> Just recently joined the list and have been reading/learning about ceph
>> for the past few months. Overall it looks to be well suited to our
>> cloud platform but I have stumbled across a few worrying items that
>> hopefully you guys can clarify the status of.
>> 
>> Reading through various mailing list archives, it would seem an OSD
>> caps out at about 3k IOPS. Dieter Kasper from Fujistu made an
>> interesting observation about the size of the OSD code(20k plus lines
>> at that time), is this being optimized further and has this IOPS limit
>> been improved in Giant?
> 
> In recent tests under fairly optimal conditions, I'm seeing performance
> topping out at about 4K object writes/s and 22K object reads/s against
> an OSD with a very fast PCIe SSD.  There are several reasons writes are
> slower than reads, but this is something we are working on improving in
> a variety of ways.
> 
> I believe others may have achieved even higher results.
> 
>> 
>> Is there a way to over come the XFS fragmentation problems other users
>> have experienced?
> 
> Setting the newish filestore_xfs_extsize parameter to true appears to
> help in testing we did a couple months ago.  We filled up a cluster to
> near capacity (~70%) and then did 12 hours of random writes.  After the
> test completed, with fi

Re: [ceph-users] re: Upgrade 0.80.5 to 0.80.8 --the VM's read requestbecome too slow

2015-02-24 Thread 杨万元
I compare  the applied commits between 0.80.7 and 0.80.8 , then  I focus on
this:
https://github.com/ceph/ceph/commit/711a7e6f81983ff2091caa0f232af914a04a041c?diff=unified#diff-9bcd2f7647a2bd574b6ebe6baf8e61b3

this commits seems take waitfor_read flushed out of the while cycle, maybe
this cause the read request performance become worse?
sorry if I was wrong , I  am not familiar  with the osdc source code .

, m

2015-02-21 23:14 GMT+08:00 Alexandre DERUMIER :

> >>so maybe can sure this problem is cause from 0.80.8
>
> That's a good news if you are sure that it's come from 0.80.8
>
> The applied commits between 0.80.7 and 0.80.8 is here:
>
> https://github.com/ceph/ceph/compare/v0.80.7...v0.80.8
>
>
> Now, we need to find which of them is for librados/librbd.
>
>
> - Mail original -
> De: "杨万元" 
> À: "aderumier" 
> Cc: "ceph-users" 
> Envoyé: Vendredi 13 Février 2015 04:39:05
> Objet: Re: [ceph-users] re: Upgrade 0.80.5 to 0.80.8 --the VM's read
> requestbecome too slow
>
> thanks very much for your advice .
> yes,as you said,disabled the rbd_cache will improve the read request,but
> if i disabled rbd_cache, the randwrite request will be worse. so this
> method maybe can not solve my problem, is it ?
> In addition , I also test the 0.80.6 and 0.80.7 librbd,they are as good as
> 0.80.5 performance , so maybe can sure this problem is cause from 0.80.8
>
> 2015-02-12 19:33 GMT+08:00 Alexandre DERUMIER < aderum...@odiso.com > :
>
>
> >>Hi,
> >>Can you test with disabling rbd_cache ?
>
> >>I remember of a bug detected in giant, not sure it's also the case for
> fireflt
>
> This was this tracker:
>
> http://tracker.ceph.com/issues/9513
>
> But It has been solved and backported to firefly.
>
> Also, can you test 0.80.6 and 0.80.7 ?
>
>
>
>
>
>
>
> - Mail original -
> De: "killingwolf" < killingw...@qq.com >
> À: "ceph-users" < ceph-users@lists.ceph.com >
> Envoyé: Jeudi 12 Février 2015 12:16:32
> Objet: [ceph-users] re: Upgrade 0.80.5 to 0.80.8 --the VM's read
> requestbecome too slow
>
> I have this problems too , Help!
>
> -- 原始邮件 --
> 发件人: "杨万元";< yangwanyuan8...@gmail.com >;
> 发送时间: 2015年2月12日(星期四) 中午11:14
> 收件人: " ceph-users@lists.ceph.com "< ceph-users@lists.ceph.com >;
> 主题: [ceph-users] Upgrade 0.80.5 to 0.80.8 --the VM's read requestbecome
> too slow
>
> Hello!
> We use Ceph+Openstack in our private cloud. Recently we upgrade our
> centos6.5 based cluster from Ceph Emperor to Ceph Firefly.
> At first,we use redhat yum repo epel to upgrade, this Ceph's version is
> 0.80.5. First upgrade monitor,then osd,last client. when we complete this
> upgrade, we boot a VM on the cluster,then use fio to test the io
> performance. The io performance is as better as before. Everything is ok!
> Then we upgrade the cluster from 0.80.5 to 0.80.8,when we completed , we
> reboot the VM to load the newest librbd. after that we also use fio to test
> the io performance .then we find the randwrite and write is as good as
> before.but the randread and read is become worse, randwrite's iops from
> 4000-5000 to 300-400 ,and the latency is worse. the write's bw from 400MB/s
> to 115MB/s . then I downgrade the ceph client version from 0.80.8 to
> 0.80.5, then the reslut become normal.
> So I think maybe something cause about librbd. I compare the 0.80.8
> release notes with 0.80.5 (
> http://ceph.com/docs/master/release-notes/#v0-80-8-firefly ), I just find
> this change in 0.80.8 is something about read request : librbd: cap memory
> utilization for read requests (Jason Dillaman) . Who can explain this?
>
>
> My ceph cluster is 400osd,5mons :
> ceph -s
> health HEALTH_OK
> monmap e11: 5 mons at {BJ-M1-Cloud71=
> 172.28.2.71:6789/0,BJ-M1-Cloud73=172.28.2.73:6789/0,BJ-M2-Cloud80=172.28.2.80:6789/0,BJ-M2-Cloud81=172.28.2.81:6789/0,BJ-M3-Cloud85=172.28.2.85:6789/0
> }, election epoch 198, quorum 0,1,2,3,4
> BJ-M1-Cloud71,BJ-M1-Cloud73,BJ-M2-Cloud80,BJ-M2-Cloud81,BJ-M3-Cloud85
> osdmap e120157: 400 osds: 400 up, 400 in
> pgmap v26161895: 29288 pgs, 6 pools, 20862 GB data, 3014 kobjects
> 41084 GB used, 323 TB / 363 TB avail
> 29288 active+clean
> client io 52640 kB/s rd, 32419 kB/s wr, 5193 op/s
>
>
> The follwing is my ceph client conf :
> [global]
> auth_service_required = cephx
> filestore_xattr_use_omap = true
> auth_client_required = cephx
> auth_cluster_required = cephx
> mon_host =
> 172.29.204.24,172.29.204.48,172.29.204.55,172.29.204.58,172.29.204.73
> mon_initial_members = ZR-F5-Cloud24, ZR-F6-Cloud48, ZR-F7-Cloud55,
> ZR-F8-Cloud58, ZR-F9-Cloud73
> fsid = c01c8e28-304e-47a4-b876-cb93acc2e980
> mon osd full ratio = .85
> mon osd nearfull ratio = .75
> public network = 172.29.204.0/24
> mon warn on legacy crush tunables = false
>
> [osd]
> osd op threads = 12
> filestore journal writeahead = true
> filestore merge threshold = 40
> filestore split multiple = 8
>
> [client]
> rbd cache = true
> rbd cache writethrough until flush = false
> rbd cache size = 67108864
> rbd cache max 

Re: [ceph-users] OSD Performance

2015-02-24 Thread Christian Balzer

Hello Kevin,

On Wed, 25 Feb 2015 07:55:34 +0400 Kevin Walker wrote:

> Hi Christian
> 
> We are just looking at options at this stage. 
>
Never a bad thing to do.
 
> Using a hardware RAM disk for the journal is the same concept as the
> SolidFire guys, who are also using XFS (at least they were last time I
> crossed paths with a customer using SolidFire) 

Ah, SolidFire. 
Didn't knew that, from looking at what they're doing I was expecting them
to use ZFS (or something totally self-made) and not XFS.
Journals in the Ceph sense and journals in the file system sense perform
similar functions, but still a bit of an apples and oranges case.

As for SolidFire itself, if you look at their numbers (I consider the
overall compression ratio to be vastly optimistic) you'll see that they
aren't leveraging the full SSD potentials either. 
Which is of course going to be pretty much impossible for anything
distributed.

> and from my experiences
> with ZFS, using a RAM based log device is a far safer option than
> enterprise slc ssd's for write log data. 

No argument here, from where I'm standing the only enterprise SSDs
fully worth the name are Intel DC 3700s.

> But, I am guessing with the
> performance of an SSD being a lot higher than a spindle, the need for a
> separate journal is negated. Each OSD has a journal, so if it fails and
> journal fails with it, it is not such a big problem as it would be with
> ZFS?
> 
Precisely. The protection here comes from the replication, not the journal
in and by itself. 
This 1:1 OSD/journal setup also prevents you from loosing multiple OSDs if
a super fast journal fails.

> For the OSD's we are actually thinking of using low cost Samsung 1TB DC
> SSD's, but based on what you are saying even that level of performance
> will be unreachable due to the cpu overhead. 
> 
Which exact Samsung model? 
If you throw enough CPU at few enough OSDs you might get closer to pushing
things over 50%. ^o^

> Does this improve with RDMA?

You (or the people developing this part) tell me.
Things overall will improve of course if this is done right, but it won't
solve the CPU contention which isn't really related to data movement.
Lower latency by using native IB will be another bonus, but again not
make the Ceph code (and other bits) magically more efficient.  

> Is anyone on the list using alternative high core count non x86
> architectures  (Tilera/ThunderX)? Would more threads help with this
> problem?
> 
As a gut feeling, I'd say no, but somebody correct me if I'm wrong,
especially with the recent sharding improvements/additions. 
It felt to me that more cores will reach a point of diminishing return,
faster cores not so much/quickly.

> As mentioned at the beginning, we are looking at options, spindles might
> end up being a better option, with an SSD tier, hence my question about
> fragmentation, but the problem for us is power consumption. Having say
> 16 OSD nodes (24 spindles each), plus 3 monitor nodes and 38 xeons
> consuming 100W each is a huge opex bill to factor against ROI. 
> 
See recent discussions about SSD tiers here. 
Quite a bit of things that would make Ceph more attractive or suitable for
you are in the pipeline, but in some cases probably a year or more away.

At 16 nodes you're probably OK to go with such dense servers (1 node going
down kills 24 OSDs and the resulting data storm won't be pretty).
You might find something with 12 HDDs and 2 SSDs easier to balance (CPU
power/RAM to OSDs) and not much more expensive.

Dedicated monitor nodes are only needed if your other servers have no fast
local storage and/or are totally overloaded.  I'd go for 1 or 2 dedicated
monitors (the primary and thus busiest monitor is picked based on the
lowest IP!) and use the saved money to beef up some other nodes for a
total of 5 monitors.

> We are running VMware vSphere and testing vCloud with OnApp, so are
> expecting we will have to build a couple of nodes to provide FC targets,
> which adds further power consumption. 
> 
Not to mention complication.
I have seen many people talking about iSCSI and other heads for Ceph
(which of course isn't exactly efficient compared to native RBD) but can't
recall a single "this is how you do it, guaranteed to work 100% in all use
cases" solution or guide.

Another group in-house here runs XenServer, they would love to use a Ceph
cluster made by me for cheaper storage than NetAPP or 3PAR, but since
XenServer only supports NFS or iSCSI, I don't see that happening any time
soon. 

Christian
> 
> Kind regards
> 
> Kevin Walker
> +968 9765 1742
> 
> On 25 Feb 2015, at 04:40, Christian Balzer  wrote:
> 
> On Wed, 25 Feb 2015 02:50:59 +0400 Kevin Walker wrote:
> 
> > Hi Mark
> > 
> > Thanks for the info, 22k is not bad, but still massively below what a
> > pcie ssd can achieve. Care to expand on why the write IOPS are so low?
> 
> Aside from what Mark mentioned in his reply there's also latency to be
> considered in the overall picture.
> 
> But my (and ot