Re: [ceph-users] Scaling RBD module

2013-09-19 Thread Somnath Roy
Thanks Josh !
I am able to successfully add this noshare option in the image mapping now. 
Looking at dmesg output, I found that was indeed the secret key problem. Block 
performance is scaling now.

Regards
Somnath

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Josh Durgin
Sent: Thursday, September 19, 2013 12:24 PM
To: Somnath Roy
Cc: Sage Weil; ceph-de...@vger.kernel.org; Anirban Ray; 
ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Scaling RBD module

On 09/19/2013 12:04 PM, Somnath Roy wrote:
> Hi Josh,
> Thanks for the information. I am trying to add the following but hitting some 
> permission issue.
>
> root@emsclient:/etc# echo :6789,:6789,:6789 
> name=admin,key=client.admin,noshare test_rbd ceph_block_test' > 
> /sys/bus/rbd/add
> -bash: echo: write error: Operation not permitted

If you check dmesg, it will probably show an error trying to authenticate to 
the cluster.

Instead of key=client.admin, you can pass the base64 secret value as shown in 
'ceph auth list' with the secret=X option.

BTW, there's a ticket for adding the noshare option to rbd map so using the 
sysfs interface like this is never necessary:

http://tracker.ceph.com/issues/6264

Josh

> Here is the contents of rbd directory..
>
> root@emsclient:/sys/bus/rbd# ll
> total 0
> drwxr-xr-x  4 root root0 Sep 19 11:59 ./
> drwxr-xr-x 30 root root0 Sep 13 11:41 ../
> --w---  1 root root 4096 Sep 19 11:59 add
> drwxr-xr-x  2 root root0 Sep 19 12:03 devices/
> drwxr-xr-x  2 root root0 Sep 19 12:03 drivers/
> -rw-r--r--  1 root root 4096 Sep 19 12:03 drivers_autoprobe
> --w---  1 root root 4096 Sep 19 12:03 drivers_probe
> --w---  1 root root 4096 Sep 19 12:03 remove
> --w---  1 root root 4096 Sep 19 11:59 uevent
>
>
> I checked even if I am logged in as root , I can't write anything on /sys.
>
> Here is the Ubuntu version I am using..
>
> root@emsclient:/etc# lsb_release -a
> No LSB modules are available.
> Distributor ID: Ubuntu
> Description:Ubuntu 13.04
> Release:13.04
> Codename:   raring
>
> Here is the mount information
>
> root@emsclient:/etc# mount
> /dev/mapper/emsclient--vg-root on / type ext4 (rw,errors=remount-ro) 
> proc on /proc type proc (rw,noexec,nosuid,nodev) sysfs on /sys type 
> sysfs (rw,noexec,nosuid,nodev) none on /sys/fs/cgroup type tmpfs (rw) 
> none on /sys/fs/fuse/connections type fusectl (rw) none on 
> /sys/kernel/debug type debugfs (rw) none on /sys/kernel/security type 
> securityfs (rw) udev on /dev type devtmpfs (rw,mode=0755) devpts on 
> /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=0620)
> tmpfs on /run type tmpfs (rw,noexec,nosuid,size=10%,mode=0755)
> none on /run/lock type tmpfs (rw,noexec,nosuid,nodev,size=5242880)
> none on /run/shm type tmpfs (rw,nosuid,nodev) none on /run/user type 
> tmpfs (rw,noexec,nosuid,nodev,size=104857600,mode=0755)
> /dev/sda1 on /boot type ext2 (rw)
> /dev/mapper/emsclient--vg-home on /home type ext4 (rw)
>
>
> Any idea what went wrong here ?
>
> Thanks & Regards
> Somnath
>
> -Original Message-
> From: Josh Durgin [mailto:josh.dur...@inktank.com]
> Sent: Wednesday, September 18, 2013 6:10 PM
> To: Somnath Roy
> Cc: Sage Weil; ceph-de...@vger.kernel.org; Anirban Ray; 
> ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Scaling RBD module
>
> On 09/17/2013 03:30 PM, Somnath Roy wrote:
>> Hi,
>> I am running Ceph on a 3 node cluster and each of my server node is running 
>> 10 OSDs, one for each disk. I have one admin node and all the nodes are 
>> connected with 2 X 10G network. One network is for cluster and other one 
>> configured as public network.
>>
>> Here is the status of my cluster.
>>
>> ~/fio_test# ceph -s
>>
>> cluster b2e0b4db-6342-490e-9c28-0aadf0188023
>>  health HEALTH_WARN clock skew detected on mon. , mon. 
>> 
>>  monmap e1: 3 mons at {=xxx.xxx.xxx.xxx:6789/0, 
>> =xxx.xxx.xxx.xxx:6789/0, 
>> =xxx.xxx.xxx.xxx:6789/0}, election epoch 64, quorum 0,1,2 
>> ,,
>>  osdmap e391: 30 osds: 30 up, 30 in
>>   pgmap v5202: 30912 pgs: 30912 active+clean; 8494 MB data, 27912 MB 
>> used, 11145 GB / 11172 GB avail
>>  mdsmap e1: 0/0/1 up
>>
>>
>> I started with rados bench command to benchmark the read performance of this 
>> Cluster on a large pool (~10K PGs) and found that each rados client has a 
>> limitation. Each client can only drive up to a certain mark. Each server  
>> node cpu utilization shows it is  around 85-90% idle and the admin node 
>> (from where rados client is running) is around ~80-85% idle. I am trying 
>> with 4K object size.
>
> Note that rados bench with 4k objects is different from rbd with 4k-sized 
> I/Os - rados bench sends each request to a new object, while rbd objects are 
> 4M by default.
>
>> Now, I started running more clients on the admin node and the performance is 
>> scaling till it hits the client cpu limit. Server still has the cpu of

Re: [ceph-users] 10/100 network for Mons?

2013-09-19 Thread David Zafman

I believe that the nature of the monitor network traffic should be fine with 
10/100 network ports.

David Zafman
Senior Developer
http://www.inktank.com

On Sep 18, 2013, at 1:24 PM, Gandalf Corvotempesta 
 wrote:

> Hi to all.
> Actually I'm building a test cluster with 3 OSD servers connected with
> IPoIB for cluster networks and 10GbE for public network.
> 
> I have to connect these OSDs to some MONs servers located in another
> rack with no gigabit or 10Gb connection.
> 
> Could I use some 10/100 networks ports? Which kind of traffic is
> managed by mons?
> AFAIK, clients will directly connect to the right OSDs (and there I
> have a 10GbE network, so Mons should't require that speed.
> 
> Is this right?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Scaling RBD module

2013-09-19 Thread Somnath Roy
Hi Josh,
Thanks for the information. I am trying to add the following but hitting some 
permission issue.

root@emsclient:/etc# echo :6789,:6789,:6789 
name=admin,key=client.admin,noshare test_rbd ceph_block_test' > /sys/bus/rbd/add
-bash: echo: write error: Operation not permitted

Here is the contents of rbd directory..

root@emsclient:/sys/bus/rbd# ll
total 0
drwxr-xr-x  4 root root0 Sep 19 11:59 ./
drwxr-xr-x 30 root root0 Sep 13 11:41 ../
--w---  1 root root 4096 Sep 19 11:59 add
drwxr-xr-x  2 root root0 Sep 19 12:03 devices/
drwxr-xr-x  2 root root0 Sep 19 12:03 drivers/
-rw-r--r--  1 root root 4096 Sep 19 12:03 drivers_autoprobe
--w---  1 root root 4096 Sep 19 12:03 drivers_probe
--w---  1 root root 4096 Sep 19 12:03 remove
--w---  1 root root 4096 Sep 19 11:59 uevent


I checked even if I am logged in as root , I can't write anything on /sys.

Here is the Ubuntu version I am using..

root@emsclient:/etc# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:Ubuntu 13.04
Release:13.04
Codename:   raring

Here is the mount information

root@emsclient:/etc# mount
/dev/mapper/emsclient--vg-root on / type ext4 (rw,errors=remount-ro)
proc on /proc type proc (rw,noexec,nosuid,nodev)
sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
none on /sys/fs/cgroup type tmpfs (rw)
none on /sys/fs/fuse/connections type fusectl (rw)
none on /sys/kernel/debug type debugfs (rw)
none on /sys/kernel/security type securityfs (rw)
udev on /dev type devtmpfs (rw,mode=0755)
devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=0620)
tmpfs on /run type tmpfs (rw,noexec,nosuid,size=10%,mode=0755)
none on /run/lock type tmpfs (rw,noexec,nosuid,nodev,size=5242880)
none on /run/shm type tmpfs (rw,nosuid,nodev)
none on /run/user type tmpfs (rw,noexec,nosuid,nodev,size=104857600,mode=0755)
/dev/sda1 on /boot type ext2 (rw)
/dev/mapper/emsclient--vg-home on /home type ext4 (rw)


Any idea what went wrong here ?

Thanks & Regards
Somnath

-Original Message-
From: Josh Durgin [mailto:josh.dur...@inktank.com]
Sent: Wednesday, September 18, 2013 6:10 PM
To: Somnath Roy
Cc: Sage Weil; ceph-de...@vger.kernel.org; Anirban Ray; 
ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Scaling RBD module

On 09/17/2013 03:30 PM, Somnath Roy wrote:
> Hi,
> I am running Ceph on a 3 node cluster and each of my server node is running 
> 10 OSDs, one for each disk. I have one admin node and all the nodes are 
> connected with 2 X 10G network. One network is for cluster and other one 
> configured as public network.
>
> Here is the status of my cluster.
>
> ~/fio_test# ceph -s
>
>cluster b2e0b4db-6342-490e-9c28-0aadf0188023
> health HEALTH_WARN clock skew detected on mon. , mon. 
> 
> monmap e1: 3 mons at {=xxx.xxx.xxx.xxx:6789/0, 
> =xxx.xxx.xxx.xxx:6789/0, 
> =xxx.xxx.xxx.xxx:6789/0}, election epoch 64, quorum 0,1,2 
> ,,
> osdmap e391: 30 osds: 30 up, 30 in
>  pgmap v5202: 30912 pgs: 30912 active+clean; 8494 MB data, 27912 MB used, 
> 11145 GB / 11172 GB avail
> mdsmap e1: 0/0/1 up
>
>
> I started with rados bench command to benchmark the read performance of this 
> Cluster on a large pool (~10K PGs) and found that each rados client has a 
> limitation. Each client can only drive up to a certain mark. Each server  
> node cpu utilization shows it is  around 85-90% idle and the admin node (from 
> where rados client is running) is around ~80-85% idle. I am trying with 4K 
> object size.

Note that rados bench with 4k objects is different from rbd with 4k-sized I/Os 
- rados bench sends each request to a new object, while rbd objects are 4M by 
default.

> Now, I started running more clients on the admin node and the performance is 
> scaling till it hits the client cpu limit. Server still has the cpu of 30-35% 
> idle. With small object size I must say that the ceph per osd cpu utilization 
> is not promising!
>
> After this, I started testing the rados block interface with kernel rbd 
> module from my admin node.
> I have created 8 images mapped on the pool having around 10K PGs and I am not 
> able to scale up the performance by running fio (either by creating a 
> software raid or running on individual /dev/rbd* instances). For example, 
> running multiple fio instances (one in /dev/rbd1 and the other in /dev/rbd2)  
> the performance I am getting is half of what I am getting if running one 
> instance. Here is my fio job script.
>
> [random-reads]
> ioengine=libaio
> iodepth=32
> filename=/dev/rbd1
> rw=randread
> bs=4k
> direct=1
> size=2G
> numjobs=64
>
> Let me know if I am following the proper procedure or not.
>
> But, If my understanding is correct, kernel rbd module is acting as a client 
> to the cluster and in one admin node I can run only one of such kernel 
> instance.
> If so, I am then limited to the client bottleneck that I stated earlier. The 
> cpu utilization of the server side is around 85-90% idle, so, it is clear 

Re: [ceph-users] Scaling RBD module

2013-09-19 Thread Josh Durgin

On 09/19/2013 12:04 PM, Somnath Roy wrote:

Hi Josh,
Thanks for the information. I am trying to add the following but hitting some 
permission issue.

root@emsclient:/etc# echo :6789,:6789,:6789 
name=admin,key=client.admin,noshare test_rbd ceph_block_test' > /sys/bus/rbd/add
-bash: echo: write error: Operation not permitted


If you check dmesg, it will probably show an error trying to
authenticate to the cluster.

Instead of key=client.admin, you can pass the base64 secret value as
shown in 'ceph auth list' with the secret=X option.

BTW, there's a ticket for adding the noshare option to rbd map so using
the sysfs interface like this is never necessary:

http://tracker.ceph.com/issues/6264

Josh


Here is the contents of rbd directory..

root@emsclient:/sys/bus/rbd# ll
total 0
drwxr-xr-x  4 root root0 Sep 19 11:59 ./
drwxr-xr-x 30 root root0 Sep 13 11:41 ../
--w---  1 root root 4096 Sep 19 11:59 add
drwxr-xr-x  2 root root0 Sep 19 12:03 devices/
drwxr-xr-x  2 root root0 Sep 19 12:03 drivers/
-rw-r--r--  1 root root 4096 Sep 19 12:03 drivers_autoprobe
--w---  1 root root 4096 Sep 19 12:03 drivers_probe
--w---  1 root root 4096 Sep 19 12:03 remove
--w---  1 root root 4096 Sep 19 11:59 uevent


I checked even if I am logged in as root , I can't write anything on /sys.

Here is the Ubuntu version I am using..

root@emsclient:/etc# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:Ubuntu 13.04
Release:13.04
Codename:   raring

Here is the mount information

root@emsclient:/etc# mount
/dev/mapper/emsclient--vg-root on / type ext4 (rw,errors=remount-ro)
proc on /proc type proc (rw,noexec,nosuid,nodev)
sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
none on /sys/fs/cgroup type tmpfs (rw)
none on /sys/fs/fuse/connections type fusectl (rw)
none on /sys/kernel/debug type debugfs (rw)
none on /sys/kernel/security type securityfs (rw)
udev on /dev type devtmpfs (rw,mode=0755)
devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=0620)
tmpfs on /run type tmpfs (rw,noexec,nosuid,size=10%,mode=0755)
none on /run/lock type tmpfs (rw,noexec,nosuid,nodev,size=5242880)
none on /run/shm type tmpfs (rw,nosuid,nodev)
none on /run/user type tmpfs (rw,noexec,nosuid,nodev,size=104857600,mode=0755)
/dev/sda1 on /boot type ext2 (rw)
/dev/mapper/emsclient--vg-home on /home type ext4 (rw)


Any idea what went wrong here ?

Thanks & Regards
Somnath

-Original Message-
From: Josh Durgin [mailto:josh.dur...@inktank.com]
Sent: Wednesday, September 18, 2013 6:10 PM
To: Somnath Roy
Cc: Sage Weil; ceph-de...@vger.kernel.org; Anirban Ray; 
ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Scaling RBD module

On 09/17/2013 03:30 PM, Somnath Roy wrote:

Hi,
I am running Ceph on a 3 node cluster and each of my server node is running 10 
OSDs, one for each disk. I have one admin node and all the nodes are connected 
with 2 X 10G network. One network is for cluster and other one configured as 
public network.

Here is the status of my cluster.

~/fio_test# ceph -s

cluster b2e0b4db-6342-490e-9c28-0aadf0188023
 health HEALTH_WARN clock skew detected on mon. , mon. 

 monmap e1: 3 mons at {=xxx.xxx.xxx.xxx:6789/0, 
=xxx.xxx.xxx.xxx:6789/0, =xxx.xxx.xxx.xxx:6789/0}, election epoch 64, 
quorum 0,1,2 ,,
 osdmap e391: 30 osds: 30 up, 30 in
  pgmap v5202: 30912 pgs: 30912 active+clean; 8494 MB data, 27912 MB used, 
11145 GB / 11172 GB avail
 mdsmap e1: 0/0/1 up


I started with rados bench command to benchmark the read performance of this 
Cluster on a large pool (~10K PGs) and found that each rados client has a 
limitation. Each client can only drive up to a certain mark. Each server  node 
cpu utilization shows it is  around 85-90% idle and the admin node (from where 
rados client is running) is around ~80-85% idle. I am trying with 4K object 
size.


Note that rados bench with 4k objects is different from rbd with 4k-sized I/Os 
- rados bench sends each request to a new object, while rbd objects are 4M by 
default.


Now, I started running more clients on the admin node and the performance is 
scaling till it hits the client cpu limit. Server still has the cpu of 30-35% 
idle. With small object size I must say that the ceph per osd cpu utilization 
is not promising!

After this, I started testing the rados block interface with kernel rbd module 
from my admin node.
I have created 8 images mapped on the pool having around 10K PGs and I am not 
able to scale up the performance by running fio (either by creating a software 
raid or running on individual /dev/rbd* instances). For example, running 
multiple fio instances (one in /dev/rbd1 and the other in /dev/rbd2)  the 
performance I am getting is half of what I am getting if running one instance. 
Here is my fio job script.

[random-reads]
ioengine=libaio
iodepth=32
filename=/dev/rbd1
rw=randread
bs=4k
direct=1
size=2G
numjobs=64

Let me know if I am following

[ceph-users] monitor deployment during quick start

2013-09-19 Thread Gruher, Joseph R
Could someone make a quick clarification on the quick start guide for me?  On 
this page: http://ceph.com/docs/next/start/quick-ceph-deploy/.  After I do 
"ceph-deploy new" to a system is that system then a monitor from that point 
forward?  Or do I then have to do "ceph-deploy mon create" to that same system 
before it is really a monitor?

Regardless of the combinations of systems I try I seem to get a failure at the 
add a monitor step.  Should this be a correct sequence?
ceph@cephtest01:~$ ceph-deploy new cephtest02
ceph@cephtest01:~$ ceph-deploy install --no-adjust-repos cephtest02 
cephtest03 cephtest04
ceph@cephtest01:~$ ceph-deploy mon create cephtest02

Here is the failure I get:

ceph@cephtest01:~$ ceph-deploy mon create cephtest02
[ceph_deploy.mon][DEBUG ] Deploying mon, cluster ceph hosts cephtest02
[ceph_deploy.mon][DEBUG ] detecting platform for host cephtest02 ...
[ceph_deploy.sudo_pushy][DEBUG ] will use a remote connection with sudo
[ceph_deploy.mon][INFO  ] distro info: Ubuntu 12.04 precise
[cephtest02][DEBUG ] determining if provided host has same hostname in remote
[cephtest02][DEBUG ] deploying mon to cephtest02
[cephtest02][DEBUG ] remote hostname: cephtest02
[cephtest02][INFO  ] write cluster configuration to /etc/ceph/{cluster}.conf
[cephtest02][DEBUG ] checking for done path: 
/var/lib/ceph/mon/ceph-cephtest02/done
[cephtest02][DEBUG ] done path does not exist: 
/var/lib/ceph/mon/ceph-cephtest02/done
[cephtest02][INFO  ] creating keyring file: 
/var/lib/ceph/tmp/ceph-cephtest02.mon.keyring
[cephtest02][INFO  ] create the monitor keyring file
[cephtest02][INFO  ] Running command: ceph-mon --cluster ceph --mkfs -i 
cephtest02 --keyring /var/lib/ceph/tmp/ceph-cephtest02.mon.keyring
[cephtest02][ERROR ] Traceback (most recent call last):
[cephtest02][ERROR ]   File 
"/usr/lib/python2.7/dist-packages/ceph_deploy/hosts/common.py", line 72, in 
mon_create
[cephtest02][ERROR ]   File 
"/usr/lib/python2.7/dist-packages/ceph_deploy/util/decorators.py", line 10, in 
inner
[cephtest02][ERROR ]   File 
"/usr/lib/python2.7/dist-packages/ceph_deploy/util/wrappers.py", line 6, in 
remote_call
[cephtest02][ERROR ]   File "/usr/lib/python2.7/subprocess.py", line 511, in 
check_call
[cephtest02][ERROR ] raise CalledProcessError(retcode, cmd)
[cephtest02][ERROR ] CalledProcessError: Command '['ceph-mon', '--cluster', 
'ceph', '--mkfs', '-i', 'cephtest02', '--keyring', 
'/var/lib/ceph/tmp/ceph-cephtest02.mon.keyring']' returned non-zero exit status 
1
[cephtest02][INFO  ] --conf/-cRead configuration from the given 
configuration file
[cephtest02][INFO  ] -d   Run in foreground, log to stderr.
[cephtest02][INFO  ] -f   Run in foreground, log to usual location.
[cephtest02][INFO  ] --id/-i  set ID portion of my name
[cephtest02][INFO  ] --name/-nset name (TYPE.ID)
[cephtest02][INFO  ] --versionshow version and quit
[cephtest02][INFO  ]--debug_ms N
[cephtest02][INFO  ] set message debug level (e.g. 1)
[cephtest02][ERROR ] too many arguments: [--cluster,ceph]
[cephtest02][ERROR ] usage: ceph-mon -i monid [--mon-data=pathtodata] [flags]
[cephtest02][ERROR ]   --debug_mon n
[cephtest02][ERROR ] debug monitor level (e.g. 10)
[cephtest02][ERROR ]   --mkfs
[cephtest02][ERROR ] build fresh monitor fs
[ceph_deploy.mon][ERROR ] Failed to execute command: ceph-mon --cluster ceph 
--mkfs -i cephtest02 --keyring /var/lib/ceph/tmp/ceph-cephtest02.mon.keyring
[ceph_deploy][ERROR ] GenericError: Failed to create 1 monitors


Trying to run the failing command myself:

ceph@cephtest01:~$ ssh cephtest02 "sudo ceph-mon --cluster ceph --mkfs -i 
cephtest02 --keyring /var/lib/ceph/tmp/ceph-cephtest02.mon.keyring"
--conf/-cRead configuration from the given configuration file
-d   Run in foreground, log to stderr.
-f   Run in foreground, log to usual location.
--id/-i  set ID portion of my name
--name/-nset name (TYPE.ID)
--versionshow version and quit

   --debug_ms N
set message debug level (e.g. 1)
too many arguments: [--cluster,ceph]
usage: ceph-mon -i monid [--mon-data=pathtodata] [flags]
  --debug_mon n
debug monitor level (e.g. 10)
  --mkfs
build fresh monitor fs


Not clear if I should be using the same system from "ceph-deploy new" for 
"ceph-deploy mon" but the same thing happens either way:

ceph@cephtest01:~$ ssh cephtest03 "sudo ceph-mon --cluster ceph --mkfs -i 
cephtest02 --keyring /var/lib/ceph/tmp/ceph-cephtest02.mon.keyring"
--conf/-cRead configuration from the given configuration file
-d   Run in foreground, log to stderr.
-f   Run in foreground, log to usual location.
--id/-i  set ID portion of my name
--name/-nset name (TYPE.ID)
--versionshow version and quit

   --debug_ms N
set message debug level (e.g. 1)
too many arguments: 

Re: [ceph-users] OSDMap problem: osd does not exist.

2013-09-19 Thread Yasuhiro Ohara

Hi Sage,

Thank you for the response.

So, it seems that mon data can be removed and recovered later,
only if osdmap is saved (in binary) and incorporated at the time
of initial creation of mon data (i.e., mon --mkfs) ?

I created the new osdmap by osdmaptool --createsimple, which provided
a different PG size for pools, and in turn it made me think I need
re-create the pool (to fix the osdmap). Another critical one of my
misunderstandings was that I thought osdmap can be re-created
and set easily such as in the case of crushmap.

I think it would be helpful for users if there's description for
instructions when you lost the mon data accidentally.
I had saved my old mon data (not removed in one of my 5 mons)
but couldn't retrieve the previous osdmap from them,
which I guess I could, theoretically.

But anyway, I'll start from scratch.
Thank you very much for the help.
Yes I'll be careful not to do that again :)

regards,
Yasu

From: Sage Weil 
Subject: Re: [ceph-users] OSDMap problem: osd does not exist.
Date: Thu, 19 Sep 2013 05:48:35 -0700 (PDT)
Message-ID: 

> On Thu, 19 Sep 2013, Yasuhiro Ohara wrote:
>> 
>> Hi Sage,
>> 
>> Thanks, after thrashing it became a little bit better,
>> but not yet healthy.
>> 
>> ceph -s: http://pastebin.com/vD28FJ4A
>> ceph osd dump: http://pastebin.com/37FLNxd7
>> ceph pg dump: http://pastebin.com/pccdg20j
>> 
>> (osd.0 and 1 are not running. I issued some "osd in" commands.
>> osd.4 are running but marked down/out: what is the "autoout" ?)
>> 
>> After thrashing some times (maybe I thrash it too much ?),
>> the osd clusters really thrashed much,
>> like in ceph -w: http://pastebin.com/fjeqrhxp
>> 
>> I thought osd's osdmap epoch was around 4900 (by seeing data/current/meta),
>> but it needed 6 or 7 osd thrash command execs until it seemed to work
>> on something, and epoch reached over 1.
>> Now I see "deep scrub ok" some time in ceph -w.
>> But still the PGs are 'creating' state, and it does not seem to be
>> creating anything really.
>> 
>> I removed and re-creted pools, because the number of PGs are incorrect,
>> and it changed pool id 0,1,2 to 3,4,5. Is this causing the problem ?
> 
> If you deleted and recreated the pools, you may as well just wipe the 
> cluster and start over from scratch... the data is gone.  The MDS is 
> crashing because the pool referenced by the MDSMap is gone and it has no 
> fs (meta)data.
> 
> I suggest just starting from scratch.  And next time, don't delete all of 
> the monitor data!  :)
> 
> sage
> 
> 
>> 
>> By the way, MDS crashes on this cluster status.
>> ceph-mds.2.log: http://pastebin.com/Ruf5YB8d
>> 
>> Any suggestion is really appreciated.
>> Thanks.
>> 
>> regards,
>> Yasu
>> 
>> From: Sage Weil 
>> Subject: Re: [ceph-users] OSDMap problem: osd does not exist.
>> Date: Wed, 18 Sep 2013 19:58:16 -0700 (PDT)
>> Message-ID: 
>> 
>> > Hey,
>> > 
>> > On Wed, 18 Sep 2013, Yasuhiro Ohara wrote:
>> >> 
>> >> Hi,
>> >> 
>> >> My OSDs are not joining the cluster correctly,
>> >> because the nonce they assume and receive from the peer are different.
>> >> It says "wrong node" because of the entity_id_t peer_addr (i.e., the
>> >> combination of the IP address, port number, and the nonce) is different.
>> >> 
>> >> Now, my questions are:
>> >> 1, Are the nonces of OSD peer addrs are kept in the osdmap ?
>> >> 2, (If so) can I modify the nonce value ?
>> >> 
>> >> More generally, how can I fix the cluster if I blew away the mon data ?
>> >> 
>> >> Below I'd like to summarize what I did.
>> >> - I tried upgrade from 0.57 to 0.67.3
>> >> - the mon protocol is different, and the mon data format seemed also
>> >>   different (changed to use leveldb ?). So restarting all mons.
>> >> - The mon data upgrade did not go well because of the full disk,
>> >>   but I didn't notice the cause and stupidly tried to start mon from 
>> >> scratch,
>> >>   building the mon data (mon --mkfs). (I solved the full disk problem
>> >>   later.)
>> >> - Now there's no OSD exising in the cluster (i.e., in osdmap).
>> >> - I added OSD configurations using "ceph osd create".
>> >> - Still OSDs do not recognize each other; they do not become peers.
>> >> - (The OSDs seem to hold the previous PG data still, and loading them
>> >>   is working fine. So I assume I still can recover the data.)
>> >> 
>> >> Does anyone have any advice on this ?
>> >> I'm planning to try to modify the source code because of no other choice,
>> >> so that they ignore nonce values :(
>> > 
>> > The nonce value is important; you can't just ignore it.  If they addr in 
>> > the osdmap isn't changing, it si probably because the mon thinks the 
>> > latest osdmap is N and the osd's think the latest is >> N.  I would look 
>> > in the osd data/current/meta directory and see what the newest osdmap 
>> > epoch is, compare that to 'ceph osd dump', and then do 'ceph osd thrash N' 
>> > to make it churn though a bunch of maps to get to an epoch that is > than 
>> > what he OSDs see.  Once that happ

Re: [ceph-users] ceph-deploy not including sudo?

2013-09-19 Thread Gruher, Joseph R
>-Original Message-
>From: Alfredo Deza [mailto:alfredo.d...@inktank.com]
>
>Can you try running ceph-deploy *without* sudo ?
>

Ah, OK, sure.  Without sudo I end up hung here again:

ceph@cephtest01:~$ ceph-deploy install cephtest03 cephtest04 cephtest05 
cephtest06

[cephtest03][INFO  ] Running command: wget -q -O- 
'https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/release.asc' | apt-key 
add -

BUT if I then add the --no-adjust-repos switch that was suggested we then 
finally run to completion!  

Thanks for the help!  On to the next step...
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ulimit max user processes (-u) and non-root ceph clients

2013-09-19 Thread Gregory Farnum
On Wed, Sep 18, 2013 at 11:43 PM, Dan Van Der Ster
 wrote:
>
> On Sep 18, 2013, at 11:50 PM, Gregory Farnum 
>  wrote:
>
>> On Wed, Sep 18, 2013 at 6:33 AM, Dan Van Der Ster
>>  wrote:
>>> Hi,
>>> We just finished debugging a problem with RBD-backed Glance image creation 
>>> failures, and thought our workaround would be useful for others. Basically, 
>>> we found that during an image upload, librbd on the glance api server was 
>>> consuming many many processes, eventually hitting the 1024 nproc limit of 
>>> non-root users in RHEL. The failure occurred when uploading to pools with 
>>> 2048 PGs, but didn't fail when uploading to pools with 512 PGs (we're 
>>> guessing that librbd is opening one thread per accessed-PG, and not closing 
>>> those threads until the whole processes completes.)
>>>
>>> If you hit this same problem (and you run RHEL like us), you'll need to 
>>> modify at least /etc/security/limits.d/90-nproc.conf (adding your non-root 
>>> user that should be allowed > 1024 procs), and then also possibly run 
>>> ulimit -u in the init script of your client process. Ubuntu should have 
>>> some similar limits.
>>
>> Did your pools with 2048 PGs have a significantly larger number of
>> OSDs in them? Or are both pools on a pool with a lot of OSDs relative
>> to the PG counts?
>
> 1056 OSDs at the moment.
>
> Uploading a 14GB image we observed up to ~1500 threads.
>
> We set the glance client to allow 4096 processes for now.
>
>
>> The PG count shouldn't matter for this directly, but RBD (and other
>> clients) will create a couple messenger threads for each OSD it talks
>> to, and while they'll eventually shut down on idle it doesn't
>> proactively close them. I'd expect this to be a problem around 500
>> OSDs.
>
> A couple, is that the upper limit? Should we be safe with ulimit -u 2*nOSDs 
> +1 ??

The messenger currently generates 2 threads per daemon it communicates
with (although they will go away after a long enough idle period).
2*nOSD+1 won't quite be enough as there's the monitor connection and a
handful of internal threads (I don't remember the exact numbers
off-hand).

So far this hasn't been a problem for anybody and I doubt you'll see
issues, but at some point we will need to switch the messenger to use
epoll instead of a thread per socket. :)

-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG distribution scattered

2013-09-19 Thread Gregory Farnum
It will not lose any of your data. But it will try and move pretty much all
of it, which will probably send performance down the toilet.
-Greg

On Thursday, September 19, 2013, Mark Nelson wrote:

> Honestly I don't remember, but I would be wary if it's not a test system.
> :)
>
> Mark
>
> On 09/19/2013 11:28 AM, Warren Wang wrote:
>
>> Is this safe to enable on a running cluster?
>>
>> --
>> Warren
>>
>> On Sep 19, 2013, at 9:43 AM, Mark Nelson  wrote:
>>
>>  On 09/19/2013 08:36 AM, Niklas Goerke wrote:
>>>
 Hi there

 I'm currently evaluating ceph and started filling my cluster for the
 first time. After filling it up to about 75%, it reported some OSDs
 being "near-full".
 After some evaluation I found that the PGs are not distributed evenly
 over all the osds.

 My Setup:
 * Two Hosts with 45 Disks each --> 90 OSDs
 * Only one newly created pool with 4500 PGs and a Replica Size of 2 -->
 should be about 100 PGs per OSD

 What I found was that one OSD only had 72 PGs, while another had 123 PGs
 [1]. That means that - if I did the math correctly - I can only fill the
 cluster to about 81%, because thats when the first OSD is completely
 full[2].

>>>
>>> Does distribution improve if you make a pool with significantly more PGs?
>>>
>>>
 I did some experimenting and found, that if I add another pool with 4500
 PGs, each OSD will have exacly doubled the amount of PGs as with one
 pool. So this is not an accident (tried it multiple times). On another
 test-cluster with 4 Hosts and 15 Disks each, the Distribution was
 similarly worse.

>>>
>>> This is a bug that causes each pool to more or less be distributed the
>>> same way on the same hosts.  We have a fix, but it impacts backwards
>>> compatibility so it's off by default.  If you set:
>>>
>>> osd pool default flag hashpspool = true
>>>
>>> Theoretically that will cause different pools to be distributed more
>>> randomly.
>>>
>>>
 To me it looks like the rjenkins algorithm is not working as it - in my
 opinion - should be.

 Am I doing anything wrong?
 Is this behaviour to be expected?
 Can I don something about it?


 Thank you very much in advance
 Niklas


 [1] I built a small script that will parse pgdump and output the amount
 of pgs on each osd: http://pastebin.com/5ZVqhy5M
 [2] I know I should not fill my cluster completely but I'm talking about
 theory and adding a margin only makes it worse.

 __**_
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**com

>>>
>>> __**_
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**com
>>>
>>
> __**_
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**com
>


-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG distribution scattered

2013-09-19 Thread Warren Wang
Good timing then. I just fired up the cluster 2 days ago. Thanks. 

--
Warren

On Sep 19, 2013, at 12:34 PM, Gregory Farnum  wrote:

> It will not lose any of your data. But it will try and move pretty much all 
> of it, which will probably send performance down the toilet.
> -Greg
> 
> On Thursday, September 19, 2013, Mark Nelson wrote:
>> Honestly I don't remember, but I would be wary if it's not a test system. :)
>> 
>> Mark
>> 
>> On 09/19/2013 11:28 AM, Warren Wang wrote:
>>> Is this safe to enable on a running cluster?
>>> 
>>> --
>>> Warren
>>> 
>>> On Sep 19, 2013, at 9:43 AM, Mark Nelson  wrote:
>>> 
 On 09/19/2013 08:36 AM, Niklas Goerke wrote:
> Hi there
> 
> I'm currently evaluating ceph and started filling my cluster for the
> first time. After filling it up to about 75%, it reported some OSDs
> being "near-full".
> After some evaluation I found that the PGs are not distributed evenly
> over all the osds.
> 
> My Setup:
> * Two Hosts with 45 Disks each --> 90 OSDs
> * Only one newly created pool with 4500 PGs and a Replica Size of 2 -->
> should be about 100 PGs per OSD
> 
> What I found was that one OSD only had 72 PGs, while another had 123 PGs
> [1]. That means that - if I did the math correctly - I can only fill the
> cluster to about 81%, because thats when the first OSD is completely
> full[2].
 
 Does distribution improve if you make a pool with significantly more PGs?
 
> 
> I did some experimenting and found, that if I add another pool with 4500
> PGs, each OSD will have exacly doubled the amount of PGs as with one
> pool. So this is not an accident (tried it multiple times). On another
> test-cluster with 4 Hosts and 15 Disks each, the Distribution was
> similarly worse.
 
 This is a bug that causes each pool to more or less be distributed the 
 same way on the same hosts.  We have a fix, but it impacts backwards 
 compatibility so it's off by default.  If you set:
 
 osd pool default flag hashpspool = true
 
 Theoretically that will cause different pools to be distributed more 
 randomly.
 
> 
> To me it looks like the rjenkins algorithm is not working as it - in my
> opinion - should be.
> 
> Am I doing anything wrong?
> Is this behaviour to be expected?
> Can I don something about it?
> 
> 
> Thank you very much in advance
> Niklas
> 
> 
> [1] I built a small script that will parse pgdump and output the amount
> of pgs on each osd: http://pastebin.com/5ZVqhy5M
> [2] I know I should not fill my cluster completely but I'm talking about
> theory and adding a margin only makes it worse.
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> -- 
> Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] poor radosgw performance

2013-09-19 Thread Yehuda Sadeh
On Thu, Sep 19, 2013 at 8:52 AM, Matt Thompson  wrote:
> Hi All,
>
> We're trying to test swift API performance of swift itself (1.9.0) and
> ceph's radosgw (0.67.3) using the following hardware configuration:
>
> Shared servers:
>
> * 1 server running keystone for authentication
> * 1 server running swift-proxy, a single MON, and radosgw + Apache / FastCGI
>
> Ceph:
>
> * 4 storage servers, 5 storage disks / 5 OSDs on each (no separate disk(s)
> for journal)
>
> Swift:
>
> * 4 storage servers, 5 storage disks on each
>
> All 10 machines have identical hardware configurations (including drive type
> & speed).
>
> We deployed ceph w/ ceph-deploy and both swift and ceph have default
> configurations w/ the exception of the following:
>
> * custom Inktank packages for apache2 / libapache2-mod-fastcgi
> * rgw_print_continue enabled
> * rgw_enable_ops_log disabled
> * rgw_ops_log_rados disabled
> * debug_rgw disabled
>
> (actually, swift was deployed w/ a chef cookbook, so configurations may be
> slightly non-standard)
>
> On the ceph storage servers, filesystem type (XFS) and filesystem mount
> options, pg_nums on pools, etc. have all been left with the defaults (8 on
> the radosgw-related pools IIRC).

8 pgs per pool, especially for the data / index pools is awfully low,
and probably where your bottleneck is.

>
> Doing a preliminary test w/ swift-bench (concurrency = 10, object_size = 1),
> we're seeing the following:
>
> Ceph:
>
> 1000 PUTS **FINAL** [0 failures], 14.8/s
> 1 GETS **FINAL** [0 failures], 40.9/s
> 1000 DEL **FINAL** [0 failures], 34.6/s
>
> Swift:
>
> 1000 PUTS **FINAL** [0 failures], 21.7/s
> 1 GETS **FINAL** [0 failures], 139.5/s
> 1000 DEL **FINAL** [0 failures], 85.5/s
>
> That's a relatively significant difference.  Would we see any real
> difference in moving the journals to an SSD per server or separate partition
> per OSD disk?  These machines are not seeing any load short of what's being

maybe, but I think at this point you're hitting the low pgs issue.

> generated by swift-bench.  Alternatively, would we see any quick wins
> standing up more MONs or moving the MON off the server running radosgw +
> Apache / FastCGI?

don't think it's going to make much of a difference right now.

Yehuda
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Excessive mon memory usage in cuttlefish 0.61.8

2013-09-19 Thread Joao Eduardo Luis

On 09/19/2013 04:46 PM, Andrey Korolyov wrote:

On Thu, Sep 19, 2013 at 1:00 PM, Joao Eduardo Luis
 wrote:

On 09/18/2013 11:25 PM, Andrey Korolyov wrote:


Hello,

Just restarted one of my mons after a month of uptime, memory commit
raised ten times high than before:

13206 root  10 -10 12.8g 8.8g 107m S65 14.0   0:53.97 ceph-mon

normal one looks like

   30092 root  10 -10 4411m 790m  46m S 1  1.2   1260:28 ceph-mon



Try running 'ceph heap stats', followed by 'ceph heap release', and then
recheck the memory consumption for the monitor.


It had shrinked to 350M RSS over night, so seems I need to restart
this mon again or try with other one to reproduce the problem over
next night. This monitor was a leader so I may check against other
ones and see their peak consumption.


Was that monitor attempting to join the quorum?

  -Joao


--
Joao Eduardo Luis
Software Engineer | http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] poor radosgw performance

2013-09-19 Thread Matt Thompson
Hi All,

We're trying to test swift API performance of swift itself (1.9.0) and
ceph's radosgw (0.67.3) using the following hardware configuration:

Shared servers:

* 1 server running keystone for authentication
* 1 server running swift-proxy, a single MON, and radosgw + Apache / FastCGI

Ceph:

* 4 storage servers, 5 storage disks / 5 OSDs on each (no separate disk(s)
for journal)

Swift:

* 4 storage servers, 5 storage disks on each

All 10 machines have identical hardware configurations (including drive
type & speed).

We deployed ceph w/ ceph-deploy and both swift and ceph have default
configurations w/ the exception of the following:

* custom Inktank packages for apache2 / libapache2-mod-fastcgi
* rgw_print_continue enabled
* rgw_enable_ops_log disabled
* rgw_ops_log_rados disabled
* debug_rgw disabled

(actually, swift was deployed w/ a chef cookbook, so configurations may be
slightly non-standard)

On the ceph storage servers, filesystem type (XFS) and filesystem mount
options, pg_nums on pools, etc. have all been left with the defaults (8 on
the radosgw-related pools IIRC).

Doing a preliminary test w/ swift-bench (concurrency = 10, object_size =
1), we're seeing the following:

Ceph:

1000 PUTS **FINAL** [0 failures], 14.8/s
1 GETS **FINAL** [0 failures], 40.9/s
1000 DEL **FINAL** [0 failures], 34.6/s

Swift:

1000 PUTS **FINAL** [0 failures], 21.7/s
1 GETS **FINAL** [0 failures], 139.5/s
1000 DEL **FINAL** [0 failures], 85.5/s

That's a relatively significant difference.  Would we see any real
difference in moving the journals to an SSD per server or separate
partition per OSD disk?  These machines are not seeing any load short of
what's being generated by swift-bench.  Alternatively, would we see any
quick wins standing up more MONs or moving the MON off the server running
radosgw + Apache / FastCGI?

Thanks in advance for the assistance.

Regards,
Matt
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG distribution scattered

2013-09-19 Thread Mark Nelson

Honestly I don't remember, but I would be wary if it's not a test system. :)

Mark

On 09/19/2013 11:28 AM, Warren Wang wrote:

Is this safe to enable on a running cluster?

--
Warren

On Sep 19, 2013, at 9:43 AM, Mark Nelson  wrote:


On 09/19/2013 08:36 AM, Niklas Goerke wrote:

Hi there

I'm currently evaluating ceph and started filling my cluster for the
first time. After filling it up to about 75%, it reported some OSDs
being "near-full".
After some evaluation I found that the PGs are not distributed evenly
over all the osds.

My Setup:
* Two Hosts with 45 Disks each --> 90 OSDs
* Only one newly created pool with 4500 PGs and a Replica Size of 2 -->
should be about 100 PGs per OSD

What I found was that one OSD only had 72 PGs, while another had 123 PGs
[1]. That means that - if I did the math correctly - I can only fill the
cluster to about 81%, because thats when the first OSD is completely
full[2].


Does distribution improve if you make a pool with significantly more PGs?



I did some experimenting and found, that if I add another pool with 4500
PGs, each OSD will have exacly doubled the amount of PGs as with one
pool. So this is not an accident (tried it multiple times). On another
test-cluster with 4 Hosts and 15 Disks each, the Distribution was
similarly worse.


This is a bug that causes each pool to more or less be distributed the same way 
on the same hosts.  We have a fix, but it impacts backwards compatibility so 
it's off by default.  If you set:

osd pool default flag hashpspool = true

Theoretically that will cause different pools to be distributed more randomly.



To me it looks like the rjenkins algorithm is not working as it - in my
opinion - should be.

Am I doing anything wrong?
Is this behaviour to be expected?
Can I don something about it?


Thank you very much in advance
Niklas


[1] I built a small script that will parse pgdump and output the amount
of pgs on each osd: http://pastebin.com/5ZVqhy5M
[2] I know I should not fill my cluster completely but I'm talking about
theory and adding a margin only makes it worse.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG distribution scattered

2013-09-19 Thread Warren Wang
Is this safe to enable on a running cluster?

--
Warren

On Sep 19, 2013, at 9:43 AM, Mark Nelson  wrote:

> On 09/19/2013 08:36 AM, Niklas Goerke wrote:
>> Hi there
>> 
>> I'm currently evaluating ceph and started filling my cluster for the
>> first time. After filling it up to about 75%, it reported some OSDs
>> being "near-full".
>> After some evaluation I found that the PGs are not distributed evenly
>> over all the osds.
>> 
>> My Setup:
>> * Two Hosts with 45 Disks each --> 90 OSDs
>> * Only one newly created pool with 4500 PGs and a Replica Size of 2 -->
>> should be about 100 PGs per OSD
>> 
>> What I found was that one OSD only had 72 PGs, while another had 123 PGs
>> [1]. That means that - if I did the math correctly - I can only fill the
>> cluster to about 81%, because thats when the first OSD is completely
>> full[2].
> 
> Does distribution improve if you make a pool with significantly more PGs?
> 
>> 
>> I did some experimenting and found, that if I add another pool with 4500
>> PGs, each OSD will have exacly doubled the amount of PGs as with one
>> pool. So this is not an accident (tried it multiple times). On another
>> test-cluster with 4 Hosts and 15 Disks each, the Distribution was
>> similarly worse.
> 
> This is a bug that causes each pool to more or less be distributed the same 
> way on the same hosts.  We have a fix, but it impacts backwards compatibility 
> so it's off by default.  If you set:
> 
> osd pool default flag hashpspool = true
> 
> Theoretically that will cause different pools to be distributed more randomly.
> 
>> 
>> To me it looks like the rjenkins algorithm is not working as it - in my
>> opinion - should be.
>> 
>> Am I doing anything wrong?
>> Is this behaviour to be expected?
>> Can I don something about it?
>> 
>> 
>> Thank you very much in advance
>> Niklas
>> 
>> 
>> [1] I built a small script that will parse pgdump and output the amount
>> of pgs on each osd: http://pastebin.com/5ZVqhy5M
>> [2] I know I should not fill my cluster completely but I'm talking about
>> theory and adding a margin only makes it worse.
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Excessive mon memory usage in cuttlefish 0.61.8

2013-09-19 Thread Andrey Korolyov
On Thu, Sep 19, 2013 at 1:00 PM, Joao Eduardo Luis
 wrote:
> On 09/18/2013 11:25 PM, Andrey Korolyov wrote:
>>
>> Hello,
>>
>> Just restarted one of my mons after a month of uptime, memory commit
>> raised ten times high than before:
>>
>> 13206 root  10 -10 12.8g 8.8g 107m S65 14.0   0:53.97 ceph-mon
>>
>> normal one looks like
>>
>>   30092 root  10 -10 4411m 790m  46m S 1  1.2   1260:28 ceph-mon
>
>
> Try running 'ceph heap stats', followed by 'ceph heap release', and then
> recheck the memory consumption for the monitor.

It had shrinked to 350M RSS over night, so seems I need to restart
this mon again or try with other one to reproduce the problem over
next night. This monitor was a leader so I may check against other
ones and see their peak consumption.

>
>
>
>> monstore has simular size about 15G per monitor so only one problem is
>> very unusual and terrifying memory consumption. Also it is possible
>> that remaining mons running 0.61.7 but binary was updated long ago so
>> it`s hard to tell which version is running not doing dump and
>> disrupting quorum for a little - anyway I should tame current one.
>
>
> How big is the cluster?  15GB for the monitor store may not be that
> surprising if you have a bunch of OSDs and they're not completely healthy,
> as that will prevent the removal of old maps on the monitor side.

~8.5T commit and > 2.5M objects but cluster is completely healthy on
the moment though recently I had run couple of recovery procedures and
it may affect mondir data allocation on day-long intervals.

>
> This could also be an issue with store compaction.
>
> Also, you should check whether the monitors are running 0.61.7 and, if so,
> update them to 0.61.8.  You should be able to get that version using the
> admin socket if you want to.

Just checked, the same 0.61.8.


>
>
>   -Joao
>
> --
> Joao Eduardo Luis
> Software Engineer | http://inktank.com | http://ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Uploading large files to swift interface on radosgw

2013-09-19 Thread Yehuda Sadeh
Now you're hitting issue #6336 (it's a regression in dumpling that
we'll fix soon). The current workaround is setting the following in
your osd:

osd max attr size = 

try a value of 10485760 (10M) which I think is large enough.

Yehuda



On Thu, Sep 19, 2013 at 7:30 AM, Gerd Jakobovitsch  wrote:
>
> Hello Yehuda, thank you for your help.
>
>
> On 09/17/2013 08:35 PM, Yehuda Sadeh wrote:
>>
>> On Tue, Sep 17, 2013 at 3:21 PM, Gerd Jakobovitsch  
>> wrote:
>>>
>>> Hi all,
>>>
>>> I am testing a ceph environment installed in debian wheezy, and, when
>>> testing file upload of more than 1 GB, I am getting errors. For files larger
>>> than 5 GB, I get a "400 Bad Request   EntityTooLarge" response; looking at
>>
>> The EntityTooLarge is expected, as there's a 5GB limit on objects.
>> Bigger objects need to be uploaded using the large object api.
>>
>>> the radosgw server, I notice that only the apache process is consuming cpu
>>> time, and I only have traffic on the external interface used by apache.
>>> For files between 2 GB  and 5 GB, I get stuck for a very long time, and I
>>> see relatively high processing for both apache and radosgw. Finally, I get a
>>> response "500 Internal Server Error UnknownError". The object is created on
>>> rados, but is empty.
>>>
>>> I am wondering whether there are any configuration I should change on
>>> apache, fastcgi or rgw, or if there are hardware limitations.
>>>
>>> Apache and fastCGI where installed from the distro. My ceph configuration:
>>
>> Are you by any chance using the fcgi module rather than the fastcgi
>> module? It had a problem with caching the entire object before sending
>> it to the backend, which would result in the same symptoms as you just
>> described.
>>
>> Yehuda
>
> Well, I followed the installation instructions, that explicitly refer to 
> fastcgi. Now I disabled the cgid module and repeated the test: I got the same 
> problem.
>
> Apache and fastcgi versions:
> apache2:
>   Installed: 2.2.22-13
> libapache2-mod-fastcgi:
>   Installed: 2.4.7~0910052141-1
>
> I enabled radosgw logging; please find annex the log file. There is a lot of 
> information listed, but I couldn't figure out the problem.
>
> Regards.
>
>
>
>>
>>> [global]
>>> mon_initial_members = spcsmp1, spcsmp2, spcsmp3
>>> mon_host = 10.17.0.2,10.17.0.3,10.17.0.4
>>> auth_cluster_required = cephx
>>> auth_service_required = cephx
>>> auth_client_required = cephx
>>> osd_journal_size = 1024
>>> filestore_xattr_use_omap = true
>>> public_network = 10.17.0.0/24
>>> cluster_network = 10.18.0.0/24
>>>
>>> [osd]
>>> osd_journal_size = 1024
>>>
>>> [client.radosgw.gateway]
>>> host = mss.mandic.com.br
>>> keyring = /etc/ceph/keyring.radosgw.gateway
>>> rgw_socket_path = /tmp/radosgw.sock
>>> log_file = /var/log/ceph/radosgw.log
>>> rgw_enable_ops_log = false
>>>
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Uploading large files to swift interface on radosgw

2013-09-19 Thread Gerd Jakobovitsch

Thank you very much, now it worked, with the value you suggested.

Regards.

On 09/19/2013 12:10 PM, Yehuda Sadeh wrote:

Now you're hitting issue #6336 (it's a regression in dumpling that
we'll fix soon). The current workaround is setting the following in
your osd:

osd max attr size = 

try a value of 10485760 (10M) which I think is large enough.

Yehuda



On Thu, Sep 19, 2013 at 7:30 AM, Gerd Jakobovitsch  wrote:

Hello Yehuda, thank you for your help.


On 09/17/2013 08:35 PM, Yehuda Sadeh wrote:

On Tue, Sep 17, 2013 at 3:21 PM, Gerd Jakobovitsch  wrote:

Hi all,

I am testing a ceph environment installed in debian wheezy, and, when
testing file upload of more than 1 GB, I am getting errors. For files larger
than 5 GB, I get a "400 Bad Request   EntityTooLarge" response; looking at

The EntityTooLarge is expected, as there's a 5GB limit on objects.
Bigger objects need to be uploaded using the large object api.


the radosgw server, I notice that only the apache process is consuming cpu
time, and I only have traffic on the external interface used by apache.
For files between 2 GB  and 5 GB, I get stuck for a very long time, and I
see relatively high processing for both apache and radosgw. Finally, I get a
response "500 Internal Server Error UnknownError". The object is created on
rados, but is empty.

I am wondering whether there are any configuration I should change on
apache, fastcgi or rgw, or if there are hardware limitations.

Apache and fastCGI where installed from the distro. My ceph configuration:

Are you by any chance using the fcgi module rather than the fastcgi
module? It had a problem with caching the entire object before sending
it to the backend, which would result in the same symptoms as you just
described.

Yehuda

Well, I followed the installation instructions, that explicitly refer to 
fastcgi. Now I disabled the cgid module and repeated the test: I got the same 
problem.

Apache and fastcgi versions:
apache2:
   Installed: 2.2.22-13
libapache2-mod-fastcgi:
   Installed: 2.4.7~0910052141-1

I enabled radosgw logging; please find annex the log file. There is a lot of 
information listed, but I couldn't figure out the problem.

Regards.




[global]
mon_initial_members = spcsmp1, spcsmp2, spcsmp3
mon_host = 10.17.0.2,10.17.0.3,10.17.0.4
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
osd_journal_size = 1024
filestore_xattr_use_omap = true
public_network = 10.17.0.0/24
cluster_network = 10.18.0.0/24

[osd]
osd_journal_size = 1024

[client.radosgw.gateway]
host = mss.mandic.com.br
keyring = /etc/ceph/keyring.radosgw.gateway
rgw_socket_path = /tmp/radosgw.sock
log_file = /var/log/ceph/radosgw.log
rgw_enable_ops_log = false



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com






smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cluster stuck at 15% degraded

2013-09-19 Thread Greg Chavez
We have an 84-osd cluster with volumes and images pools for OpenStack.  I
was having trouble with full osds, so I increased the pg count from the 128
default to 2700.  This balanced out the osds but the cluster is stuck at
15% degraded

http://hastebin.com/wixarubebe.dos

That's the output of ceph health detail.  I've never seen a pg with the
state active+remapped+wait_backfill+backfill_toofull.  Clearly I should
have increased the pg count more gradually, but here I am. I'm frozen,
afraid to do anything.

Any suggestions? Thanks.

--Greg Chavez
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MONs numbers, hardware sizing and write ack

2013-09-19 Thread Joao Eduardo Luis

On 09/19/2013 10:03 AM, Gandalf Corvotempesta wrote:

2013/9/19 Joao Eduardo Luis :

We have no benchmarks on that, that I am aware of.  But the short and sweet
answer should be "not really, highly unlikely".

If anything, increasing the number of mons should increase the response
time, although for such low numbers that should also be virtually
negligible.

Think of the monitors as the guys behind the counter at a the tourist desk
giving maps for your city and providing information to people, and if you
assume that for each one of these guys there's a single line. Having more of
these guys helps with larger numbers of tourists.

[cut]

In any case, you should be safe running 3 or 5 monitors without any
noticeable decrease in performance.  7 may also be just fine. More than that
I have no idea, but you should feel free to test this out in your system and
share results; I'm sure all of us would appreciate :)


Thank you for the clarification.
What I would like to avoid is putting in production dedicated hardware for mons
I'm running out of 10GbE ports and I don't have a 1GbE switch in ceph
cluster) so,
putting MONs on the same OSD server will allow me to archieve this by using
the same OSD hardware and thus the same 10GbE network port.

But I would also like to avoid a performance bottleneck caused by mons on the
same osd server. If increasing the number of mons will also decrease
the hardware requirements
for each of them, i'll put one mon on each osd server (5, 7, 9, and so on)


It's perfectly fine having the monitors sharing the servers with the OSDs.

Having more monitors however won't decrease the hardware requirements 
for each individual monitor.  Should not increase the hardware 
requirements either.


  -Joao



--
Joao Eduardo Luis
Software Engineer | http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG distribution scattered

2013-09-19 Thread Mark Nelson

On 09/19/2013 08:36 AM, Niklas Goerke wrote:

Hi there

I'm currently evaluating ceph and started filling my cluster for the
first time. After filling it up to about 75%, it reported some OSDs
being "near-full".
After some evaluation I found that the PGs are not distributed evenly
over all the osds.

My Setup:
* Two Hosts with 45 Disks each --> 90 OSDs
* Only one newly created pool with 4500 PGs and a Replica Size of 2 -->
should be about 100 PGs per OSD

What I found was that one OSD only had 72 PGs, while another had 123 PGs
[1]. That means that - if I did the math correctly - I can only fill the
cluster to about 81%, because thats when the first OSD is completely
full[2].


Does distribution improve if you make a pool with significantly more PGs?



I did some experimenting and found, that if I add another pool with 4500
PGs, each OSD will have exacly doubled the amount of PGs as with one
pool. So this is not an accident (tried it multiple times). On another
test-cluster with 4 Hosts and 15 Disks each, the Distribution was
similarly worse.


This is a bug that causes each pool to more or less be distributed the 
same way on the same hosts.  We have a fix, but it impacts backwards 
compatibility so it's off by default.  If you set:


osd pool default flag hashpspool = true

Theoretically that will cause different pools to be distributed more 
randomly.




To me it looks like the rjenkins algorithm is not working as it - in my
opinion - should be.

Am I doing anything wrong?
Is this behaviour to be expected?
Can I don something about it?


Thank you very much in advance
Niklas


[1] I built a small script that will parse pgdump and output the amount
of pgs on each osd: http://pastebin.com/5ZVqhy5M
[2] I know I should not fill my cluster completely but I'm talking about
theory and adding a margin only makes it worse.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PG distribution scattered

2013-09-19 Thread Niklas Goerke

Hi there

I'm currently evaluating ceph and started filling my cluster for the 
first time. After filling it up to about 75%, it reported some OSDs 
being "near-full".
After some evaluation I found that the PGs are not distributed evenly 
over all the osds.


My Setup:
* Two Hosts with 45 Disks each --> 90 OSDs
* Only one newly created pool with 4500 PGs and a Replica Size of 2 --> 
should be about 100 PGs per OSD


What I found was that one OSD only had 72 PGs, while another had 123 
PGs [1]. That means that - if I did the math correctly - I can only fill 
the cluster to about 81%, because thats when the first OSD is completely 
full[2].


I did some experimenting and found, that if I add another pool with 
4500 PGs, each OSD will have exacly doubled the amount of PGs as with 
one pool. So this is not an accident (tried it multiple times). On 
another test-cluster with 4 Hosts and 15 Disks each, the Distribution 
was similarly worse.


To me it looks like the rjenkins algorithm is not working as it - in my 
opinion - should be.


Am I doing anything wrong?
Is this behaviour to be expected?
Can I don something about it?


Thank you very much in advance
Niklas


[1] I built a small script that will parse pgdump and output the amount 
of pgs on each osd: http://pastebin.com/5ZVqhy5M
[2] I know I should not fill my cluster completely but I'm talking 
about theory and adding a margin only makes it worse.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Impossible to Create Bucket on RadosGW?

2013-09-19 Thread Alexander Sidorenko
Georg Höllrigl  writes:

> 
> Hello,
> 
> I'm horribly failing at creating a bucket on radosgw at ceph 0.67.2 
> running on ubuntu 12.04.
> 
> Right now I feel frustrated about radosgw-admin for beeing inconsistent 
> in its options. It's possible to list the buckets and also to delete 
> them but not to create!
> 
> No matter what I tried - using telnet, curl, s3cmd - I'm getting back
> 
> S3Error: 405 (Method Not Allowed)
> 
> I don't see a way to configure this somewhere in apache!?
> 
> Any ideas whats going on here?
> 
> Georg
> 

I faced same issue and I had to switch to python API. It seems like java 
amazon-aws libraries could only be used with some regions override file, 
otherwise available regions are fetched from hardcoded amazon url. Just try 
other lang API. 

PS: There is also a note in Ceph doc (http://ceph.com/docs/master/radosgw/
s3/java/#generate-object-download-urls-signed-and-unsigned) 

'The java library does not have a method for generating unsigned URLs, so 
the example below just generates a signed URL'

I need public URLs from Ceph and I'm using python API.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDMap problem: osd does not exist.

2013-09-19 Thread Sage Weil
On Thu, 19 Sep 2013, Yasuhiro Ohara wrote:
> 
> Hi Sage,
> 
> Thanks, after thrashing it became a little bit better,
> but not yet healthy.
> 
> ceph -s: http://pastebin.com/vD28FJ4A
> ceph osd dump: http://pastebin.com/37FLNxd7
> ceph pg dump: http://pastebin.com/pccdg20j
> 
> (osd.0 and 1 are not running. I issued some "osd in" commands.
> osd.4 are running but marked down/out: what is the "autoout" ?)
> 
> After thrashing some times (maybe I thrash it too much ?),
> the osd clusters really thrashed much,
> like in ceph -w: http://pastebin.com/fjeqrhxp
> 
> I thought osd's osdmap epoch was around 4900 (by seeing data/current/meta),
> but it needed 6 or 7 osd thrash command execs until it seemed to work
> on something, and epoch reached over 1.
> Now I see "deep scrub ok" some time in ceph -w.
> But still the PGs are 'creating' state, and it does not seem to be
> creating anything really.
> 
> I removed and re-creted pools, because the number of PGs are incorrect,
> and it changed pool id 0,1,2 to 3,4,5. Is this causing the problem ?

If you deleted and recreated the pools, you may as well just wipe the 
cluster and start over from scratch... the data is gone.  The MDS is 
crashing because the pool referenced by the MDSMap is gone and it has no 
fs (meta)data.

I suggest just starting from scratch.  And next time, don't delete all of 
the monitor data!  :)

sage


> 
> By the way, MDS crashes on this cluster status.
> ceph-mds.2.log: http://pastebin.com/Ruf5YB8d
> 
> Any suggestion is really appreciated.
> Thanks.
> 
> regards,
> Yasu
> 
> From: Sage Weil 
> Subject: Re: [ceph-users] OSDMap problem: osd does not exist.
> Date: Wed, 18 Sep 2013 19:58:16 -0700 (PDT)
> Message-ID: 
> 
> > Hey,
> > 
> > On Wed, 18 Sep 2013, Yasuhiro Ohara wrote:
> >> 
> >> Hi,
> >> 
> >> My OSDs are not joining the cluster correctly,
> >> because the nonce they assume and receive from the peer are different.
> >> It says "wrong node" because of the entity_id_t peer_addr (i.e., the
> >> combination of the IP address, port number, and the nonce) is different.
> >> 
> >> Now, my questions are:
> >> 1, Are the nonces of OSD peer addrs are kept in the osdmap ?
> >> 2, (If so) can I modify the nonce value ?
> >> 
> >> More generally, how can I fix the cluster if I blew away the mon data ?
> >> 
> >> Below I'd like to summarize what I did.
> >> - I tried upgrade from 0.57 to 0.67.3
> >> - the mon protocol is different, and the mon data format seemed also
> >>   different (changed to use leveldb ?). So restarting all mons.
> >> - The mon data upgrade did not go well because of the full disk,
> >>   but I didn't notice the cause and stupidly tried to start mon from 
> >> scratch,
> >>   building the mon data (mon --mkfs). (I solved the full disk problem
> >>   later.)
> >> - Now there's no OSD exising in the cluster (i.e., in osdmap).
> >> - I added OSD configurations using "ceph osd create".
> >> - Still OSDs do not recognize each other; they do not become peers.
> >> - (The OSDs seem to hold the previous PG data still, and loading them
> >>   is working fine. So I assume I still can recover the data.)
> >> 
> >> Does anyone have any advice on this ?
> >> I'm planning to try to modify the source code because of no other choice,
> >> so that they ignore nonce values :(
> > 
> > The nonce value is important; you can't just ignore it.  If they addr in 
> > the osdmap isn't changing, it si probably because the mon thinks the 
> > latest osdmap is N and the osd's think the latest is >> N.  I would look 
> > in the osd data/current/meta directory and see what the newest osdmap 
> > epoch is, compare that to 'ceph osd dump', and then do 'ceph osd thrash N' 
> > to make it churn though a bunch of maps to get to an epoch that is > than 
> > what he OSDs see.  Once that happens, the osd boot messages will properly 
> > update the cluster osdmap with their new addr and things should start up.  
> > Until then, the osd will just sit and wait to get a map newer than what 
> > they have that will never come...
> > 
> > sage
> > 
> >> 
> >> Thanks in advance.
> >> 
> >> regards,
> >> Yasu
> >> 
> >> From: Yasuhiro Ohara 
> >> Subject: Re: [ceph-users] OSDMap problem: osd does not exist.
> >> Date: Thu, 12 Sep 2013 09:45:51 -0700 (PDT)
> >> Message-ID: <20130912.094551.06710597.y...@soe.ucsc.edu>
> >> 
> >> > 
> >> > Hi Joao,
> >> > 
> >> > Thank you for the response.
> >> > I meant "ceph-mon -i X --mkfs".
> >> > 
> >> > Actually I did it on 3 node. On other 2 mon nodes, the original
> >> > mon data were left, but currently all 5 nodes run ceph-mon again.
> >> > That I shouldn't do that ?
> >> > 
> >> > regards,
> >> > Yasu
> >> > 
> >> > From: Joao Eduardo Luis 
> >> > Subject: Re: [ceph-users] OSDMap problem: osd does not exist.
> >> > Date: Thu, 12 Sep 2013 11:35:40 +0100
> >> > Message-ID: <523198fc.8050...@inktank.com>
> >> > 
> >> >> On 09/12/2013 09:35 AM, Yasuhiro Ohara wrote:
> >> >>>
> >> >>> Hi,
> >> >>>
> >> >>> recently I tried to u

Re: [ceph-users] ceph-deploy not including sudo?

2013-09-19 Thread Alfredo Deza
On Wed, Sep 18, 2013 at 11:54 PM, Gruher, Joseph R
 wrote:
> Using latest ceph-deploy:
>
> ceph@cephtest01:/my-cluster$ sudo ceph-deploy --version
>
> 1.2.6
>
>
>
> I get this failure:
>
>
>
> ceph@cephtest01:/my-cluster$ sudo ceph-deploy install cephtest03 cephtest04
> cephtest05 cephtest06

How you are running the command above is part of the problem.
ceph-deploy doesn't need to run
with sudo privileges because it uses the current directory to
read/write some files and then send
them to remote hosts.

What happens next is that it will attempt to detect if you need `sudo`
on the remote end, it does this
by making sure you are *not* executing as a super user (which you are).
>
> [ceph_deploy.install][DEBUG ] Installing stable version dumpling on cluster
> ceph hosts cephtest03 cephtest04 cephtest05 cephtest06
>
> [ceph_deploy.install][DEBUG ] Detecting platform for host cephtest03 ...
>
> [ceph_deploy.sudo_pushy][DEBUG ] will use a remote connection without sudo
>

The line above confirms this, "will use a remote connection without
sudo" is here because it detects
you are executing as a super user, so there is no need to introduce
sudo in the other end. It assumes
you are login in as root.

> [ceph_deploy.install][INFO  ] Distro info: Ubuntu 12.04 precise
>
> [cephtest03][INFO  ] installing ceph on cephtest03
>
> [cephtest03][INFO  ] Running command: env DEBIAN_FRONTEND=noninteractive
> apt-get -q install --assume-yes ca-certificates
>
> [cephtest03][ERROR ] Traceback (most recent call last):
>
> [cephtest03][ERROR ]   File
> "/usr/lib/python2.7/dist-packages/ceph_deploy/hosts/debian/install.py", line
> 26, in install
>
> [cephtest03][ERROR ]   File
> "/usr/lib/python2.7/dist-packages/ceph_deploy/util/decorators.py", line 10,
> in inner
>
> [cephtest03][ERROR ]   File
> "/usr/lib/python2.7/dist-packages/ceph_deploy/util/wrappers.py", line 6, in
> remote_call
>
> [cephtest03][ERROR ]   File "/usr/lib/python2.7/subprocess.py", line 511, in
> check_call
>
> [cephtest03][ERROR ] raise CalledProcessError(retcode, cmd)
>
> [cephtest03][ERROR ] CalledProcessError: Command '['env',
> 'DEBIAN_FRONTEND=noninteractive', 'apt-get', '-q', 'install',
> '--assume-yes', 'ca-certificates']' returned non-zero exit status 100
>
> [cephtest03][ERROR ] E: Could not open lock file /var/lib/dpkg/lock - open
> (13: Permission denied)
>
> [cephtest03][ERROR ] E: Unable to lock the administration directory
> (/var/lib/dpkg/), are you root?
>
> [ceph_deploy][ERROR ] RuntimeError: Failed to execute command: env
> DEBIAN_FRONTEND=noninteractive apt-get -q install --assume-yes
> ca-certificates
>
>
>
> This failure seems to imply ceph-deploy is not prefacing remote (SSH)
> commands to other systems with sudo?  For example this command as shown in
> the ceph-deploy output fails:
>
>
>
> ceph@cephtest01:/my-cluster$ ssh cephtest03 env
> DEBIAN_FRONTEND=noninteractive apt-get -q install --assume-yes
> ca-certificates
>
> E: Could not open lock file /var/lib/dpkg/lock - open (13: Permission
> denied)
>
> E: Unable to lock the administration directory (/var/lib/dpkg/), are you
> root?
>
>
>
> But with the sudo added it works:
>
>
>
> ceph@cephtest01:/my-cluster$ ssh cephtest03 sudo env
> DEBIAN_FRONTEND=noninteractive apt-get -q install --assume-yes
> ca-certificates
>
> Reading package lists...
>
> Building dependency tree...
>
> Reading state information...
>
> ca-certificates is already the newest version.
>
> 0 upgraded, 0 newly installed, 0 to remove and 2 not upgraded.
>
> ceph@cephtest01:/my-cluster$
>

Can you try running ceph-deploy *without* sudo ?

>
>
> Thanks,
>
> Joe
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Objects get via s3api FastCGI "incomplete headers" and hanging up

2013-09-19 Thread Mihály Árva-Tóth
Hello,

I wrote a PHP client script (using latest AWS S3 API lib) and this not
solve my (hanging download) problem at all. So the problem not be in
libs3/s3 tool. I changed MTU from 1500 to 9000 and back does not solved.
Are there any apache (mpm-worker), fastcgi, rgw or librados tuning options
to handle more concurrent download? (File sizes between 16K and 1024K).

Thank you,
Mihaly


2013/9/17 Mihály Árva-Tóth 

> Hello,
>
> I'm trying to download objects from one container (which contains 3
> million objects, file sizes between 16K and 1024K) parallel 10 threads. I'm
> using "s3" binary comes from libs3. I'm monitoring download time, response
> time of 80% lower than 50-80 ms. But sometimes download hanging up, up to
> 17 secs; apache returns with error code 500. apache error log (lot of):
>
> [Tue Sep 17 11:33:11 2013] [error] [client 194.38.106.67] FastCGI: comm
> with server "/var/www/radosgw.fcgi" aborted: idle timeout (30 sec)
> [Tue Sep 17 11:33:11 2013] [error] [client 194.38.106.67] FastCGI:
> incomplete headers (0 bytes) received from server "/var/www/radosgw.fcgi"
> [Tue Sep 17 11:33:11 2013] [error] [client 194.38.106.67] Handler for
> fastcgi-script returned invalid result code 1
>
> I tried with native apache2/fastcgi ubuntu packages and Ceph built
> apache2/fastcgi both. When I turn on "rgw print continue = true" with
> modified build, the result is better very bit (less hungs). "FastCgiWrapper
> Off" of course.
>
> And if I set parallel get requests only 3 (instead of 10) the result is
> much better, the longest hang only 1500 ms. So I think this is depends with
> some resource management. But I get no idea.
>
> Using ceph-0.67.4 with Ubuntu 12.04 x8_64.
>
> I found the following issue (more than 1 year):
> http://tracker.ceph.com/issues/2027
>
> But this closed with unable to reproduce. I can reproduce every time.
>
> Thank you,
> Mihaly
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [SOLVED] Re: Ceph bock storage and Openstack Cinder Scheduler issue

2013-09-19 Thread Darren Birkett
On 19 September 2013 11:51, Gavin  wrote:

>
> Hi,
>
> Please excuse/disregard my previous email, I just needed a
> clarification on my understanding of how this all fits together.
>
> I was kindly pointed in the right direction by a friendly gentleman
> from Rackspace. Thanks Darren. :)
>
> The reason for my confusion was due to the way that the volumes are
> displayed in the Horizon dashboard.
>
> The dashboard shows that all volumes are attached to one Compute node,
> which obviously led to my initial concerns.
>
> Now that I know that the connections come from libvirt on the compute
> node where the instances live, I have one less thing to worry about.
>
> Thanks,
> Gavin
> ___
>

yes AFAIK, cinder-volume is largely only involved in brokering the initial
volume creation in RBD.  libvirt on the compute host where the instance
lives then connects to RBD.

Whilst this would suggest that the inital cinder-volume host that brokered
the creation is therefore no longer needed after creation, I do vaguely
remember there still being some sort of thin requirement on that host
remaining there, in grizzly at least.  That may be fixed now, but I'd be
interested to see your experiences with that.

Darren
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] [SOLVED] Re: Ceph bock storage and Openstack Cinder Scheduler issue

2013-09-19 Thread Gavin
On 19 September 2013 11:57, Gavin  wrote:
> Hi there,
>
> Can someone possibly shed some light on and issue we are experiencing
> with the way Cinder is scheduling Ceph volumes in our environment.
>
> We are running cinder-volume on each of our compute nodes, and they
> are all configured to make use of our Ceph cluster.
>
> As far as we can tell the Ceph cluster is working as it should,
> however the problem we are having is that each and every Ceph volume
> gets attached to only one of the Compute nodes.
>
> This is not idea as it will create a bottle-neck on the one host.
>
> From what I have read the default Cinder scheduler should pick the
> cinder-volume node with the most available space, but since all
> compute nodes should report the same, as per the space available in
> the Ceph volume pool, how is this meant to work then ?
>
> We have also tried to implement the Cinder chance scheduler in the
> hope that Cinder will randomly pick another storage node, but this did
> not make any difference.
>
> Has anyone else experienced the same issue or similar ?
>
> Is there perhaps a way that we can round-robin the volume attachments ?
>
> Openstack version: Grizzly using Ubuntu LTS and Cloud PPA.
>
> Ceph version: Cuttlefish from Ceph PPA.

Hi,

Please excuse/disregard my previous email, I just needed a
clarification on my understanding of how this all fits together.

I was kindly pointed in the right direction by a friendly gentleman
from Rackspace. Thanks Darren. :)

The reason for my confusion was due to the way that the volumes are
displayed in the Horizon dashboard.

The dashboard shows that all volumes are attached to one Compute node,
which obviously led to my initial concerns.

Now that I know that the connections come from libvirt on the compute
node where the instances live, I have one less thing to worry about.

Thanks,
Gavin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Decrease radosgw logging level

2013-09-19 Thread Mihály Árva-Tóth
2013/9/17 Joao Eduardo Luis 

> On 09/13/2013 01:02 PM, Mihály Árva-Tóth wrote:
>
>> Hello,
>>
>> How can I decrease logging level of radosgw? I uploaded 400k pieces of
>> objects and my radosgw log raises to 2 GiB. Current settings:
>>
>> rgw_enable_usage_log = true
>> rgw_usage_log_tick_interval = 30
>> rgw_usage_log_flush_threshold = 1024
>> rgw_usage_max_shards = 32
>> rgw_usage_max_user_shards = 1
>> rgw_print_continue = false
>> rgw_enable_ops_log = false
>> rgw_ops_log_rados = false
>> log_file =
>> log_to_syslog = true
>>
>
> If you mean output from rgw itself to its own log, try adjusting 'debug
> rgw'.  Default is 1, so check if you have it set to some higher value. You
> can always set it to 0 too (debug rgw = 0)
>

Hello Joao,

Thank  you, this was the solution.

Regards,
Mihaly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph bock storage and Openstack Cinder Scheduler issue

2013-09-19 Thread Gavin
Hi there,

Can someone possibly shed some light on and issue we are experiencing
with the way Cinder is scheduling Ceph volumes in our environment.

We are running cinder-volume on each of our compute nodes, and they
are all configured to make use of our Ceph cluster.

As far as we can tell the Ceph cluster is working as it should,
however the problem we are having is that each and every Ceph volume
gets attached to only one of the Compute nodes.

This is not idea as it will create a bottle-neck on the one host.

>From what I have read the default Cinder scheduler should pick the
cinder-volume node with the most available space, but since all
compute nodes should report the same, as per the space available in
the Ceph volume pool, how is this meant to work then ?

We have also tried to implement the Cinder chance scheduler in the
hope that Cinder will randomly pick another storage node, but this did
not make any difference.

Has anyone else experienced the same issue or similar ?

Is there perhaps a way that we can round-robin the volume attachments ?

Openstack version: Grizzly using Ubuntu LTS and Cloud PPA.

Ceph version: Cuttlefish from Ceph PPA.

Thanks in advance,
Gavin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MONs numbers, hardware sizing and write ack

2013-09-19 Thread Gandalf Corvotempesta
2013/9/19 Joao Eduardo Luis :
> We have no benchmarks on that, that I am aware of.  But the short and sweet
> answer should be "not really, highly unlikely".
>
> If anything, increasing the number of mons should increase the response
> time, although for such low numbers that should also be virtually
> negligible.
>
> Think of the monitors as the guys behind the counter at a the tourist desk
> giving maps for your city and providing information to people, and if you
> assume that for each one of these guys there's a single line. Having more of
> these guys helps with larger numbers of tourists.
[cut]
> In any case, you should be safe running 3 or 5 monitors without any
> noticeable decrease in performance.  7 may also be just fine. More than that
> I have no idea, but you should feel free to test this out in your system and
> share results; I'm sure all of us would appreciate :)

Thank you for the clarification.
What I would like to avoid is putting in production dedicated hardware for mons
I'm running out of 10GbE ports and I don't have a 1GbE switch in ceph
cluster) so,
putting MONs on the same OSD server will allow me to archieve this by using
the same OSD hardware and thus the same 10GbE network port.

But I would also like to avoid a performance bottleneck caused by mons on the
same osd server. If increasing the number of mons will also decrease
the hardware requirements
for each of them, i'll put one mon on each osd server (5, 7, 9, and so on)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Excessive mon memory usage in cuttlefish 0.61.8

2013-09-19 Thread Joao Eduardo Luis

On 09/18/2013 11:25 PM, Andrey Korolyov wrote:

Hello,

Just restarted one of my mons after a month of uptime, memory commit
raised ten times high than before:

13206 root  10 -10 12.8g 8.8g 107m S65 14.0   0:53.97 ceph-mon

normal one looks like

  30092 root  10 -10 4411m 790m  46m S 1  1.2   1260:28 ceph-mon


Try running 'ceph heap stats', followed by 'ceph heap release', and then 
recheck the memory consumption for the monitor.




monstore has simular size about 15G per monitor so only one problem is
very unusual and terrifying memory consumption. Also it is possible
that remaining mons running 0.61.7 but binary was updated long ago so
it`s hard to tell which version is running not doing dump and
disrupting quorum for a little - anyway I should tame current one.


How big is the cluster?  15GB for the monitor store may not be that 
surprising if you have a bunch of OSDs and they're not completely 
healthy, as that will prevent the removal of old maps on the monitor side.


This could also be an issue with store compaction.

Also, you should check whether the monitors are running 0.61.7 and, if 
so, update them to 0.61.8.  You should be able to get that version using 
the admin socket if you want to.



  -Joao

--
Joao Eduardo Luis
Software Engineer | http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MONs numbers, hardware sizing and write ack

2013-09-19 Thread Joao Eduardo Luis

On 09/19/2013 09:17 AM, Gandalf Corvotempesta wrote:

Hi to all,
increasing the total numbers of MONs available in a cluster, for
example growing from 3 to 5, will also decrease the hardware
requirements (i.e. RAM and CPU) for each mon instance ?


We have no benchmarks on that, that I am aware of.  But the short and 
sweet answer should be "not really, highly unlikely".


If anything, increasing the number of mons should increase the response 
time, although for such low numbers that should also be virtually 
negligible.


Think of the monitors as the guys behind the counter at a the tourist 
desk giving maps for your city and providing information to people, and 
if you assume that for each one of these guys there's a single line. 
Having more of these guys helps with larger numbers of tourists.


But now imagine that your city is constantly under development, and for 
some reason those maps need to be updated rather frequently.  Each time 
there's a new change, a courier delivers the latest change to the city 
map in a post-it to the tourist office.  He gets in line with all the 
other people, and delivers said post-it to one of the guys behind the 
counter.  Now this guy will have to give that post-it to his supervisor 
(it might even be himself), and the supervisor will make the appropriate 
change in the city map and give a map for review to every guy behind the 
counter, which in turn will check the change and tell the supervisor 
whether they accept said map or not -- when a majority accepts the map, 
then the supervisor will issue an order to start providing this new 
version of the map to whomever is in the queue.


So you can see how adding more guys behind the counter can easily 
increase the complexity of getting the maps ready for public 
consumption, as there are a few more hops in the process.  Fortunately 
for us, in Ceph this is accomplished faster than in any tourist office 
I've ever been to.


I guess it is just a matter of how many clients (including OSDs, MDSs) 
you have in the cluster.  More monitors will sort of load balance reads 
across monitors; but updates are centralized and overseen by a single 
monitor.


In any case, you should be safe running 3 or 5 monitors without any 
noticeable decrease in performance.  7 may also be just fine. More than 
that I have no idea, but you should feel free to test this out in your 
system and share results; I'm sure all of us would appreciate :)


  -Joao



Another question: when a client write something to the cluster, the
write ack is sent back to the client as soon as one OSD has wrote data
to the journal, or after one replica is made? For example, could I use
a 10GbE network for the public side and a bonded 1GbE for cluster side
without performance bottleneck ? In this case , only replication will
be made slower with no negative performance on clients (RGW and so on)

I'll host some VM disk images so I need the maximum speed client-size
and replication could also be slower (our VM has an ephemeral storage)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
Joao Eduardo Luis
Software Engineer | http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MONs numbers, hardware sizing and write ack

2013-09-19 Thread Gandalf Corvotempesta
Hi to all,
increasing the total numbers of MONs available in a cluster, for
example growing from 3 to 5, will also decrease the hardware
requirements (i.e. RAM and CPU) for each mon instance ?

I'm asking this because our cluster will be made with 5 OSD server and
I can easily put one MON on each OSD server.

Another question: when a client write something to the cluster, the
write ack is sent back to the client as soon as one OSD has wrote data
to the journal, or after one replica is made? For example, could I use
a 10GbE network for the public side and a bonded 1GbE for cluster side
without performance bottleneck ? In this case , only replication will
be made slower with no negative performance on clients (RGW and so on)

I'll host some VM disk images so I need the maximum speed client-size
and replication could also be slower (our VM has an ephemeral storage)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDMap problem: osd does not exist.

2013-09-19 Thread Yasuhiro Ohara

Hi Sage,

Thanks, after thrashing it became a little bit better,
but not yet healthy.

ceph -s: http://pastebin.com/vD28FJ4A
ceph osd dump: http://pastebin.com/37FLNxd7
ceph pg dump: http://pastebin.com/pccdg20j

(osd.0 and 1 are not running. I issued some "osd in" commands.
osd.4 are running but marked down/out: what is the "autoout" ?)

After thrashing some times (maybe I thrash it too much ?),
the osd clusters really thrashed much,
like in ceph -w: http://pastebin.com/fjeqrhxp

I thought osd's osdmap epoch was around 4900 (by seeing data/current/meta),
but it needed 6 or 7 osd thrash command execs until it seemed to work
on something, and epoch reached over 1.
Now I see "deep scrub ok" some time in ceph -w.
But still the PGs are 'creating' state, and it does not seem to be
creating anything really.

I removed and re-creted pools, because the number of PGs are incorrect,
and it changed pool id 0,1,2 to 3,4,5. Is this causing the problem ?

By the way, MDS crashes on this cluster status.
ceph-mds.2.log: http://pastebin.com/Ruf5YB8d

Any suggestion is really appreciated.
Thanks.

regards,
Yasu

From: Sage Weil 
Subject: Re: [ceph-users] OSDMap problem: osd does not exist.
Date: Wed, 18 Sep 2013 19:58:16 -0700 (PDT)
Message-ID: 

> Hey,
> 
> On Wed, 18 Sep 2013, Yasuhiro Ohara wrote:
>> 
>> Hi,
>> 
>> My OSDs are not joining the cluster correctly,
>> because the nonce they assume and receive from the peer are different.
>> It says "wrong node" because of the entity_id_t peer_addr (i.e., the
>> combination of the IP address, port number, and the nonce) is different.
>> 
>> Now, my questions are:
>> 1, Are the nonces of OSD peer addrs are kept in the osdmap ?
>> 2, (If so) can I modify the nonce value ?
>> 
>> More generally, how can I fix the cluster if I blew away the mon data ?
>> 
>> Below I'd like to summarize what I did.
>> - I tried upgrade from 0.57 to 0.67.3
>> - the mon protocol is different, and the mon data format seemed also
>>   different (changed to use leveldb ?). So restarting all mons.
>> - The mon data upgrade did not go well because of the full disk,
>>   but I didn't notice the cause and stupidly tried to start mon from scratch,
>>   building the mon data (mon --mkfs). (I solved the full disk problem
>>   later.)
>> - Now there's no OSD exising in the cluster (i.e., in osdmap).
>> - I added OSD configurations using "ceph osd create".
>> - Still OSDs do not recognize each other; they do not become peers.
>> - (The OSDs seem to hold the previous PG data still, and loading them
>>   is working fine. So I assume I still can recover the data.)
>> 
>> Does anyone have any advice on this ?
>> I'm planning to try to modify the source code because of no other choice,
>> so that they ignore nonce values :(
> 
> The nonce value is important; you can't just ignore it.  If they addr in 
> the osdmap isn't changing, it si probably because the mon thinks the 
> latest osdmap is N and the osd's think the latest is >> N.  I would look 
> in the osd data/current/meta directory and see what the newest osdmap 
> epoch is, compare that to 'ceph osd dump', and then do 'ceph osd thrash N' 
> to make it churn though a bunch of maps to get to an epoch that is > than 
> what he OSDs see.  Once that happens, the osd boot messages will properly 
> update the cluster osdmap with their new addr and things should start up.  
> Until then, the osd will just sit and wait to get a map newer than what 
> they have that will never come...
> 
> sage
> 
>> 
>> Thanks in advance.
>> 
>> regards,
>> Yasu
>> 
>> From: Yasuhiro Ohara 
>> Subject: Re: [ceph-users] OSDMap problem: osd does not exist.
>> Date: Thu, 12 Sep 2013 09:45:51 -0700 (PDT)
>> Message-ID: <20130912.094551.06710597.y...@soe.ucsc.edu>
>> 
>> > 
>> > Hi Joao,
>> > 
>> > Thank you for the response.
>> > I meant "ceph-mon -i X --mkfs".
>> > 
>> > Actually I did it on 3 node. On other 2 mon nodes, the original
>> > mon data were left, but currently all 5 nodes run ceph-mon again.
>> > That I shouldn't do that ?
>> > 
>> > regards,
>> > Yasu
>> > 
>> > From: Joao Eduardo Luis 
>> > Subject: Re: [ceph-users] OSDMap problem: osd does not exist.
>> > Date: Thu, 12 Sep 2013 11:35:40 +0100
>> > Message-ID: <523198fc.8050...@inktank.com>
>> > 
>> >> On 09/12/2013 09:35 AM, Yasuhiro Ohara wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> recently I tried to upgrade from 0.57 to 0.67.3, hit the changes
>> >>> of mon protocol, and so I updated all of the 5 mons.
>> >>> After upgrading the mon, (and during the debugging of other problems,)
>> >>> I removed and created the mon filesystem from scratch.
>> >> 
>> >> What do you mean by this?  Did you recreate the file system on all 5 
>> >> monitors?  Did you backup any of your previous mon data directories?
>> >> 
>> >>   -Joao
>> >> 
>> >> -- 
>> >> Joao Eduardo Luis
>> >> Software Engineer | http://inktank.com | http://ceph.com
>> >> ___
>> >> ceph-users mailing list
>> >> ceph-users@list