Ceph write performance

2012-07-20 Thread George Shuklin
Good day. I've start to play with Ceph... And I found some kinda strange performance issues. I'm not sure if this is due ceph limitation or my bad setup. Setup: osd - xfs on ramdisk (only one osd) mds - raid0 on 10 disks mon - second raid0 on 10 disks I've mount ceph share at localhost and

Re: Ceph write performance

2012-07-20 Thread George Shuklin
On 20.07.2012 14:41, Dieter Kasper (KD) wrote: Good day. Thank you for attention. ramdisk size ~70Gb (modprobe brd rd_size=7000) journal seems be on same device as storage size of OSD was unchanged (... means I create it by manual and do not make any specific changes) During test I

Re: Ceph write performance

2012-07-20 Thread Mark Nelson
Hi George, I think you may find that the limitation is in the the filestore. It's one of the things I've been working on trying to track down as I've seen low performance on SSDs with small request sizes as well. You can use the test_filestore_workloadgen to specifically test the filestore

Increase number of PG

2012-07-20 Thread Sławomir Skowron
I know that this feature is disabled, are you planning to enable this in near future ?? I have many of drives, and my S3 instalation use only few of them in one time, and i need to improve that. When i use it as rbd it use all of them. Regards Slawomir Skowron -- To unsubscribe from this

Re: Ceph write performance

2012-07-20 Thread Matthew Richardson
On 20/07/12 11:24, George Shuklin wrote: Good day. I've start to play with Ceph... And I found some kinda strange performance issues. I'm not sure if this is due ceph limitation or my bad setup. I'm seeing a similar problem which looks like a potential bug, which someone else seems to have

Re: Poor read performance in KVM

2012-07-20 Thread Vladimir Bashkirtsev
Yes, they can hold up reads to the same object. Depending on where they're stuck, they may be blocking other requests as well if they're e.g. taking up all the filestore threads. Waiting for subops means they're waiting for replicas to acknowledge the write and commit it to disk. The real

Tuning placement group

2012-07-20 Thread François Charlier
Hello, Readinghttp://ceph.com/docs/master/ops/manage/grow/placement-groups/ and thinking to build a ceph cluster with potentially 1000 OSDs. Using the recommandations on the previously cited link, it would require pg_num being set between 10,000 30,000. Okay with that. Let's use the

Re: Ceph write performance

2012-07-20 Thread Gregory Farnum
On Fri, Jul 20, 2012 at 3:24 AM, George Shuklin shuk...@selectel.ru wrote: Good day. I've start to play with Ceph... And I found some kinda strange performance issues. I'm not sure if this is due ceph limitation or my bad setup. Setup: osd - xfs on ramdisk (only one osd) mds - raid0 on 10

Re: Poor read performance in KVM

2012-07-20 Thread Tommi Virtanen
On Fri, Jul 20, 2012 at 9:17 AM, Vladimir Bashkirtsev vladi...@bashkirtsev.com wrote: not running. So I ended up rebooting hosts and that's where fun begin: btrfs has failed to umount , on boot up it spit out btrfs: free space inode generation (0) did not match free space cache generation

Re: Poor read performance in KVM

2012-07-20 Thread Mark Nelson
On 7/20/12 11:42 AM, Tommi Virtanen wrote: On Fri, Jul 20, 2012 at 9:17 AM, Vladimir Bashkirtsev vladi...@bashkirtsev.com wrote: not running. So I ended up rebooting hosts and that's where fun begin: btrfs has failed to umount , on boot up it spit out btrfs: free space inode generation (0) did

Re: Poor read performance in KVM

2012-07-20 Thread Vladimir Bashkirtsev
On 21/07/2012 2:12 AM, Tommi Virtanen wrote: On Fri, Jul 20, 2012 at 9:17 AM, Vladimir Bashkirtsev vladi...@bashkirtsev.com wrote: not running. So I ended up rebooting hosts and that's where fun begin: btrfs has failed to umount , on boot up it spit out btrfs: free space inode generation (0)

Re: Tuning placement group

2012-07-20 Thread Florian Haas
On Fri, Jul 20, 2012 at 9:33 AM, François Charlier francois.charl...@enovance.com wrote: Hello, Readinghttp://ceph.com/docs/master/ops/manage/grow/placement-groups/ and thinking to build a ceph cluster with potentially 1000 OSDs. Using the recommandations on the previously cited link, it

Re: Tuning placement group

2012-07-20 Thread Yehuda Sadeh
On Fri, Jul 20, 2012 at 11:08 AM, Florian Haas flor...@hastexo.com wrote: On Fri, Jul 20, 2012 at 9:33 AM, François Charlier francois.charl...@enovance.com wrote: Hello, Readinghttp://ceph.com/docs/master/ops/manage/grow/placement-groups/ and thinking to build a ceph cluster with

Re: Tuning placement group

2012-07-20 Thread Sage Weil
On Fri, 20 Jul 2012, Fran?ois Charlier wrote: Hello, Readinghttp://ceph.com/docs/master/ops/manage/grow/placement-groups/ and thinking to build a ceph cluster with potentially 1000 OSDs. Using the recommandations on the previously cited link, it would require pg_num being set between

Re: Increase number of PG

2012-07-20 Thread Tommi Virtanen
On Fri, Jul 20, 2012 at 8:31 AM, Sławomir Skowron szi...@gmail.com wrote: I know that this feature is disabled, are you planning to enable this in near future ?? PG splitting/joining is the next major project for the OSD. It won't be backported to argonaut, but it will be in the next stable

Re: Ceph write performance on RAM-DISK

2012-07-20 Thread Mark Nelson
On 07/20/2012 03:36 PM, Dieter Kasper wrote: Hi Mark, George, I can observe a similar (poor) Performance on my system with fio on /dev/rbd1 #--- seq. write RBD RX37-0:~ # dd if=/dev/zero of=/dev/rbd1 bs=1024k count=1 1+0 records in 1+0 records out 1048576 bytes (10 GB) copied,

Running into an error with mkcephfs

2012-07-20 Thread Joe Landman
Hi folks: Setting up a test cluster. Simple ceph.conf [global] auth supported = cephx [mon] mon data = /data/mon$id debug ms = 1 [mon.0] host = n01 mon addr = 10.202.1.142:6789 [mon.1] host = n02 mon addr = 10.202.1.141:6789 [mon.2]

Re: Running into an error with mkcephfs

2012-07-20 Thread Gregory Farnum
If I remember mkcephfs correctly, it deliberately does not create the directories for each store (you'll notice that http://ceph.com/docs/master/start/quick-start/#deploy-the-configuration includes creating the directory for each daemon) — does /data/1/osd0 exist yet? On Fri, Jul 20, 2012 at 2:45

Re: Ceph write performance on RAM-DISK

2012-07-20 Thread Dieter Kasper
Hi Mark, George, I can observe a similar (poor) Performance on my system with fio on /dev/rbd1 #--- seq. write RBD RX37-0:~ # dd if=/dev/zero of=/dev/rbd1 bs=1024k count=1 1+0 records in 1+0 records out 1048576 bytes (10 GB) copied, 41.1819 s, 255 MB/s #--- seq. read RBD

Re: Running into an error with mkcephfs

2012-07-20 Thread Tommi Virtanen
On Fri, Jul 20, 2012 at 3:04 PM, Gregory Farnum g...@inktank.com wrote: If I remember mkcephfs correctly, it deliberately does not create the directories for each store (you'll notice that http://ceph.com/docs/master/start/quick-start/#deploy-the-configuration includes creating the directory

[PATCH 0/9] messenger fixups, batch #1

2012-07-20 Thread Sage Weil
This is the series I elbowed into testing today. There is the CRUSH support, and some incremental cleanups and fixes to the messenger that my testing with socket failure injection turned up. The basic strategy here is to move more things under the con-mutex and drop useless checks/calls where

[PATCH 3/9] libceph: report socket read/write error message

2012-07-20 Thread Sage Weil
We need to set error_msg to something useful before calling ceph_fault(); do so here for try_{read,write}(). This is more informative than libceph: osd0 192.168.106.220:6801 (null) Signed-off-by: Sage Weil s...@inktank.com --- net/ceph/messenger.c |8 ++-- 1 files changed, 6

[PATCH 2/9] libceph: support crush tunables

2012-07-20 Thread Sage Weil
From: caleb miles caleb.mi...@inktank.com The server side recently added support for tuning some magic crush variables. Decode these variables if they are present, or use the default values if they are not present. Corresponds to ceph.git commit 89af369c25f274fe62ef730e5e8aad0c54f1e5a5.

[PATCH 9/9] libceph: reset connection retry on successfully negotiation

2012-07-20 Thread Sage Weil
We exponentially back off when we encounter connection errors. If several errors accumulate, we will eventually wait ages before even trying to reconnect. Fix this by resetting the backoff counter after a successful negotiation/ connection with the remote node. Fixes ceph issue #2802.

[PATCH 7/9] ceph: close old con before reopening on mds reconnect

2012-07-20 Thread Sage Weil
When we detect a mds session reset, close the old ceph_connection before reopening it. This ensures we clean up the old socket properly and keep the ceph_connection state correct. Signed-off-by: Sage Weil s...@inktank.com --- fs/ceph/mds_client.c |1 + 1 files changed, 1 insertions(+), 0

[PATCH 1/9] libceph: move feature bits to separate header

2012-07-20 Thread Sage Weil
This is simply cleanup that will keep things more closely synced with the userland code. Signed-off-by: Sage Weil s...@inktank.com --- fs/ceph/mds_client.c |1 + fs/ceph/super.c|1 + include/linux/ceph/ceph_features.h | 24

[PATCH 5/9] libceph: resubmit linger ops when pg mapping changes

2012-07-20 Thread Sage Weil
The linger op registration (i.e., watch) modifies the object state. As such, the OSD will reply with success if it has already applied without doing the associated side-effects (setting up the watch session state). If we lose the ACK and resubmit, we will see success but the watch will not be

[PATCH 8/9] libceph: protect ceph_con_open() with mutex

2012-07-20 Thread Sage Weil
Take the con mutex while we are initiating a ceph open. This is necessary because the may have previously been in use and then closed, which could result in a racing workqueue running con_work(). Signed-off-by: Sage Weil s...@inktank.com --- net/ceph/messenger.c |2 ++ 1 files changed, 2

[PATCH 6/9] libceph: (re)initialize bio_iter on start of message receive

2012-07-20 Thread Sage Weil
Previously, we were opportunistically initializing the bio_iter if it appeared to be uninitialized in the middle of the read path. The problem is that a sequence like: - start reading message - initialize bio_iter - read half a message - messenger fault, reconnect - restart reading message

32bit ceph-osd on 64bit kernel

2012-07-20 Thread Smart Weblications GmbH - Florian Wiessner
Hi List, when running a 32bit ceph-osd binary on 64bit kernel i get: [ 601.181640] ioctl32(ceph-osd:3399): Unknown cmd fd(20) cmd(9408){t:ff94;sz:0} arg(000f) on /data/ceph_backend/osd [ 612.156104] ioctl32(ceph-osd:3399): Unknown cmd fd(20) cmd(9408){t:ff94;sz:0}