Re: Ceph performance improvement

2012-08-24 Thread Denis Fondras

Hello Mark,



Not sure what version of glibc Wheezy has, but try to make sure you have
one that supports syncfs (you'll also need a semi-new kernel, 3.0+
should be fine).



Wheezy has a fairly recent kernel :
# uname -a
Linux ceph-osd-0 3.2.0-3-amd64 #1 SMP Mon Jul 23 02:45:17 UTC 2012 
x86_64 GNU/Linux




default values are quite a bit lower for most of these.  You may want to
play with them and see if it has an effect.



I found these values on this ML. I haven't tried to tweak them but it is 
much better than with default values. I will try to change it.




RBD caching should definitely be enabled for a test like this.  I'd be
surprised if you got 42MB/s without it though...



root@ceph-osd-0:~# ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok 
config show | grep rbd

debug_rbd = 0/5
rbd_cache = false
rbd_cache_size = 33554432
rbd_cache_max_dirty = 25165824
rbd_cache_target_dirty = 16777216
rbd_cache_max_dirty_age = 1

In my opinions, performances from RBD client are decent.
Unfortunately I need concurrent access and CephFS is really appealing in 
that respect.




Ouch, that's taking a while!  In addition to the comments that David
made, be aware that you are also testing the metadata server with
cephFS.  Right now that's not getting a lot of attention as we are
primarily focusing on RADOS performance at the moment.  For this kind of
test though, distributed filesystems will never be as good as local
disks...



Yes, it may be the MDS that is the bottleneck. Perhaps I should have a 
lot of them...




Are you putting both journals on the SSD when you add an OSD?  If so,
what's the throughput your SSD can sustain?



Both journals are on the SSD. It seems that when I do ceph-osd -i $id 
--mkfs --mkkey it creates the journal according to the settings in 
ceph.conf.
I did some tests and my SSD drive is somewhat broken... Crucial C300 is 
a bit old and can only do 80MB/s writing.




You may want to check and see how big the IOs going to disk are on the
OSD node, and how quickly you are filling up the journal vs writing out
to disk.  collectl -sD -oT will give you a nice report.  Iostat can
probably tell you all of the same stuff with the right flags.



Thank you for that tool.

Denis
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph performance improvement

2012-08-22 Thread Alexandre DERUMIER
Not sure what version of glibc Wheezy has, but try to make sure you have 
one that supports syncfs (you'll also need a semi-new kernel, 3.0+ 
should be fine). 

Hi, glibc from wheezy don't have syncfs support.

- Mail original - 

De: Mark Nelson mark.nel...@inktank.com 
À: Denis Fondras c...@ledeuns.net 
Cc: ceph-devel@vger.kernel.org 
Envoyé: Mercredi 22 Août 2012 14:35:28 
Objet: Re: Ceph performance improvement 

On 08/22/2012 03:54 AM, Denis Fondras wrote: 
 Hello all, 

Hello! 

David had some good comments in his reply, so I'll just add in a couple 
of extra thoughts... 

 
 I'm currently testing Ceph. So far it seems that HA and recovering are 
 very good. 
 The only point that prevents my from using it at datacenter-scale is 
 performance. 
 
 First of all, here is my setup : 
 - 1 OSD/MDS/MON on a Supermicro X9DR3-F/X9DR3-F (1x Intel Xeon E5-2603 - 
 4 cores and 8GB RAM) running Debian Sid/Wheezy and Ceph version 0.49 

Not sure what version of glibc Wheezy has, but try to make sure you have 
one that supports syncfs (you'll also need a semi-new kernel, 3.0+ 
should be fine). 

 (commit:ca6265d0f4d68a5eb82b5bfafb450e8e696633ac). It has 1x 320GB drive 
 for the system, 1x 64GB SSD (Crucial C300 - /dev/sda) for the journal 
 and 4x 3TB drive (Western Digital WD30EZRX). Everything but the boot 
 partition is BTRFS-formated and 4K-aligned. 
 - 1 client (P4 3.00GHz dual-core, 1GB RAM) running Debian Sid/Wheezy and 
 Ceph version 0.49 (commit:ca6265d0f4d68a5eb82b5bfafb450e8e696633ac). 
 Both servers are linked over a 1Gb Ethernet switch (iperf shows about 
 960Mb/s). 
 
 Here is my ceph.conf : 
 --cut-here-- 
 [global] 
 auth supported = cephx 
 keyring = /etc/ceph/keyring 
 journal dio = true 
 osd op threads = 24 
 osd disk threads = 24 
 filestore op threads = 6 
 filestore queue max ops = 24 
 osd client message size cap = 1400 
 ms dispatch throttle bytes = 1750 
 

default values are quite a bit lower for most of these. You may want to 
play with them and see if it has an effect. 

 [mon] 
 mon data = /home/mon.$id 
 keyring = /etc/ceph/keyring.$name 
 
 [mon.a] 
 host = ceph-osd-0 
 mon addr = 192.168.0.132:6789 
 
 [mds] 
 keyring = /etc/ceph/keyring.$name 
 
 [mds.a] 
 host = ceph-osd-0 
 
 [osd] 
 osd data = /home/osd.$id 
 osd journal = /home/osd.$id.journal 
 osd journal size = 1000 
 keyring = /etc/ceph/keyring.$name 
 
 [osd.0] 
 host = ceph-osd-0 
 btrfs devs = /dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201 
 btrfs options = rw,noatime 

Just fyi, we are trying to get away from btrfs devs. 

 --cut-here-- 
 
 Here are some figures : 
 * Test with dd on the OSD server (on drive 
 /dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201) : 
 # dd if=/dev/zero of=testdd bs=4k count=4M 
 17179869184 bytes (17 GB) written, 123,746 s, 139 MB/s 

Good job using a data file that is much bigger than main memory! That 
looks pretty accurate for a 7200rpm spinning disk. For dd benchmarks, 
you should probably throw in conv=fdatasync at the end though. 

 
 = iostat (on the OSD server) : 
 avg-cpu: %user %nice %system %iowait %steal %idle 
 0,00 0,00 0,52 41,99 0,00 57,48 
 
 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn 
 sdf 247,00 0,00 125520,00 0 125520 
 
 * Test with unpacking and deleting OpenBSD/5.1 src.tar.gz to the OSD 
 server (on drive 
 /dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201) : 
 # time tar xzf src.tar.gz 
 real 0m9.669s 
 user 0m8.405s 
 sys 0m4.736s 
 
 # time rm -rf * 
 real 0m3.647s 
 user 0m0.036s 
 sys 0m3.552s 
 
 = iostat (on the OSD server) : 
 avg-cpu: %user %nice %system %iowait %steal %idle 
 10,83 0,00 28,72 16,62 0,00 43,83 
 
 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn 
 sdf 1369,00 0,00 9300,00 0 9300 
 
 * Test with dd from the client using RBD : 
 # dd if=/dev/zero of=testdd bs=4k count=4M 
 17179869184 bytes (17 GB) written, 406,941 s, 42,2 MB/s 

RBD caching should definitely be enabled for a test like this. I'd be 
surprised if you got 42MB/s without it though... 

 
 = iostat (on the OSD server) : 
 avg-cpu: %user %nice %system %iowait %steal %idle 
 4,57 0,00 30,46 27,66 0,00 37,31 
 
 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn 
 sda 317,00 0,00 57400,00 0 57400 
 sdf 237,00 0,00 88336,00 0 88336 
 
 * Test with unpacking and deleting OpenBSD/5.1 src.tar.gz from the 
 client using RBD : 
 # time tar xzf src.tar.gz 
 real 0m26.955s 
 user 0m9.233s 
 sys 0m11.425s 
 
 # time rm -rf * 
 real 0m8.545s 
 user 0m0.128s 
 sys 0m8.297s 
 
 = iostat (on the OSD server) : 
 avg-cpu: %user %nice %system %iowait %steal %idle 
 4,59 0,00 24,74 30,61 0,00 40,05 
 
 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn 
 sda 239,00 0,00 54772,00 0 54772 
 sdf 441,00 0,00 50836,00 0 50836 
 
 * Test with dd from the client using CephFS : 
 # dd if=/dev/zero of=testdd bs=4k count=4M 
 17179869184 bytes (17 GB) written, 338,29 s, 50,8 MB/s 
 
 = iostat (on the OSD server) : 
 avg-cpu: %user %nice %system %iowait

Re: Ceph performance improvement

2012-08-22 Thread Tommi Virtanen
On Wed, Aug 22, 2012 at 9:23 AM, Denis Fondras c...@ledeuns.net wrote:
 Are you sure your osd data and journal are on the disks you think? The
 /home paths look suspicious -- especially for journal, which often
 should be a block device.
 I am :)
...
 -rw-r--r-- 1 root root 1048576000 août  22 17:22 /home/osd.0.journal

Your journal is a file on a btrfs partition. That is probably a bad
idea for performance. I'd recommend partitioning the drive and using
partitions as journals directly.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph performance improvement / journal on block-dev

2012-08-22 Thread Dieter Kasper (KD)
On Wed, Aug 22, 2012 at 06:29:12PM +0200, Tommi Virtanen wrote:
(...)
 
 Your journal is a file on a btrfs partition. That is probably a bad
 idea for performance. I'd recommend partitioning the drive and using
 partitions as journals directly.

Hi Tommi,

can you please teach me how to use the right parameter(s) to realize 'journal 
on block-dev' ?

It looks like something is not OK during 'mkcephfs -a -c /etc/ceph/ceph.conf 
--mkbtrfs'
(see below)

Regards,
-Dieter


e.g.
---snip---
modprobe -v brd rd_nr=6 rd_size=1000# 6x 10G RAM DISK

/etc/ceph/ceph.conf
--
[global]
auth supported = none

# set log file
log file = /ceph/log/$name.log
log_to_syslog = true# uncomment this line to log to syslog

# set up pid files
pid file = /var/run/ceph/$name.pid

[mon]  
mon data = /ceph/$name
debug optracker = 0

[mon.alpha]
host = 127.0.0.1
mon addr = 127.0.0.1:6789

[mds]
debug optracker = 0

[mds.0]
host = 127.0.0.1

[osd]
osd data = /data/$name

[osd.0]
host = 127.0.0.1
btrfs devs  = /dev/ram0
osd journal = /dev/ram3

[osd.1]
host = 127.0.0.1
btrfs devs  = /dev/ram1
osd journal = /dev/ram4

[osd.2]
host = 127.0.0.1
btrfs devs  = /dev/ram2
osd journal = /dev/ram5
--

root # mkcephfs -a -c /etc/ceph/ceph.conf --mkbtrfs
temp dir is /tmp/mkcephfs.wzARGSpFB6
preparing monmap in /tmp/mkcephfs.wzARGSpFB6/monmap
/usr/bin/monmaptool --create --clobber --add alpha 127.0.0.1:6789 --print 
/tmp/mkcephfs.wzARGSpFB6/monmap
/usr/bin/monmaptool: monmap file /tmp/mkcephfs.wzARGSpFB6/monmap
/usr/bin/monmaptool: generated fsid 40b997ea-387a-4deb-9a30-805cd076a0de
epoch 0
fsid 40b997ea-387a-4deb-9a30-805cd076a0de
last_changed 2012-08-22 21:04:00.553972
created 2012-08-22 21:04:00.553972
0: 127.0.0.1:6789/0 mon.alpha
/usr/bin/monmaptool: writing epoch 0 to /tmp/mkcephfs.wzARGSpFB6/monmap (1 
monitors)
=== osd.0 === 
pushing conf and monmap to 127.0.0.1:/tmp/mkfs.ceph.11005
umount: /data/osd.0: not mounted
umount: /dev/ram0: not mounted

Btrfs v0.19.1+

ATTENTION:

mkfs.btrfs is not intended to be used directly. Please use the
YaST partitioner to create and manage btrfs filesystems to be
in a supported state on SUSE Linux Enterprise systems.

fs created label (null) on /dev/ram0
nodesize 4096 leafsize 4096 sectorsize 4096 size 9.54GiB
Scanning for Btrfs filesystems
 HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
2012-08-22 21:04:01.519073 7fb475e8b780 -1 journal check: ondisk fsid 
8b18c558-8b40-4b07-aa66-61fecb4dd89d doesn't match expected 
ee0b8bf1-dd4a-459e-a218-3f590f9a8c16, invalid (someone else's?) journal
 HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
 HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
 HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
2012-08-22 21:04:01.923505 7fb475e8b780 -1 filestore(/data/osd.0) could not 
find 23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory
2012-08-22 21:04:01.937429 7fb475e8b780 -1 created object store /data/osd.0 
journal /dev/ram3 for osd.0 fsid 40b997ea-387a-4deb-9a30-805cd076a0de
creating private key for osd.0 keyring /data/osd.0/keyring
creating /data/osd.0/keyring
collecting osd.0 key
=== osd.1 === 
pushing conf and monmap to 127.0.0.1:/tmp/mkfs.ceph.11005
umount: /data/osd.1: not mounted
(...)


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph performance improvement / journal on block-dev

2012-08-22 Thread Tommi Virtanen
On Wed, Aug 22, 2012 at 12:12 PM, Dieter Kasper (KD)
d.kas...@kabelmail.de wrote:
 Your journal is a file on a btrfs partition. That is probably a bad
 idea for performance. I'd recommend partitioning the drive and using
 partitions as journals directly.
 can you please teach me how to use the right parameter(s) to realize 'journal 
 on block-dev' ?

Replacing the example paths, use sudo parted /dev/sdg or gksu
gparted /dev/sdg, create partitions, set osd journal to point to a
block device for a partition.

[osd.42]
osd journal = /dev/sdg4

 It looks like something is not OK during 'mkcephfs -a -c /etc/ceph/ceph.conf 
 --mkbtrfs'
 (see below)

Try running it with -x for any chance of extracting debuggable
information from the monster.

 Scanning for Btrfs filesystems
  HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
 2012-08-22 21:04:01.519073 7fb475e8b780 -1 journal check: ondisk fsid 
 8b18c558-8b40-4b07-aa66-61fecb4dd89d doesn't match expected 
 ee0b8bf1-dd4a-459e-a218-3f590f9a8c16, invalid (someone else's?) journal

Based on that, my best guess would be that you're seeing a journal
from an old run -- perhaps you need to explicitly clear out the block
device contents..

Frankly, you should not use btrfs devs. Any convenience you may gain
is more than doubly offset by pains exactly like these.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph performance improvement

2012-08-22 Thread Mark Kirkwood

On 22/08/12 22:24, David McBride wrote:

On 22/08/12 09:54, Denis Fondras wrote:


* Test with dd from the client using CephFS :
# dd if=/dev/zero of=testdd bs=4k count=4M
17179869184 bytes (17 GB) written, 338,29 s, 50,8 MB/s


Again, the synchronous nature of 'dd' is probably severely affecting 
apparent performance.  I'd suggest looking at some other tools, like 
fio, bonnie++, or iozone, which might generate more representative load.


(Or, if you have a specific use-case in mind, something that generates 
an IO pattern like what you'll be using in production would be ideal!)





Appending conv=fsync to the dd will make the comparison fair enough. 
Looking at the ceph code, it does



sync_file_range(fd, offset, blocksz, SYNC_FILE_RANGE_WRITE);

which is very fast - way faster than fdatasync() and friends (I have 
tested this ... see prev posting on random write performance with file 
writetest.c attached).


I am not convinced the these sort of tests are in any way 'unfair' - for 
instance I would like to use rbd for postgres or mysql data volumes... 
and many database actions involve a stream of block writes similar 
enough to doing dd (e.g bulk row loads, appends to transaction log 
journals).


Cheers

Mark
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html