Re: Ceph performance improvement
Hello Mark, Not sure what version of glibc Wheezy has, but try to make sure you have one that supports syncfs (you'll also need a semi-new kernel, 3.0+ should be fine). Wheezy has a fairly recent kernel : # uname -a Linux ceph-osd-0 3.2.0-3-amd64 #1 SMP Mon Jul 23 02:45:17 UTC 2012 x86_64 GNU/Linux default values are quite a bit lower for most of these. You may want to play with them and see if it has an effect. I found these values on this ML. I haven't tried to tweak them but it is much better than with default values. I will try to change it. RBD caching should definitely be enabled for a test like this. I'd be surprised if you got 42MB/s without it though... root@ceph-osd-0:~# ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show | grep rbd debug_rbd = 0/5 rbd_cache = false rbd_cache_size = 33554432 rbd_cache_max_dirty = 25165824 rbd_cache_target_dirty = 16777216 rbd_cache_max_dirty_age = 1 In my opinions, performances from RBD client are decent. Unfortunately I need concurrent access and CephFS is really appealing in that respect. Ouch, that's taking a while! In addition to the comments that David made, be aware that you are also testing the metadata server with cephFS. Right now that's not getting a lot of attention as we are primarily focusing on RADOS performance at the moment. For this kind of test though, distributed filesystems will never be as good as local disks... Yes, it may be the MDS that is the bottleneck. Perhaps I should have a lot of them... Are you putting both journals on the SSD when you add an OSD? If so, what's the throughput your SSD can sustain? Both journals are on the SSD. It seems that when I do ceph-osd -i $id --mkfs --mkkey it creates the journal according to the settings in ceph.conf. I did some tests and my SSD drive is somewhat broken... Crucial C300 is a bit old and can only do 80MB/s writing. You may want to check and see how big the IOs going to disk are on the OSD node, and how quickly you are filling up the journal vs writing out to disk. collectl -sD -oT will give you a nice report. Iostat can probably tell you all of the same stuff with the right flags. Thank you for that tool. Denis -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph performance improvement
Not sure what version of glibc Wheezy has, but try to make sure you have one that supports syncfs (you'll also need a semi-new kernel, 3.0+ should be fine). Hi, glibc from wheezy don't have syncfs support. - Mail original - De: Mark Nelson mark.nel...@inktank.com À: Denis Fondras c...@ledeuns.net Cc: ceph-devel@vger.kernel.org Envoyé: Mercredi 22 Août 2012 14:35:28 Objet: Re: Ceph performance improvement On 08/22/2012 03:54 AM, Denis Fondras wrote: Hello all, Hello! David had some good comments in his reply, so I'll just add in a couple of extra thoughts... I'm currently testing Ceph. So far it seems that HA and recovering are very good. The only point that prevents my from using it at datacenter-scale is performance. First of all, here is my setup : - 1 OSD/MDS/MON on a Supermicro X9DR3-F/X9DR3-F (1x Intel Xeon E5-2603 - 4 cores and 8GB RAM) running Debian Sid/Wheezy and Ceph version 0.49 Not sure what version of glibc Wheezy has, but try to make sure you have one that supports syncfs (you'll also need a semi-new kernel, 3.0+ should be fine). (commit:ca6265d0f4d68a5eb82b5bfafb450e8e696633ac). It has 1x 320GB drive for the system, 1x 64GB SSD (Crucial C300 - /dev/sda) for the journal and 4x 3TB drive (Western Digital WD30EZRX). Everything but the boot partition is BTRFS-formated and 4K-aligned. - 1 client (P4 3.00GHz dual-core, 1GB RAM) running Debian Sid/Wheezy and Ceph version 0.49 (commit:ca6265d0f4d68a5eb82b5bfafb450e8e696633ac). Both servers are linked over a 1Gb Ethernet switch (iperf shows about 960Mb/s). Here is my ceph.conf : --cut-here-- [global] auth supported = cephx keyring = /etc/ceph/keyring journal dio = true osd op threads = 24 osd disk threads = 24 filestore op threads = 6 filestore queue max ops = 24 osd client message size cap = 1400 ms dispatch throttle bytes = 1750 default values are quite a bit lower for most of these. You may want to play with them and see if it has an effect. [mon] mon data = /home/mon.$id keyring = /etc/ceph/keyring.$name [mon.a] host = ceph-osd-0 mon addr = 192.168.0.132:6789 [mds] keyring = /etc/ceph/keyring.$name [mds.a] host = ceph-osd-0 [osd] osd data = /home/osd.$id osd journal = /home/osd.$id.journal osd journal size = 1000 keyring = /etc/ceph/keyring.$name [osd.0] host = ceph-osd-0 btrfs devs = /dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201 btrfs options = rw,noatime Just fyi, we are trying to get away from btrfs devs. --cut-here-- Here are some figures : * Test with dd on the OSD server (on drive /dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201) : # dd if=/dev/zero of=testdd bs=4k count=4M 17179869184 bytes (17 GB) written, 123,746 s, 139 MB/s Good job using a data file that is much bigger than main memory! That looks pretty accurate for a 7200rpm spinning disk. For dd benchmarks, you should probably throw in conv=fdatasync at the end though. = iostat (on the OSD server) : avg-cpu: %user %nice %system %iowait %steal %idle 0,00 0,00 0,52 41,99 0,00 57,48 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sdf 247,00 0,00 125520,00 0 125520 * Test with unpacking and deleting OpenBSD/5.1 src.tar.gz to the OSD server (on drive /dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201) : # time tar xzf src.tar.gz real 0m9.669s user 0m8.405s sys 0m4.736s # time rm -rf * real 0m3.647s user 0m0.036s sys 0m3.552s = iostat (on the OSD server) : avg-cpu: %user %nice %system %iowait %steal %idle 10,83 0,00 28,72 16,62 0,00 43,83 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sdf 1369,00 0,00 9300,00 0 9300 * Test with dd from the client using RBD : # dd if=/dev/zero of=testdd bs=4k count=4M 17179869184 bytes (17 GB) written, 406,941 s, 42,2 MB/s RBD caching should definitely be enabled for a test like this. I'd be surprised if you got 42MB/s without it though... = iostat (on the OSD server) : avg-cpu: %user %nice %system %iowait %steal %idle 4,57 0,00 30,46 27,66 0,00 37,31 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 317,00 0,00 57400,00 0 57400 sdf 237,00 0,00 88336,00 0 88336 * Test with unpacking and deleting OpenBSD/5.1 src.tar.gz from the client using RBD : # time tar xzf src.tar.gz real 0m26.955s user 0m9.233s sys 0m11.425s # time rm -rf * real 0m8.545s user 0m0.128s sys 0m8.297s = iostat (on the OSD server) : avg-cpu: %user %nice %system %iowait %steal %idle 4,59 0,00 24,74 30,61 0,00 40,05 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 239,00 0,00 54772,00 0 54772 sdf 441,00 0,00 50836,00 0 50836 * Test with dd from the client using CephFS : # dd if=/dev/zero of=testdd bs=4k count=4M 17179869184 bytes (17 GB) written, 338,29 s, 50,8 MB/s = iostat (on the OSD server) : avg-cpu: %user %nice %system %iowait
Re: Ceph performance improvement
On Wed, Aug 22, 2012 at 9:23 AM, Denis Fondras c...@ledeuns.net wrote: Are you sure your osd data and journal are on the disks you think? The /home paths look suspicious -- especially for journal, which often should be a block device. I am :) ... -rw-r--r-- 1 root root 1048576000 août 22 17:22 /home/osd.0.journal Your journal is a file on a btrfs partition. That is probably a bad idea for performance. I'd recommend partitioning the drive and using partitions as journals directly. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph performance improvement / journal on block-dev
On Wed, Aug 22, 2012 at 06:29:12PM +0200, Tommi Virtanen wrote: (...) Your journal is a file on a btrfs partition. That is probably a bad idea for performance. I'd recommend partitioning the drive and using partitions as journals directly. Hi Tommi, can you please teach me how to use the right parameter(s) to realize 'journal on block-dev' ? It looks like something is not OK during 'mkcephfs -a -c /etc/ceph/ceph.conf --mkbtrfs' (see below) Regards, -Dieter e.g. ---snip--- modprobe -v brd rd_nr=6 rd_size=1000# 6x 10G RAM DISK /etc/ceph/ceph.conf -- [global] auth supported = none # set log file log file = /ceph/log/$name.log log_to_syslog = true# uncomment this line to log to syslog # set up pid files pid file = /var/run/ceph/$name.pid [mon] mon data = /ceph/$name debug optracker = 0 [mon.alpha] host = 127.0.0.1 mon addr = 127.0.0.1:6789 [mds] debug optracker = 0 [mds.0] host = 127.0.0.1 [osd] osd data = /data/$name [osd.0] host = 127.0.0.1 btrfs devs = /dev/ram0 osd journal = /dev/ram3 [osd.1] host = 127.0.0.1 btrfs devs = /dev/ram1 osd journal = /dev/ram4 [osd.2] host = 127.0.0.1 btrfs devs = /dev/ram2 osd journal = /dev/ram5 -- root # mkcephfs -a -c /etc/ceph/ceph.conf --mkbtrfs temp dir is /tmp/mkcephfs.wzARGSpFB6 preparing monmap in /tmp/mkcephfs.wzARGSpFB6/monmap /usr/bin/monmaptool --create --clobber --add alpha 127.0.0.1:6789 --print /tmp/mkcephfs.wzARGSpFB6/monmap /usr/bin/monmaptool: monmap file /tmp/mkcephfs.wzARGSpFB6/monmap /usr/bin/monmaptool: generated fsid 40b997ea-387a-4deb-9a30-805cd076a0de epoch 0 fsid 40b997ea-387a-4deb-9a30-805cd076a0de last_changed 2012-08-22 21:04:00.553972 created 2012-08-22 21:04:00.553972 0: 127.0.0.1:6789/0 mon.alpha /usr/bin/monmaptool: writing epoch 0 to /tmp/mkcephfs.wzARGSpFB6/monmap (1 monitors) === osd.0 === pushing conf and monmap to 127.0.0.1:/tmp/mkfs.ceph.11005 umount: /data/osd.0: not mounted umount: /dev/ram0: not mounted Btrfs v0.19.1+ ATTENTION: mkfs.btrfs is not intended to be used directly. Please use the YaST partitioner to create and manage btrfs filesystems to be in a supported state on SUSE Linux Enterprise systems. fs created label (null) on /dev/ram0 nodesize 4096 leafsize 4096 sectorsize 4096 size 9.54GiB Scanning for Btrfs filesystems HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device 2012-08-22 21:04:01.519073 7fb475e8b780 -1 journal check: ondisk fsid 8b18c558-8b40-4b07-aa66-61fecb4dd89d doesn't match expected ee0b8bf1-dd4a-459e-a218-3f590f9a8c16, invalid (someone else's?) journal HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device 2012-08-22 21:04:01.923505 7fb475e8b780 -1 filestore(/data/osd.0) could not find 23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory 2012-08-22 21:04:01.937429 7fb475e8b780 -1 created object store /data/osd.0 journal /dev/ram3 for osd.0 fsid 40b997ea-387a-4deb-9a30-805cd076a0de creating private key for osd.0 keyring /data/osd.0/keyring creating /data/osd.0/keyring collecting osd.0 key === osd.1 === pushing conf and monmap to 127.0.0.1:/tmp/mkfs.ceph.11005 umount: /data/osd.1: not mounted (...) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph performance improvement / journal on block-dev
On Wed, Aug 22, 2012 at 12:12 PM, Dieter Kasper (KD) d.kas...@kabelmail.de wrote: Your journal is a file on a btrfs partition. That is probably a bad idea for performance. I'd recommend partitioning the drive and using partitions as journals directly. can you please teach me how to use the right parameter(s) to realize 'journal on block-dev' ? Replacing the example paths, use sudo parted /dev/sdg or gksu gparted /dev/sdg, create partitions, set osd journal to point to a block device for a partition. [osd.42] osd journal = /dev/sdg4 It looks like something is not OK during 'mkcephfs -a -c /etc/ceph/ceph.conf --mkbtrfs' (see below) Try running it with -x for any chance of extracting debuggable information from the monster. Scanning for Btrfs filesystems HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device 2012-08-22 21:04:01.519073 7fb475e8b780 -1 journal check: ondisk fsid 8b18c558-8b40-4b07-aa66-61fecb4dd89d doesn't match expected ee0b8bf1-dd4a-459e-a218-3f590f9a8c16, invalid (someone else's?) journal Based on that, my best guess would be that you're seeing a journal from an old run -- perhaps you need to explicitly clear out the block device contents.. Frankly, you should not use btrfs devs. Any convenience you may gain is more than doubly offset by pains exactly like these. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph performance improvement
On 22/08/12 22:24, David McBride wrote: On 22/08/12 09:54, Denis Fondras wrote: * Test with dd from the client using CephFS : # dd if=/dev/zero of=testdd bs=4k count=4M 17179869184 bytes (17 GB) written, 338,29 s, 50,8 MB/s Again, the synchronous nature of 'dd' is probably severely affecting apparent performance. I'd suggest looking at some other tools, like fio, bonnie++, or iozone, which might generate more representative load. (Or, if you have a specific use-case in mind, something that generates an IO pattern like what you'll be using in production would be ideal!) Appending conv=fsync to the dd will make the comparison fair enough. Looking at the ceph code, it does sync_file_range(fd, offset, blocksz, SYNC_FILE_RANGE_WRITE); which is very fast - way faster than fdatasync() and friends (I have tested this ... see prev posting on random write performance with file writetest.c attached). I am not convinced the these sort of tests are in any way 'unfair' - for instance I would like to use rbd for postgres or mysql data volumes... and many database actions involve a stream of block writes similar enough to doing dd (e.g bulk row loads, appends to transaction log journals). Cheers Mark -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html