[ceph-users] Proper procedure for osd/host removal
Hello, I've been working to upgrade the hardware on a semi-production ceph cluster, following the instructions for OSD removal from http://ceph.com/docs/master/rados/operations/add-or-rm-osds/#removing-osds-manual. Basically, I've added the new hosts to the cluster and now I'm removing the old ones from it. What I found curious is that after the sync triggered by the ceph osd out id finishes and I stop the osd process and remove it from the crush map, another session of synchronization is triggered - sometimes this one takes longer than the first. Also, removing an empty host bucket from the crush map triggred another resynchronization. I noticed that the overall weight of the host bucket does not change in the crush map as a result of one OSD being out, therefore what is happening is kinda' normal behavior - however it remains time-consuming. Is there something that can be done to avoid the double resync? I'm running 0.72.2 on top of ubuntu 12.04 on the OSD hosts. Thanks, Dinu ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Proper procedure for osd/host removal
Thanks - I was suspecting it. I was thinking at a course of action that would allow setting the weight of an entire host to zero in the crush map - thus forcing the migration of the data out of the OSDs of that host, followed by the crush and osd removal, one by one (hopefully this time without another backfill session). Problem is I don't have where to test how that would work and/or what would be the side-effects (if any). On 15 Dec 2014, at 21:07, Adeel Nazir ad...@ziptel.ca wrote: I'm going through something similar, and it seems like the double backfill you're experiencing is about par for the course. According to the CERN presentation (http://www.slideshare.net/Inktank_Ceph/scaling-ceph-at-cern slide 19), doing a 'ceph osd crush rm osd ID' should save the double backfill, but I haven't experienced that in my 0.80.5 cluster. Even after I do a crush rm osd, and finally remove it via ceph rm osd.ID, it computes a new map and does the backfill again. As far as I can tell, there's no way around it without editing the map manually, making whatever changes you require and then pushing the new map. I personally am not experienced enough to feel comfortable making that kind of a change. Adeel -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Dinu Vlad Sent: Monday, December 15, 2014 11:35 AM To: ceph-users@lists.ceph.com Subject: [ceph-users] Proper procedure for osd/host removal Hello, I've been working to upgrade the hardware on a semi-production ceph cluster, following the instructions for OSD removal from http://ceph.com/docs/master/rados/operations/add-or-rm- osds/#removing-osds-manual. Basically, I've added the new hosts to the cluster and now I'm removing the old ones from it. What I found curious is that after the sync triggered by the ceph osd out id finishes and I stop the osd process and remove it from the crush map, another session of synchronization is triggered - sometimes this one takes longer than the first. Also, removing an empty host bucket from the crush map triggred another resynchronization. I noticed that the overall weight of the host bucket does not change in the crush map as a result of one OSD being out, therefore what is happening is kinda' normal behavior - however it remains time-consuming. Is there something that can be done to avoid the double resync? I'm running 0.72.2 on top of ubuntu 12.04 on the OSD hosts. Thanks, Dinu ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Openstack Havana root fs resize don't work
There’s a known issue with Havana’s rbd driver in nova and it has nothing to do with ceph. Unfortunately, it is only fixed in icehouse. See https://bugs.launchpad.net/ubuntu/+source/nova/+bug/1219658 for more details. I can confirm that applying the patch manually works. On 05 Aug 2014, at 11:00, Hauke Bruno Wollentin hauke-bruno.wollen...@innovo-cloud.de wrote: Hi folks, we use Ceph Dumpling as storage backend for Openstack Havana. However our instances are not able to resize its root filesystem. This issue just occurs for the virtual root disk. If we start instances with an attached volume, the virtual volume disks size is correct. Our infrastructure: - 1 OpenStack Controller - 1 OpenStack Neutron Node - 1 OpenStack Cinder Node - 4 KVM Hypervisors - 4 Ceph-Storage Nodes including mons - 1 dedicated mon As OS we use Ubuntu 12.04. Our cinder.conf on Cinder Node: volume_driver = cinder.volume.driver.RBDDriver rbd_pool = volumes rbd_secret = SECRET rbd_user = cinder rbd_ceph_conf = /etc/ceph/ceph.conf rbd_max_clone_depth = 5 glance_api_version = 2 Our nova.conf on hypervisors: libvirt_images_type=rbd libvirt_images_rbd_pool=volumes libvirt_images_rbd_ceph_conf=/etc/ceph/ceph.conf rbd_user=admin rbd_secret_uuid=SECRET libvirt_inject_password=false libvirt_inject_key=false libvirt_inject_partition=-2 In our instances we see that the virtual disk isn't _updated_ in its size. It still uses the size specified in the images. We use growrootfs in our images as described in the documentation + verified its functionality (we switched temporarly to LVM as the storage backend, that works). Our images are manually created regarding the documention (means only 1 partition, no swap, cloud-utils etc.). Does anyone has some hints how to solve this issue? Cheers, Hauke ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Move osd disks between hosts
I'm running a ceph cluster with 3 mon and 4 osd nodes (32 disks total) and I've been looking at the possibility to migrate the data to 2 new nodes. The operation should happen by relocating the disks - I'm not getting any new hard-drives. The cluster is used as a backend for an openstack cloud, so downtime should be as short as possible - preferably not more than 24 h during the week-end. I'd like a second opinion on the process - since I do not have the resources to test the move scenario. I'm running emperor (0.72.1) at the moment. All pools in the cluster have size 2. Each existing OSD nodes have each an SSD for journals; /dev/disk/by-id paths were used. Here's what I think would work: 1 - stop ceph on the existing OSD nodes (all of them) and shutdown the node 1 2; 2 - take drives 1-16/ssds 1-2 out and put them in the new node #1; start it up with ceph's upstart script set on manual and check/correct journal paths 3 - edit the CRUSH map on the monitors to reflect the new situation 4 - start ceph on the new node #1 and old nodes 3 4; wait for the rebuild to happen 5 - repeat steps 1-4 for the rest of the nodes/drives; Any opinions? Or a better path to follow? Thanks! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Move osd disks between hosts
Hello Sage, Yes, original deployment was done via ceph-deploy - and I am very happy to read this :) Thank you! Dinu On May 14, 2014, at 4:17 PM, Sage Weil s...@inktank.com wrote: Hi Dinu, On Wed, 14 May 2014, Dinu Vlad wrote: I'm running a ceph cluster with 3 mon and 4 osd nodes (32 disks total) and I've been looking at the possibility to migrate the data to 2 new nodes. The operation should happen by relocating the disks - I'm not getting any new hard-drives. The cluster is used as a backend for an openstack cloud, so downtime should be as short as possible - preferably not more than 24 h during the week-end. I'd like a second opinion on the process - since I do not have the resources to test the move scenario. I'm running emperor (0.72.1) at the moment. All pools in the cluster have size 2. Each existing OSD nodes have each an SSD for journals; /dev/disk/by-id paths were used. Here's what I think would work: 1 - stop ceph on the existing OSD nodes (all of them) and shutdown the node 1 2; 2 - take drives 1-16/ssds 1-2 out and put them in the new node #1; start it up with ceph's upstart script set on manual and check/correct journal paths 3 - edit the CRUSH map on the monitors to reflect the new situation 4 - start ceph on the new node #1 and old nodes 3 4; wait for the rebuild to happen 5 - repeat steps 1-4 for the rest of the nodes/drives; If you used ceph-deploy and/or ceph-disk to set up these OSDs (that is, if they are stored on labeled GPT partitions such that upstart is automagically starting up the ceph-osd daemons for you without you putting anythign in /etc/fstab to manually mount the volumes) then all of this should be plug and play for you--including step #3. By default, the startup process will 'fix' the CRUSH hierarchy position based on the hostname and (if present) other positional data configured for 'crush location' in ceph.conf. The only real requirement is that both the osd data and journal volumes get moved so that the daemon has everything it needs to start up. sage Any opinions? Or a better path to follow? Thanks! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] rados federated gateway - selective replication
I'm trying to figure out a way to configure selective replication of objects between 2 geographically-separated ceph clusters, via the radosgw-agent. Ideally that should happen at the bucket level - but as far as I can figure that seems impossible (running ceph emperor, 0.72.1). Is there any way to achieve this (with the current ceph stable release)? Thanks! -- Dinu ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Questions about federated rados gateway configuration
Hi all, I was going through the documentation (http://ceph.com/docs/master/radosgw/federated-config/), having in mind a (future) replicated swift object store between 2 geographically separated datacenters (and 2 different Ceph clusters) and a few things caught my attention. Considering I'm planning for 3 gateways in each datacenter: 1. Concerning the list of pools that need to be pre-created: ALL pools for both zones have to exist on both clusters? 2. When using keystone integration, can I have all gateways (from both zones) authenticate using the same keystone instance? 3. Concerning the Create a keyring section: is it necessary to have the same keyring file present on all nodes (monitors, osd, gateways) from both clusters? 4. What would be the process to add another gateway to a working federated environment? I'd appreciate any input on the matters above. Thanks, -- Dinu ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ephemeral RBD with Havana and Dumpling
Thank you all for the info. Any chance this may make it into mainline? Thanks, Dinu On Nov 14, 2013, at 4:27 PM, Jens-Christian Fischer jens-christian.fisc...@switch.ch wrote: On Thu, Nov 14, 2013 at 9:12 PM, Jens-Christian Fischer jens-christian.fisc...@switch.ch wrote: We have migration working partially - it works through Horizon (to a random host) and sometimes through the CLI. random host? Do you mean cold-migration? Live-migration should be specified destination host. I have just been digging through Horizon to find out what it does: it calls migrate (reading some docs: ah this is indeed cold migration) We are using the nova fork by Josh Durgin https://github.com/jdurgin/nova/commits/havana-ephemeral-rbd - are there more patches that need to be integrated? I hope I can release or push commits to this branch contains live-migration, incorrect filesystem size fix and ceph-snapshort support in a few days. great looking very much forward to that! cheers jc ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ephemeral RBD with Havana and Dumpling
Out of curiosity - can you live-migrate instances with this setup? On Nov 12, 2013, at 10:38 PM, Dmitry Borodaenko dborodae...@mirantis.com wrote: And to answer my own question, I was missing a meaningful error message: what the ObjectNotFound exception I got from librados didn't tell me was that I didn't have the images keyring file in /etc/ceph/ on my compute node. After 'ceph auth get-or-create client.images /etc/ceph/ceph.client.images.keyring' and reverting images caps back to original state, it all works! On Tue, Nov 12, 2013 at 12:19 PM, Dmitry Borodaenko dborodae...@mirantis.com wrote: I can get ephemeral storage for Nova to work with RBD backend, but I don't understand why it only works with the admin cephx user? With a different user starting a VM fails, even if I set its caps to 'allow *'. Here's what I have in nova.conf: libvirt_images_type=rbd libvirt_images_rbd_pool=images rbd_secret_uuid=fd9a11cc-6995-10d7-feb4-d338d73a4399 rbd_user=images The secret UUID is defined following the same steps as for Cinder and Glance: http://ceph.com/docs/master/rbd/libvirt/ BTW rbd_user option doesn't seem to be documented anywhere, is that a documentation bug? And here's what 'ceph auth list' tells me about my cephx users: client.admin key: AQCoSX1SmIo0AxAAnz3NffHCMZxyvpz65vgRDg== caps: [mds] allow caps: [mon] allow * caps: [osd] allow * client.images key: AQC1hYJS0LQhDhAAn51jxI2XhMaLDSmssKjK+g== caps: [mds] allow caps: [mon] allow * caps: [osd] allow * client.volumes key: AQALSn1ScKruMhAAeSETeatPLxTOVdMIt10uRg== caps: [mon] allow r caps: [osd] allow class-read object_prefix rbd_children, allow rwx pool=volumes, allow rx pool=images Setting rbd_user to images or volumes doesn't work. What am I missing? Thanks, -- Dmitry Borodaenko -- Dmitry Borodaenko ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph 0.72 with zfs
Any chance this option will be included for future emperor binaries? I don't mind compiling software, but I would like to keep things upgradable via apt-get … Thanks, Dinu On Nov 7, 2013, at 4:05 AM, Sage Weil s...@inktank.com wrote: Hi Dinu, You currently need to compile yourself, and pass --with-zfs to ./configure. Once it is built in, ceph-osd will detect whether the underlying fs is zfs on its own. sage On Wed, 6 Nov 2013, Dinu Vlad wrote: Hello, I'm testing the 0.72 release and thought to give a spin to the zfs support. While I managed to setup a cluster on top of a number of zfs datasets, the ceph-osd logs show it's using the genericfilestorebackend: 2013-11-06 09:27:59.386392 7fdfee0ab7c0 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: FIEMAP ioctl is NOT supported 2013-11-06 09:27:59.386409 7fdfee0ab7c0 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option 2013-11-06 09:27:59.391026 7fdfee0ab7c0 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: syncfs(2) syscall fully supported (by glibc and kernel) I noticed however that the ceph sources include some files related to zfs: # find . | grep -i zfs ./src/os/ZFS.cc ./src/os/ZFS.h ./src/os/ZFSFileStoreBackend.cc ./src/os/ZFSFileStoreBackend.h A coupel of questions: - is 0.72-rc1 package currently in the raring repository compiled with zfs support ? - if yes - how can I inform ceph-osd to use the ZFSFileStoreBackend ? Thanks, Dinu ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph cluster performance
I had great results from the older 530 series too. In this case however, the SSDs were only used for journals and I don't know if ceph-osd sends TRIM to the drive in the process of journaling over a block device. They were also under-subscribed, with just 3 x 10G partitions out of 240 GB raw capacity. I did a manual trim, but it hasn't changed anything. I'm still having fun with the configuration so I'll be able to use Mike Dawson's suggested tools to check for latencies. On Nov 6, 2013, at 11:35 PM, ja...@peacon.co.uk wrote: On 2013-11-06 20:25, Mike Dawson wrote: We just fixed a performance issue on our cluster related to spikes of high latency on some of our SSDs used for osd journals. In our case, the slow SSDs showed spikes of 100x higher latency than expected. Many SSDs show this behaviour when 100% provisioned and/or never TRIM'd, since the pool of ready erased cells is quickly depleted under steady write workload, so it has to wait for cells to charge to accommodate the write. The Intel 3700 SSDs look to have some of the best consistency ratings of any of the more reasonably priced drives at the moment, and good IOPS too: http://www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-dc-s3700-series.html Obviously the quoted IOPS numbers are dependent on quite a deep queue mind. There is a big range of performance in the market currently; some Enterprise SSDs are quoted at just 4,000 IOPS yet cost as many pounds! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph 0.72 with zfs
Looking forward to it. Tests done so far show some interesting results - so I'm considering it for future production use. On Nov 7, 2013, at 1:01 PM, Sage Weil s...@newdream.net wrote: The challenge here is that libzfs is currently a build time dependency, which means it needs to be included in the target distro already, or we need to bundle it in the Ceph.com repos. I am currently looking at the possibility of making the OSD back end dynamically linked at runtime, which would allow a separately packaged zfs back end; that may (or may not!) help. sage Dinu Vlad dinuvla...@gmail.com wrote: Any chance this option will be included for future emperor binaries? I don't mind compiling software, but I would like to keep things upgradable via apt-get … Thanks, Dinu On Nov 7, 2013, at 4:05 AM, Sage Weil s...@inktank.com wrote: Hi Dinu, You currently need to compile yourself, and pass --with-zfs to ./configure. Once it is built in, ceph-osd will detect whether the underlying fs is zfs on its own. sage On Wed, 6 Nov 2013, Dinu Vlad wrote: Hello, I'm testing the 0.72 release and thought to give a spin to the zfs support. While I managed to setup a cluster on top of a number of zfs datasets, the ceph-osd logs show it's using the genericfilestorebackend: 2013-11-06 09:27:59.386392 7fdfee0ab7c0 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: FIEMAP ioctl is NOT supported 2013-11-06 09:27:59.386409 7fdfee0ab7c0 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option 2013-11-06 09:27:59.391026 7fdfee0ab7c0 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: syncfs(2) syscall fully supported (by glibc and kernel) I noticed however that the ceph sources include some files related to zfs: # find . | grep -i zfs ./src/os/ZFS.cc ./src/os/ZFS.h ./src/os/ZFSFileStoreBackend.cc ./src/os/ZFSFileStoreBackend.h A coupel of questions: - is 0.72-rc1 package currently in the raring repository compiled with zfs support ? - if yes - how can I inform ceph-osd to use the ZFSFileStoreBackend ? Thanks, Dinu ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Havana RBD - a few problems
Under grizzly we disabled completely the image injection via libvirt_inject_partition = -2 in nova.conf. I'm not sure rbd images can even be mounted that way - but then again, I don't have experience with havana. We're using config disks (which break live migrations) and/or the metadata service (which does not) in combination with cloud-init, to bootstrap instances. On Nov 7, 2013, at 6:15 PM, Jens-Christian Fischer jens-christian.fisc...@switch.ch wrote: Hi all we have installed a Havana OpenStack cluster with RBD as the backing storage for volumes, images and the ephemeral images. The code as delivered in https://github.com/openstack/nova/blob/master/nova/virt/libvirt/imagebackend.py#L498 fails, because the RBD.path it not set. I have patched this to read: @@ -419,10 +419,12 @@ class Rbd(Image): if path: try: self.rbd_name = path.split('/')[1] +self.path = path except IndexError: raise exception.InvalidDevicePath(path=path) else: self.rbd_name = '%s_%s' % (instance['name'], disk_name) +self.path = 'volumes/%s' % self.rbd_name self.snapshot_name = snapshot_name if not CONF.libvirt_images_rbd_pool: raise RuntimeError(_('You should specify' but am not sure this is correct. I have the following problems: 1) can't inject data into image 2013-11-07 16:59:25.251 24891 INFO nova.virt.libvirt.driver [req-f813ef24-de7d-4a05-ad6f-558e27292495 c66a737acf0545fdb9a0a920df0794d9 2096e25f5e814882b5907bc5db342308] [instance: 2fa02e4f-f804-4679-9507-736eeebd9b8d] Injecting key into image fc8179d4-14f3-4f21-a76d-72b03b5c1862 2013-11-07 16:59:25.269 24891 WARNING nova.virt.disk.api [req-f813ef24-de7d-4a05-ad6f-558e27292495 c66a737acf0545fdb9a0a920df0794d9 2096e25f5e814882b5907bc5db342308] Ignoring error injecting data into image (Error mounting volumes/ instance- 0089_disk with libguestfs (volumes/instance-0089_disk: No such file or directory)) possibly the self.path = … is wrong - but what are the correct values? 2) Creating a new instance from an ISO image fails completely - no bootable disk found, says the KVM console. Related? 3) When creating a new instance from an image (non ISO images work), the disk is not resized to the size specified in the flavor (but left at the size of the original image) I would be really grateful, if those people that have Grizzly/Havana running with an RBD backend could pipe in here… thanks Jens-Christian -- SWITCH Jens-Christian Fischer, Peta Solutions Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland phone +41 44 268 15 15, direct +41 44 268 15 71 jens-christian.fisc...@switch.ch http://www.switch.ch http://www.switch.ch/socialmedia ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph cluster performance
I was under the same impression - using a small portion of the SSD via partitioning (in my case - 30 gigs out of 240) would have the same effect as activating the HPA explicitly. Am I wrong? On Nov 7, 2013, at 8:16 PM, ja...@peacon.co.uk wrote: On 2013-11-07 17:47, Gruher, Joseph R wrote: I wonder how effective trim would be on a Ceph journal area. If the journal empties and is then trimmed the next write cycle should be faster, but if the journal is active all the time the benefits would be lost almost immediately, as those cells are going to receive data again almost immediately and go back to an untrimmed state until the next trim occurs. If it's under-provisioned (so the device knows there are unused cells), the device would simply write to an empty cell and flag the old cell for erasing, so there should be no change. Latency would rise when sustained write rate exceeded the devices' ability to clear cells, so eventually the stock of ready cells would be depleted. FWIW, I think there is considerable mileage in the larger-consumer grade argument. Assuming drives will be half the price in a years time, so selecting devices that can last only a year is preferable to spending 3x the price on one that can survive three. That though opens the tin of worms that is SMART reporting and moving journals at some future point mind. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph cluster performance
I have 2 SSDs (same model, smaller capacity) for / connected on the mainboard. Their sync write performance is also poor - less than 600 iops, 4k blocks. On Nov 7, 2013, at 9:44 PM, Kyle Bader kyle.ba...@gmail.com wrote: ST240FN0021 connected via a SAS2x36 to a LSI 9207-8i. The problem might be SATA transport protocol overhead at the expander. Have you tried directly connecting the SSDs to SATA2/3 ports on the mainboard? -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph cluster performance
I'm using the latest 3.8.0 branch from raring. Is there a more recent/better kernel recommended? Meanwhile, I think I might have identified the culprit - my SSD drives are extremely slow on sync writes, doing 5-600 iops max with 4k blocksize. By comparison, an Intel 530 in another server (also installed behind a SAS expander is doing the same test with ~ 8k iops. I guess I'm good for replacing them. Removing the SSD drives from the setup and re-testing with ceph = 595 MB/s throughput under the same conditions (only mechanical drives, journal on a separate partition on each one, 8 rados bench processes, 16 threads each). On Nov 5, 2013, at 4:38 PM, Mark Nelson mark.nel...@inktank.com wrote: Ok, some more thoughts: 1) What kernel are you using? 2) Mixing SATA and SAS on an expander backplane can some times have bad effects. We don't really know how bad this is and in what circumstances, but the Nexenta folks have seen problems with ZFS on solaris and it's not impossible linux may suffer too: http://gdamore.blogspot.com/2010/08/why-sas-sata-is-not-such-great-idea.html 3) If you are doing tests and look at disk throughput with something like collectl -sD -oT do the writes look balanced across the spinning disks? Do any devices have much really high service times or queue times? 4) Also, after the test is done, you can try: find /var/run/ceph/*.asok -maxdepth 1 -exec sudo ceph --admin-daemon {} dump_historic_ops \; foo and then grep for duration in foo. You'll get a list of the slowest operations over the last 10 minutes from every osd on the node. Once you identify a slow duration, you can go back and in an editor search for the slow duration and look at where in the OSD it hung up. That might tell us more about slow/latent operations. 5) Something interesting here is that I've heard from another party that in a 36 drive Supermicro SC847E16 chassis they had 30 7.2K RPM disks and 6 SSDs on a SAS9207-8i controller and were pushing significantly faster throughput than you are seeing (even given the greater number of drives). So it's very interesting to me that you are pushing so much less. The 36 drive supermicro chassis I have with no expanders and 30 drives with 6 SSDs can push about 2100MB/s with a bunch of 9207-8i controllers and XFS (no replication). Mark On 11/05/2013 05:15 AM, Dinu Vlad wrote: Ok, so after tweaking the deadline scheduler and the filestore_wbthrottle* ceph settings I was able to get 440 MB/s from 8 rados bench instances, over a single osd node (pool pg_num = 1800, size = 1) This still looks awfully slow to me - fio throughput across all disks reaches 2.8 GB/s!! I'd appreciate any suggestion, where to look for the issue. Thanks! On Oct 31, 2013, at 6:35 PM, Dinu Vlad dinuvla...@gmail.com wrote: I tested the osd performance from a single node. For this purpose I deployed a new cluster (using ceph-deploy, as before) and on fresh/repartitioned drives. I created a single pool, 1800 pgs. I ran the rados bench both on the osd server and on a remote one. Cluster configuration stayed default, with the same additions about xfs mount mkfs.xfs as before. With a single host, the pgs were stuck unclean (active only, not active+clean): # ceph -s cluster ffd16afa-6348-4877-b6bc-d7f9d82a4062 health HEALTH_WARN 1800 pgs stuck unclean monmap e1: 3 mons at {cephmon1=10.4.0.250:6789/0,cephmon2=10.4.0.251:6789/0,cephmon3=10.4.0.252:6789/0}, election epoch 4, quorum 0,1,2 cephmon1,cephmon2,cephmon3 osdmap e101: 18 osds: 18 up, 18 in pgmap v1055: 1800 pgs: 1800 active; 0 bytes data, 732 MB used, 16758 GB / 16759 GB avail mdsmap e1: 0/0/1 up Test results: Local test, 1 process, 16 threads: 241.7 MB/s Local test, 8 processes, 128 threads: 374.8 MB/s Remote test, 1 process, 16 threads: 231.8 MB/s Remote test, 8 processes, 128 threads: 366.1 MB/s Maybe it's just me, but it seems on the low side too. Thanks, Dinu On Oct 30, 2013, at 8:59 PM, Mark Nelson mark.nel...@inktank.com wrote: On 10/30/2013 01:51 PM, Dinu Vlad wrote: Mark, The SSDs are http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/ssd/enterprise-sata-ssd/?sku=ST240FN0021 and the HDDs are http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/hdd/constellation/?sku=ST91000640SS. The chasis is a SiliconMechanics C602 - but I don't have the exact model. It's based on Supermicro, has 24 slots front and 2 in the back and a SAS expander. I did a fio test (raw partitions, 4M blocksize, ioqueue maxed out according to what the driver reports in dmesg). here are the results (filtered): Sequential: Run status group 0 (all jobs): WRITE: io=176952MB, aggrb=2879.0MB/s, minb=106306KB/s, maxb=191165KB/s, mint=60444msec, maxt=61463msec Individually, the HDDs had best:worst 103:109 MB/s while the SSDs gave 153:189 MB/s Ok
[ceph-users] ceph 0.72 with zfs
Hello, I'm testing the 0.72 release and thought to give a spin to the zfs support. While I managed to setup a cluster on top of a number of zfs datasets, the ceph-osd logs show it's using the genericfilestorebackend: 2013-11-06 09:27:59.386392 7fdfee0ab7c0 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: FIEMAP ioctl is NOT supported 2013-11-06 09:27:59.386409 7fdfee0ab7c0 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option 2013-11-06 09:27:59.391026 7fdfee0ab7c0 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: syncfs(2) syscall fully supported (by glibc and kernel) I noticed however that the ceph sources include some files related to zfs: # find . | grep -i zfs ./src/os/ZFS.cc ./src/os/ZFS.h ./src/os/ZFSFileStoreBackend.cc ./src/os/ZFSFileStoreBackend.h A coupel of questions: - is 0.72-rc1 package currently in the raring repository compiled with zfs support ? - if yes - how can I inform ceph-osd to use the ZFSFileStoreBackend ? Thanks, Dinu ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph cluster performance
ST240FN0021 connected via a SAS2x36 to a LSI 9207-8i. By fixed - you mean replaced the SSDs? Thanks, Dinu On Nov 6, 2013, at 10:25 PM, Mike Dawson mike.daw...@cloudapt.com wrote: We just fixed a performance issue on our cluster related to spikes of high latency on some of our SSDs used for osd journals. In our case, the slow SSDs showed spikes of 100x higher latency than expected. What SSDs were you using that were so slow? Cheers, Mike On 11/6/2013 12:39 PM, Dinu Vlad wrote: I'm using the latest 3.8.0 branch from raring. Is there a more recent/better kernel recommended? Meanwhile, I think I might have identified the culprit - my SSD drives are extremely slow on sync writes, doing 5-600 iops max with 4k blocksize. By comparison, an Intel 530 in another server (also installed behind a SAS expander is doing the same test with ~ 8k iops. I guess I'm good for replacing them. Removing the SSD drives from the setup and re-testing with ceph = 595 MB/s throughput under the same conditions (only mechanical drives, journal on a separate partition on each one, 8 rados bench processes, 16 threads each). On Nov 5, 2013, at 4:38 PM, Mark Nelson mark.nel...@inktank.com wrote: Ok, some more thoughts: 1) What kernel are you using? 2) Mixing SATA and SAS on an expander backplane can some times have bad effects. We don't really know how bad this is and in what circumstances, but the Nexenta folks have seen problems with ZFS on solaris and it's not impossible linux may suffer too: http://gdamore.blogspot.com/2010/08/why-sas-sata-is-not-such-great-idea.html 3) If you are doing tests and look at disk throughput with something like collectl -sD -oT do the writes look balanced across the spinning disks? Do any devices have much really high service times or queue times? 4) Also, after the test is done, you can try: find /var/run/ceph/*.asok -maxdepth 1 -exec sudo ceph --admin-daemon {} dump_historic_ops \; foo and then grep for duration in foo. You'll get a list of the slowest operations over the last 10 minutes from every osd on the node. Once you identify a slow duration, you can go back and in an editor search for the slow duration and look at where in the OSD it hung up. That might tell us more about slow/latent operations. 5) Something interesting here is that I've heard from another party that in a 36 drive Supermicro SC847E16 chassis they had 30 7.2K RPM disks and 6 SSDs on a SAS9207-8i controller and were pushing significantly faster throughput than you are seeing (even given the greater number of drives). So it's very interesting to me that you are pushing so much less. The 36 drive supermicro chassis I have with no expanders and 30 drives with 6 SSDs can push about 2100MB/s with a bunch of 9207-8i controllers and XFS (no replication). Mark On 11/05/2013 05:15 AM, Dinu Vlad wrote: Ok, so after tweaking the deadline scheduler and the filestore_wbthrottle* ceph settings I was able to get 440 MB/s from 8 rados bench instances, over a single osd node (pool pg_num = 1800, size = 1) This still looks awfully slow to me - fio throughput across all disks reaches 2.8 GB/s!! I'd appreciate any suggestion, where to look for the issue. Thanks! On Oct 31, 2013, at 6:35 PM, Dinu Vlad dinuvla...@gmail.com wrote: I tested the osd performance from a single node. For this purpose I deployed a new cluster (using ceph-deploy, as before) and on fresh/repartitioned drives. I created a single pool, 1800 pgs. I ran the rados bench both on the osd server and on a remote one. Cluster configuration stayed default, with the same additions about xfs mount mkfs.xfs as before. With a single host, the pgs were stuck unclean (active only, not active+clean): # ceph -s cluster ffd16afa-6348-4877-b6bc-d7f9d82a4062 health HEALTH_WARN 1800 pgs stuck unclean monmap e1: 3 mons at {cephmon1=10.4.0.250:6789/0,cephmon2=10.4.0.251:6789/0,cephmon3=10.4.0.252:6789/0}, election epoch 4, quorum 0,1,2 cephmon1,cephmon2,cephmon3 osdmap e101: 18 osds: 18 up, 18 in pgmap v1055: 1800 pgs: 1800 active; 0 bytes data, 732 MB used, 16758 GB / 16759 GB avail mdsmap e1: 0/0/1 up Test results: Local test, 1 process, 16 threads: 241.7 MB/s Local test, 8 processes, 128 threads: 374.8 MB/s Remote test, 1 process, 16 threads: 231.8 MB/s Remote test, 8 processes, 128 threads: 366.1 MB/s Maybe it's just me, but it seems on the low side too. Thanks, Dinu On Oct 30, 2013, at 8:59 PM, Mark Nelson mark.nel...@inktank.com wrote: On 10/30/2013 01:51 PM, Dinu Vlad wrote: Mark, The SSDs are http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/ssd/enterprise-sata-ssd/?sku=ST240FN0021 and the HDDs are http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/hdd/constellation/?sku=ST91000640SS. The chasis is a SiliconMechanics C602 - but I don't have
Re: [ceph-users] ceph cluster performance
Ok, so after tweaking the deadline scheduler and the filestore_wbthrottle* ceph settings I was able to get 440 MB/s from 8 rados bench instances, over a single osd node (pool pg_num = 1800, size = 1) This still looks awfully slow to me - fio throughput across all disks reaches 2.8 GB/s!! I'd appreciate any suggestion, where to look for the issue. Thanks! On Oct 31, 2013, at 6:35 PM, Dinu Vlad dinuvla...@gmail.com wrote: I tested the osd performance from a single node. For this purpose I deployed a new cluster (using ceph-deploy, as before) and on fresh/repartitioned drives. I created a single pool, 1800 pgs. I ran the rados bench both on the osd server and on a remote one. Cluster configuration stayed default, with the same additions about xfs mount mkfs.xfs as before. With a single host, the pgs were stuck unclean (active only, not active+clean): # ceph -s cluster ffd16afa-6348-4877-b6bc-d7f9d82a4062 health HEALTH_WARN 1800 pgs stuck unclean monmap e1: 3 mons at {cephmon1=10.4.0.250:6789/0,cephmon2=10.4.0.251:6789/0,cephmon3=10.4.0.252:6789/0}, election epoch 4, quorum 0,1,2 cephmon1,cephmon2,cephmon3 osdmap e101: 18 osds: 18 up, 18 in pgmap v1055: 1800 pgs: 1800 active; 0 bytes data, 732 MB used, 16758 GB / 16759 GB avail mdsmap e1: 0/0/1 up Test results: Local test, 1 process, 16 threads: 241.7 MB/s Local test, 8 processes, 128 threads: 374.8 MB/s Remote test, 1 process, 16 threads: 231.8 MB/s Remote test, 8 processes, 128 threads: 366.1 MB/s Maybe it's just me, but it seems on the low side too. Thanks, Dinu On Oct 30, 2013, at 8:59 PM, Mark Nelson mark.nel...@inktank.com wrote: On 10/30/2013 01:51 PM, Dinu Vlad wrote: Mark, The SSDs are http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/ssd/enterprise-sata-ssd/?sku=ST240FN0021 and the HDDs are http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/hdd/constellation/?sku=ST91000640SS. The chasis is a SiliconMechanics C602 - but I don't have the exact model. It's based on Supermicro, has 24 slots front and 2 in the back and a SAS expander. I did a fio test (raw partitions, 4M blocksize, ioqueue maxed out according to what the driver reports in dmesg). here are the results (filtered): Sequential: Run status group 0 (all jobs): WRITE: io=176952MB, aggrb=2879.0MB/s, minb=106306KB/s, maxb=191165KB/s, mint=60444msec, maxt=61463msec Individually, the HDDs had best:worst 103:109 MB/s while the SSDs gave 153:189 MB/s Ok, that looks like what I'd expect to see given the controller being used. SSDs are probably limited by total aggregate throughput. Random: Run status group 0 (all jobs): WRITE: io=106868MB, aggrb=1727.2MB/s, minb=67674KB/s, maxb=106493KB/s, mint=60404msec, maxt=61875msec Individually (best:worst) HDD 71:73 MB/s, SSD 68:101 MB/s (with only one out of 6 doing 101) This is on just one of the osd servers. Where the ceph tests to one OSD server or across all servers? It might be worth trying tests against a single server with no replication using multiple rados bench instances and just seeing what happens. Thanks, Dinu On Oct 30, 2013, at 6:38 PM, Mark Nelson mark.nel...@inktank.com wrote: On 10/30/2013 09:05 AM, Dinu Vlad wrote: Hello, I've been doing some tests on a newly installed ceph cluster: # ceph osd create bench1 2048 2048 # ceph osd create bench2 2048 2048 # rbd -p bench1 create test # rbd -p bench1 bench-write test --io-pattern rand elapsed: 483 ops: 396579 ops/sec: 820.23 bytes/sec: 2220781.36 # rados -p bench2 bench 300 write --show-time # (run 1) Total writes made: 20665 Write size: 4194304 Bandwidth (MB/sec): 274.923 Stddev Bandwidth: 96.3316 Max bandwidth (MB/sec): 748 Min bandwidth (MB/sec): 0 Average Latency:0.23273 Stddev Latency: 0.262043 Max latency:1.69475 Min latency:0.057293 These results seem to be quite poor for the configuration: MON: dual-cpu Xeon E5-2407 2.2 GHz, 48 GB RAM, 2xSSD for OS OSD: dual-cpu Xeon E5-2620 2.0 GHz, 64 GB RAM, 2xSSD for OS (on-board controller), 18 HDD 1TB 7.2K rpm SAS for OSD drives and 6 SSDs (SATA) for journal, attached to a LSI 9207-8i controller. All servers have dual 10GE network cards, connected to a pair of dedicated switches. Each SSD has 3 10 GB partitions for journals. Agreed, you should see much higher throughput with that kind of storage setup. What brand/model SSDs are these? Also, what brand and model of chassis? With 24 drives and 8 SSDs I could push 2GB/s (no replication though) with a couple of concurrent rados bench processes going on our SC847A chassis, so ~550MB/s aggregate throughput for 18 drives and 6 SSDs is definitely on the low side. I'm actually not too familiar with what the RBD benchmarking commands are doing behind the scenes. Typically I've tested
Re: [ceph-users] testing ceph
Is disk sda on server1 empty or does it contain already a partition? On Nov 4, 2013, at 5:25 PM, charles L charlesboy...@hotmail.com wrote: Pls can somebody help? Im getting this error. ceph@CephAdmin:~$ ceph-deploy osd create server1:sda:/dev/sdj1 [ceph_deploy.cli][INFO ] Invoked (1.3): /usr/bin/ceph-deploy osd create server1:sda:/dev/sdj1 [ceph_deploy.osd][DEBUG ] Preparing cluster ceph disks server1:/dev/sda:/dev/sdj1 [server1][DEBUG ] connected to host: server1 [server1][DEBUG ] detect platform information from remote host [server1][DEBUG ] detect machine type [ceph_deploy.osd][INFO ] Distro info: Ubuntu 12.04 precise [ceph_deploy.osd][DEBUG ] Deploying osd to server1 [server1][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf [server1][INFO ] Running command: sudo udevadm trigger --subsystem-match=block --action=add [ceph_deploy.osd][DEBUG ] Preparing host server1 disk /dev/sda journal /dev/sdj1 activate True [server1][INFO ] Running command: sudo ceph-disk-prepare --fs-type xfs --cluster ceph -- /dev/sda /dev/sdj1 [server1][ERROR ] WARNING:ceph-disk:OSD will not be hot-swappable if journal is not the same device as the osd data [server1][ERROR ] Could not create partition 1 from 34 to 2047 [server1][ERROR ] Error encountered; not saving changes. [server1][ERROR ] ceph-disk: Error: Command '['sgdisk', '--largest-new=1', '--change-name=1:ceph data', '--partition-guid=1:d3ca8a92-7ba5-412e-abf5-06af958b788d', '--typecode=1:89c57f98-2fe5-4dc0-89c1-f3ad0ceff2be', '--', '/dev/sda']' returned non-zero exit status 4 [server1][ERROR ] Traceback (most recent call last): [server1][ERROR ] File /usr/lib/python2.7/dist-packages/ceph_deploy/lib/remoto/process.py, line 68, in run [server1][ERROR ] reporting(conn, result, timeout) [server1][ERROR ] File /usr/lib/python2.7/dist-packages/ceph_deploy/lib/remoto/log.py, line 13, in reporting [server1][ERROR ] received = result.receive(timeout) [server1][ERROR ] File /usr/lib/python2.7/dist-packages/ceph_deploy/lib/remoto/lib/execnet/gateway_base.py, line 455, in receive [server1][ERROR ] raise self._getremoteerror() or EOFError() [server1][ERROR ] RemoteError: Traceback (most recent call last): [server1][ERROR ] File string, line 806, in executetask [server1][ERROR ] File , line 35, in _remote_run [server1][ERROR ] RuntimeError: command returned non-zero exit status: 1 [server1][ERROR ] [server1][ERROR ] [ceph_deploy.osd][ERROR ] Failed to execute command: ceph-disk-prepare --fs-type xfs --cluster ceph -- /dev/sda /dev/sdj1 [ceph_deploy][ERROR ] GenericError: Failed to create 1 OSDs Date: Thu, 31 Oct 2013 10:55:56 + From: joao.l...@inktank.com To: charlesboy...@hotmail.com; ceph-de...@vger.kernel.org Subject: Re: testing ceph On 10/31/2013 04:54 AM, charles L wrote: Hi, Pls is this a good setup for a production environment test of ceph? My focus is on the SSD ... should it be partitioned(sdf1,2 ,3,4) and shared by the four OSDs on a host? or is this a better configuration for the SSD to be just one partition(sdf1) while all osd uses that one partition? my setup: - 6 Servers with one 250gb boot disk for OS(sda), four-2Tb Disks each for the OSDs i.e Total disks = 6x4 = 24 disks (sdb -sde) and one-60GB SSD for Osd Journal(sdf). -RAM = 32GB on each server with 2 GB network link. hostname for servers: Server1 -Server6 Charles, What you are describing on the ceph.conf below is definitely not a good idea. If you really want to use just one SSD and share it across multiple OSDs, then you have two possible approaches: - partition that disk and assign a *different* partition to each OSD; or - keep only one partition, format it with some filesystem, and assign a *different* journal file within that fs to each OSD. What you are describing has you using the same partition for all OSDs. This will likely create issues due to multiple OSDs writing and reading from a single journal. TBH I'm not familiar enough with the journal mechanism to know whether the OSDs will detect that situation. -Joao [osd.0] host = server1 devs = /dev/sdb osd journal = /dev/sdf1 [osd.1] host = server1 devs = /dev/sdc osd journal = /dev/sdf2 [osd.3] host = server1 devs = /dev/sdd osd journal = /dev/sdf2 [osd.4] host = server1 devs = /dev/sde osd journal = /dev/sdf2 [osd.5] host = server2 devs = /dev/sdb osd journal = /dev/sdf2 ... [osd.23] host = server6 devs = /dev/sde osd journal = /dev/sdf2 Thanks. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Joao Eduardo Luis Software Engineer | http://inktank.com | http://ceph.com
Re: [ceph-users] Openstack Instances and RBDs
I don't know of any guide besides the official install docs from grizzly/havana, but I'm running openstack grizzly on top of rbd storage using glance cinder and it makes (almost) no use of /var/lib/nova/instances. Live migrations also work. The only files there should be config.xml and console - otherwise, live-migrations won't work OR the path should be a mounted shared storage (NFS, GlusterFS etc). Nova-compute stores disk* files under that path in the following cases: - when one starts an instance only by using --image image id argument to nova-boot, without a pre-created cinder volume and without the --block-device-mapping argument - when one uses a config disk for bootstrapping instances - when one configures a swap disk in the flavor used to start the instance On Nov 2, 2013, at 2:32 AM, Gaylord Holder ghol...@cs.drexel.edu wrote: http://www.sebastien-han.fr/blog/2013/06/03/ceph-integration-in-openstack-grizzly-update-and-roadmap-for-havana/ suggests it is possible to run openstack instances (not only images) off of RBDs in grizzly and havana (which I'm running), and to use RBDs in lieu of a shared file system. I've followed http://ceph.com/docs/next/rbd/libvirt/ but I can only get boot-from-volume to work. Instances still are being housed in /var/lib/nova/instances, making live-migration a non-starter. Is there a better guide for running openstack instances out of RBDs, or is it just not ready yet? Thanks, -Gaylord ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph cluster performance
Any other options or ideas? Thanks, Dinu On Oct 31, 2013, at 6:35 PM, Dinu Vlad dinuvla...@gmail.com wrote: I tested the osd performance from a single node. For this purpose I deployed a new cluster (using ceph-deploy, as before) and on fresh/repartitioned drives. I created a single pool, 1800 pgs. I ran the rados bench both on the osd server and on a remote one. Cluster configuration stayed default, with the same additions about xfs mount mkfs.xfs as before. With a single host, the pgs were stuck unclean (active only, not active+clean): # ceph -s cluster ffd16afa-6348-4877-b6bc-d7f9d82a4062 health HEALTH_WARN 1800 pgs stuck unclean monmap e1: 3 mons at {cephmon1=10.4.0.250:6789/0,cephmon2=10.4.0.251:6789/0,cephmon3=10.4.0.252:6789/0}, election epoch 4, quorum 0,1,2 cephmon1,cephmon2,cephmon3 osdmap e101: 18 osds: 18 up, 18 in pgmap v1055: 1800 pgs: 1800 active; 0 bytes data, 732 MB used, 16758 GB / 16759 GB avail mdsmap e1: 0/0/1 up Test results: Local test, 1 process, 16 threads: 241.7 MB/s Local test, 8 processes, 128 threads: 374.8 MB/s Remote test, 1 process, 16 threads: 231.8 MB/s Remote test, 8 processes, 128 threads: 366.1 MB/s Maybe it's just me, but it seems on the low side too. Thanks, Dinu On Oct 30, 2013, at 8:59 PM, Mark Nelson mark.nel...@inktank.com wrote: On 10/30/2013 01:51 PM, Dinu Vlad wrote: Mark, The SSDs are http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/ssd/enterprise-sata-ssd/?sku=ST240FN0021 and the HDDs are http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/hdd/constellation/?sku=ST91000640SS. The chasis is a SiliconMechanics C602 - but I don't have the exact model. It's based on Supermicro, has 24 slots front and 2 in the back and a SAS expander. I did a fio test (raw partitions, 4M blocksize, ioqueue maxed out according to what the driver reports in dmesg). here are the results (filtered): Sequential: Run status group 0 (all jobs): WRITE: io=176952MB, aggrb=2879.0MB/s, minb=106306KB/s, maxb=191165KB/s, mint=60444msec, maxt=61463msec Individually, the HDDs had best:worst 103:109 MB/s while the SSDs gave 153:189 MB/s Ok, that looks like what I'd expect to see given the controller being used. SSDs are probably limited by total aggregate throughput. Random: Run status group 0 (all jobs): WRITE: io=106868MB, aggrb=1727.2MB/s, minb=67674KB/s, maxb=106493KB/s, mint=60404msec, maxt=61875msec Individually (best:worst) HDD 71:73 MB/s, SSD 68:101 MB/s (with only one out of 6 doing 101) This is on just one of the osd servers. Where the ceph tests to one OSD server or across all servers? It might be worth trying tests against a single server with no replication using multiple rados bench instances and just seeing what happens. Thanks, Dinu On Oct 30, 2013, at 6:38 PM, Mark Nelson mark.nel...@inktank.com wrote: On 10/30/2013 09:05 AM, Dinu Vlad wrote: Hello, I've been doing some tests on a newly installed ceph cluster: # ceph osd create bench1 2048 2048 # ceph osd create bench2 2048 2048 # rbd -p bench1 create test # rbd -p bench1 bench-write test --io-pattern rand elapsed: 483 ops: 396579 ops/sec: 820.23 bytes/sec: 2220781.36 # rados -p bench2 bench 300 write --show-time # (run 1) Total writes made: 20665 Write size: 4194304 Bandwidth (MB/sec): 274.923 Stddev Bandwidth: 96.3316 Max bandwidth (MB/sec): 748 Min bandwidth (MB/sec): 0 Average Latency:0.23273 Stddev Latency: 0.262043 Max latency:1.69475 Min latency:0.057293 These results seem to be quite poor for the configuration: MON: dual-cpu Xeon E5-2407 2.2 GHz, 48 GB RAM, 2xSSD for OS OSD: dual-cpu Xeon E5-2620 2.0 GHz, 64 GB RAM, 2xSSD for OS (on-board controller), 18 HDD 1TB 7.2K rpm SAS for OSD drives and 6 SSDs (SATA) for journal, attached to a LSI 9207-8i controller. All servers have dual 10GE network cards, connected to a pair of dedicated switches. Each SSD has 3 10 GB partitions for journals. Agreed, you should see much higher throughput with that kind of storage setup. What brand/model SSDs are these? Also, what brand and model of chassis? With 24 drives and 8 SSDs I could push 2GB/s (no replication though) with a couple of concurrent rados bench processes going on our SC847A chassis, so ~550MB/s aggregate throughput for 18 drives and 6 SSDs is definitely on the low side. I'm actually not too familiar with what the RBD benchmarking commands are doing behind the scenes. Typically I've tested fio on top of a filesystem on RBD. Using ubuntu 13.04, ceph 0.67.4, XFS for backend storage. Cluster was installed using ceph-deploy. ceph.conf pretty much out of the box (diff from default follows) osd_journal_size = 10240 osd mount options xfs = rw,noatime,nobarrier,inode64 osd mkfs options xfs = -f
Re: [ceph-users] ceph cluster performance
I tested the osd performance from a single node. For this purpose I deployed a new cluster (using ceph-deploy, as before) and on fresh/repartitioned drives. I created a single pool, 1800 pgs. I ran the rados bench both on the osd server and on a remote one. Cluster configuration stayed default, with the same additions about xfs mount mkfs.xfs as before. With a single host, the pgs were stuck unclean (active only, not active+clean): # ceph -s cluster ffd16afa-6348-4877-b6bc-d7f9d82a4062 health HEALTH_WARN 1800 pgs stuck unclean monmap e1: 3 mons at {cephmon1=10.4.0.250:6789/0,cephmon2=10.4.0.251:6789/0,cephmon3=10.4.0.252:6789/0}, election epoch 4, quorum 0,1,2 cephmon1,cephmon2,cephmon3 osdmap e101: 18 osds: 18 up, 18 in pgmap v1055: 1800 pgs: 1800 active; 0 bytes data, 732 MB used, 16758 GB / 16759 GB avail mdsmap e1: 0/0/1 up Test results: Local test, 1 process, 16 threads: 241.7 MB/s Local test, 8 processes, 128 threads: 374.8 MB/s Remote test, 1 process, 16 threads: 231.8 MB/s Remote test, 8 processes, 128 threads: 366.1 MB/s Maybe it's just me, but it seems on the low side too. Thanks, Dinu On Oct 30, 2013, at 8:59 PM, Mark Nelson mark.nel...@inktank.com wrote: On 10/30/2013 01:51 PM, Dinu Vlad wrote: Mark, The SSDs are http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/ssd/enterprise-sata-ssd/?sku=ST240FN0021 and the HDDs are http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/hdd/constellation/?sku=ST91000640SS. The chasis is a SiliconMechanics C602 - but I don't have the exact model. It's based on Supermicro, has 24 slots front and 2 in the back and a SAS expander. I did a fio test (raw partitions, 4M blocksize, ioqueue maxed out according to what the driver reports in dmesg). here are the results (filtered): Sequential: Run status group 0 (all jobs): WRITE: io=176952MB, aggrb=2879.0MB/s, minb=106306KB/s, maxb=191165KB/s, mint=60444msec, maxt=61463msec Individually, the HDDs had best:worst 103:109 MB/s while the SSDs gave 153:189 MB/s Ok, that looks like what I'd expect to see given the controller being used. SSDs are probably limited by total aggregate throughput. Random: Run status group 0 (all jobs): WRITE: io=106868MB, aggrb=1727.2MB/s, minb=67674KB/s, maxb=106493KB/s, mint=60404msec, maxt=61875msec Individually (best:worst) HDD 71:73 MB/s, SSD 68:101 MB/s (with only one out of 6 doing 101) This is on just one of the osd servers. Where the ceph tests to one OSD server or across all servers? It might be worth trying tests against a single server with no replication using multiple rados bench instances and just seeing what happens. Thanks, Dinu On Oct 30, 2013, at 6:38 PM, Mark Nelson mark.nel...@inktank.com wrote: On 10/30/2013 09:05 AM, Dinu Vlad wrote: Hello, I've been doing some tests on a newly installed ceph cluster: # ceph osd create bench1 2048 2048 # ceph osd create bench2 2048 2048 # rbd -p bench1 create test # rbd -p bench1 bench-write test --io-pattern rand elapsed: 483 ops: 396579 ops/sec: 820.23 bytes/sec: 2220781.36 # rados -p bench2 bench 300 write --show-time # (run 1) Total writes made: 20665 Write size: 4194304 Bandwidth (MB/sec): 274.923 Stddev Bandwidth: 96.3316 Max bandwidth (MB/sec): 748 Min bandwidth (MB/sec): 0 Average Latency:0.23273 Stddev Latency: 0.262043 Max latency:1.69475 Min latency:0.057293 These results seem to be quite poor for the configuration: MON: dual-cpu Xeon E5-2407 2.2 GHz, 48 GB RAM, 2xSSD for OS OSD: dual-cpu Xeon E5-2620 2.0 GHz, 64 GB RAM, 2xSSD for OS (on-board controller), 18 HDD 1TB 7.2K rpm SAS for OSD drives and 6 SSDs (SATA) for journal, attached to a LSI 9207-8i controller. All servers have dual 10GE network cards, connected to a pair of dedicated switches. Each SSD has 3 10 GB partitions for journals. Agreed, you should see much higher throughput with that kind of storage setup. What brand/model SSDs are these? Also, what brand and model of chassis? With 24 drives and 8 SSDs I could push 2GB/s (no replication though) with a couple of concurrent rados bench processes going on our SC847A chassis, so ~550MB/s aggregate throughput for 18 drives and 6 SSDs is definitely on the low side. I'm actually not too familiar with what the RBD benchmarking commands are doing behind the scenes. Typically I've tested fio on top of a filesystem on RBD. Using ubuntu 13.04, ceph 0.67.4, XFS for backend storage. Cluster was installed using ceph-deploy. ceph.conf pretty much out of the box (diff from default follows) osd_journal_size = 10240 osd mount options xfs = rw,noatime,nobarrier,inode64 osd mkfs options xfs = -f -i size=2048 [osd] public network = 10.4.0.0/24 cluster network = 10.254.254.0/24 All tests were run from a server outside
[ceph-users] ceph cluster performance
Hello, I've been doing some tests on a newly installed ceph cluster: # ceph osd create bench1 2048 2048 # ceph osd create bench2 2048 2048 # rbd -p bench1 create test # rbd -p bench1 bench-write test --io-pattern rand elapsed: 483 ops: 396579 ops/sec: 820.23 bytes/sec: 2220781.36 # rados -p bench2 bench 300 write --show-time # (run 1) Total writes made: 20665 Write size: 4194304 Bandwidth (MB/sec): 274.923 Stddev Bandwidth: 96.3316 Max bandwidth (MB/sec): 748 Min bandwidth (MB/sec): 0 Average Latency:0.23273 Stddev Latency: 0.262043 Max latency:1.69475 Min latency:0.057293 These results seem to be quite poor for the configuration: MON: dual-cpu Xeon E5-2407 2.2 GHz, 48 GB RAM, 2xSSD for OS OSD: dual-cpu Xeon E5-2620 2.0 GHz, 64 GB RAM, 2xSSD for OS (on-board controller), 18 HDD 1TB 7.2K rpm SAS for OSD drives and 6 SSDs (SATA) for journal, attached to a LSI 9207-8i controller. All servers have dual 10GE network cards, connected to a pair of dedicated switches. Each SSD has 3 10 GB partitions for journals. Using ubuntu 13.04, ceph 0.67.4, XFS for backend storage. Cluster was installed using ceph-deploy. ceph.conf pretty much out of the box (diff from default follows) osd_journal_size = 10240 osd mount options xfs = rw,noatime,nobarrier,inode64 osd mkfs options xfs = -f -i size=2048 [osd] public network = 10.4.0.0/24 cluster network = 10.254.254.0/24 All tests were run from a server outside the cluster, connected to the storage network with 2x 10 GE nics. I've done a few other tests of the individual components: - network: avg. 7.6 Gbit/s (iperf, mtu=1500), 9.6 Gbit/s (mtu=9000) - md raid0 write across all 18 HDDs - 1.4 GB/s sustained throughput - fio SSD write (xfs, 4k blocks, directio): ~ 250 MB/s, ~55K IOPS I'd appreciate any suggestion that might help improve the performance or identify a bottleneck. Thanks Dinu ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph cluster performance
Mark, The SSDs are http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/ssd/enterprise-sata-ssd/?sku=ST240FN0021 and the HDDs are http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/hdd/constellation/?sku=ST91000640SS. The chasis is a SiliconMechanics C602 - but I don't have the exact model. It's based on Supermicro, has 24 slots front and 2 in the back and a SAS expander. I did a fio test (raw partitions, 4M blocksize, ioqueue maxed out according to what the driver reports in dmesg). here are the results (filtered): Sequential: Run status group 0 (all jobs): WRITE: io=176952MB, aggrb=2879.0MB/s, minb=106306KB/s, maxb=191165KB/s, mint=60444msec, maxt=61463msec Individually, the HDDs had best:worst 103:109 MB/s while the SSDs gave 153:189 MB/s Random: Run status group 0 (all jobs): WRITE: io=106868MB, aggrb=1727.2MB/s, minb=67674KB/s, maxb=106493KB/s, mint=60404msec, maxt=61875msec Individually (best:worst) HDD 71:73 MB/s, SSD 68:101 MB/s (with only one out of 6 doing 101) This is on just one of the osd servers. Thanks, Dinu On Oct 30, 2013, at 6:38 PM, Mark Nelson mark.nel...@inktank.com wrote: On 10/30/2013 09:05 AM, Dinu Vlad wrote: Hello, I've been doing some tests on a newly installed ceph cluster: # ceph osd create bench1 2048 2048 # ceph osd create bench2 2048 2048 # rbd -p bench1 create test # rbd -p bench1 bench-write test --io-pattern rand elapsed: 483 ops: 396579 ops/sec: 820.23 bytes/sec: 2220781.36 # rados -p bench2 bench 300 write --show-time # (run 1) Total writes made: 20665 Write size: 4194304 Bandwidth (MB/sec): 274.923 Stddev Bandwidth: 96.3316 Max bandwidth (MB/sec): 748 Min bandwidth (MB/sec): 0 Average Latency:0.23273 Stddev Latency: 0.262043 Max latency:1.69475 Min latency:0.057293 These results seem to be quite poor for the configuration: MON: dual-cpu Xeon E5-2407 2.2 GHz, 48 GB RAM, 2xSSD for OS OSD: dual-cpu Xeon E5-2620 2.0 GHz, 64 GB RAM, 2xSSD for OS (on-board controller), 18 HDD 1TB 7.2K rpm SAS for OSD drives and 6 SSDs (SATA) for journal, attached to a LSI 9207-8i controller. All servers have dual 10GE network cards, connected to a pair of dedicated switches. Each SSD has 3 10 GB partitions for journals. Agreed, you should see much higher throughput with that kind of storage setup. What brand/model SSDs are these? Also, what brand and model of chassis? With 24 drives and 8 SSDs I could push 2GB/s (no replication though) with a couple of concurrent rados bench processes going on our SC847A chassis, so ~550MB/s aggregate throughput for 18 drives and 6 SSDs is definitely on the low side. I'm actually not too familiar with what the RBD benchmarking commands are doing behind the scenes. Typically I've tested fio on top of a filesystem on RBD. Using ubuntu 13.04, ceph 0.67.4, XFS for backend storage. Cluster was installed using ceph-deploy. ceph.conf pretty much out of the box (diff from default follows) osd_journal_size = 10240 osd mount options xfs = rw,noatime,nobarrier,inode64 osd mkfs options xfs = -f -i size=2048 [osd] public network = 10.4.0.0/24 cluster network = 10.254.254.0/24 All tests were run from a server outside the cluster, connected to the storage network with 2x 10 GE nics. I've done a few other tests of the individual components: - network: avg. 7.6 Gbit/s (iperf, mtu=1500), 9.6 Gbit/s (mtu=9000) - md raid0 write across all 18 HDDs - 1.4 GB/s sustained throughput - fio SSD write (xfs, 4k blocks, directio): ~ 250 MB/s, ~55K IOPS What you might want to try doing is 4M direct IO writes using libaio and a high iodepth to all drives (spinning disks and SSDs) concurrently and see how both the per-drive and aggregate throughput is. With just SSDs, I've been able to push the 9207-8i up to around 3GB/s with Ceph writes (1.5GB/s if you don't count journal writes), but perhaps there is something interesting about the way the hardware is setup on your system. I'd appreciate any suggestion that might help improve the performance or identify a bottleneck. Thanks Dinu ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com