Re: [ceph-users] requests are blocked - problem
Hi, On 08/19/2015 11:01 AM, Christian Balzer wrote: Hello, That's a pretty small cluster all things considered, so your rather intensive test setup is likely to run into any or all of the following issues: 1) The amount of data you're moving around is going cause a lot of promotions from and to the cache tier. This is expensive and slow. 2) EC coded pools are slow. You may have actually better results with a Ceph classic approach, 2-4 HDDs per journal SSD. Also 6TB HDDs combined with EC may look nice to you from a cost/density prospect, but more HDDs means more IOPS and thus speed. 3) scrubbing (unless configured very aggressively down) will impact your performance on top of the items above. 4) You already noted the kernel versus userland bit. 5) Having all your storage in a single JBOD chassis strikes me as ill advised, though I don't think it's an actual bottleneck at 4x12Gb/s. We use two of these (I forgot to mention that) Each chasis has two internal controllers - both exposing all the disks to the connected hosts. There are two osd nodes connected to each chasis. When you ran the fio tests I assume nothing else was going on and the dataset size would have fit easily into the cache pool, right? Look at your nodes with atop or iostat, I venture all your HDDs are at 100%. Christian Yes, the problem was a full cache pool. I'm currently wondering on how to tune the cache pool parameters so that the whole cluster doesn't slow down that much when the cache is full... I'm thinking of doing some tests on a pool w/o the cache tier so I can compare the results. Any suggestions would be greatly appreciated.. J -- Jacek Jarosiewicz Administrator Systemów Informatycznych SUPERMEDIA Sp. z o.o. z siedzibą w Warszawie ul. Senatorska 13/15, 00-075 Warszawa Sąd Rejonowy dla m.st.Warszawy, XII Wydział Gospodarczy Krajowego Rejestru Sądowego, nr KRS 029537; kapitał zakładowy 42.756.000 zł NIP: 957-05-49-503 Adres korespondencyjny: ul. Jubilerska 10, 04-190 Warszawa SUPERMEDIA - http://www.supermedia.pl dostep do internetu - hosting - kolokacja - lacza - telefonia ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] requests are blocked - problem
On 08/19/2015 10:58 AM, Nick Fisk wrote: -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jacek Jarosiewicz Sent: 19 August 2015 09:29 To: ceph-us...@ceph.com Subject: [ceph-users] requests are blocked - problem I would suggest running the fio tests again, just to make sure that there isn't a problem with your newer config, but I suspect you will see equally bad performance with the fio tests now that the cache tier has begun to be more populated. Ok, I did the tests and You're right - the full cache was the problem. After flushing cache and running fio results are again good (fast). Is there a way to tune the cache parameters so that the whole cluster doesn't slow down that much and doesn't block requests? We use defaults for the cache pool from the documentation: hit_set_period 3600 cache_min_flush_age 600 cache_min_evict_age 1800 J -- Jacek Jarosiewicz Administrator Systemów Informatycznych SUPERMEDIA Sp. z o.o. z siedzibą w Warszawie ul. Senatorska 13/15, 00-075 Warszawa Sąd Rejonowy dla m.st.Warszawy, XII Wydział Gospodarczy Krajowego Rejestru Sądowego, nr KRS 029537; kapitał zakładowy 42.756.000 zł NIP: 957-05-49-503 Adres korespondencyjny: ul. Jubilerska 10, 04-190 Warszawa SUPERMEDIA - http://www.supermedia.pl dostep do internetu - hosting - kolokacja - lacza - telefonia ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Latency impact on RBD performance
Hi, We are currently using 2 OSD hosts with SSDs to provide RBD backed volumes for KVM hypervisors. This 'cluster' is currently set up in 'Location A'. We are looking to move our hypervisors/VMs over to a new location, and will have a 1Gbit link between the two datacenters. We can run Layer 2 over the link, and it should have ~10ms of latency. Call the new datacenter 'Location B'. One proposed solution for the migration is to set up new RBD hosts in the new location, set up a new pool, and move the VM volumes to it. The potential issue with this solution is that we can end up in a scenario where the VM is running on a hypervisor in 'Location A', but writing/reading to a volume in 'Location B'. My question is: what kind of performance impact should we expect when reading/writing over a link with ~10ms of latency? Will it bring I/O intensive operations (like databases) to a halt, or will it be 'tolerable' for a short period (a few days). Most of the VMs are running database backed e-commerce sites. My expectation is that 10ms for every I/O operation will cause a significant impact, but we wanted to verify that before ruling it out as a solution. We will also be doing some internal testing of course. I appreciate any feedback the community has. - Logan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph cluster_network with linklocal ipv6
On 08/18/2015 03:39 PM, Björn Lässig wrote: For not having any dependencies in my cluster network, i want to use only ipv6 link-local addresses on interface 'cephnet'. cluster_network = fe80::%cephnet/64 RFC4007 11.7 The IPv6 addressing architecture [1] also defines the syntax of IPv6 prefixes. If the address portion of a prefix is non-global and its scope zone should be disambiguated, the address portion SHOULD be in the format. For example, a link-local prefix fe80::/64 on the second link can be represented as follows: fe80::%2/64 In this combination, it is important to place the zone index portion before the prefix length when we consider parsing the format by a name-to-address library function [11]. That is, we can first separate the address with the zone index from the prefix length, and just pass the former to the library function. so my original format would be the correct one. just looked into the code ... and this is exactly what happens in: src/common/ipaddr.cc: bool parse_network(const char *s, […]) { it looks where a '/' is and try to cast everything after to a longint. Everything before the '/' is thrown into inet_pton ok = inet_pton(AF_INET6, addr, ((struct sockaddr_in6*)network)-sin6_addr); and inet_pton6 from glibc does not look at '%'. No chance, this code could work with an interfacename after or before the '/' I'm using addresses with global scope now. Björn ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] requests are blocked - problem
-Original Message- From: Jacek Jarosiewicz [mailto:jjarosiew...@supermedia.pl] Sent: 19 August 2015 14:28 To: Nick Fisk n...@fisk.me.uk; ceph-us...@ceph.com Subject: Re: [ceph-users] requests are blocked - problem On 08/19/2015 10:58 AM, Nick Fisk wrote: -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jacek Jarosiewicz Sent: 19 August 2015 09:29 To: ceph-us...@ceph.com Subject: [ceph-users] requests are blocked - problem I would suggest running the fio tests again, just to make sure that there isn't a problem with your newer config, but I suspect you will see equally bad performance with the fio tests now that the cache tier has begun to be more populated. Ok, I did the tests and You're right - the full cache was the problem. After flushing cache and running fio results are again good (fast). Is there a way to tune the cache parameters so that the whole cluster doesn't slow down that much and doesn't block requests? We use defaults for the cache pool from the documentation: hit_set_period 3600 cache_min_flush_age 600 cache_min_evict_age 1800 Although you may get some benefit from tweaking parameters, I suspect you are nearer the performance ceiling for the current implementation of the tiering code. Could you post all the variables you set for the tiering including target_max_bytes and the dirty/full ratios. Since you are doing maildirs, which will have lots of small files, you might also want to try making the object size of the RBD smaller. This will mean less data is needed to be shifted on each promotion/flush. J -- Jacek Jarosiewicz Administrator Systemów Informatycznych SUPERMEDIA Sp. z o.o. z siedzibą w Warszawie ul. Senatorska 13/15, 00-075 Warszawa Sąd Rejonowy dla m.st.Warszawy, XII Wydział Gospodarczy Krajowego Rejestru Sądowego, nr KRS 029537; kapitał zakładowy 42.756.000 zł NIP: 957-05-49-503 Adres korespondencyjny: ul. Jubilerska 10, 04-190 Warszawa SUPERMEDIA - http://www.supermedia.pl dostep do internetu - hosting - kolokacja - lacza - telefonia ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Latency impact on RBD performance
I would suspect that you will notice a significant slow down. Don't forget that’s an extra 10ms on however long it already takes for each IO. Also when the cluster does any sort of recovery it will likely get much worse. -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Logan Barfield Sent: 19 August 2015 15:20 To: ceph-us...@ceph.com Subject: [ceph-users] Latency impact on RBD performance Hi, We are currently using 2 OSD hosts with SSDs to provide RBD backed volumes for KVM hypervisors. This 'cluster' is currently set up in 'Location A'. We are looking to move our hypervisors/VMs over to a new location, and will have a 1Gbit link between the two datacenters. We can run Layer 2 over the link, and it should have ~10ms of latency. Call the new datacenter 'Location B'. One proposed solution for the migration is to set up new RBD hosts in the new location, set up a new pool, and move the VM volumes to it. The potential issue with this solution is that we can end up in a scenario where the VM is running on a hypervisor in 'Location A', but writing/reading to a volume in 'Location B'. My question is: what kind of performance impact should we expect when reading/writing over a link with ~10ms of latency? Will it bring I/O intensive operations (like databases) to a halt, or will it be 'tolerable' for a short period (a few days). Most of the VMs are running database backed e-commerce sites. My expectation is that 10ms for every I/O operation will cause a significant impact, but we wanted to verify that before ruling it out as a solution. We will also be doing some internal testing of course. I appreciate any feedback the community has. - Logan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Latency impact on RBD performance
This simply depends on what your workload is. I know this is a non-anwer for you but that's how it is. Databases are the worst, because they tend to hit the disks with every transaction, and the transaction throughput is in direct proportion to the number of IOPS you can get. And the number of IOPS you can get (in this scenario) is basically iops=1000/latency_in_ms. There's not much parallelism when commiting. So if your storage has a 2ms latency now it can achieve at most 500 IOPS (single thread/queue depth synchronous) - that can in theory equal to 500 durable transactions per second in a database But in practice: a) mysql/innodb has an option not to flush every transaction but every Xth transaction. With X=10 it's like having a 5000 IOPS disk more or less. b) there is filesystem overhead - if you are appending to the transaction log you not only have to flush the data, but also filesystem metadata, and that could be many more IOPS before you even get to the transaction - I've seen a factor of 10(!) with no preallocation and xfs. That's terrible. But not every database does this, for example mysql creates and preallocates iblogfileX files so that should not be an issue if you run mysql. c) there can be other IOs blocking the submission, you have a limited queue depth of in-flight iops so it can clog up even if you turn the synchronous IO into asynchronous, and you often have to flush even the asynchronous writes d) we must not forget about reads - if there's a webserver connecting to the database then there it will need more memory because all requests will take longer (and they also often consume CPU even when waiting), and if there are multiple requests or subrequests it cascades and goes up fast. --- Take a look in iostat at the drive utilization and latency of your database's disk. Then calculate: resulting_disk_utilization=(latency+10)/latency*utilization Taking the figures from previous example, let's say you have a 2ms latency and the drive is 20% utilized: (2+10)/2*20 = 120% My guess is that it's going to be horrible unless you turn everything into async IO, enable cache=unsafe in qemu and pray really hard :/ Jan On 19 Aug 2015, at 16:20, Logan Barfield lbarfi...@tqhosting.com wrote: Hi, We are currently using 2 OSD hosts with SSDs to provide RBD backed volumes for KVM hypervisors. This 'cluster' is currently set up in 'Location A'. We are looking to move our hypervisors/VMs over to a new location, and will have a 1Gbit link between the two datacenters. We can run Layer 2 over the link, and it should have ~10ms of latency. Call the new datacenter 'Location B'. One proposed solution for the migration is to set up new RBD hosts in the new location, set up a new pool, and move the VM volumes to it. The potential issue with this solution is that we can end up in a scenario where the VM is running on a hypervisor in 'Location A', but writing/reading to a volume in 'Location B'. My question is: what kind of performance impact should we expect when reading/writing over a link with ~10ms of latency? Will it bring I/O intensive operations (like databases) to a halt, or will it be 'tolerable' for a short period (a few days). Most of the VMs are running database backed e-commerce sites. My expectation is that 10ms for every I/O operation will cause a significant impact, but we wanted to verify that before ruling it out as a solution. We will also be doing some internal testing of course. I appreciate any feedback the community has. - Logan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph distributed osd
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 By default, all pools will use all OSDs. Each RBD, for instance, is broken up into 4 MB objects and those objects are somewhat uniformly distributed between the OSDs. When you add another OSD, the CRUSH map is recalculated and the OSDs shuffle the objects to their new locations somewhat uniformly distributing them across all available OSDs. I say uniformly distributed because it is based on the hashing algorithm of the name and size is not taken into account. So you may have more larger objects on some OSDs than others. The number of PGs affect the ability to more uniformly distribute the data (more hash buckets for data to land in). You can create CRUSH rules that limit selection of OSDs to a subset and then configure a pool to use those rules. This is a pretty advanced configuration option. I hope that helps with your question. -BEGIN PGP SIGNATURE- Version: Mailvelope v1.0.0 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJV1M7SCRDmVDuy+mK58QAASbYQAMG0oPEu56Uz0/9cb4LY E7QTeX2hUGRX5c65Zurr9p+/Sc4WCvDEZm/aPPcB9UtO0O5dvWXULWjXRgr0 Z13/28OozLxWQihRc80OhY2MskNfgPA0zYwaANgUR0xJV4YFQ1ORa13rj0L8 SL4z/IDK9tK/NDLxnjq/iMPXCTTcg3ufiB+0Njl3zLRbGEOAix6H5hzi0239 qHb7UniTtailICcSI0byQE2vKPWQbJ7GueECbcAn/MkqU0uZqzyh5HotiBFq 9ut/ui3ec0Sg/3puD6TOhipQlP998sMnAa5hFi+hoNbVbljGZ9dGZ+inVlJy kSQTbNDs0Xo2QijGH11LrQ4yL47Trr2WkIriHONtvbncgZg3qK7uR39k6kZ9 dfGUdtstkn8sh5gt98jFNvjWL8UTH9puAJv5C9TzPuq+cq3kr3dwhy4WxrN+ MNISYwJOvncY/2kl03FLL/Z0HxDx1mjjJMQdzM+q9+D0m/EYfUpe/DxMqqMI 4t8hD5UPBhkv1sgLYSWyJ5vxLnNOZP7roe2Jp0KwwlSADM9DJb4MEx/1nNcb 6emts8KUhhtb1jsH8gu9Z0tzHcaqNE8N1z9JiveaNCjs6wTp8xbtmDB7p9k4 uZzzoIXTJWrIN/Qqukza+/+8D+WAJ618uwXCCpWi/k83RKt7iy2iv5w4EDTx 25cQ =a+24 -END PGP SIGNATURE- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Tue, Aug 18, 2015 at 8:26 AM, gjprabu gjpr...@zohocorp.com wrote: Hi Luis, What i mean , we have three OSD with Harddisk size each 1TB and two pool (poolA and poolB) with replica 2. Here writing behavior is the confusion for us. Our assumptions is below. PoolA -- may write with OSD1 and OSD2 (is this correct) PoolB -- may write with OSD3 and OSD1 (is this correct) suppose the hard disk size got full , then how many OSD's need to be added and How will be the writing behavior to new OSD's After added few osd's PoolA -- may write with OSD4 and OSD5 (is this correct) PoolB -- may write with OSD5 and OSD6 (is this correct) Regards Prabu On Mon, 17 Aug 2015 19:41:53 +0530 Luis Periquito periqu...@gmail.com wrote I don't understand your question? You created a 1G RBD/disk and it's full. You are able to grow it though - but that's a Linux management issue, not ceph. As everything is thin-provisioned you can create a RBD with an arbitrary size - I've create one with 1PB when the cluster only had 600G/Raw available. On Mon, Aug 17, 2015 at 1:18 PM, gjprabu gjpr...@zohocorp.com wrote: Hi All, Anybody can help on this issue. Regards Prabu On Mon, 17 Aug 2015 12:08:28 +0530 gjprabu gjpr...@zohocorp.com wrote Hi All, Also please find osd information. ceph osd dump | grep 'replicated size' pool 2 'repo' replicated size 2 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 126 pgp_num 126 last_change 21573 flags hashpspool stripe_width 0 Regards Prabu On Mon, 17 Aug 2015 11:58:55 +0530 gjprabu gjpr...@zohocorp.com wrote Hi All, We need to test three OSD and one image with replica 2(size 1GB). While testing data is not writing above 1GB. Is there any option to write on third OSD. ceph osd pool get repo pg_num pg_num: 126 # rbd showmapped id pool image snap device 0 rbd integdownloads -/dev/rbd0 -- Already one 2 repo integrepotest -/dev/rbd2 -- newly created [root@hm2 repository]# df -Th Filesystem Type Size Used Avail Use% Mounted on /dev/sda5ext4 289G 18G 257G 7% / devtmpfs devtmpfs 252G 0 252G 0% /dev tmpfstmpfs 252G 0 252G 0% /dev/shm tmpfstmpfs 252G 538M 252G 1% /run tmpfstmpfs 252G 0 252G 0% /sys/fs/cgroup /dev/sda2ext4 488M 212M 241M 47% /boot /dev/sda4ext4 1.9T 20G 1.8T 2% /var /dev/mapper/vg0-zoho ext4 8.6T 1.7T 6.5T 21% /zoho /dev/rbd0ocfs2 977G 101G 877G 11% /zoho/build/downloads /dev/rbd2ocfs21000M 1000M 0 100% /zoho/build/repository @:~$ scp -r sample.txt root@integ-hm2:/zoho/build/repository/ root@integ-hm2's password: sample.txt 100% 1024MB 4.5MB/s 03:48 scp: /zoho/build/repository//sample.txt: No space left on device Regards Prabu On Thu, 13 Aug 2015 19:42:11 +0530 gjprabu gjpr...@zohocorp.com wrote Dear Team, We are using two
Re: [ceph-users] ceph distributed osd
___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bad performances in recovery
Hi, Thank you for the quick reply. However, we do have those exact settings for recovery and it still strongly affects client io. I have looked at various ceph logs and osd logs and nothing is out of the ordinary. Here's an idea though, please tell me if I am wrong. We use intel SSDs for journaling and samsung SSDs as proper OSDs. As was explained several times on this mailing list, Samsung SSDs suck in ceph. They have horrible O_dsync speed and die easily, when used as journal. That's why we're using Intel ssds for journaling, so that we didn't end up putting 96 samsung SSDs in the trash. In recovery though, what is the ceph behaviour? What kind of write does it do on the OSD SSDs? Does it write directly to the SSDs or through the journal? Additionally, something else we notice: the ceph cluster is MUCH slower after recovery than before. Clearly there is a bottleneck somewhere and that bottleneck does not get cleared up after the recovery is done. On 2015-08-19 3:32 PM, Somnath Roy wrote: If you are concerned about *client io performance* during recovery, use these settings.. osd recovery max active = 1 osd max backfills = 1 osd recovery threads = 1 osd recovery op priority = 1 If you are concerned about *recovery performance*, you may want to bump this up, but I doubt it will help much from default settings.. Thanks Regards Somnath -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of J-P Methot Sent: Wednesday, August 19, 2015 12:17 PM To: ceph-us...@ceph.com Subject: [ceph-users] Bad performances in recovery Hi, Our setup is currently comprised of 5 OSD nodes with 12 OSD each, for a total of 60 OSDs. All of these are SSDs with 4 SSD journals on each. The ceph version is hammer v0.94.1 . There is a performance overhead because we're using SSDs (I've heard it gets better in infernalis, but we're not upgrading just yet) but we can reach numbers that I would consider alright. Now, the issue is, when the cluster goes into recovery it's very fast at first, but then slows down to ridiculous levels as it moves forward. You can go from 7% to 2% to recover in ten minutes, but it may take 2 hours to recover the last 2%. While this happens, the attached openstack setup becomes incredibly slow, even though there is only a small fraction of objects still recovering (less than 1%). The settings that may affect recovery speed are very low, as they are by default, yet they still affect client io speed way more than it should. Why would ceph recovery become so slow as it progress and affect client io even though it's recovering at a snail's pace? And by a snail's pace, I mean a few kb/second on 10gbps uplinks. -- == Jean-Philippe Méthot Administrateur système / System administrator GloboTech Communications Phone: 1-514-907-0050 Toll Free: 1-(888)-GTCOMM1 Fax: 1-(514)-907-0750 jpmet...@gtcomm.net http://www.gtcomm.net ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). -- == Jean-Philippe Méthot Administrateur système / System administrator GloboTech Communications Phone: 1-514-907-0050 Toll Free: 1-(888)-GTCOMM1 Fax: 1-(514)-907-0750 jpmet...@gtcomm.net http://www.gtcomm.net ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bad performances in recovery
All the writes will go through the journal. It may happen your SSDs are not preconditioned well and after a lot of writes during recovery IOs are stabilized to lower number. This is quite common for SSDs if that is the case. Thanks Regards Somnath -Original Message- From: J-P Methot [mailto:jpmet...@gtcomm.net] Sent: Wednesday, August 19, 2015 1:03 PM To: Somnath Roy; ceph-us...@ceph.com Subject: Re: [ceph-users] Bad performances in recovery Hi, Thank you for the quick reply. However, we do have those exact settings for recovery and it still strongly affects client io. I have looked at various ceph logs and osd logs and nothing is out of the ordinary. Here's an idea though, please tell me if I am wrong. We use intel SSDs for journaling and samsung SSDs as proper OSDs. As was explained several times on this mailing list, Samsung SSDs suck in ceph. They have horrible O_dsync speed and die easily, when used as journal. That's why we're using Intel ssds for journaling, so that we didn't end up putting 96 samsung SSDs in the trash. In recovery though, what is the ceph behaviour? What kind of write does it do on the OSD SSDs? Does it write directly to the SSDs or through the journal? Additionally, something else we notice: the ceph cluster is MUCH slower after recovery than before. Clearly there is a bottleneck somewhere and that bottleneck does not get cleared up after the recovery is done. On 2015-08-19 3:32 PM, Somnath Roy wrote: If you are concerned about *client io performance* during recovery, use these settings.. osd recovery max active = 1 osd max backfills = 1 osd recovery threads = 1 osd recovery op priority = 1 If you are concerned about *recovery performance*, you may want to bump this up, but I doubt it will help much from default settings.. Thanks Regards Somnath -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of J-P Methot Sent: Wednesday, August 19, 2015 12:17 PM To: ceph-us...@ceph.com Subject: [ceph-users] Bad performances in recovery Hi, Our setup is currently comprised of 5 OSD nodes with 12 OSD each, for a total of 60 OSDs. All of these are SSDs with 4 SSD journals on each. The ceph version is hammer v0.94.1 . There is a performance overhead because we're using SSDs (I've heard it gets better in infernalis, but we're not upgrading just yet) but we can reach numbers that I would consider alright. Now, the issue is, when the cluster goes into recovery it's very fast at first, but then slows down to ridiculous levels as it moves forward. You can go from 7% to 2% to recover in ten minutes, but it may take 2 hours to recover the last 2%. While this happens, the attached openstack setup becomes incredibly slow, even though there is only a small fraction of objects still recovering (less than 1%). The settings that may affect recovery speed are very low, as they are by default, yet they still affect client io speed way more than it should. Why would ceph recovery become so slow as it progress and affect client io even though it's recovering at a snail's pace? And by a snail's pace, I mean a few kb/second on 10gbps uplinks. -- == Jean-Philippe Méthot Administrateur système / System administrator GloboTech Communications Phone: 1-514-907-0050 Toll Free: 1-(888)-GTCOMM1 Fax: 1-(514)-907-0750 jpmet...@gtcomm.net http://www.gtcomm.net ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). -- == Jean-Philippe Méthot Administrateur système / System administrator GloboTech Communications Phone: 1-514-907-0050 Toll Free: 1-(888)-GTCOMM1 Fax: 1-(514)-907-0750 jpmet...@gtcomm.net http://www.gtcomm.net ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Latency impact on RBD performance
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 You didn't specify your database, but if you are using mysql you can use: [mysqld] # Change disk flush to every second instead of after each transaction. innodb_flush_log_at_trx_commit=2 to specify flushing the logs every X seconds instead of after every transaction. It really depends if you can afford to lose any transactions. This helped on a machine that had some high disk waits and were I felt comfortable enough losing two seconds of transactions. In my situation it took an almost 30 minute query to ~30 seconds. Now that the disk issue is resolved, I don't use that option any more. -BEGIN PGP SIGNATURE- Version: Mailvelope v1.0.0 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJV1OtLCRDmVDuy+mK58QAA420QAJmjJvK4fvoYJb5eNKDg z1cxO4+v3ue+cFX6cj8HnO5ByYKWCroxfTHU159b3L2LvGQVkJso7ogR6e82 rkcDeaLYs82TupBcszCnMpCcI6NB7PrWMp8LPkhiqnitd+tVntZQ+bHSPPGQ vGSp1b8WC9AwjnkiifX6jNhS9KxbKPrzqlfQfGcxEaC082lWp2E+MInDWYct K1tYiCfVrvNWVa5JWwU5xXDasHJNeA9Xam4h1Bvy3m3coueF7hVL2k3sV9d0 OkcfQbyNVeOxViQ4FbfT1RdgtWO9UdkNoj7ujtGaK6nPN2icLNnuVQqw+gq0 F2FTBDDAz9C/pSiYeljyKE5QoU0iDJNtadbnlgg8zqZC8o1F5TVrUIv4a6Hz +Vr+KQ/MY2QK6qXSRxDrR6MBQqMkC7jDC05/bx1CI5dgPEtetN6LXIO6fk30 8Y6PRCMIr0/dvseBFkO8rdASFf3aLDvVPuQMnyWwrMhFzzsxDfP/mWrOby3K WXD2hvRuXltLe3XwuAocZVXSSIgTQnBCfe9HeKnbR6p+4LZjxUfq/EaEg56F 9swbl33e4NByk/te63KuwDs7o+CWzzD3LOBkSXyolWKWsH6uSiklwGOSiYO8 xqgIbN9Lwzo//T14DyZfZRExfRhLgC99eGqpLdROs2P/i+1E9O8pRMr7i/Ls Huga =Ef88 -END PGP SIGNATURE- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Wed, Aug 19, 2015 at 8:20 AM, Logan Barfield lbarfi...@tqhosting.com wrote: Hi, We are currently using 2 OSD hosts with SSDs to provide RBD backed volumes for KVM hypervisors. This 'cluster' is currently set up in 'Location A'. We are looking to move our hypervisors/VMs over to a new location, and will have a 1Gbit link between the two datacenters. We can run Layer 2 over the link, and it should have ~10ms of latency. Call the new datacenter 'Location B'. One proposed solution for the migration is to set up new RBD hosts in the new location, set up a new pool, and move the VM volumes to it. The potential issue with this solution is that we can end up in a scenario where the VM is running on a hypervisor in 'Location A', but writing/reading to a volume in 'Location B'. My question is: what kind of performance impact should we expect when reading/writing over a link with ~10ms of latency? Will it bring I/O intensive operations (like databases) to a halt, or will it be 'tolerable' for a short period (a few days). Most of the VMs are running database backed e-commerce sites. My expectation is that 10ms for every I/O operation will cause a significant impact, but we wanted to verify that before ruling it out as a solution. We will also be doing some internal testing of course. I appreciate any feedback the community has. - Logan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Bad performances in recovery
Hi, Our setup is currently comprised of 5 OSD nodes with 12 OSD each, for a total of 60 OSDs. All of these are SSDs with 4 SSD journals on each. The ceph version is hammer v0.94.1 . There is a performance overhead because we're using SSDs (I've heard it gets better in infernalis, but we're not upgrading just yet) but we can reach numbers that I would consider alright. Now, the issue is, when the cluster goes into recovery it's very fast at first, but then slows down to ridiculous levels as it moves forward. You can go from 7% to 2% to recover in ten minutes, but it may take 2 hours to recover the last 2%. While this happens, the attached openstack setup becomes incredibly slow, even though there is only a small fraction of objects still recovering (less than 1%). The settings that may affect recovery speed are very low, as they are by default, yet they still affect client io speed way more than it should. Why would ceph recovery become so slow as it progress and affect client io even though it's recovering at a snail's pace? And by a snail's pace, I mean a few kb/second on 10gbps uplinks. -- == Jean-Philippe Méthot Administrateur système / System administrator GloboTech Communications Phone: 1-514-907-0050 Toll Free: 1-(888)-GTCOMM1 Fax: 1-(514)-907-0750 jpmet...@gtcomm.net http://www.gtcomm.net ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bad performances in recovery
If you are concerned about *client io performance* during recovery, use these settings.. osd recovery max active = 1 osd max backfills = 1 osd recovery threads = 1 osd recovery op priority = 1 If you are concerned about *recovery performance*, you may want to bump this up, but I doubt it will help much from default settings.. Thanks Regards Somnath -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of J-P Methot Sent: Wednesday, August 19, 2015 12:17 PM To: ceph-us...@ceph.com Subject: [ceph-users] Bad performances in recovery Hi, Our setup is currently comprised of 5 OSD nodes with 12 OSD each, for a total of 60 OSDs. All of these are SSDs with 4 SSD journals on each. The ceph version is hammer v0.94.1 . There is a performance overhead because we're using SSDs (I've heard it gets better in infernalis, but we're not upgrading just yet) but we can reach numbers that I would consider alright. Now, the issue is, when the cluster goes into recovery it's very fast at first, but then slows down to ridiculous levels as it moves forward. You can go from 7% to 2% to recover in ten minutes, but it may take 2 hours to recover the last 2%. While this happens, the attached openstack setup becomes incredibly slow, even though there is only a small fraction of objects still recovering (less than 1%). The settings that may affect recovery speed are very low, as they are by default, yet they still affect client io speed way more than it should. Why would ceph recovery become so slow as it progress and affect client io even though it's recovering at a snail's pace? And by a snail's pace, I mean a few kb/second on 10gbps uplinks. -- == Jean-Philippe Méthot Administrateur système / System administrator GloboTech Communications Phone: 1-514-907-0050 Toll Free: 1-(888)-GTCOMM1 Fax: 1-(514)-907-0750 jpmet...@gtcomm.net http://www.gtcomm.net ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] any recommendation of using EnhanceIO?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Probably the big question is what are the pain points? The most common answer we get when asking folks what applications they run on top of Ceph is everything!. This is wonderful, but not helpful when trying to figure out what performance issues matter most! :) IE, should we be focusing on IOPS? Latency? Finding a way to avoid journal overhead for large writes? Are there specific use cases where we should specifically be focusing attention? general iscsi? S3? databases directly on RBD? etc. There's tons of different areas that we can work on (general OSD threading improvements, different messenger implementations, newstore, client side bottlenecks, etc) but all of those things tackle different kinds of problems. We run everything or it sure seems like it. I did some computation of a large sampling of our servers and found that the average request size was ~12K/~18K (read/write) and ~30%/70% (it looks like I didn't save that spreadsheet to get exact numbers). So, any optimization in smaller I/O sizes would really benefit us - Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 -BEGIN PGP SIGNATURE- Version: Mailvelope v1.0.0 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJV1ONDCRDmVDuy+mK58QAAGZMP/2jZk2ScXzJP0kpvVpCZ pHn74r2tR2gwpzSDzFp2az5L36DutrJImDx/wDZnxsuA6v1sUWiW3/VBwm8r LRHr0jHvVm5EnpyJuT7zPRg/aJhXcK9hlIuud3tQ357BtMb0Hv6rbVclqtoP RCYFoWv6vHVg0nmbKyZODj9W4PWfb6AjXazwHlgZw10q1GcYSs5LS2n9Yx8B Q8hpn/8mf49IopYuyBOH5VTIxOUGlv1XAUD4kSRSYvhFLMQg0lt7L1VZrbiY qqFtMUlvSoasb7eFahYZskkjUPB9c9kplWjKkKo8K7nV40pUfg8yClZmZ0kl a4gok35Cn8x58rWBrSpMEqAvr5ObE27LNnwGhy9KfzdkMpdWS2/lr7oZsb+O Gwk/4/u4hWbjuYSeGsqXefFINgRjl8TPOTQ7ZawlOcxsJhnyiGwWa5jXypHZ Smju6lEKMd9XvoBHtRBk+wX08E/T/U6pZplOpRwG8jV5xQtrF9B9AEiQoMTj x4HWdD17O9DcW5veiPkDzf9onkrbWZdcYjTXwdKqm6q0vEvk4stcLTfgjCVu +zqcgcbCyvw/URNmVjAHH3dSkfmrFBHuLW062hYhSnPlqgSBJ6xTwFzQSIIZ ZcDOfQIR72l5mLLg/YAvOMljhUnwZjdURRJytYG5KsdzZRWJZ9bWd+n2ZwfZ Yf4H =bSAx -END PGP SIGNATURE- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bad performances in recovery
Also, check if scrubbing started in the cluster or not. That may considerably slow down the cluster. -Original Message- From: Somnath Roy Sent: Wednesday, August 19, 2015 1:35 PM To: 'J-P Methot'; ceph-us...@ceph.com Subject: RE: [ceph-users] Bad performances in recovery All the writes will go through the journal. It may happen your SSDs are not preconditioned well and after a lot of writes during recovery IOs are stabilized to lower number. This is quite common for SSDs if that is the case. Thanks Regards Somnath -Original Message- From: J-P Methot [mailto:jpmet...@gtcomm.net] Sent: Wednesday, August 19, 2015 1:03 PM To: Somnath Roy; ceph-us...@ceph.com Subject: Re: [ceph-users] Bad performances in recovery Hi, Thank you for the quick reply. However, we do have those exact settings for recovery and it still strongly affects client io. I have looked at various ceph logs and osd logs and nothing is out of the ordinary. Here's an idea though, please tell me if I am wrong. We use intel SSDs for journaling and samsung SSDs as proper OSDs. As was explained several times on this mailing list, Samsung SSDs suck in ceph. They have horrible O_dsync speed and die easily, when used as journal. That's why we're using Intel ssds for journaling, so that we didn't end up putting 96 samsung SSDs in the trash. In recovery though, what is the ceph behaviour? What kind of write does it do on the OSD SSDs? Does it write directly to the SSDs or through the journal? Additionally, something else we notice: the ceph cluster is MUCH slower after recovery than before. Clearly there is a bottleneck somewhere and that bottleneck does not get cleared up after the recovery is done. On 2015-08-19 3:32 PM, Somnath Roy wrote: If you are concerned about *client io performance* during recovery, use these settings.. osd recovery max active = 1 osd max backfills = 1 osd recovery threads = 1 osd recovery op priority = 1 If you are concerned about *recovery performance*, you may want to bump this up, but I doubt it will help much from default settings.. Thanks Regards Somnath -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of J-P Methot Sent: Wednesday, August 19, 2015 12:17 PM To: ceph-us...@ceph.com Subject: [ceph-users] Bad performances in recovery Hi, Our setup is currently comprised of 5 OSD nodes with 12 OSD each, for a total of 60 OSDs. All of these are SSDs with 4 SSD journals on each. The ceph version is hammer v0.94.1 . There is a performance overhead because we're using SSDs (I've heard it gets better in infernalis, but we're not upgrading just yet) but we can reach numbers that I would consider alright. Now, the issue is, when the cluster goes into recovery it's very fast at first, but then slows down to ridiculous levels as it moves forward. You can go from 7% to 2% to recover in ten minutes, but it may take 2 hours to recover the last 2%. While this happens, the attached openstack setup becomes incredibly slow, even though there is only a small fraction of objects still recovering (less than 1%). The settings that may affect recovery speed are very low, as they are by default, yet they still affect client io speed way more than it should. Why would ceph recovery become so slow as it progress and affect client io even though it's recovering at a snail's pace? And by a snail's pace, I mean a few kb/second on 10gbps uplinks. -- == Jean-Philippe Méthot Administrateur système / System administrator GloboTech Communications Phone: 1-514-907-0050 Toll Free: 1-(888)-GTCOMM1 Fax: 1-(514)-907-0750 jpmet...@gtcomm.net http://www.gtcomm.net ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). -- == Jean-Philippe Méthot Administrateur système / System administrator GloboTech Communications Phone: 1-514-907-0050 Toll Free: 1-(888)-GTCOMM1 Fax: 1-(514)-907-0750 jpmet...@gtcomm.net http://www.gtcomm.net ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph OSD nodes in XenServer VMs
Hi all, We are experimenting with an idea to run OSD nodes in XenServer VMs. We believe this could provide better flexibility, backups for the nodes etc. For example: Xenserver with 4 HDDs dedicated for Ceph. We would introduce 1 VM (OSD node) with raw/direct access to 4 HDDs or 2 VMs (2 OSD nodes) with 2 HDDs each. Do you have any experience with this? Any thoughts on this? Good or bad idea? Thank you Jiri ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] any recommendation of using EnhanceIO?
On Wed, 19 Aug 2015 10:02:25 +0100 Nick Fisk wrote: -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Christian Balzer Sent: 19 August 2015 03:32 To: ceph-users@lists.ceph.com Cc: Nick Fisk n...@fisk.me.uk Subject: Re: [ceph-users] any recommendation of using EnhanceIO? On Tue, 18 Aug 2015 20:48:26 +0100 Nick Fisk wrote: [mega snip] 4. Disk based OSD with SSD Journal performance As I touched on above earlier, I would expect a disk based OSD with SSD journal to have similar performance to a pure SSD OSD when dealing with sequential small IO's. Currently the levelDB sync and potentially other things slow this down. Has anybody tried symlinking the omap directory to a SSD and tested if hat makes a (significant) difference? I thought I remember reading somewhere that all these items need to remain on the OSD itself so that when the OSD calls fsync it can be sure they are all in sync at the same time. Would be nice to have this confirmed by the devs. It being leveldb, you'd think it would be in sync by default. But even if it were potentially unsafe (not crash safe) in the current incarnation, the results of such a test might make any needed changes attractive. Unfortunately I don't have anything resembling a SSD in my test cluster. Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] requests are blocked - problem
Hello, On Wed, 19 Aug 2015 15:27:29 +0200 Jacek Jarosiewicz wrote: Hi, On 08/19/2015 11:01 AM, Christian Balzer wrote: Hello, That's a pretty small cluster all things considered, so your rather intensive test setup is likely to run into any or all of the following issues: 1) The amount of data you're moving around is going cause a lot of promotions from and to the cache tier. This is expensive and slow. 2) EC coded pools are slow. You may have actually better results with a Ceph classic approach, 2-4 HDDs per journal SSD. Also 6TB HDDs combined with EC may look nice to you from a cost/density prospect, but more HDDs means more IOPS and thus speed. 3) scrubbing (unless configured very aggressively down) will impact your performance on top of the items above. 4) You already noted the kernel versus userland bit. 5) Having all your storage in a single JBOD chassis strikes me as ill advised, though I don't think it's an actual bottleneck at 4x12Gb/s. We use two of these (I forgot to mention that) Each chasis has two internal controllers - both exposing all the disks to the connected hosts. There are two osd nodes connected to each chasis. Ah, so you have the dual controller version. When you ran the fio tests I assume nothing else was going on and the dataset size would have fit easily into the cache pool, right? Look at your nodes with atop or iostat, I venture all your HDDs are at 100%. Christian Yes, the problem was a full cache pool. I'm currently wondering on how to tune the cache pool parameters so that the whole cluster doesn't slow down that much when the cache is full... Nick already gave you some advice on this, however with the current versions of Ceph cache tiering is simply expensive and slow. I'm thinking of doing some tests on a pool w/o the cache tier so I can compare the results. Any suggestions would be greatly appreciated.. For a realistic comparison with your current setup, a total rebuild would be in order. Provided your cluster is testing only at this point. Given your current HW, that means the same 2-3 HDDs per storage node and 1 SSD as journal. What exact maker/model are your SSDs? Again, more HDDs means more (sustainable) IOPS, so unless your space requirements (data and physical) are very demanding, double the amount of 3TB HDDs would be noticeably better. Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-osd suddenly dies and no longer can be started
I kind of fixed it by creating a new journal in file instead of separate partition, which probably caused some data loss, but at least allowed OSD to start and join cluster. Backfilling is now in progress. Old journal is still there on separate device, if it can help in investigation. But this malloc - ENOMEM/OOM killer - corrupted journal - trying to recover - ENOMEM/OOM killer ... looks like a bug. 2015-08-19 0:13 GMT+03:00 Евгений Д. ineu.m...@gmail.com: Hello. I have a small Ceph cluster running 9 OSDs, using XFS on separate disks and dedicated partitions on system disk for journals. After creation it worked fine for a while. Then suddenly one of OSDs stopped and didn't start. I had to recreate it. Recovery started. After few days of recovery OSD on another machine also stopped. I try to start it, it runs for few minutes and dies, looks like it is not able to recover journal. According to strace, it tries to allocate too much memory and stops with ENOMEM. Sometimes it is being killed by kernel's OOM killer. I tried flushing journal manually with `ceph-osd -i 3 --flush-journal`, but it didn't work either. Error log is as follows: [root@assets-2 ~]# ceph-osd -i 3 --flush-journal SG_IO: bad/missing sense data, sb[]: 70 00 05 00 00 00 00 0d 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 2015-08-18 23:00:37.956714 7ff102040880 -1 filestore(/var/lib/ceph/osd/ceph-3) could not find 225eff8c/default.4323.18_22783306dc51892b40b49e3e26f79baf_55c38b33172600566c46_s.jpeg/head//8 in index: (2) No such file or directory 2015-08-18 23:00:37.956741 7ff102040880 -1 filestore(/var/lib/ceph/osd/ceph-3) could not find 235eff8c/default.4323.16_3018ff7c6066bddc0c867b293724d7b1_dolar7_106_m.jpg/head//8 in index: (2) No such file or directory skipped 2015-08-18 23:00:37.958424 7ff102040880 -1 filestore(/var/lib/ceph/osd/ceph-3) could not find c//head//8 in index: (2) No such file or directory tcmalloc: large alloc 1073741824 bytes == 0x66b1 @ 0x7ff10115ae6a 0x7ff10117ad64 0x7ff0ffd4fc29 0x7ff0ffd5086b 0x7ff0ffd50914 0x7ff0ffd50b7f 0x968a0f 0xa572b3 0xa5c6b1 0xa5f762 0x9018ba 0x90238e 0x911b2c 0x915064 0x92d7cb 0x8ff890 0x642239 0x7ff0ff3daaf5 0x65cdc9 (nil) tcmalloc: large alloc 2147483648 bytes == 0xbf49 @ 0x7ff10115ae6a 0x7ff10117ad64 0x7ff0ffd4fc29 0x7ff0ffd5086b 0x7ff0ffd50914 0x7ff0ffd50b7f 0x968a0f 0xa572b3 0xa5c6b1 0xa5f762 0x9018ba 0x90238e 0x911b2c 0x915064 0x92d7cb 0x8ff890 0x642239 0x7ff0ff3daaf5 0x65cdc9 (nil) tcmalloc: large alloc 4294967296 bytes == 0x16e32 @ 0x7ff10115ae6a 0x7ff10117ad64 0x7ff0ffd4fc29 0x7ff0ffd5086b 0x7ff0ffd50914 0x7ff0ffd50b7f 0x968a0f 0xa572b3 0xa5c6b1 0xa5f762 0x9018ba 0x90238e 0x911b2c 0x915064 0x92d7cb 0x8ff890 0x642239 0x7ff0ff3daaf5 0x65cdc9 (nil) tcmalloc: large alloc 8589934592 bytes == (nil) @ 0x7ff10115ae6a 0x7ff10117ad64 0x7ff0ffd4fc29 0x7ff0ffd5086b 0x7ff0ffd50914 0x7ff0ffd50b7f 0x968a0f 0xa572b3 0xa5c6b1 0xa5f762 0x9018ba 0x90238e 0x911b2c 0x915064 0x92d7cb 0x8ff890 0x642239 0x7ff0ff3daaf5 0x65cdc9 (nil) terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc *** Caught signal (Aborted) ** in thread 7ff102040880 ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3) 1: ceph-osd() [0xac5642] 2: (()+0xf130) [0x7ff1009d4130] 3: (gsignal()+0x37) [0x7ff0ff3ee5d7] 4: (abort()+0x148) [0x7ff0ff3efcc8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7ff0ffcf29b5] 6: (()+0x5e926) [0x7ff0ffcf0926] 7: (()+0x5e953) [0x7ff0ffcf0953] 8: (()+0x5eb73) [0x7ff0ffcf0b73] 9: (()+0x15d3e) [0x7ff10115ad3e] 10: (tc_new()+0x1e0) [0x7ff10117ade0] 11: (std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocatorchar const)+0x59) [0x7ff0ffd4fc29] 12: (std::string::_Rep::_M_clone(std::allocatorchar const, unsigned long)+0x1b) [0x7ff0ffd5086b] 13: (std::string::reserve(unsigned long)+0x44) [0x7ff0ffd50914] 14: (std::string::append(char const*, unsigned long)+0x4f) [0x7ff0ffd50b7f] 15: (LevelDBStore::LevelDBTransactionImpl::rmkeys_by_prefix(std::string const)+0xdf) [0x968a0f] 16: (DBObjectMap::clear_header(std::tr1::shared_ptrDBObjectMap::_Header, std::tr1::shared_ptrKeyValueDB::TransactionImpl)+0xd3) [0xa572b3] 17: (DBObjectMap::_clear(std::tr1::shared_ptrDBObjectMap::_Header, std::tr1::shared_ptrKeyValueDB::TransactionImpl)+0xa1) [0xa5c6b1] 18: (DBObjectMap::clear(ghobject_t const, SequencerPosition const*)+0x202) [0xa5f762] 19: (FileStore::lfn_unlink(coll_t, ghobject_t const, SequencerPosition const, bool)+0x16a) [0x9018ba] 20: (FileStore::_remove(coll_t, ghobject_t const, SequencerPosition const)+0x9e) [0x90238e] 21: (FileStore::_do_transaction(ObjectStore::Transaction, unsigned long, int, ThreadPool::TPHandle*)+0x252c) [0x911b2c] 22: (FileStore::_do_transactions(std::listObjectStore::Transaction*, std::allocatorObjectStore::Transaction* , unsigned long, ThreadPool::TPHandle*)+0x64)
Re: [ceph-users] any recommendation of using EnhanceIO?
Am 18.08.2015 um 15:43 schrieb Campbell, Bill: Hey Stefan, Are you using your Ceph cluster for virtualization storage? Yes Is dm-writeboost configured on the OSD nodes themselves? Yes Stefan *From: *Stefan Priebe - Profihost AG s.pri...@profihost.ag *To: *Mark Nelson mnel...@redhat.com, ceph-users@lists.ceph.com *Sent: *Tuesday, August 18, 2015 7:36:10 AM *Subject: *Re: [ceph-users] any recommendation of using EnhanceIO? We're using an extra caching layer for ceph since the beginning for our older ceph deployments. All new deployments go with full SSDs. I've tested so far: - EnhanceIO - Flashcache - Bcache - dm-cache - dm-writeboost The best working solution was and is bcache except for it's buggy code. The current code in 4.2-rc7 vanilla kernel still contains bugs. f.e. discards result in crashed FS after reboots and so on. But it's still the fastest for ceph. The 2nd best solution which we already use in production is dm-writeboost (https://github.com/akiradeveloper/dm-writeboost). Everything else is too slow. Stefan Am 18.08.2015 um 13:33 schrieb Mark Nelson: Hi Jan, Out of curiosity did you ever try dm-cache? I've been meaning to give it a spin but haven't had the spare cycles. Mark On 08/18/2015 04:00 AM, Jan Schermer wrote: I already evaluated EnhanceIO in combination with CentOS 6 (and backported 3.10 and 4.0 kernel-lt if I remember correctly). It worked fine during benchmarks and stress tests, but once we run DB2 on it it panicked within minutes and took all the data with it (almost literally - files that werent touched, like OS binaries were b0rked and the filesystem was unsalvageable). If you disregard this warning - the performance gains weren't that great either, at least in a VM. It had problems when flushing to disk after reaching dirty watermark and the block size has some not-well-documented implications (not sure now, but I think it only cached IO _larger_than the block size, so if your database keeps incrementing an XX-byte counter it will go straight to disk). Flashcache doesn't respect barriers (or does it now?) - if that's ok for you than go for it, it should be stable and I used it in the past in production without problems. bcache seemed to work fine, but I needed to a) use it for root b) disable and enable it on the fly (doh) c) make it non-persisent (flush it) before reboot - not sure if that was possible either. d) all that in a customer's VM, and that customer didn't have a strong technical background to be able to fiddle with it... So I haven't tested it heavily. Bcache should be the obvious choice if you are in control of the environment. At least you can cry on LKML's shoulder when you lose data :-) Jan On 18 Aug 2015, at 01:49, Alex Gorbachev a...@iss-integration.com wrote: What about https://github.com/Frontier314/EnhanceIO? Last commit 2 months ago, but no external contributors :( The nice thing about EnhanceIO is there is no need to change device name, unlike bcache, flashcache etc. Best regards, Alex On Thu, Jul 23, 2015 at 11:02 AM, Daniel Gryniewicz d...@redhat.com wrote: I did some (non-ceph) work on these, and concluded that bcache was the best supported, most stable, and fastest. This was ~1 year ago, to take it with a grain of salt, but that's what I would recommend. Daniel From: Dominik Zalewski dzalew...@optlink.net To: German Anders gand...@despegar.com Cc: ceph-users ceph-users@lists.ceph.com Sent: Wednesday, July 1, 2015 5:28:10 PM Subject: Re: [ceph-users] any recommendation of using EnhanceIO? Hi, I’ve asked same question last weeks or so (just search the mailing list archives for EnhanceIO :) and got some interesting answers. Looks like the project is pretty much dead since it was bought out by HGST. Even their website has some broken links in regards to EnhanceIO I’m keen to try flashcache or bcache (its been in the mainline kernel for some time) Dominik On 1 Jul 2015, at 21:13, German Anders gand...@despegar.com wrote: Hi cephers, Is anyone out there that implement enhanceIO in a production environment? any recommendation? any perf output to share with the diff between using it and not? Thanks in advance, German ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com
[ceph-users] НА: Rename Ceph cluster
Hi! I think, that renaming cluster - is not only mv config file. We try to change name of test Hammer cluster, created with ceph-deploy and got some issues. In default install, naming of many parts are derived from cluster name. For example, cephx keys are stored not in ceph.client.admin.keyring, but $CLUSTERNAME.client.admin.keyring, so we have to rename keyrings also. The same thing is for OSD/MON mounting points: instead /var/lib/ceph/osd/ceph-$OSDNUM, after renaming cluster, daemons try to run OSD from /var/lib/ceph/osd/$CLUSTERNAME-$OSDNUM. Of course, there are no such mountpoints and we manually create them, mount fs and re-run OSDs. There is one unresolved issue with udev rules: after node reboot, filesystems are mounted by udev into the old mountpoints. As the cluster is for testing - this is not a big thing. So, be carefull while renaming production or loaded cluster. PS: All above is my IMHO and I may be wrong. ;) Megov Igor CIO, Yuterra От: ceph-users ceph-users-boun...@lists.ceph.com от имени Jan Schermer j...@schermer.cz Отправлено: 18 августа 2015 г. 15:18 Кому: Erik McCormick Копия: ceph-users@lists.ceph.com Тема: Re: [ceph-users] Rename Ceph cluster I think it's pretty clear: http://ceph.com/docs/master/install/manual-deployment/ For example, when you run multiple clusters in a federated architecture, the cluster name (e.g., us-west, us-east) identifies the cluster for the current CLI session. Note: To identify the cluster name on the command line interface, specify the a Ceph configuration file with the cluster name (e.g., ceph.conf, us-west.conf, us-east.conf, etc.). Also see CLI usage (ceph --cluster {cluster-name}). But it could be tricky on the OSDs that are running, depending on the distribution initscripts - you could find out that you can't service ceph stop osd... anymore after the change, since it can't find it's pidfile anymore. Looking at Centos initscript it looks like it accepts -c conffile argument though. (So you should be managins OSDs with -c ceph-prod.conf now?) Jan On 18 Aug 2015, at 14:13, Erik McCormick emccorm...@cirrusseven.commailto:emccorm...@cirrusseven.com wrote: I've got a custom named cluster integrated with Openstack (Juno) and didn't run into any hard-coded name issues that I can recall. Where are you seeing that? As to the name change itself, I think it's really just a label applying to a configuration set. The name doesn't actually appear *in* the configuration files. It stands to reason you should be able to rename the configuration files on the client side and leave the cluster alone. It'd be with trying in a test environment anyway. -Erik On Aug 18, 2015 7:59 AM, Jan Schermer j...@schermer.czmailto:j...@schermer.cz wrote: This should be simple enough mv /etc/ceph/ceph-prod.conf /etc/ceph/ceph.conf No? :-) Or you could set this in nova.conf: images_rbd_ceph_conf=/etc/ceph/ceph-prod.conf Obviously since different parts of openstack have their own configs, you'd have to do something similiar for cinder/glance... so not worth the hassle. Jan On 18 Aug 2015, at 13:50, Vasiliy Angapov anga...@gmail.commailto:anga...@gmail.com wrote: Hi, Does anyone know what steps should be taken to rename a Ceph cluster? Btw, is it ever possbile without data loss? Background: I have a cluster named ceph-prod integrated with OpenStack, however I found out that the default cluster name ceph is very much hardcoded into OpenStack so I decided to change it to the default value. Regards, Vasily. ___ ceph-users mailing list ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] НА: Rename Ceph cluster
Thanks to all! Everything worked like a charm: 1) Stopped the cluster (I guess it's faster than moving OSDs one by one) 2) Unmounted OSDs and fixed fstab entries for them 3) Renamed the MON and OSD folders 4) Renamed config file and keyrings, fixed paths to keyrings in config 5) Mounted OSDs back (mount -a) 6) Started everything 7) Fixed path to config in nova.conf, cinder.conf and glance-api.conf all over the OpenStack Everything works as expected. It took about half an hour to do all the job. Again, thanks to all! Regards, Vasily. 2015-08-19 15:10 GMT+08:00 Межов Игорь Александрович me...@yuterra.ru: Hi! I think, that renaming cluster - is not only mv config file. We try to change name of test Hammer cluster, created with ceph-deploy and got some issues. In default install, naming of many parts are derived from cluster name. For example, cephx keys are stored not in ceph.client.admin.keyring, but $CLUSTERNAME.client.admin.keyring, so we have to rename keyrings also. The same thing is for OSD/MON mounting points: instead /var/lib/ceph/osd/ceph-$OSDNUM, after renaming cluster, daemons try to run OSD from /var/lib/ceph/osd/$CLUSTERNAME-$OSDNUM. Of course, there are no such mountpoints and we manually create them, mount fs and re-run OSDs. There is one unresolved issue with udev rules: after node reboot, filesystems are mounted by udev into the old mountpoints. As the cluster is for testing - this is not a big thing. So, be carefull while renaming production or loaded cluster. PS: All above is my IMHO and I may be wrong. ;) Megov Igor CIO, Yuterra От: ceph-users ceph-users-boun...@lists.ceph.com от имени Jan Schermer j...@schermer.cz Отправлено: 18 августа 2015 г. 15:18 Кому: Erik McCormick Копия: ceph-users@lists.ceph.com Тема: Re: [ceph-users] Rename Ceph cluster I think it's pretty clear: http://ceph.com/docs/master/install/manual-deployment/ For example, when you run multiple clusters in a federated architecture, the cluster name (e.g., us-west, us-east) identifies the cluster for the current CLI session. Note: To identify the cluster name on the command line interface, specify the a Ceph configuration file with the cluster name (e.g., ceph.conf, us-west.conf, us-east.conf, etc.). Also see CLI usage (ceph --cluster {cluster-name}). But it could be tricky on the OSDs that are running, depending on the distribution initscripts - you could find out that you can't service ceph stop osd... anymore after the change, since it can't find it's pidfile anymore. Looking at Centos initscript it looks like it accepts -c conffile argument though. (So you should be managins OSDs with -c ceph-prod.conf now?) Jan On 18 Aug 2015, at 14:13, Erik McCormick emccorm...@cirrusseven.com wrote: I've got a custom named cluster integrated with Openstack (Juno) and didn't run into any hard-coded name issues that I can recall. Where are you seeing that? As to the name change itself, I think it's really just a label applying to a configuration set. The name doesn't actually appear *in* the configuration files. It stands to reason you should be able to rename the configuration files on the client side and leave the cluster alone. It'd be with trying in a test environment anyway. -Erik On Aug 18, 2015 7:59 AM, Jan Schermer j...@schermer.cz wrote: This should be simple enough mv /etc/ceph/ceph-prod.conf /etc/ceph/ceph.conf No? :-) Or you could set this in nova.conf: images_rbd_ceph_conf=/etc/ceph/ceph-prod.conf Obviously since different parts of openstack have their own configs, you'd have to do something similiar for cinder/glance... so not worth the hassle. Jan On 18 Aug 2015, at 13:50, Vasiliy Angapov anga...@gmail.com wrote: Hi, Does anyone know what steps should be taken to rename a Ceph cluster? Btw, is it ever possbile without data loss? Background: I have a cluster named ceph-prod integrated with OpenStack, however I found out that the default cluster name ceph is very much hardcoded into OpenStack so I decided to change it to the default value. Regards, Vasily. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] requests are blocked - problem
-Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jacek Jarosiewicz Sent: 19 August 2015 09:29 To: ceph-us...@ceph.com Subject: [ceph-users] requests are blocked - problem Hi, Our setup is this: 4 x OSD nodes: E5-1630 CPU 32 GB RAM Mellanox MT27520 56Gbps network cards SATA controller LSI Logic SAS3008 Storage nodes are connected to SuperMicro chassis: 847E1C-R1K28JBOD Each node has 2-3 spinning OSDs (6TB drives) and 2 ssd OSDs (240GB drives) 3 monitors running on OSD nodes ceph hammer 0.94.2 Ubuntu 14.04 cache tier with ecpool (3+1) We've ran some tests on the cluster and results were promising - speeds, iops etc as expected, but now we tried to use more than one client for testing and ran into some problems: When you ran this test did you fill the RBD enough that it would have been flushing the dirty contents down to the base tier? We've created a couple rbd images and mapped them on clients (kernel rbd) running two rsync processes and one dd on a large number of files (~ 250 GB of maildirs rsync'ed from one rbd image to the other and a dd process writing one big 2TB file on another rbd image) And the speeds now are less than OK, plus we have a lot requests blocked warnings. We've left the processes to run over night but this morning I came and none of the processes finished - they are able to write data, but at a very, very slow rate. Once you have written to a block once, if the underlying object exists on the EC pool, it will have to be promoted to the Cache pool before it can be written to. This can have a severe impact on performance, especially if you are hitting lots of different blocks and the tiering agent can't keep up with the promotion requests. Please help me diagnose this problem, everything seems to work, just very, very slow... when we ran the tests with fio (librbd engine) everything seemd fine.. I know that kernel implementation is slower, but is this normal? I can't understand why are so many requests blocked. I would suggest running the fio tests again, just to make sure that there isn't a problem with your newer config, but I suspect you will see equally bad performance with the fio tests now that the cache tier has begun to be more populated. Some diagnostic data: root@cf01:/var/log/ceph# ceph -s cluster 3469081f-9852-4b6e-b7ed-900e77c48bb5 health HEALTH_WARN 31 requests are blocked 32 sec monmap e1: 3 mons at {cf01=10.4.10.211:6789/0,cf02=10.4.10.212:6789/0,cf03=10.4.10.213:6789/0} election epoch 202, quorum 0,1,2 cf01,cf02,cf03 osdmap e1319: 18 osds: 18 up, 18 in pgmap v933010: 2112 pgs, 19 pools, 10552 GB data, 2664 kobjects 14379 GB used, 42812 GB / 57192 GB avail 2111 active+clean 1 active+clean+scrubbing client io 0 B/s rd, 12896 kB/s wr, 35 op/s root@cf01:/var/log/ceph# ceph health detail HEALTH_WARN 23 requests are blocked 32 sec; 6 osds have slow requests 1 ops are blocked 131.072 sec 22 ops are blocked 65.536 sec 1 ops are blocked 65.536 sec on osd.2 1 ops are blocked 65.536 sec on osd.3 1 ops are blocked 65.536 sec on osd.4 1 ops are blocked 131.072 sec on osd.7 18 ops are blocked 65.536 sec on osd.10 1 ops are blocked 65.536 sec on osd.12 6 osds have slow requests root@cf01:/var/log/ceph# grep WRN ceph.log | tail -50 2015-08-19 10:23:34.505669 osd.14 10.4.10.213:6810/21207 17942 : cluster [WRN] 2 slow requests, 2 included below; oldest blocked for 30.575870 secs 2015-08-19 10:23:34.505796 osd.14 10.4.10.213:6810/21207 17943 : cluster [WRN] slow request 30.575870 seconds old, received at 2015-08-19 10:23:03.929722: osd_op(client.9203.1:22822591 rbd_data.37e02ae8944a.000180ca [set-alloc-hint object_size 4194304 write_size 4194304,write 0~462848] 5.1c2aff5f ondisk+write e1319) currently waiting for blocked object 2015-08-19 10:23:34.505803 osd.14 10.4.10.213:6810/21207 17944 : cluster [WRN] slow request 30.560009 seconds old, received at 2015-08-19 10:23:03.945583: osd_op(client.9203.1:22822592 rbd_data.37e02ae8944a.000180ca [set-alloc-hint object_size 4194304 write_size 4194304,write 462848~524288] 5.1c2aff5f ondisk+write e1319) currently waiting for blocked object 2015-08-19 10:23:35.489927 osd.1 10.4.10.211:6812/9198 7921 : cluster [WRN] 1 slow requests, 1 included below; oldest blocked for 30.112783 secs 2015-08-19 10:23:35.490326 osd.1 10.4.10.211:6812/9198 7922 : cluster [WRN] slow request 30.112783 seconds old, received at 2015-08-19 10:23:05.376339: osd_op(osd.14.1299:731492 rbd_data.37e02ae8944a.00017a90 [copy-from ver 61293] 4.5aa9fb69 ondisk+write+ignore_overlay+enforce_snapc+known_if_redirected e1319) currently commit_sent 2015-08-19 10:23:36.799861 osd.6 10.4.10.213:6806/22569 22663 : cluster [WRN] 2 slow requests, 1 included below; oldest
Re: [ceph-users] requests are blocked - problem
Hello, That's a pretty small cluster all things considered, so your rather intensive test setup is likely to run into any or all of the following issues: 1) The amount of data you're moving around is going cause a lot of promotions from and to the cache tier. This is expensive and slow. 2) EC coded pools are slow. You may have actually better results with a Ceph classic approach, 2-4 HDDs per journal SSD. Also 6TB HDDs combined with EC may look nice to you from a cost/density prospect, but more HDDs means more IOPS and thus speed. 3) scrubbing (unless configured very aggressively down) will impact your performance on top of the items above. 4) You already noted the kernel versus userland bit. 5) Having all your storage in a single JBOD chassis strikes me as ill advised, though I don't think it's an actual bottleneck at 4x12Gb/s. When you ran the fio tests I assume nothing else was going on and the dataset size would have fit easily into the cache pool, right? Look at your nodes with atop or iostat, I venture all your HDDs are at 100%. Christian On Wed, 19 Aug 2015 10:29:28 +0200 Jacek Jarosiewicz wrote: Hi, Our setup is this: 4 x OSD nodes: E5-1630 CPU 32 GB RAM Mellanox MT27520 56Gbps network cards SATA controller LSI Logic SAS3008 Storage nodes are connected to SuperMicro chassis: 847E1C-R1K28JBOD Each node has 2-3 spinning OSDs (6TB drives) and 2 ssd OSDs (240GB drives) 3 monitors running on OSD nodes ceph hammer 0.94.2 Ubuntu 14.04 cache tier with ecpool (3+1) We've ran some tests on the cluster and results were promising - speeds, iops etc as expected, but now we tried to use more than one client for testing and ran into some problems: We've created a couple rbd images and mapped them on clients (kernel rbd) running two rsync processes and one dd on a large number of files (~ 250 GB of maildirs rsync'ed from one rbd image to the other and a dd process writing one big 2TB file on another rbd image) And the speeds now are less than OK, plus we have a lot requests blocked warnings. We've left the processes to run over night but this morning I came and none of the processes finished - they are able to write data, but at a very, very slow rate. Please help me diagnose this problem, everything seems to work, just very, very slow... when we ran the tests with fio (librbd engine) everything seemd fine.. I know that kernel implementation is slower, but is this normal? I can't understand why are so many requests blocked. Some diagnostic data: root@cf01:/var/log/ceph# ceph -s cluster 3469081f-9852-4b6e-b7ed-900e77c48bb5 health HEALTH_WARN 31 requests are blocked 32 sec monmap e1: 3 mons at {cf01=10.4.10.211:6789/0,cf02=10.4.10.212:6789/0,cf03=10.4.10.213:6789/0} election epoch 202, quorum 0,1,2 cf01,cf02,cf03 osdmap e1319: 18 osds: 18 up, 18 in pgmap v933010: 2112 pgs, 19 pools, 10552 GB data, 2664 kobjects 14379 GB used, 42812 GB / 57192 GB avail 2111 active+clean 1 active+clean+scrubbing client io 0 B/s rd, 12896 kB/s wr, 35 op/s root@cf01:/var/log/ceph# ceph health detail HEALTH_WARN 23 requests are blocked 32 sec; 6 osds have slow requests 1 ops are blocked 131.072 sec 22 ops are blocked 65.536 sec 1 ops are blocked 65.536 sec on osd.2 1 ops are blocked 65.536 sec on osd.3 1 ops are blocked 65.536 sec on osd.4 1 ops are blocked 131.072 sec on osd.7 18 ops are blocked 65.536 sec on osd.10 1 ops are blocked 65.536 sec on osd.12 6 osds have slow requests root@cf01:/var/log/ceph# grep WRN ceph.log | tail -50 2015-08-19 10:23:34.505669 osd.14 10.4.10.213:6810/21207 17942 : cluster [WRN] 2 slow requests, 2 included below; oldest blocked for 30.575870 secs 2015-08-19 10:23:34.505796 osd.14 10.4.10.213:6810/21207 17943 : cluster [WRN] slow request 30.575870 seconds old, received at 2015-08-19 10:23:03.929722: osd_op(client.9203.1:22822591 rbd_data.37e02ae8944a.000180ca [set-alloc-hint object_size 4194304 write_size 4194304,write 0~462848] 5.1c2aff5f ondisk+write e1319) currently waiting for blocked object 2015-08-19 10:23:34.505803 osd.14 10.4.10.213:6810/21207 17944 : cluster [WRN] slow request 30.560009 seconds old, received at 2015-08-19 10:23:03.945583: osd_op(client.9203.1:22822592 rbd_data.37e02ae8944a.000180ca [set-alloc-hint object_size 4194304 write_size 4194304,write 462848~524288] 5.1c2aff5f ondisk+write e1319) currently waiting for blocked object 2015-08-19 10:23:35.489927 osd.1 10.4.10.211:6812/9198 7921 : cluster [WRN] 1 slow requests, 1 included below; oldest blocked for 30.112783 secs 2015-08-19 10:23:35.490326 osd.1 10.4.10.211:6812/9198 7922 : cluster [WRN] slow request 30.112783 seconds old, received at 2015-08-19 10:23:05.376339: osd_op(osd.14.1299:731492 rbd_data.37e02ae8944a.00017a90 [copy-from ver 61293] 4.5aa9fb69
Re: [ceph-users] any recommendation of using EnhanceIO?
-Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Christian Balzer Sent: 19 August 2015 03:32 To: ceph-users@lists.ceph.com Cc: Nick Fisk n...@fisk.me.uk Subject: Re: [ceph-users] any recommendation of using EnhanceIO? On Tue, 18 Aug 2015 20:48:26 +0100 Nick Fisk wrote: [mega snip] 4. Disk based OSD with SSD Journal performance As I touched on above earlier, I would expect a disk based OSD with SSD journal to have similar performance to a pure SSD OSD when dealing with sequential small IO's. Currently the levelDB sync and potentially other things slow this down. Has anybody tried symlinking the omap directory to a SSD and tested if hat makes a (significant) difference? I thought I remember reading somewhere that all these items need to remain on the OSD itself so that when the OSD calls fsync it can be sure they are all in sync at the same time. Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com