Re: [ceph-users] Poor performance on all SSD cluster
On 24/06/14 17:37, Alexandre DERUMIER wrote: Hi Greg, So the only way to improve performance would be to not use O_DIRECT (as this should bypass rbd cache as well, right?). yes, indeed O_DIRECT bypass cache. BTW, Do you need to use mysql with O_DIRECT ? default innodb_flush_method is fdatasync, so it should work with cache. (but you can lose some write is case of a crash failure) While this suggestion is good, I don't believe that the you could lose data statement is correct with respect to fdatasync (or fsync) [1]. With all modern kernels I think you will find that fdatasync will actually flush modified buffers to the device (i.e write through file buffer cache). All of which means that Mysql performance (looking at you binlog) may still suffer due to lots of small block size sync writes. regards Mark [1] See kernel archives concerning REQ_FLUSH and friends. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Deep scrub versus osd scrub load threshold
Hello, On Mon, 23 Jun 2014 21:50:50 -0700 David Zafman wrote: By default osd_scrub_max_interval and osd_deep_scrub_interval are 1 week 604800 seconds (60*60*24*7) and osd_scrub_min_interval is 1 day 86400 seconds (60*60*24). As long as osd_scrub_max_interval = osd_deep_scrub_interval then the load won’t impact when deep scrub occurs. I suggest that osd_scrub_min_interval = osd_scrub_max_interval = osd_deep_scrub_interval. I’d like to know how you have those 3 values set, so I can confirm that this explains the issue. They are and were unsurprisingly set to the default values. Now to provide some more information, shortly after the inception of this cluster I did initiate a deep scrub on all OSDs on 00:30 on a Sunday morning (the things we do for Ceph, a scheduler with a variety of rules would be nice, but I digress). This took until 05:30 despite the cluster being idle and with close to no data in it. In retrospect it seems clear to me that this already was influenced by the load threshold (a scrub I initiated with the new threshold value of 1.5 finished in just 30 minutes last night). Consequently all the normal scrubs happened in the same time frame until this weekend on the 21st (normal scrub). The deep scrub on the 22nd clearly ran into the load threshold. So if I understand you correctly setting osd_scrub_max_interval to 6 days should have deep scrubs ignore the load threshold as per the documentation? Regards, Christian David Zafman Senior Developer http://www.inktank.com http://www.redhat.com On Jun 23, 2014, at 7:01 PM, Christian Balzer ch...@gol.com wrote: Hello, On Mon, 23 Jun 2014 14:20:37 -0400 Gregory Farnum wrote: Looks like it's a doc error (at least on master), but it might have changed over time. If you're running Dumpling we should change the docs. Nope, I'm running 0.80.1 currently. Christian -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Sun, Jun 22, 2014 at 10:18 PM, Christian Balzer ch...@gol.com wrote: Hello, This weekend I noticed that the deep scrubbing took a lot longer than usual (long periods without a scrub running/finishing), even though the cluster wasn't all that busy. It was however busier than in the past and the load average was above 0.5 frequently. Now according to the documentation osd scrub load threshold is ignored when it comes to deep scrubs. However after setting it to 1.5 and restarting the OSDs the floodgates opened and all those deep scrubs are now running at full speed. Documentation error or did I unstuck something by the OSD restart? Regards, Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Poor performance on all SSD cluster
All of which means that Mysql performance (looking at you binlog) may still suffer due to lots of small block size sync writes. Which begs the question: Anyone running a reasonable busy Mysql server on Ceph backed storage? We tried and it did not perform good enough. We have a small ceph cluster: 3 machines with 2 SSD journals and 10 spinning disks each. Using ceph trough kvm rbd we were seeing performance equal to about 1-2 spinning disks. Reading this thread it now looks a bit if there are inherent architecture + latency issues that would prevent it from performing great as a Mysql database store. I'd be interested in example setups where people are running busy databases on Ceph backed volumes. Cheers, Robert ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Poor performance on all SSD cluster
On 24/06/14 18:15, Robert van Leeuwen wrote: All of which means that Mysql performance (looking at you binlog) may still suffer due to lots of small block size sync writes. Which begs the question: Anyone running a reasonable busy Mysql server on Ceph backed storage? We tried and it did not perform good enough. We have a small ceph cluster: 3 machines with 2 SSD journals and 10 spinning disks each. Using ceph trough kvm rbd we were seeing performance equal to about 1-2 spinning disks. Reading this thread it now looks a bit if there are inherent architecture + latency issues that would prevent it from performing great as a Mysql database store. I'd be interested in example setups where people are running busy databases on Ceph backed volumes. Yes indeed, We have looked extensively at Postgres performance on rbd - and while it is not Mysql, the underlying mechanism for durable writes (i.e commit) is essentially very similar (fsync, fdatasync and friends). We achieved quite reasonable performance (by that I mean sufficiently encouraging to be happy to host real datastores for our moderately busy systems - and we are continuing to investigate using it for our really busy ones). I have not experimented exptensively with the various choices of flush method (called sync method in Postgres but the same idea), as we found quite good performance with the default (fdatasync). However this is clearly an area that is worth investigation. Regards Mark ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Firefly OSDs : set_extsize: FSSETXATTR: (22) Invalid argument
On Tue, Jun 24, 2014 at 12:02 PM, Florent B flor...@coppint.com wrote: Hi all, On 2 Firefly cluster, I have a lot of errors like this on my OSDs : 2014-06-24 09:54:39.088469 7fb5b8628700 0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-4) set_extsize: FSSETXATTR: (22) Invalid argument Both are using XFS, *without* filestore_xattr_use_omap = true. I read that was not necessary for XFS... What could be the problem ? Both clusters are using a RedHat 3.10 kernel on Debian Wheezy. Have you done a ceph upgrade recently? This is most probably an artifact of the upgrade, caused by a bug (omission) that has been fixed. Nothing serious: set_extsize simply tries to set an allocation size hint, it doesn't affect anything else. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Firefly OSDs : set_extsize: FSSETXATTR: (22) Invalid argument
On Tue, Jun 24, 2014 at 1:15 PM, Florent B flor...@coppint.com wrote: On 06/24/2014 11:13 AM, Ilya Dryomov wrote: On Tue, Jun 24, 2014 at 12:02 PM, Florent B flor...@coppint.com wrote: Hi all, On 2 Firefly cluster, I have a lot of errors like this on my OSDs : 2014-06-24 09:54:39.088469 7fb5b8628700 0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-4) set_extsize: FSSETXATTR: (22) Invalid argument Both are using XFS, *without* filestore_xattr_use_omap = true. I read that was not necessary for XFS... What could be the problem ? Both clusters are using a RedHat 3.10 kernel on Debian Wheezy. Have you done a ceph upgrade recently? This is most probably an artifact of the upgrade, caused by a bug (omission) that has been fixed. Nothing serious: set_extsize simply tries to set an allocation size hint, it doesn't affect anything else. Thanks, Ilya Yes of course I upgraded from Emperor when Firefly was released. What did I miss ? You missed nothing, set_extsize code just didn't take upgrades into account. The fix for that should be in the next firefly release. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Poor performance on all SSD cluster
On 23/06/14 19:16, Mark Kirkwood wrote: For database types (and yes I'm one of those)...you want to know that your writes (particularly your commit writes) are actually making it to persistent storage (that ACID thing you know). Now I see RBD cache very like battery backed RAID cards - your commits (i.e fsync or O_DIRECT writes) are not actually written, but are cached - so you are depending on the reliability of a) your RAID controller battery etc in that case or more interestingly b) your Ceph topology - to withstand node failures. Given we usually design a Ceph cluster with these things in mind it is probably ok! Thinking about this a bit more (and noting Mark N's comment too), this is a bit more subtle that what I indicated above: The rbd cache lives at the *client* level so (thinking in Openstack terms): if your VM fails - no problem, the compute node has the write cache in memory...ok, but how about if the compute node itself fails? This is analogous to: how about if your battery backed raid card self destructs? The answer would appear to be data loss, so rbd cache reliability looks to be dependent on the resilience of the client/compute design. Regards Mark ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] 'osd pool set-quota' behaviour with CephFS
Last week I decided to take a look at the 'osd pool set-quota' option. I have a directory in cephFS that uses a pool called pool-2 (configured by following this: http://www.sebastien-han.fr/blog/2013/02/11/mount-a-specific-pool-with-cephfs/). I have a directory in that filled with cat pictures. I ran 'rados df'. I then copied a couple more cat pictures into my directory using 'cp file destination sync'. I then ran 'rados df' again, this showed an increase in the object count for the pool equal to the number of additional cat pictures and an increase in the pool size equal to the size of the cat pictures, as expected. I then used the command 'ceph osd pool set-quota {pool-name} [max_objects {obj-count}] [max_bytes {bytes}]', as per http://ceph.com/docs/master/rados/operations/pools/, and set an object limit a couple of objects bigger than the current pool size. I then ran a loop copying more cat pictures one at a time (again with ' sync') each time. Whilst doing this I ran 'rados df', the number of objects in the pool increased up to the limit and stopped. However on the machine copying the cat pictures, the copying appeared to work fine and running ls showed more pictures than the 'rados df' command would suggest should be there. If I accessed the same directory from a different machine, then I saw only the pictures that were copied up to the limit. If I then removed the limit, the images would appear in the directory and 'rados df' would report a larger number of objects. Similar behaviour was observed when setting a size limit. What's going on? Is this expected behaviour? George Ryall Scientific Computing | STFC Rutherford Appleton Laboratory | Harwell Oxford | Didcot | OX11 0QX (01235 44) 5021 -- Scanned by iCritical. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Poor performance on all SSD cluster
On 06/24/2014 03:45 AM, Mark Kirkwood wrote: On 24/06/14 18:15, Robert van Leeuwen wrote: All of which means that Mysql performance (looking at you binlog) may still suffer due to lots of small block size sync writes. Which begs the question: Anyone running a reasonable busy Mysql server on Ceph backed storage? We tried and it did not perform good enough. We have a small ceph cluster: 3 machines with 2 SSD journals and 10 spinning disks each. Using ceph trough kvm rbd we were seeing performance equal to about 1-2 spinning disks. Reading this thread it now looks a bit if there are inherent architecture + latency issues that would prevent it from performing great as a Mysql database store. I'd be interested in example setups where people are running busy databases on Ceph backed volumes. Yes indeed, We have looked extensively at Postgres performance on rbd - and while it is not Mysql, the underlying mechanism for durable writes (i.e commit) is essentially very similar (fsync, fdatasync and friends). We achieved quite reasonable performance (by that I mean sufficiently encouraging to be happy to host real datastores for our moderately busy systems - and we are continuing to investigate using it for our really busy ones). I have not experimented exptensively with the various choices of flush method (called sync method in Postgres but the same idea), as we found quite good performance with the default (fdatasync). However this is clearly an area that is worth investigation. FWIW, I ran through the DBT-3 benchmark suite on MariaDB ontop of qemu/kvm RBD with a 3X replication pool on 30 OSDs with 3x replication. I kept buffer sizes small to try to force disk IO and benchmarked against a local disk passed through to the VM. We typically did about 3-4x faster on queries than the local disk, but there were a couple of queries were we were slower. I didn't look at how multiple databases scaled though. That may have it's own benefits and challenges. I'm encouraged overall though. It looks like from your comments and from my own testing it's possible to have at least passable performance with a single database and potentially as we reduce latency in Ceph make it even better. With multiple databases, it's entirely possible that we can do pretty good even now. Regards Mark ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Poor performance on all SSD cluster
On 06/24/2014 04:46 AM, Mark Kirkwood wrote: On 23/06/14 19:16, Mark Kirkwood wrote: For database types (and yes I'm one of those)...you want to know that your writes (particularly your commit writes) are actually making it to persistent storage (that ACID thing you know). Now I see RBD cache very like battery backed RAID cards - your commits (i.e fsync or O_DIRECT writes) are not actually written, but are cached - so you are depending on the reliability of a) your RAID controller battery etc in that case or more interestingly b) your Ceph topology - to withstand node failures. Given we usually design a Ceph cluster with these things in mind it is probably ok! Thinking about this a bit more (and noting Mark N's comment too), this is a bit more subtle that what I indicated above: The rbd cache lives at the *client* level so (thinking in Openstack terms): if your VM fails - no problem, the compute node has the write cache in memory...ok, but how about if the compute node itself fails? This is analogous to: how about if your battery backed raid card self destructs? The answer would appear to be data loss, so rbd cache reliability looks to be dependent on the resilience of the client/compute design. Well, it's the same problem you have with cache on most spinning disks. You just have to assume that anything that wasn't flushed might not have made it. Depending on the use case that might or might not be an ok assumption. In terms of data loss, the way I like to look at this is that there is always a spectrum. Even with battery backed RAID cards you don't have any guarantee that any given write is going to make it out of RAM and to the controller before a system crash. What's more important imho is making sure you know exactly what the granularity is and what kind of guaranties you do get. Regards Mark ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Poor performance on all SSD cluster
On Mon, Jun 23, 2014 at 3:03 PM, Mark Nelson mark.nel...@inktank.com wrote: Well, for random IO you often can't do much coalescing. You have to bite the bullet and either parallelize things or reduce per-op latency. Ceph already handles parallelism very well. You just throw more disks at the problem and so long as there are enough client requests it more or less just scales (limited by things like network bisection bandwidth or other complications). On the latency side, spinning disks aren't fast enough for Ceph's extra latency overhead to matter much, but with SSDs the story is different. That's why we are very interested in reducing latency. Regarding journals: Journal writes are always sequential (even for random IO!), but are O_DIRECT so they'll skip linux buffer cache. If you have hardware that is fast at writing sequential small IO (say a controller with WB cache or an SSD), you can do journal writes very quickly. For bursts of small random IO, performance can be quite good. The downsides is that you can hit journal limits very quickly, meaning you have to flush and wait for the underlying filestore to catch up. This results in performance that starts out super fast, then stalls once the journal limits are hit, back to super fast again for a bit, then another stall, etc. This is less than ideal given the way crush distributes data across OSDs. The alternative is setting a soft limit on how much data is in the journal and flushing smaller amounts of data more quickly to limit the spikey behaviour. On the whole, that can be good but limits the burst potential and also limits the amount of data that could potentially be coalesced in the journal. Mark, What settings are you suggesting for setting a soft limit on journal size and flushing smaller amounts of data? Something like this? filestore_queue_max_bytes: 10485760 filestore_queue_committing_max_bytes: 10485760 journal_max_write_bytes: 10485760 journal_queue_max_bytes: 10485760 ms_dispatch_throttle_bytes: 10485760 objecter_infilght_op_bytes: 10485760 (see Small bytes in http://ceph.com/community/ceph-bobtail-jbod-performance-tuning) Luckily with RBD you can (when applicable) coalesce on the client with RBD cache instead, which is arguably better anyway since you can send bigger IOs to the OSDs earlier in the write path. So long as you are ok with what RBD cache does and does not guarantee, it's definitely worth enabling imho. Thanks, Jake ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 'osd pool set-quota' behaviour with CephFS
Hi George, I actually asked Sage about a similar scenario at the OpenStack summit in Atlanta this year -- namely if I could use the new pool quota functionality to enforce quotas on CephFS. The answer was no, that the pool quota functionality is mostly intended for radosgw and that the existing cephfs clients have no support for it. He said the quota should work, actually, but that you were likely to see some very strange behavior in cephfs. That sounds like what you've seen. It won't be a graceful failure at all. Quotas in cephfs is a different task, and one that I'm following as well. See here: https://github.com/ceph/ceph/pull/1122 The pull request is old, but Sage did mention he was in contact with the team working on the code and was hopeful to see it finished. - Travis On Tue, Jun 24, 2014 at 7:06 AM, george.ry...@stfc.ac.uk wrote: Last week I decided to take a look at the ‘osd pool set-quota’ option. I have a directory in cephFS that uses a pool called pool-2 (configured by following this: http://www.sebastien-han.fr/blog/2013/02/11/mount-a-specific-pool-with-cephfs/). I have a directory in that filled with cat pictures. I ran ‘rados df’. I then copied a couple more cat pictures into my directory using ‘cp file destination sync’. I then ran ‘rados df’ again, this showed an increase in the object count for the pool equal to the number of additional cat pictures and an increase in the pool size equal to the size of the cat pictures, as expected. I then used the command ‘ceph osd pool set-quota {pool-name} [max_objects {obj-count}] [max_bytes {bytes}]’, as per http://ceph.com/docs/master/rados/operations/pools/, and set an object limit a couple of objects bigger than the current pool size. I then ran a loop copying more cat pictures one at a time (again with ‘ sync’) each time. Whilst doing this I ran ‘rados df’, the number of objects in the pool increased up to the limit and stopped. However on the machine copying the cat pictures, the copying appeared to work fine and running ls showed more pictures than the ‘rados df’ command would suggest should be there. If I accessed the same directory from a different machine, then I saw only the pictures that were copied up to the limit. If I then removed the limit, the images would appear in the directory and ‘rados df’ would report a larger number of objects. Similar behaviour was observed when setting a size limit. What’s going on? Is this expected behaviour? George Ryall Scientific Computing | STFC Rutherford Appleton Laboratory | Harwell Oxford | Didcot | OX11 0QX (01235 44) 5021 -- Scanned by iCritical. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] limitations of erasure coded pools
Hi All, Could someone point me to a document (possibly a FAQ :) ) describing the limitations of erasure coded pools? Hopefully it would contain the when and how to use them as well. E.g. I read about people using replicated pools as a front end to erasure coded pools, but I don't know why they're deciding to do this, or how they are setting this up. THanks! Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Multiple hierarchies and custom placement
There's not really a simple way to do this. There are functions in the OSDMap structure to calculate the location of a particular PG, but there are a lot of independent places that map objects into PGs. On Monday, June 23, 2014, Shayan Saeed shayansaee...@gmail.com wrote: Thanks for getting back with a helpful reply. Assuming that I change the source code to do custom placement, what are the places I need to look in the code to do that? I am currently trying to change the CRUSH code, but is there any place else I need to be concerned about? Regards, Shayan Saeed On Mon, Jun 23, 2014 at 2:14 PM, Gregory Farnum g...@inktank.com javascript:_e(%7B%7D,'cvml','g...@inktank.com'); wrote: On Fri, Jun 20, 2014 at 4:23 PM, Shayan Saeed shayansaee...@gmail.com javascript:_e(%7B%7D,'cvml','shayansaee...@gmail.com'); wrote: Is it allowed for crush maps to have multiple hierarchies for different pools. So for example, I want one pool to treat my cluster as flat with every host being equal but the other pool to have a more hierarchical idea as hosts-racks-root? Yes. It can get complicated, so make sure you know exactly what you're doing, but you can create different root buckets and link the OSDs in to each root in different ways. Also, is it currently possible in ceph to have a custom placement of erasure coded chunks. So for example within a pool, I want objects to reside exactly on the OSDs I choose instead of doing placement for load balancing. Can I specify something like: For object 1, I want systematic chunks on rack1 and non systematic distributed between rack2 and rack3 and then for object 2, I want systematic ones on rack2 and non systematic distributed between rack1 and rack3? Not generally, no — you need to let the CRUSH algorithm place them. You can do things like specify specific buckets within a CRUSH rule, but that applies on a pool level. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com I would greatly appreciate any suggestions I get. Regards, Shayan Saeed ___ ceph-users mailing list ceph-users@lists.ceph.com javascript:_e(%7B%7D,'cvml','ceph-users@lists.ceph.com'); http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Continuing placement group problems
We are using two instances of version 8.0 of ceph both on ZFS. We are frequently getting placement group inconsistent on both Ceph clusters We are suspecting that there is a problem with the network that is randomly corrupting the update of placement groups. Does anyone have any suggestions as to where and how to look for the problem. The network does not seem to have any problems. ZFS is not reporting any problems with the disks and the OSD's are fine. Thanks Peter. Log as follows health HEALTH_ERR 50 pgs inconsistent; 121 scrub errors monmap e8: 6 mons at {broll=10.5.8.9:6789/0,gelbin=10.5.8.10:6789/0,magni=10.5.8.12:6789/0,sicco=10.5.8.11:6789/0,tyrande=10.5.8.8:6789/0,varian=10.5.8.14:6789/0}, election epoch 272, quorum 0,1,2,3,4,5 tyrande,broll,gelbin,sicco,magni,varian mdsmap e430: 1/1/1 up {0=broll=up:active}, 5 up:standby osdmap e18928: 7 osds: 7 up, 7 in pgmap v4910054: 512 pgs, 4 pools, 13043 MB data, 3681 objects 40800 MB used, 856 GB / 895 GB avail 462 active+clean 50 active+clean+inconsistent client io 12769 B/s rd, 5 op/s --- Follow us on: www.twitter.com/teamenergyeaa www.youtube.com/user/teamenergyeaa Date for your diary TEAM User Group Conference 5th November 2014 Subscribe to our newsletter http://www.teamenergy.com/newsletter-subscription/ www.teamenergy.com +44 (0)1908 690018 enquir...@teamenergy.com Registered Office: TEAM (Energy Auditing Agency Ltd), 34 The Forum, Rockingham Drive, Linford Wood, Milton Keynes, MK14 6LY Registered in England No. 1916768 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph RGW + S3 Client (s3cmd)
Hi Vickey, This really looks like a DNS issue. Are you sure that the host from which s3cmd is running is able to resolve the host 'bmi-pocfe2.scc.fi'? Does a regular ping works? $ ping bmi-pocfe2.scc.fi François On 23. 06. 14 16:24, Vickey Singh wrote: # s3cmd ls WARNING: Retrying failed request: / ([Errno -2] Name or service not known) WARNING: Waiting 3 sec... WARNING: Retrying failed request: / ([Errno -2] Name or service not known) WARNING: Waiting 6 sec... WARNING: Retrying failed request: / ([Errno -2] Name or service not known) WARNING: Waiting 9 sec... ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Deep scrub versus osd scrub load threshold
Unfortunately, decreasing the osd_scrub_max_interval to 6 days isn’t going to fix it. There is sort of quirk in the way the deep scrub is initiated. It doesn’t trigger a deep scrub until a regular scrub is about to start. So with osd_scrub_max_interval set to 1 week and a high load the next possible scrub or deep-scrub is 1 week from the last REGULAR scrub, even if the last deep scrub was more than 7 days ago. The longest wait for a deep scrub is osd_scrub_max_interval + osd_deep_scrub_interval between deep scrubs. For example, a deep scrub happens on Jan 1. Each day after that for six days a regular scrub happens with low load. After 6 regular scrubs ending on Jan 7 the load goes high. Now with the load high no scrub can start until Jan 14 because you must get past osd_scrub_max_interval since the last regular scrub on Jan 7. At that time it will be a deep scrub because it is more than 7 days since the last deep scrub on Jan 1. See also http://tracker.ceph.com/issues/6735 There may be a need for more documentation clarification in this area or a change to the behavior. David Zafman Senior Developer http://www.inktank.com http://www.redhat.com On Jun 23, 2014, at 11:10 PM, Christian Balzer ch...@gol.com wrote: Hello, On Mon, 23 Jun 2014 21:50:50 -0700 David Zafman wrote: By default osd_scrub_max_interval and osd_deep_scrub_interval are 1 week 604800 seconds (60*60*24*7) and osd_scrub_min_interval is 1 day 86400 seconds (60*60*24). As long as osd_scrub_max_interval = osd_deep_scrub_interval then the load won’t impact when deep scrub occurs. I suggest that osd_scrub_min_interval = osd_scrub_max_interval = osd_deep_scrub_interval. I’d like to know how you have those 3 values set, so I can confirm that this explains the issue. They are and were unsurprisingly set to the default values. Now to provide some more information, shortly after the inception of this cluster I did initiate a deep scrub on all OSDs on 00:30 on a Sunday morning (the things we do for Ceph, a scheduler with a variety of rules would be nice, but I digress). This took until 05:30 despite the cluster being idle and with close to no data in it. In retrospect it seems clear to me that this already was influenced by the load threshold (a scrub I initiated with the new threshold value of 1.5 finished in just 30 minutes last night). Consequently all the normal scrubs happened in the same time frame until this weekend on the 21st (normal scrub). The deep scrub on the 22nd clearly ran into the load threshold. So if I understand you correctly setting osd_scrub_max_interval to 6 days should have deep scrubs ignore the load threshold as per the documentation? Regards, Christian David Zafman Senior Developer http://www.inktank.com http://www.redhat.com On Jun 23, 2014, at 7:01 PM, Christian Balzer ch...@gol.com wrote: Hello, On Mon, 23 Jun 2014 14:20:37 -0400 Gregory Farnum wrote: Looks like it's a doc error (at least on master), but it might have changed over time. If you're running Dumpling we should change the docs. Nope, I'm running 0.80.1 currently. Christian -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Sun, Jun 22, 2014 at 10:18 PM, Christian Balzer ch...@gol.com wrote: Hello, This weekend I noticed that the deep scrubbing took a lot longer than usual (long periods without a scrub running/finishing), even though the cluster wasn't all that busy. It was however busier than in the past and the load average was above 0.5 frequently. Now according to the documentation osd scrub load threshold is ignored when it comes to deep scrubs. However after setting it to 1.5 and restarting the OSDs the floodgates opened and all those deep scrubs are now running at full speed. Documentation error or did I unstuck something by the OSD restart? Regards, Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph RGW + S3 Client (s3cmd)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 06/23/2014 04:24 AM, Vickey Singh wrote: host_bucket = %(bucket)s.bmi-pocfe2.scc.fi http://s.bmi-pocfe2.scc.fi Should there be a '.' (period) between %(bucket) and s.bmi-pocfe2.scc.fi? - -Stephan -BEGIN PGP SIGNATURE- Version: GnuPG v1 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJTqe0OAAoJEJs+ZpPmKxIxKCAH/3ief0gpJYtECDraFJoYuIlD DC8JCBoTMkhMWhbxgq3kSWnPNuVaMMrQOBAOpZF1tebAqYDNjvvAgI4Yc/45PQTg o4H5O7OvJHM+VC2tmLdF8jgjuhi5P+xErM+7LB7V8PDrLMUsdT6HsvqdDchCIjLk kUBo7whIehNPrvcP964iFG5fB5V44mA0TDyGqPFp5wbXVrHw8fVG9pR4hhi0eETg I0xNXjNwvjR+4WZavpiUkEK+/91nGJatUNsu7jl1DvshizH4L3Ujpifr3RyDkVJT EKJ6UVbpBNsRwvfx30SSIUDfzWCCBgifv2DYNy4wPadLeNU3kwaUyvj2bvqZSlw= =dGQO -END PGP SIGNATURE- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] limitations of erasure coded pools
Message: 24 Date: Tue, 24 Jun 2014 09:39:50 -0500 From: Chad Seys cws...@physics.wisc.edu To: ceph-users@lists.ceph.com ceph-users@lists.ceph.com Subject: [ceph-users] limitations of erasure coded pools Message-ID: 201406240939.50550.cws...@physics.wisc.edu Content-Type: Text/Plain; charset=us-ascii Hi All, Could someone point me to a document (possibly a FAQ :) ) describing the limitations of erasure coded pools? Hopefully it would contain the when and how to use them as well. Hi Chad, this Ceph Enterprise 1.2 FAQ provides a good overview: https://download.inktank.com/docs/ICE%201.2%20-%20Cache%20and%20Erasure%20Coding%20FAQ.pdf E.g. I read about people using replicated pools as a front end to erasure coded pools, but I don't know why they're deciding to do this, or how they are setting this up. Unless you have a very specific use-case then you don't want to interact directly with an EC pool, for a number of reasons - but here's one really good one: objects cannot be modified in-place, so to speak, the OSDs have to read (at least) k chunks, compute the data, make the write, recompute the updated object's EC. This extra overhead probably makes EC unsuitable for block storage, whereas it might be OK for particularly read-dominated object storage. The replicated cache front-end/tier helps to service random IO bursts (in writeback mode) and keeps hot objects available to serve to clients without recomputing. -- Cheers, ~Blairo ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Poor performance on all SSD cluster
On 24/06/14 23:39, Mark Nelson wrote: On 06/24/2014 03:45 AM, Mark Kirkwood wrote: On 24/06/14 18:15, Robert van Leeuwen wrote: All of which means that Mysql performance (looking at you binlog) may still suffer due to lots of small block size sync writes. Which begs the question: Anyone running a reasonable busy Mysql server on Ceph backed storage? We tried and it did not perform good enough. We have a small ceph cluster: 3 machines with 2 SSD journals and 10 spinning disks each. Using ceph trough kvm rbd we were seeing performance equal to about 1-2 spinning disks. Reading this thread it now looks a bit if there are inherent architecture + latency issues that would prevent it from performing great as a Mysql database store. I'd be interested in example setups where people are running busy databases on Ceph backed volumes. Yes indeed, We have looked extensively at Postgres performance on rbd - and while it is not Mysql, the underlying mechanism for durable writes (i.e commit) is essentially very similar (fsync, fdatasync and friends). We achieved quite reasonable performance (by that I mean sufficiently encouraging to be happy to host real datastores for our moderately busy systems - and we are continuing to investigate using it for our really busy ones). I have not experimented exptensively with the various choices of flush method (called sync method in Postgres but the same idea), as we found quite good performance with the default (fdatasync). However this is clearly an area that is worth investigation. FWIW, I ran through the DBT-3 benchmark suite on MariaDB ontop of qemu/kvm RBD with a 3X replication pool on 30 OSDs with 3x replication. I kept buffer sizes small to try to force disk IO and benchmarked against a local disk passed through to the VM. We typically did about 3-4x faster on queries than the local disk, but there were a couple of queries were we were slower. I didn't look at how multiple databases scaled though. That may have it's own benefits and challenges. I'm encouraged overall though. It looks like from your comments and from my own testing it's possible to have at least passable performance with a single database and potentially as we reduce latency in Ceph make it even better. With multiple databases, it's entirely possible that we can do pretty good even now. Yes - same kind of findings, specifically: - random read and write (e.g index access) faster than local disk - sequential write (e.g batch inserts) similar or faster than local disk - sequential read (e.g table scan) slower than local disk Regards Mark ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com