[ceph-users] Ceph cluster on AWS EC2 VMs using public ips
Hi Experts, I need a quick advise on deployment of ceph cluster on AWS EC2 VMs. 1) I have two separate AWS accounts and I am trying to create ceph cluster on one account and create ceph-client on another account and connect. (EC2 Account and VMs + ceph client) public ip--- (EC2 Account B + ceph cluster (1mon + 3 OSD Vms)) 2) I have given inbound and outbound traffic permission to ALL traffic, so no restriction on both account 3) I have configured my ceph cluster based on *public ips* assigned to EC2 instances. I know it is not recommended but there is no other way I can contact my ceph cluster from another Account's VM. 4) Now after adding public ip to ceph cluster Vms . I am able to make my monitor run but still not able to connect from another account ceph client. ifconfig eth0:0 52.24.62.171 netmask 255.255.255.0 up and public network = 52.24.73.240/24 in ceph.conf Help me if at all it is possible to setup ceph on AWS EC2 based on *public ips* ? and my ceph client on another account can contact this cluster ? Thanks a lot sumit ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Civet RadosGW S3 not storing complete obects; civetweb logs stop after rotation
Hey there, Sorry for the delay. I have been moving apartments UGH. Our dev team found out how to quickly identify these files that are downloading a smaller size:: iterate through all of the objects in a bucket and call for a key.size in each item and compare it to conn.get_bucket().get_key().size of each key and the sizes differ. If the sizes differ these correspond exactly to any object that seems to have missing objects in ceph. The objects always seem to be intervals of 512k as well which is really odd. == http://pastebin.com/R34wF7PB == My main question is why are these sizes different at all? Shouldn't they be exactly the same? Why are they off by multiples of 512k as well? Finally I need a way to rule out that this is a ceph issue and the only way I can think of is grabbing a list of all of the data files and concatenating them together in order in hopes that the manifest is wrong and I get the whole file. For example:: implicit size 7745820218 explicit size 7744771642. Absolute 1048576; name = 86b6fad8-3c53-465f-8758-2009d6df01e9/TCGA-A2-A0T7-01A-21D-A099-09_IlluminaGA-DNASeq_exome.bam I explicitly called one of the gateways and then piped the output to a text file while downloading this bam: https://drive.google.com/file/d/0B16pfLB7yY6GcTZXalBQM3RHT0U/view?usp=sharing (25 Mb of text) As we can see above. Ceph is saying that the size is 7745820218 bytes somewhere but when we download it we get 7744771642 bytes. If I download the object I get a 7744771642 byte file. Finally if I do a range request of all of the bytes from 7744771642 to the end I get a cannot compete request:: http://pastebin.com/CVvmex4m -- traceback of the python range request. http://pastebin.com/4sd1Jc0G -- the radoslog of the range request If I request the file with a shorter range (say 7744771642 -2 bytes (7744771640)) I am left with just a 2 byte file:: http://pastebin.com/Sn7Y0t9G -- range request of file - 2 bytes to end of file. lacadmin@kh10-9:~$ ls -lhab 7gtest-range.bam -rw-r--r-- 1 lacadmin lacadmin 2 Feb 24 01:00 7gtest-range.bam I think that rados-gw may not be keeping track of the multipart chunks errors possibly? How did rados get the original and correct file size and why is it short when it returns the actual chunks? Finally why are the corrupt / missing chunks always a multipe of 512K? I do not see anything obvious that is set to 512K on the configuration/user side. Sorry for the questions and babling but I am at a loss as to how to address this. On 04/28/2015 05:03 PM, Yehuda Sadeh-Weinraub wrote: - Original Message - From: Sean seapasu...@uchicago.edu To: ceph-users@lists.ceph.com Sent: Tuesday, April 28, 2015 2:52:35 PM Subject: [ceph-users] Civet RadosGW S3 not storing complete obects; civetweb logs stop after rotation Hey yall! I have a weird issue and I am not sure where to look so any help would be appreciated. I have a large ceph giant cluster that has been stable and healthy almost entirely since its inception. We have stored over 1.5PB into the cluster currently through RGW and everything seems to be functioning great. We have downloaded smaller objects without issue but last night we did a test on our largest file (almost 1 terabyte) and it continuously times out at almost the exact same place. Investigating further it looks like Civetweb/RGW is returning that the uploads completed even though the objects are truncated. At least when we download the objects they seem to be truncated. I have tried searching through the mailing list archives to see what may be going on but it looks like the mailing list DB may be going through some mainenance: Unable to read word database file '/dh/mailman/dap/archives/private/ceph-users-ceph.com/htdig/db.words.db' After checking through the gzipped logs I see that civetweb just stops logging after a rotation for some reason as well and my last log is from the 28th of march. I tried manually running /etc/init.d/radosgw reload but this didn't seem to work. As running the download again could take all day to error out we instead use the range request to try and pull the missing bites. https://gist.github.com/MurphyMarkW/8e356823cfe00de86a48 -- there is the code we are using to download via S3 / boto as well as the returned size report and overview of our issue. http://pastebin.com/cVLdQBMF-- Here is some of the log from the civetweb server they are hitting. Here is our current config :: http://pastebin.com/2SGfSDYG Current output of ceph health:: http://pastebin.com/3f6iJEbu I am thinking that this must be a civetweb/radosgw bug of somekind. My question is 1.) is there a way to try and download the object via rados directly I am guessing I will need to find the prefix and then just cat all of them together and hope I get it right? 2.) Why would ceph say the upload went fine but then return a smaller object? Note that the returned http
Re: [ceph-users] Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down
On Fri, 1 May 2015, tuomas.juntu...@databasement.fi wrote: Hi I deleted the images and img pools and started osd's, they still die. Here's a log of one of the osd's after this, if you need it. http://beta.xaasbox.com/ceph/ceph-osd.19.log I've pushed another commit that should avoid this case, sha1 425bd4e1dba00cc2243b0c27232d1f9740b04e34. Note that once the pools are fully deleted (shouldn't take too long once the osds are up and stabilize) you should switch back to the normal packages that don't have these workarounds. sage Br, Tuomas Thanks man. I'll try it tomorrow. Have a good one. Br,T Original message From: Sage Weil s...@newdream.net Date: 30/04/2015 18:23 (GMT+02:00) To: Tuomas Juntunen tuomas.juntu...@databasement.fi Cc: ceph-users@lists.ceph.com, ceph-de...@vger.kernel.org Subject: RE: [ceph-users] Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down On Thu, 30 Apr 2015, tuomas.juntu...@databasement.fi wrote: Hey Yes I can drop the images data, you think this will fix it? It's a slightly different assert that (I believe) should not trigger once the pool is deleted. Please give that a try and if you still hit it I'll whip up a workaround. Thanks! sage Br, Tuomas On Wed, 29 Apr 2015, Tuomas Juntunen wrote: Hi I updated that version and it seems that something did happen, the osd's stayed up for a while and 'ceph status' got updated. But then in couple of minutes, they all went down the same way. I have attached new 'ceph osd dump -f json-pretty' and got a new log from one of the osd's with osd debug = 20, http://beta.xaasbox.com/ceph/ceph-osd.15.log Sam mentioned that you had said earlier that this was not critical data? If not, I think the simplest thing is to just drop those pools. The important thing (from my perspective at least :) is that we understand the root cause and can prevent this in the future. sage Thank you! Br, Tuomas -Original Message- From: Sage Weil [mailto:s...@newdream.net] Sent: 28. huhtikuuta 2015 23:57 To: Tuomas Juntunen Cc: ceph-users@lists.ceph.com; ceph-de...@vger.kernel.org Subject: Re: [ceph-users] Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down Hi Tuomas, I've pushed an updated wip-hammer-snaps branch. Can you please try it? The build will appear here http://gitbuilder.ceph.com/ceph-deb-trusty-x86_64-basic/sha1/08bf531331afd5e 2eb514067f72afda11bcde286 (or a similar url; adjust for your distro). Thanks! sage On Tue, 28 Apr 2015, Sage Weil wrote: [adding ceph-devel] Okay, I see the problem. This seems to be unrelated ot the giant - hammer move... it's a result of the tiering changes you made: The following: ceph osd tier add img images --force-nonempty ceph osd tier cache-mode images forward ceph osd tier set-overlay img images Specifically, --force-nonempty bypassed important safety checks. 1. images had snapshots (and removed_snaps) 2. images was added as a tier *of* img, and img's removed_snaps was copied to images, clobbering the removed_snaps value (see OSDMap::Incremental::propagate_snaps_to_tiers) 3. tiering relation was undone, but removed_snaps was still gone 4. on OSD startup, when we load the PG, removed_snaps is initialized with the older map. later, in PGPool::update(), we assume that removed_snaps alwasy grows (never shrinks) and we trigger an assert. To fix this I think we need to do 2 things: 1. make the OSD forgiving out removed_snaps getting smaller. This is probably a good thing anyway: once we know snaps are removed on all OSDs we can prune the interval_set in the OSDMap. Maybe. 2. Fix the mon to prevent this from happening, *even* when --force-nonempty is specified. (This is the root cause.) I've opened http://tracker.ceph.com/issues/11493 to track this. sage Idea was to make images as a tier to img, move data to img then change clients to use the new img pool. Br, Tuomas Can you explain exactly what you mean by: Also I created one pool for tier to be able to move data without outage. -Sam - Original Message - From: tuomas juntunen tuomas.juntu...@databasement.fi To: Ian Colle ico...@redhat.com Cc: ceph-users@lists.ceph.com Sent: Monday, April 27, 2015 4:23:44 AM Subject: Re: [ceph-users] Upgrade from Giant to Hammer and after some basic
Re: [ceph-users] Possible improvements for a slow write speed (excluding independent SSD journals)
Hello, On Fri, 1 May 2015 12:03:59 -0400 Anthony Levesque wrote: By what I read on some of the topics, is it you guys opinion that Ceph cannot scale nicely on full SSD cluster. Meaning that no matter how many OSD Node we add, at some point you won’t be able to scale pass some throughput. No, that's not what at least I'm saying at all. Ceph scales quite well, much better than some other distributed storage solutions. The more nodes and/or OSDs, the better. However those nodes need to be balanced and well designed, your original try with the 1TB EVOs was limited by those SSDs. Having 16 (fast) SSDs per node is going to be limited by the CPU resources to handle the potential IOPS they're capable of. Your network might be another limiting factor at some point. The exercise with Ceph is to deploy well balanced storage nodes, where well is the closest fit to your IOPS needs, budget and other constraints (rack space, power). Christian --- Anthony Lévesque GloboTech Communications Phone: 1-514-907-0050 x 208 Toll Free: 1-(888)-GTCOMM1 x 208 Phone Urgency: 1-(514) 907-0047 1-(866)-500-1555 Fax: 1-(514)-907-0750 aleves...@gtcomm.net mailto:aleves...@gtcomm.net http://www.gtcomm.net http://www.gtcomm.net/ On Apr 30, 2015, at 9:32 PM, Christian Balzer ch...@gol.com wrote: On Thu, 30 Apr 2015 18:01:44 -0400 Anthony Levesque wrote: Im planning to setup 4-6 POC in the next 2 week to test various scenarios here. Im checking to get POC with s3610, s3710, p3500(seem to be knew. I know the lifespam is lower) and maybe P3700 Don't ignore the S3700, it is faster in sequential writes than the 3710 because it uses older, less dense flash modules, thus more parallelism. And with Ceph, especially when looking at the journals, you will hit the max sequential write speed limit of the SSD long, lng before you'll hit the IOPS limit. Both due to the nature of journal write and the little detail that you'll hit the CPU performance wall before that. The speed of the 400GB p3500 seem very nice and the price is alright. The major difference will be the durability between the P3700 and P3500 and the IOPS. Read the link below about write amplification, but that is something that happens mostly on the OSD part, which in your case of 1TB EVOs is already a scary prospect in my book. in both option, they are the model with the lowest price per MB/s when compare to the S series. Price per MB/s is a good start, don't forget to factor in TBW/$ and try to estimate what write loads your cluster will see. But all of this is irrelevant if your typical write patterns will exceed your CPU resources while your SSDs are bored. For example this fio in a VM here: --- # fio --size=4G --ioengine=libaio --invalidate=1 --direct=0 --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4K --iodepth=32 write: io=1381.4MB, bw=16364KB/s, iops=4091 , runt= 86419msec --- Will utilize all 8 3.1 GHz cores here, on a 3 node firefly cluster with 8 HDD OSDs and 4 journal SSDs (100GB S3700) per node. While the journal SSDs are at 11% and the OSD HDDs at 30-40% utilization. When changing that fio to direct=1, the IOPS drop to half of that. With a block size of 4MB things of course change to the OSDs being 100% busy, the SSDs about 60% (they can only do 200MB/s) and with 3-4 cores worth being idle or in IOwait. Model Price per MB/s DC S3500 120GB $1.10 240GB $1.01 300GB $1.03 480GB $1.28 DC S3610 200GB $0.99 400GB $1.14 480GB $1.24 DC S3710 200GB $1.17 DC P3500 400GB $0.64 DC P3700 400GB $0.96 As a side note, the expense doesn’t scare me directly, Its more that we are going blind here since it seem not a lot of people do full SSD setup.(Or share there experiences) See this: http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-October/043949.html I'd suggest you try the above tests yourself, you seem to have a significant amount of hardware already. There are many SSD threads, but so far there's at best one example of a setup going from Firefly to Giant and Hammer. So for me it's hard to qualify and quantify the improvements Hammer brings to SSD based clusters other than better, maybe about 50%. Which while significant, is obviously nowhere near the raw performance the hardware would be capable of. But then again, my guestimate is that aside from the significant code that gets executed per Ceph IOP, any such Ceph IOP results in 5-10 real IOPs down the line. Christian Anyway still brainstorm this so we can work on some POC. Will you guys posted here. --- Anthony Lévesque On Apr 29, 2015, at 11:27 PM, Christian Balzer ch...@gol.com wrote: Hello, On Wed, 29 Apr 2015 15:01:49 -0400 Anthony Levesque wrote: We redid the test with 4MB
Re: [ceph-users] Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down
Thanks, I'll do this when the commit is available and report back. And indeed, I'll change to the official ones after everything is ok. Br, Tuomas On Fri, 1 May 2015, tuomas.juntu...@databasement.fi wrote: Hi I deleted the images and img pools and started osd's, they still die. Here's a log of one of the osd's after this, if you need it. http://beta.xaasbox.com/ceph/ceph-osd.19.log I've pushed another commit that should avoid this case, sha1 425bd4e1dba00cc2243b0c27232d1f9740b04e34. Note that once the pools are fully deleted (shouldn't take too long once the osds are up and stabilize) you should switch back to the normal packages that don't have these workarounds. sage Br, Tuomas Thanks man. I'll try it tomorrow. Have a good one. Br,T Original message From: Sage Weil s...@newdream.net Date: 30/04/2015 18:23 (GMT+02:00) To: Tuomas Juntunen tuomas.juntu...@databasement.fi Cc: ceph-users@lists.ceph.com, ceph-de...@vger.kernel.org Subject: RE: [ceph-users] Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down On Thu, 30 Apr 2015, tuomas.juntu...@databasement.fi wrote: Hey Yes I can drop the images data, you think this will fix it? It's a slightly different assert that (I believe) should not trigger once the pool is deleted. Please give that a try and if you still hit it I'll whip up a workaround. Thanks! sage Br, Tuomas On Wed, 29 Apr 2015, Tuomas Juntunen wrote: Hi I updated that version and it seems that something did happen, the osd's stayed up for a while and 'ceph status' got updated. But then in couple of minutes, they all went down the same way. I have attached new 'ceph osd dump -f json-pretty' and got a new log from one of the osd's with osd debug = 20, http://beta.xaasbox.com/ceph/ceph-osd.15.log Sam mentioned that you had said earlier that this was not critical data? If not, I think the simplest thing is to just drop those pools. The important thing (from my perspective at least :) is that we understand the root cause and can prevent this in the future. sage Thank you! Br, Tuomas -Original Message- From: Sage Weil [mailto:s...@newdream.net] Sent: 28. huhtikuuta 2015 23:57 To: Tuomas Juntunen Cc: ceph-users@lists.ceph.com; ceph-de...@vger.kernel.org Subject: Re: [ceph-users] Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down Hi Tuomas, I've pushed an updated wip-hammer-snaps branch. Can you please try it? The build will appear here http://gitbuilder.ceph.com/ceph-deb-trusty-x86_64-basic/sha1/08bf531331afd5e 2eb514067f72afda11bcde286 (or a similar url; adjust for your distro). Thanks! sage On Tue, 28 Apr 2015, Sage Weil wrote: [adding ceph-devel] Okay, I see the problem. This seems to be unrelated ot the giant - hammer move... it's a result of the tiering changes you made: The following: ceph osd tier add img images --force-nonempty ceph osd tier cache-mode images forward ceph osd tier set-overlay img images Specifically, --force-nonempty bypassed important safety checks. 1. images had snapshots (and removed_snaps) 2. images was added as a tier *of* img, and img's removed_snaps was copied to images, clobbering the removed_snaps value (see OSDMap::Incremental::propagate_snaps_to_tiers) 3. tiering relation was undone, but removed_snaps was still gone 4. on OSD startup, when we load the PG, removed_snaps is initialized with the older map. later, in PGPool::update(), we assume that removed_snaps alwasy grows (never shrinks) and we trigger an assert. To fix this I think we need to do 2 things: 1. make the OSD forgiving out removed_snaps getting smaller. This is probably a good thing anyway: once we know snaps are removed on all OSDs we can prune the interval_set in the OSDMap. Maybe. 2. Fix the mon to prevent this from happening, *even* when --force-nonempty is specified. (This is the root cause.) I've opened http://tracker.ceph.com/issues/11493 to track this. sage Idea was to make images as a tier to img, move data to img then change clients to use the new img pool. Br, Tuomas Can you explain exactly what you mean by: Also I created one pool for tier to be able to move data without outage. -Sam - Original Message - From: tuomas juntunen tuomas.juntu...@databasement.fi To: Ian Colle ico...@redhat.com Cc:
Re: [ceph-users] Ceph Fuse Crashed when Reading and How to Backup the data
On 30/04/2015 09:21, flisky wrote: When I read the file through the ceph-fuse, the process crashed. Here is the log - terminate called after throwing an instance of 'ceph::buffer::end_of_buffer' what(): buffer::end_of_buffer *** Caught signal (Aborted) ** in thread 7fe0814d3700 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff) 1: (()+0x249805) [0x7fe08670b805] 2: (()+0x10d10) [0x7fe085c39d10] 3: (gsignal()+0x37) [0x7fe0844d3267] 4: (abort()+0x16a) [0x7fe0844d4eca] 5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7fe084de706d] 6: (()+0x5eee6) [0x7fe084de4ee6] 7: (()+0x5ef31) [0x7fe084de4f31] 8: (()+0x5f149) [0x7fe084de5149] 9: (ceph::buffer::list::substr_of(ceph::buffer::list const, unsigned int, unsigned int)+0x24b) [0x7fe08688993b] 10: (ObjectCacher::_readx(ObjectCacher::OSDRead*, ObjectCacher::ObjectSet*, Context*, bool)+0x1423) [0x7fe0866c6b73] 11: (ObjectCacher::C_RetryRead::finish(int)+0x20) [0x7fe0866cd870] 12: (Context::complete(int)+0x9) [0x7fe086687eb9] 13: (void finish_contextsContext(CephContext*, std::listContext*, std::allocatorContext* , int)+0xac) [0x7fe0866ca73c] 14: (ObjectCacher::bh_read_finish(long, sobject_t, unsigned long, long, unsigned long, ceph::buffer::list, int, bool)+0x29e) [0x7fe0866bfd2e] 15: (ObjectCacher::C_ReadFinish::finish(int)+0x7f) [0x7fe0866cc85f] 16: (Context::complete(int)+0x9) [0x7fe086687eb9] 17: (C_Lock::finish(int)+0x29) [0x7fe086688269] 18: (Context::complete(int)+0x9) [0x7fe086687eb9] 19: (Finisher::finisher_thread_entry()+0x1b4) [0x7fe08671f184] 20: (()+0x76aa) [0x7fe085c306aa] 21: (clone()+0x6d) [0x7fe0845a4eed] = Some part may be interesting - -11 2015-04-30 15:55:59.063828 7fd6a816c700 10 -- 172.30.11.188:0/10443 172.16.3.153:6820/1532355 pipe(0x7fd6740344c0 sd=8 :58596 s=2 pgs=3721 cs=1 l=1 c=0x7fd674038760).reader got message 1 0x7fd65c001940 osd_op_reply(1 119. [read 0~4390] v0'0 uv0 ack = -1 ((1) Operation not permitted)) v6 -10 2015-04-30 15:55:59.063848 7fd6a816c700 1 -- 172.30.11.188:0/10443 == osd.9 172.16.3.153:6820/1532355 1 osd_op_reply(1 119. [read 0~4390] v0'0 uv0 ack = -1 ((1) Operation not permitted)) v6 187+0+0 (689339676 0 0) 0x7fd65c001940 con 0x7fd674038760 And the cephfs-journal seems okay. Could anyone tell me why it is happening? Hmm, the backtrace is the same as http://tracker.ceph.com/issues/11510 This isn't the same cluster by any chance? John ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Quick question - version query
ceph --admin-daemon path/to/admin/socket version On Fri, May 1, 2015 at 10:44 AM, Tony Harris neth...@gmail.com wrote: Hi all, I feel a bit like an idiot at the moment - I know there is a command through ceph to query the monitor and OSD daemons to check their version level, but I can't remember what it is to save my life and I'm having trouble locating it in the docs. I need to make sure the entire cluster is running 0.94.1 at this point.. -Tony ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] How to estimate whether putting a journal on SSD will help with performance?
Is there any way to confirm (beforehand) that using SSDs for journals will help? We're seeing very disappointing Ceph performance. We have 10GigE interconnect (as a shared public/internal network). We're wondering whether it makes sense to buy SSDs and put journals on them. But we're looking for a way to verify that this will actually help BEFORE we splash cash on SSDs. The problem is that the way we have things configured now, with journals on spinning HDDs (shared with OSDs as the backend storage), apart from slow read/write performance to Ceph I already mention, we're also seeing fairly low disk utilization on OSDs. This low disk utilization suggests that journals are not really used to their max, which begs for the questions whether buying SSDs for journals will help. This kind of suggests that the bottleneck is NOT the disk. But,m yeah, we cannot really confirm that. Our typical data access use case is a lot of small random read/writes. We're doing a lot of rsyncing (entire regular linux filesystems) from one VM to another. We're using Ceph for OpenStack storage (kvm). Enabling RBD cache didn't really help all that much. So, is there any way to confirm beforehand that using SSDs for journals will help in our case? Kind Regards, Piotr ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to estimate whether putting a journal on SSD will help with performance?
Thanks for your answer, Nick. Typically it's a single rsync session at a time (sometimes two, but rarely more concurrently). So it's a single ~5GB typical linux filesystem from one random VM to another random VM. Apart from using RBD Cache, is there any other way to improve the overall performance of such a use case in a Ceph cluster? In theory I guess we could always tarball it, and rsync the tarball, thus effectively using sequential IO rather than random. But that's simply not feasible for us at the moment. Any other ways? Sidequestion: does using RBDCache impact the way data is stored on the client? (e.g. a write call returning after data has been written to Journal (fast) vs written all the way to the OSD data store(slow)). I'm guessing it's always the first one, regardless of whether client uses RBDCache or not, right? My logic here is that otherwise that would imply that clients can impact the way OSDs behave, which could be dangerous in some situations. Kind Regards, Piotr On Fri, May 1, 2015 at 10:59 AM, Nick Fisk n...@fisk.me.uk wrote: How many Rsync's are doing at a time? If it is only a couple, you will not be able to take advantage of the full number of OSD's, as each block of data is only located on 1 OSD (not including replicas). When you look at disk statistics you are seeing an average over time, so it will look like the OSD's are not very busy, when in fact each one is busy for a very brief period. SSD journals will help your write latency, probably going down from around 15-30ms to under 5ms *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf Of *Piotr Wachowicz *Sent:* 01 May 2015 09:31 *To:* ceph-users@lists.ceph.com *Subject:* [ceph-users] How to estimate whether putting a journal on SSD will help with performance? Is there any way to confirm (beforehand) that using SSDs for journals will help? We're seeing very disappointing Ceph performance. We have 10GigE interconnect (as a shared public/internal network). We're wondering whether it makes sense to buy SSDs and put journals on them. But we're looking for a way to verify that this will actually help BEFORE we splash cash on SSDs. The problem is that the way we have things configured now, with journals on spinning HDDs (shared with OSDs as the backend storage), apart from slow read/write performance to Ceph I already mention, we're also seeing fairly low disk utilization on OSDs. This low disk utilization suggests that journals are not really used to their max, which begs for the questions whether buying SSDs for journals will help. This kind of suggests that the bottleneck is NOT the disk. But,m yeah, we cannot really confirm that. Our typical data access use case is a lot of small random read/writes. We're doing a lot of rsyncing (entire regular linux filesystems) from one VM to another. We're using Ceph for OpenStack storage (kvm). Enabling RBD cache didn't really help all that much. So, is there any way to confirm beforehand that using SSDs for journals will help in our case? Kind Regards, Piotr ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to estimate whether putting a journal on SSD will help with performance?
On 01-05-15 11:42, Nick Fisk wrote: Yeah, that’s your problem, doing a single thread rsync when you have quite poor write latency will not be quick. SSD journals should give you a fair performance boost, otherwise you need to coalesce the writes at the client so that Ceph is given bigger IOs at higher queue depths. Exactly. But Ceph doesn't excell in serial I/O streams like these. It performs best when I/O is done in parallel. So if you can figure a way put to run multiple rsyncs at the same time you might see a great performance boost. This way all OSDs can process the I/O instead of one by one. RBD Cache can help here as well as potentially FS tuning to buffer more aggressively. If writeback RBD cache is enabled, data will be buffered by RBD until a sync is called by the client, so data loss can occur during this period if the app is not issuing fsyncs properly. Once a sync is called data is flushed to the journals and then later to the actual OSD store. *From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf Of *Piotr Wachowicz *Sent:* 01 May 2015 10:14 *To:* Nick Fisk *Cc:* ceph-users@lists.ceph.com *Subject:* Re: [ceph-users] How to estimate whether putting a journal on SSD will help with performance? Thanks for your answer, Nick. Typically it's a single rsync session at a time (sometimes two, but rarely more concurrently). So it's a single ~5GB typical linux filesystem from one random VM to another random VM. Apart from using RBD Cache, is there any other way to improve the overall performance of such a use case in a Ceph cluster? In theory I guess we could always tarball it, and rsync the tarball, thus effectively using sequential IO rather than random. But that's simply not feasible for us at the moment. Any other ways? Sidequestion: does using RBDCache impact the way data is stored on the client? (e.g. a write call returning after data has been written to Journal (fast) vs written all the way to the OSD data store(slow)). I'm guessing it's always the first one, regardless of whether client uses RBDCache or not, right? My logic here is that otherwise that would imply that clients can impact the way OSDs behave, which could be dangerous in some situations. Kind Regards, Piotr On Fri, May 1, 2015 at 10:59 AM, Nick Fisk n...@fisk.me.uk mailto:n...@fisk.me.uk wrote: How many Rsync’s are doing at a time? If it is only a couple, you will not be able to take advantage of the full number of OSD’s, as each block of data is only located on 1 OSD (not including replicas). When you look at disk statistics you are seeing an average over time, so it will look like the OSD’s are not very busy, when in fact each one is busy for a very brief period. SSD journals will help your write latency, probably going down from around 15-30ms to under 5ms *From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com mailto:ceph-users-boun...@lists.ceph.com] *On Behalf Of *Piotr Wachowicz *Sent:* 01 May 2015 09:31 *To:* ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com *Subject:* [ceph-users] How to estimate whether putting a journal on SSD will help with performance? Is there any way to confirm (beforehand) that using SSDs for journals will help? We're seeing very disappointing Ceph performance. We have 10GigE interconnect (as a shared public/internal network). We're wondering whether it makes sense to buy SSDs and put journals on them. But we're looking for a way to verify that this will actually help BEFORE we splash cash on SSDs. The problem is that the way we have things configured now, with journals on spinning HDDs (shared with OSDs as the backend storage), apart from slow read/write performance to Ceph I already mention, we're also seeing fairly low disk utilization on OSDs. This low disk utilization suggests that journals are not really used to their max, which begs for the questions whether buying SSDs for journals will help. This kind of suggests that the bottleneck is NOT the disk. But,m yeah, we cannot really confirm that. Our typical data access use case is a lot of small random read/writes. We're doing a lot of rsyncing (entire regular linux filesystems) from one VM to another. We're using Ceph for OpenStack storage (kvm). Enabling RBD cache didn't really help all that much. So, is there any way to confirm beforehand that using SSDs for journals will help in our case? Kind Regards, Piotr Image removed by sender. ___ ceph-users mailing list ceph-users@lists.ceph.com
[ceph-users] Ceph hammer rgw : unbale to create bucket
I have freshly install the Ceph hammer version 0.94.1 . I am facing problems while configuring Rados gateway.I want to map specific users to specific pools. For this I followed the following links. (1). http://comments.gmane.org/gmane.comp.file-systems.ceph.user/4992 (2). http://cephnotes.ksperis.com/blog/2014/11/28/placement-pools-on-rados-gw I followed the methods mentioned in this link . The problem which I am facing is I am not able to create a new bucket in federated mode. Also at one point of time I was able to create bucket with Dot prefix but it also stops working after restart of rgw. At present it is working with default .rgw.bucket pool but not working with other pools. Any help in this regard is apprectaied ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Fuse Crashed when Reading and How to Backup the data
It turns out the permission problem. When I change to ceph.admin, I can read the file, and the file content seems garbage. Best regards, On 2015年05月01日 02:07, Gregory Farnum wrote: The not permitted bit usually means that your client doesn't have access permissions to the data pool in use. I'm not sure why it would be getting aborted without any output though — is there any traceback at all in the logs? A message about the OOM-killer zapping it or something? -Greg On Thu, Apr 30, 2015 at 1:45 AM, flisky yinjif...@lianjia.com wrote: Sorry,I cannot reproduce the Operation not permitted log Here is a small portion of log with debug_client = 20/20 == -22 2015-04-30 16:29:12.858309 7fe9757f2700 10 client.58272 check_caps on 115.head(ref=2 ll_ref=10 cap_refs={} open={1=1} mode=100664 size=119/0 mtime=2015-04-20 14:14:57.961482 caps=pAsLsXsFscr(0=pAsLsXsFscr) objectset[115 ts 0/0 objects 0 dirty_or_tx 0] parents=0x7fe968022980 0x7fe968021c30) wanted pFscr used - is_delayed=1 -21 2015-04-30 16:29:12.858326 7fe9757f2700 10 client.58272 cap mds.0 issued pAsLsXsFscr implemented pAsLsXsFscr revoking - -20 2015-04-30 16:29:12.858333 7fe9757f2700 10 client.58272 send_cap 115.head(ref=2 ll_ref=10 cap_refs={} open={1=1} mode=100664 size=119/0 mtime=2015-04-20 14:14:57.961482 caps=pAsLsXsFscr(0=pAsLsXsFscr) objectset[115 ts 0/0 objects 0 dirty_or_tx 0] parents=0x7fe968022980 0x7fe968021c30) mds.0 seq 1 used - want pFscr flush - retain pAsxLsxXsxFsxcrwbl held pAsLsXsFscr revoking - dropping - -19 2015-04-30 16:29:12.858358 7fe9757f2700 15 client.58272 auth cap, setting max_size = 0 -18 2015-04-30 16:29:12.858368 7fe9757f2700 10 client.58272 _create_fh 115 mode 1 -17 2015-04-30 16:29:12.858376 7fe9757f2700 20 client.58272 trim_cache size 14 max 16384 -16 2015-04-30 16:29:12.858378 7fe9757f2700 3 client.58272 ll_open 115.head 32768 = 0 (0x7fe95c0052c0) -15 2015-04-30 16:29:12.858385 7fe9757f2700 3 client.58272 ll_forget 115 1 -14 2015-04-30 16:29:12.858386 7fe9757f2700 20 client.58272 _ll_put 0x7fe968021c30 115 1 - 9 -13 2015-04-30 16:29:12.858500 7fe974ff1700 20 client.58272 _ll_get 0x7fe968021c30 115 - 10 -12 2015-04-30 16:29:12.858503 7fe974ff1700 3 client.58272 ll_getattr 115.head -11 2015-04-30 16:29:12.858505 7fe974ff1700 10 client.58272 _getattr mask pAsLsXsFs issued=1 -10 2015-04-30 16:29:12.858509 7fe974ff1700 10 client.58272 fill_stat on 115 snap/devhead mode 0100664 mtime 2015-04-20 14:14:57.961482 ctime 2015-04-20 14:14:57.960359 -9 2015-04-30 16:29:12.858518 7fe974ff1700 3 client.58272 ll_getattr 115.head = 0 -8 2015-04-30 16:29:12.858525 7fe974ff1700 3 client.58272 ll_forget 115 1 -7 2015-04-30 16:29:12.858526 7fe974ff1700 20 client.58272 _ll_put 0x7fe968021c30 115 1 - 9 -6 2015-04-30 16:29:12.858536 7fe9577fe700 3 client.58272 ll_read 0x7fe95c0052c0 115 0~4096 -5 2015-04-30 16:29:12.858539 7fe9577fe700 10 client.58272 get_caps 115.head(ref=3 ll_ref=9 cap_refs={} open={1=1} mode=100664 size=119/0 mtime=2015-04-20 14:14:57.961482 caps=pAsLsXsFscr(0=pAsLsXsFscr) objectset[115 ts 0/0 objects 0 dirty_or_tx 0] parents=0x7fe968022980 0x7fe968021c30) have pAsLsXsFscr need Fr want Fc but not Fc revoking - -4 2015-04-30 16:29:12.858561 7fe9577fe700 10 client.58272 _read_async 115.head(ref=3 ll_ref=9 cap_refs={2048=1} open={1=1} mode=100664 size=119/0 mtime=2015-04-20 14:14:57.961482 caps=pAsLsXsFscr(0=pAsLsXsFscr) objectset[115 ts 0/0 objects 0 dirty_or_tx 0] parents=0x7fe968022980 0x7fe968021c30) 0~4096 -3 2015-04-30 16:29:12.858575 7fe9577fe700 10 client.58272 max_byes=0 max_periods=4 -2 2015-04-30 16:29:12.858692 7fe9577fe700 5 client.58272 get_cap_ref got first FILE_CACHE ref on 115.head(ref=3 ll_ref=9 cap_refs={1024=0,2048=1} open={1=1} mode=100664 size=119/0 mtime=2015-04-20 14:14:57.961482 caps=pAsLsXsFscr(0=pAsLsXsFscr) objectset[115 ts 0/0 objects 1 dirty_or_tx 0] parents=0x7fe968022980 0x7fe968021c30) -1 2015-04-30 16:29:12.867657 7fe9797fa700 10 client.58272 ms_handle_connect on 172.16.3.149:6823/982446 0 2015-04-30 16:29:12.872787 7fe97bfff700 -1 *** Caught signal (Aborted) ** On 2015年04月30日 16:21, flisky wrote: When I read the file through the ceph-fuse, the process crashed. Here is the log - terminate called after throwing an instance of 'ceph::buffer::end_of_buffer' what(): buffer::end_of_buffer *** Caught signal (Aborted) ** in thread 7fe0814d3700 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff) 1: (()+0x249805) [0x7fe08670b805] 2: (()+0x10d10) [0x7fe085c39d10] 3: (gsignal()+0x37) [0x7fe0844d3267] 4: (abort()+0x16a) [0x7fe0844d4eca] 5: (__gnu_cxx::__verbose_terminate_handler()+0x16d)
Re: [ceph-users] How to estimate whether putting a journal on SSD will help with performance?
How many Rsync's are doing at a time? If it is only a couple, you will not be able to take advantage of the full number of OSD's, as each block of data is only located on 1 OSD (not including replicas). When you look at disk statistics you are seeing an average over time, so it will look like the OSD's are not very busy, when in fact each one is busy for a very brief period. SSD journals will help your write latency, probably going down from around 15-30ms to under 5ms From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Piotr Wachowicz Sent: 01 May 2015 09:31 To: ceph-users@lists.ceph.com Subject: [ceph-users] How to estimate whether putting a journal on SSD will help with performance? Is there any way to confirm (beforehand) that using SSDs for journals will help? We're seeing very disappointing Ceph performance. We have 10GigE interconnect (as a shared public/internal network). We're wondering whether it makes sense to buy SSDs and put journals on them. But we're looking for a way to verify that this will actually help BEFORE we splash cash on SSDs. The problem is that the way we have things configured now, with journals on spinning HDDs (shared with OSDs as the backend storage), apart from slow read/write performance to Ceph I already mention, we're also seeing fairly low disk utilization on OSDs. This low disk utilization suggests that journals are not really used to their max, which begs for the questions whether buying SSDs for journals will help. This kind of suggests that the bottleneck is NOT the disk. But,m yeah, we cannot really confirm that. Our typical data access use case is a lot of small random read/writes. We're doing a lot of rsyncing (entire regular linux filesystems) from one VM to another. We're using Ceph for OpenStack storage (kvm). Enabling RBD cache didn't really help all that much. So, is there any way to confirm beforehand that using SSDs for journals will help in our case? Kind Regards, Piotr ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to estimate whether putting a journal on SSD will help with performance?
Also remember to drive your Ceph cluster as hard as you got means to, eg. tuning the VM OSes/IO sub systems like using multiple RBD devices per VM (to issue more out standing IOPs from VM IO subsystem), best IO scheduler, CPU power + memory per VM, also ensure low network latency + bandwidth between your rsyncing VMs etc. On 01/05/2015, at 11.13, Piotr Wachowicz piotr.wachow...@brightcomputing.com wrote: Thanks for your answer, Nick. Typically it's a single rsync session at a time (sometimes two, but rarely more concurrently). So it's a single ~5GB typical linux filesystem from one random VM to another random VM. Apart from using RBD Cache, is there any other way to improve the overall performance of such a use case in a Ceph cluster? In theory I guess we could always tarball it, and rsync the tarball, thus effectively using sequential IO rather than random. But that's simply not feasible for us at the moment. Any other ways? Sidequestion: does using RBDCache impact the way data is stored on the client? (e.g. a write call returning after data has been written to Journal (fast) vs written all the way to the OSD data store(slow)). I'm guessing it's always the first one, regardless of whether client uses RBDCache or not, right? My logic here is that otherwise that would imply that clients can impact the way OSDs behave, which could be dangerous in some situations. Kind Regards, Piotr On Fri, May 1, 2015 at 10:59 AM, Nick Fisk n...@fisk.me.uk mailto:n...@fisk.me.uk wrote: How many Rsync’s are doing at a time? If it is only a couple, you will not be able to take advantage of the full number of OSD’s, as each block of data is only located on 1 OSD (not including replicas). When you look at disk statistics you are seeing an average over time, so it will look like the OSD’s are not very busy, when in fact each one is busy for a very brief period. SSD journals will help your write latency, probably going down from around 15-30ms to under 5ms From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Piotr Wachowicz Sent: 01 May 2015 09:31 To: ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com Subject: [ceph-users] How to estimate whether putting a journal on SSD will help with performance? Is there any way to confirm (beforehand) that using SSDs for journals will help? We're seeing very disappointing Ceph performance. We have 10GigE interconnect (as a shared public/internal network). We're wondering whether it makes sense to buy SSDs and put journals on them. But we're looking for a way to verify that this will actually help BEFORE we splash cash on SSDs. The problem is that the way we have things configured now, with journals on spinning HDDs (shared with OSDs as the backend storage), apart from slow read/write performance to Ceph I already mention, we're also seeing fairly low disk utilization on OSDs. This low disk utilization suggests that journals are not really used to their max, which begs for the questions whether buying SSDs for journals will help. This kind of suggests that the bottleneck is NOT the disk. But,m yeah, we cannot really confirm that. Our typical data access use case is a lot of small random read/writes. We're doing a lot of rsyncing (entire regular linux filesystems) from one VM to another. We're using Ceph for OpenStack storage (kvm). Enabling RBD cache didn't really help all that much. So, is there any way to confirm beforehand that using SSDs for journals will help in our case? Kind Regards, Piotr ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to estimate whether putting a journal on SSD will help with performance?
Yeah, that's your problem, doing a single thread rsync when you have quite poor write latency will not be quick. SSD journals should give you a fair performance boost, otherwise you need to coalesce the writes at the client so that Ceph is given bigger IOs at higher queue depths. RBD Cache can help here as well as potentially FS tuning to buffer more aggressively. If writeback RBD cache is enabled, data will be buffered by RBD until a sync is called by the client, so data loss can occur during this period if the app is not issuing fsyncs properly. Once a sync is called data is flushed to the journals and then later to the actual OSD store. From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Piotr Wachowicz Sent: 01 May 2015 10:14 To: Nick Fisk Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] How to estimate whether putting a journal on SSD will help with performance? Thanks for your answer, Nick. Typically it's a single rsync session at a time (sometimes two, but rarely more concurrently). So it's a single ~5GB typical linux filesystem from one random VM to another random VM. Apart from using RBD Cache, is there any other way to improve the overall performance of such a use case in a Ceph cluster? In theory I guess we could always tarball it, and rsync the tarball, thus effectively using sequential IO rather than random. But that's simply not feasible for us at the moment. Any other ways? Sidequestion: does using RBDCache impact the way data is stored on the client? (e.g. a write call returning after data has been written to Journal (fast) vs written all the way to the OSD data store(slow)). I'm guessing it's always the first one, regardless of whether client uses RBDCache or not, right? My logic here is that otherwise that would imply that clients can impact the way OSDs behave, which could be dangerous in some situations. Kind Regards, Piotr On Fri, May 1, 2015 at 10:59 AM, Nick Fisk n...@fisk.me.uk mailto:n...@fisk.me.uk wrote: How many Rsync's are doing at a time? If it is only a couple, you will not be able to take advantage of the full number of OSD's, as each block of data is only located on 1 OSD (not including replicas). When you look at disk statistics you are seeing an average over time, so it will look like the OSD's are not very busy, when in fact each one is busy for a very brief period. SSD journals will help your write latency, probably going down from around 15-30ms to under 5ms From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com mailto:ceph-users-boun...@lists.ceph.com ] On Behalf Of Piotr Wachowicz Sent: 01 May 2015 09:31 To: ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com Subject: [ceph-users] How to estimate whether putting a journal on SSD will help with performance? Is there any way to confirm (beforehand) that using SSDs for journals will help? We're seeing very disappointing Ceph performance. We have 10GigE interconnect (as a shared public/internal network). We're wondering whether it makes sense to buy SSDs and put journals on them. But we're looking for a way to verify that this will actually help BEFORE we splash cash on SSDs. The problem is that the way we have things configured now, with journals on spinning HDDs (shared with OSDs as the backend storage), apart from slow read/write performance to Ceph I already mention, we're also seeing fairly low disk utilization on OSDs. This low disk utilization suggests that journals are not really used to their max, which begs for the questions whether buying SSDs for journals will help. This kind of suggests that the bottleneck is NOT the disk. But,m yeah, we cannot really confirm that. Our typical data access use case is a lot of small random read/writes. We're doing a lot of rsyncing (entire regular linux filesystems) from one VM to another. We're using Ceph for OpenStack storage (kvm). Enabling RBD cache didn't really help all that much. So, is there any way to confirm beforehand that using SSDs for journals will help in our case? Kind Regards, Piotr ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to estimate whether putting a journal on SSD will help with performance?
Piotr, You may also investigate if the cache tier made of a couple of ssds could help you. Not sure how the data is used in your company, but if you have a bunch of hot data that moves around from one vm to another it might greatly speed up the rsync. On the other hand, if a lot of rsync data is cold, it might have an adverse effect on performance. As a test, you could try to create a small pool with a couple of ssds in a cache tier on top of your spinning osds. You don't need to purchase tons of ssds in advance. As a test case, I would suggest 2-4 ssds in a cache tier should be okay for the PoC. Andrei - Original Message - From: Nick Fisk n...@fisk.me.uk To: Piotr Wachowicz piotr.wachow...@brightcomputing.com Cc: ceph-users@lists.ceph.com Sent: Friday, 1 May, 2015 10:42:12 AM Subject: Re: [ceph-users] How to estimate whether putting a journal on SSD will help with performance? Yeah, that’s your problem, doing a single thread rsync when you have quite poor write latency will not be quick. SSD journals should give you a fair performance boost, otherwise you need to coalesce the writes at the client so that Ceph is given bigger IOs at higher queue depths. RBD Cache can help here as well as potentially FS tuning to buffer more aggressively. If writeback RBD cache is enabled, data will be buffered by RBD until a sync is called by the client, so data loss can occur during this period if the app is not issuing fsyncs properly. Once a sync is called data is flushed to the journals and then later to the actual OSD store. From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Piotr Wachowicz Sent: 01 May 2015 10:14 To: Nick Fisk Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] How to estimate whether putting a journal on SSD will help with performance? Thanks for your answer, Nick. Typically it's a single rsync session at a time (sometimes two, but rarely more concurrently). So it's a single ~5GB typical linux filesystem from one random VM to another random VM. Apart from using RBD Cache, is there any other way to improve the overall performance of such a use case in a Ceph cluster? In theory I guess we could always tarball it, and rsync the tarball, thus effectively using sequential IO rather than random. But that's simply not feasible for us at the moment. Any other ways? Sidequestion: does using RBDCache impact the way data is stored on the client? (e.g. a write call returning after data has been written to Journal (fast) vs written all the way to the OSD data store(slow)). I'm guessing it's always the first one, regardless of whether client uses RBDCache or not, right? My logic here is that otherwise that would imply that clients can impact the way OSDs behave, which could be dangerous in some situations. Kind Regards, Piotr On Fri, May 1, 2015 at 10:59 AM, Nick Fisk n...@fisk.me.uk wrote: How many Rsync’s are doing at a time? If it is only a couple, you will not be able to take advantage of the full number of OSD’s, as each block of data is only located on 1 OSD (not including replicas). When you look at disk statistics you are seeing an average over time, so it will look like the OSD’s are not very busy, when in fact each one is busy for a very brief period. SSD journals will help your write latency, probably going down from around 15-30ms to under 5ms From: ceph-users [mailto: ceph-users-boun...@lists.ceph.com ] On Behalf Of Piotr Wachowicz Sent: 01 May 2015 09:31 To: ceph-users@lists.ceph.com Subject: [ceph-users] How to estimate whether putting a journal on SSD will help with performance? Is there any way to confirm (beforehand) that using SSDs for journals will help? We're seeing very disappointing Ceph performance. We have 10GigE interconnect (as a shared public/internal network). We're wondering whether it makes sense to buy SSDs and put journals on them. But we're looking for a way to verify that this will actually help BEFORE we splash cash on SSDs. The problem is that the way we have things configured now, with journals on spinning HDDs (shared with OSDs as the backend storage), apart from slow read/write performance to Ceph I already mention, we're also seeing fairly low disk utilization on OSDs. This low disk utilization suggests that journals are not really used to their max, which begs for the questions whether buying SSDs for journals will help. This kind of suggests that the bottleneck is NOT the disk. But,m yeah, we cannot really confirm that. Our typical data access use case is a lot of small random read/writes. We're doing a lot of rsyncing (entire regular linux filesystems) from one VM to another. We're using Ceph for OpenStack storage (kvm). Enabling RBD cache didn't really help all that much. So, is there any way to confirm beforehand that using SSDs for journals
Re: [ceph-users] How to estimate whether putting a journal on SSD will help with performance?
Hi, On 01.05.2015 10:30, Piotr Wachowicz wrote: Is there any way to confirm (beforehand) that using SSDs for journals will help? yes SSD-Journal helps a lot (if you use the right SSDs) for write speed, and I made the experiences that this also helped (but not too much) for read-performance. We're seeing very disappointing Ceph performance. We have 10GigE interconnect (as a shared public/internal network). Which kind of CPU do you use for the OSD-hosts? We're wondering whether it makes sense to buy SSDs and put journals on them. But we're looking for a way to verify that this will actually help BEFORE we splash cash on SSDs. I can recommend the Intel DC S3700 SSD for journaling! In the beginning I started with different much cheaper models, but this was the wrong decision. The problem is that the way we have things configured now, with journals on spinning HDDs (shared with OSDs as the backend storage), apart from slow read/write performance to Ceph I already mention, we're also seeing fairly low disk utilization on OSDs. This low disk utilization suggests that journals are not really used to their max, which begs for the questions whether buying SSDs for journals will help. This kind of suggests that the bottleneck is NOT the disk. But,m yeah, we cannot really confirm that. Our typical data access use case is a lot of small random read/writes. We're doing a lot of rsyncing (entire regular linux filesystems) from one VM to another. We're using Ceph for OpenStack storage (kvm). Enabling RBD cache didn't really help all that much. The read speed can be optimized with an bigger read ahead cache inside the VM, like: echo 4096 /sys/block/vda/queue/read_ahead_kb Udo ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Radosgw agent and federated config problems
We run a ceph cluster with radosgw on top of it. During the installation we have never specified any regions or zones, which means that every bucket currently resides in the default region. To support a federated config we have built a test cluster that replicates the current production setup with the same default region and zone. Once that setup was running we went through the following steps to make the switch to a federated config. Our second zone is completely empty to begin with and has no data in it at this point. 1) We created a new region that includes the api_name, master_zone and endpoints for our two zones. 2) We created two users in zone1 and zone2 with the same access and secret key across the two zones. 3) We created two zones with default pools and specified the access and secret key. 4) We have changed ceph.conf to include the new region and zone and pushed it to our nodes. 5) The default region was set to our new region through radosgw-admin and the default was removed. 6) The regionmap was updated to reflect the changes we made to our regions. This last step proved to be a little difficult, as radosgw-admin regionmap update returns: 7f7b36b7b840 -1 cannot update region map, master_region conflict The master_region is set to 'ams' in both clusters. It may be that we be running into issues later on because we have solved this the 'hard way' by changing the regionmap manually. 6) As we have changed our region and zones we have restarted radosgw. As expected this takes our objects offline. 7) We have updated all buckets to sit in the new region. After our buckets have changed all of our objects are back online again. We have not made any changes to our pools. The new region points to the existing pool so this has never resulted in any physical movement of data. Once this was all done the cluster was up and running, as expected, but serving its content from the new zone. At this point we set up radosgw-admin with the users from step 2 and 3 matching our zones. The first time we have done this we ran into a couple of problems. The first was that radosgw-admin that's available in the repository is a little older than the one on github. This version lacks a lot of exception handling and proper error output, making it difficult to diagnose issues as they come up. We've switched to the latest available version from github which has helped us a lot to get where we are now. We had to switch radosgw from sockets to tcp first, but the manual didn't include a specific parameter which lead to radosgw not being able to handle /-characters properly. Once we added AllowEncodedSlashes it all magically worked. As it took us quite some time and fiddling around to get to this point we wanted to replicate the exact same situation on another test environment again to make sure we know what to do when we would change this in a live environment. And this is where it all fails. We are unable to get this set up back up again. We've compared configurations, checked every single setting we've played with but we're unable to find what's going wrong. The error message is pretty obvious though: 2015-04-24 15:37:55,073 9406 [radosgw_agent.worker][DEBUG ] syncing object object/test.txt 2015-04-24 15:37:55,089 9406 [radosgw_agent.worker][DEBUG ] object object/test.txt not found on master, deleting from secondary I was expecting to find this entry in our Apache log files. Surely it would trigger a 404. It turns out though that we're not seeing any log files at all. It's not being found at all. Though when I look at the logs in zone2 I see the following: [24/Apr/2015:15:45:01 +] PUT /object/test.txt?rgwx-op-id=radosgw1%3A9727%3A1rgwx-source-zone=zone1rgwx-client-id=radosgw-agent HTTP/1.1 404 242 - Boto/2.20.1 Python/2.7.6 Linux/3.13.0-49-generic [24/Apr/2015:15:45:01 +] GET /object/?max-keys=0 HTTP/1.1 200 408 - Boto/2.20.1 Python/2.7.6 Linux/3.13.0-49-generic [24/Apr/2015:15:45:01 +] DELETE /object/test.txt HTTP/1.1 204 126 - Boto/2.20.1 Python/2.7.6 Linux/3.13.0-49-generic” We’re running ceph and radosgw 0.94.1, the agent comes from github as the one that’s in the repository doesn’t seem entirely stable nor very clear on error messages. I’m sure we may be missing something, but it feels like radosgw-agent isn’t production ready yet. Any thoughts? Thanks, Thomas ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to estimate whether putting a journal on SSD will help with performance?
yes SSD-Journal helps a lot (if you use the right SSDs) What SSDs to avoid for journaling from your experience? Why? We're seeing very disappointing Ceph performance. We have 10GigE interconnect (as a shared public/internal network). Which kind of CPU do you use for the OSD-hosts? Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz FYI, we are hosting VMs on our OSD nodes, but the VMs use very small amounts of CPUs and RAM We're wondering whether it makes sense to buy SSDs and put journals on them. But we're looking for a way to verify that this will actually help BEFORE we splash cash on SSDs. I can recommend the Intel DC S3700 SSD for journaling! In the beginning I started with different much cheaper models, but this was the wrong decision. What, apart from the price, made the difference? sustained read/write bandwidth? IOPS? We're considering this one (PCI-e SSD). What do you think? http://www.plextor-digital.com/index.php/en/M6e-BK/m6e-bk.html PX-128M6e-BK Also, we're thinking about sharing one SSD between two OSDs. Any reason why this would be a bad idea? We're using Ceph for OpenStack storage (kvm). Enabling RBD cache didn't really help all that much. The read speed can be optimized with an bigger read ahead cache inside the VM, like: echo 4096 /sys/block/vda/queue/read_ahead_kb Thanks, we will try that. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com