[PATCH] vstart: allow minimum pool size of one
I needed this patch after some simple 1 OSD vstart environments refused to allow clients to connect. -- A minimum pool size of 2 was introduced by 13486857cf. This sets the minimum to one so that basic vstart environments work. Signed-off-by: Noah Watkins diff --git a/src/vstart.sh b/src/vstart.sh index 4565efa..bdf02f3 100755 --- a/src/vstart.sh +++ b/src/vstart.sh @@ -290,6 +290,7 @@ if [ "$start_mon" -eq 1 ]; then [global] osd pg bits = 3 osd pgp bits = 5 ; (invalid, but ceph should cope!) +osd pool default min size = 1 EOF [ "$cephx" -eq 1 ] && cat<> $conf auth supported = cephx -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: bobtail timing
I've got wip_recovery_qos and wip_persist_missing that should go into bobtail. wip_recovery_qos passed regression (mostly, failures due to fsx, a bug fixed in master, and timeouts waiting for machines), and is waiting on review. wip_persist_missing has a teuthology test I'll push tomorrow (wip_divergent_priors). The second commit in wip_persist_missing I think still needs review (formerly wip_divergent_entries). -Sam On Thu, Nov 8, 2012 at 5:30 PM, Yehuda Sadeh wrote: > On Wed, Oct 31, 2012 at 1:46 PM, Sage Weil wrote: >> I would like to freeze v0.55, the "bobtail" stable release, at the end of >> next week. If there is any functionality you are working on that should >> be included, we need to get it into master (preferably well) before that. >> There will be several weeks of testing in the 'next' branch after that >> (probaly 3 weeks) before it is released. > > I merged (against current master) and pushed all the pending rgw stuff > to wip-rgw-integration. This includes: > > wip-post-cleaned > wip-stripe > wip-keystone > wip-3452 > wip-3453 > wip-swift-token > > All that stuff needs to go into bobtail, but still waiting for review. > The bottom 3 are trivial. > > Yehuda > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rbd map command hangs for 15 minutes during system start up
On 11/08/2012 02:10 PM, Mandell Degerness wrote: We are seeing a somewhat random, but frequent hang on our systems during startup. The hang happens at the point where an "rbd map " command is run. I've attached the ceph logs from the cluster. The map command happens at Nov 8 18:41:09 on server 172.18.0.15. The process which hung can be seen in the log as 172.18.0.15:0/1143980479. It appears as if the TCP socket is opened to the OSD, but then times out 15 minutes later, the process gets data when the socket is closed on the client server and it retries. Please help. We are using ceph version 0.48.2argonaut (commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe). We are using a 3.5.7 kernel with the following list of patches applied: 1-libceph-encapsulate-out-message-data-setup.patch 2-libceph-dont-mark-footer-complete-before-it-is.patch 3-libceph-move-init-of-bio_iter.patch 4-libceph-dont-use-bio_iter-as-a-flag.patch 5-libceph-resubmit-linger-ops-when-pg-mapping-changes.patch 6-libceph-re-initialize-bio_iter-on-start-of-message-receive.patch 7-ceph-close-old-con-before-reopening-on-mds-reconnect.patch 8-libceph-protect-ceph_con_open-with-mutex.patch 9-libceph-reset-connection-retry-on-successfully-negotiation.patch 10-rbd-only-reset-capacity-when-pointing-to-head.patch 11-rbd-set-image-size-when-header-is-updated.patch 12-libceph-fix-crypto-key-null-deref-memory-leak.patch 13-ceph-tolerate-and-warn-on-extraneous-dentry-from-mds.patch 14-ceph-avoid-divide-by-zero-in-__validate_layout.patch 15-rbd-drop-dev-reference-on-error-in-rbd_open.patch 16-ceph-Fix-oops-when-handling-mdsmap-that-decreases-max_mds.patch 17-libceph-check-for-invalid-mapping.patch 18-ceph-propagate-layout-error-on-osd-request-creation.patch 19-rbd-BUG-on-invalid-layout.patch 20-ceph-return-EIO-on-invalid-layout-on-GET_DATALOC-ioctl.patch 21-ceph-avoid-32-bit-page-index-overflow.patch 23-ceph-fix-dentry-reference-leak-in-encode_fh.patch Any suggestions? The log shows your monitors don't have time sychronized enough among them to make much progress (including authenticating new connections). That's probably the real issue. 0.2s is pretty large clock drift. One thought is that the following patch (which we could not apply) is what is required: 22-rbd-reset-BACKOFF-if-unable-to-re-queue.patch This is certainly useful too, but I don't think it's the cause of the delay in this case. Josh -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: bobtail timing
On Wed, Oct 31, 2012 at 1:46 PM, Sage Weil wrote: > I would like to freeze v0.55, the "bobtail" stable release, at the end of > next week. If there is any functionality you are working on that should > be included, we need to get it into master (preferably well) before that. > There will be several weeks of testing in the 'next' branch after that > (probaly 3 weeks) before it is released. I merged (against current master) and pushed all the pending rgw stuff to wip-rgw-integration. This includes: wip-post-cleaned wip-stripe wip-keystone wip-3452 wip-3453 wip-swift-token All that stuff needs to go into bobtail, but still waiting for review. The bottom 3 are trivial. Yehuda -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SSD journal suggestion / rsockets
Joseph, I've downloaded and read the presentation from 'Sean Hefty / Intel Corporation' about rsockets, which sounds very promising to me. Can you please teach me how to get access to the rsockets source ? Thanks, -Dieter On Thu, Nov 08, 2012 at 09:12:45PM +0100, Joseph Glanville wrote: > On 9 November 2012 02:00, Atchley, Scott wrote: > > On Nov 8, 2012, at 9:39 AM, Mark Nelson wrote: > > > >> On 11/08/2012 07:55 AM, Atchley, Scott wrote: > >>> On Nov 8, 2012, at 3:22 AM, Gandalf Corvotempesta > >>> wrote: > >>> > 2012/11/8 Mark Nelson : > > I haven't done much with IPoIB (just RDMA), but my understanding is > > that it > > tends to top out at like 15Gb/s. Some others on this mailing list can > > probably speak more authoritatively. Even with RDMA you are going to > > top > > out at around 3.1-3.2GB/s. > > 15Gb/s is still faster than 10Gbe > But this speed limit seems to be kernel-related and should be the same > even in a 10Gbe environment, or not? > >>> > >>> We have a test cluster with Mellanox QDR HCAs (i.e. NICs). When using > >>> Verbs (the native IB API), I see ~27 Gb/s between two hosts. When running > >>> Sockets over these devices using IPoIB, I see 13-22 Gb/s depending on > >>> whether I use interrupt affinity and process binding. > >>> > >>> For our Ceph testing, we will set the affinity of two of the mlx4 > >>> interrupt handlers to cores 0 and 1 and we will not using process > >>> binding. For single stream Netperf, we do use process binding and bind it > >>> to the same core (i.e. 0) and we see ~22 Gb/s. For multiple, concurrent > >>> Netperf runs, we do not use process binding but we still see ~22 Gb/s. > >> > >> Scott, this is very interesting! Does setting the interrupt affinity > >> make the biggest difference then when you have concurrent netperf > >> processes going? For some reason I thought that setting interrupt > >> affinity wasn't even guaranteed in linux any more, but this is just some > >> half-remembered recollection from a year or two ago. > > > > We are using RHEL6 with a 3.5.1 kernel. I tested single stream Netperf with > > and without affinity: > > > > Default (irqbalance running) 12.8 Gb/s > > IRQ balance off13.0 Gb/s > > Set IRQ affinity to socket 0 17.3 Gb/s # using the Mellanox script > > > > When I set the affinity to cores 0-1 _and_ I bind Netperf to core 0, I get > > ~22 Gb/s for a single stream. > > > >>> We used all of the Mellanox tuning recommendations for IPoIB available in > >>> their tuning pdf: > >>> > >>> http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf > >>> > >>> We looked at their interrupt affinity setting scripts and then wrote our > >>> own. > >>> > >>> Our testing is with IPoIB in "connected" mode, not "datagram" mode. > >>> Connected mode is less scalable, but currently I only get ~3 Gb/s with > >>> datagram mode. Mellanox claims that we should get identical performance > >>> with both modes and we are looking into it. > >>> > >>> We are getting a new test cluster with FDR HCAs and I will look into > >>> those as well. > >> > >> Nice! At some point I'll probably try to justify getting some FDR cards > >> in house. I'd definitely like to hear how FDR ends up working for you. > > > > I'll post the numbers when I get access after they are set up. > > > > Scott > > > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > the body of a message to majord...@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > If you are running Ceph purely in userspace you could try using rsockets. > rsockets is a pure userspace implementation of sockets over RDMA. It > has much much lower latency and close to native throughput. > My guess is rsockets will probably work perfectly and should give you > 95% of theoretical max performance. > > I would like to see a somewhat native implementation of RDMA in Ceph one day. > I was doing some preliminary work on it 1.5 years ago when Ceph was > first gaining traction but we didn't end up putting our focus on Ceph > and as such I never got anywhere with it. > In theory one only needs to use RDMA for the fast path to gain alot of > benefit. This can be done even in the RBD kernel module with the > RDMA-CM which will interact nicely across kernelspace and userspace > (they actually share he same API thankfully). > > Joseph. > > -- > CTO | Orion Virtualisation Solutions | www.orionvm.com.au > Phone: 1300 56 99 52 | Mobile: 0428 754 846 > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kerne
Re: trying to import crushmap results in max_devices > osdmap max_osd
On 11/07/2012 07:28 AM, Stefan Priebe - Profihost AG wrote: Hello, i've added two nodes with 4 devices each and modified the crushmap. But importing the new map results in: crushmap max_devices 55 > osdmap max_osd 35 What's wrong? I think this is an obsolete check since ee541c0f8d871172ec61962372efca943308e5fe. wip-max-devices removes these checks. Sage, is there any reason to keep them? Josh -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Review request branch wip-java-test
I have a 3 line change to the file qa/workunits/libcephfs-java/test.sh that tweaks how LD_LIBRARY_PATH is set for the test execution. The branch is wip-java-test in ceph.git. Best, -Joe Buck-- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: extreme ceph-osd cpu load for rand. 4k write
Am 08.11.2012 22:50, schrieb Josh Durgin: It looks like a not insignificant portion of time is spent in the logging infrastructure. Could you add this to the osds' configuration to prevent any debug log gathering (it's logged/gathered): debug lockdep = 0/0 ... debug throttle = 0/0 New one attached. Stefan out.pdf Description: Adobe PDF document
rbd map command hangs for 15 minutes during system start up
We are seeing a somewhat random, but frequent hang on our systems during startup. The hang happens at the point where an "rbd map " command is run. I've attached the ceph logs from the cluster. The map command happens at Nov 8 18:41:09 on server 172.18.0.15. The process which hung can be seen in the log as 172.18.0.15:0/1143980479. It appears as if the TCP socket is opened to the OSD, but then times out 15 minutes later, the process gets data when the socket is closed on the client server and it retries. Please help. We are using ceph version 0.48.2argonaut (commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe). We are using a 3.5.7 kernel with the following list of patches applied: 1-libceph-encapsulate-out-message-data-setup.patch 2-libceph-dont-mark-footer-complete-before-it-is.patch 3-libceph-move-init-of-bio_iter.patch 4-libceph-dont-use-bio_iter-as-a-flag.patch 5-libceph-resubmit-linger-ops-when-pg-mapping-changes.patch 6-libceph-re-initialize-bio_iter-on-start-of-message-receive.patch 7-ceph-close-old-con-before-reopening-on-mds-reconnect.patch 8-libceph-protect-ceph_con_open-with-mutex.patch 9-libceph-reset-connection-retry-on-successfully-negotiation.patch 10-rbd-only-reset-capacity-when-pointing-to-head.patch 11-rbd-set-image-size-when-header-is-updated.patch 12-libceph-fix-crypto-key-null-deref-memory-leak.patch 13-ceph-tolerate-and-warn-on-extraneous-dentry-from-mds.patch 14-ceph-avoid-divide-by-zero-in-__validate_layout.patch 15-rbd-drop-dev-reference-on-error-in-rbd_open.patch 16-ceph-Fix-oops-when-handling-mdsmap-that-decreases-max_mds.patch 17-libceph-check-for-invalid-mapping.patch 18-ceph-propagate-layout-error-on-osd-request-creation.patch 19-rbd-BUG-on-invalid-layout.patch 20-ceph-return-EIO-on-invalid-layout-on-GET_DATALOC-ioctl.patch 21-ceph-avoid-32-bit-page-index-overflow.patch 23-ceph-fix-dentry-reference-leak-in-encode_fh.patch Any suggestions? One thought is that the following patch (which we could not apply) is what is required: 22-rbd-reset-BACKOFF-if-unable-to-re-queue.patch Regards, Mandell Degerness hanglog_ceph.log.gz Description: GNU Zip compressed data
Re: extreme ceph-osd cpu load for rand. 4k write
Am 08.11.2012 22:58, schrieb Mark Nelson: Also, I'm not sure what version you are running, but you may want to try testing master and see if that helps. Sam has done some work on our threading and locking code that might help. This is git master (two hours old). Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unexpected problem with radosgw fcgi
Ok, i will digg in nginx, thanks. Dnia 8 lis 2012 o godz. 22:48 Yehuda Sadeh napisał(a): > On Wed, Nov 7, 2012 at 6:16 AM, Sławomir Skowron wrote: >> I have realize that requests from fastcgi in nginx from radosgw returning: >> >> HTTP/1.1 200, not a HTTP/1.1 200 OK >> >> Any other cgi that i run, for example php via fastcgi return this like >> RFC says, with OK. >> >> Is someone experience this problem ?? > > I have seen a similar issue in the past with nginx. It doesn't happen > with apache. My guess is that it's either something with the way nginx > is configured, or some difference in the fastcgi module > implementation. > >> >> I see in code: >> >> ./src/rgw/rgw_rest.cc line 36 >> >> const static struct rgw_html_errors RGW_HTML_ERRORS[] = { >>{ 0, 200, "" }, >> >> >> What if i change this into: >> >> { 0, 200, "OK" }, > > The third field there specifies the error code embedded in the > returned XML with S3, so it wouldn't fix anything. > > > Yehuda -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SSD journal suggestion / rsockets
On 9 November 2012 08:21, Dieter Kasper wrote: > Joseph, > > I've downloaded and read the presentation from 'Sean Hefty / Intel > Corporation' > about rsockets, which sounds very promising to me. > Can you please teach me how to get access to the rsockets source ? > > Thanks, > -Dieter > > rsockets is distributed as part of librdmacm. You can clone the git repository here: git://beany.openfabrics.org/~shefty/librdmacm.git I recommend using the latest master as it features much better support for forking. Joseph. -- CTO | Orion Virtualisation Solutions | www.orionvm.com.au Phone: 1300 56 99 52 | Mobile: 0428 754 846 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: extreme ceph-osd cpu load for rand. 4k write
On 11/08/2012 03:50 PM, Josh Durgin wrote: On 11/08/2012 01:27 PM, Stefan Priebe wrote: Am 08.11.2012 17:06, schrieb Mark Nelson: On 11/08/2012 09:45 AM, Stefan Priebe - Profihost AG wrote: Am 08.11.2012 16:01, schrieb Sage Weil: On Thu, 8 Nov 2012, Stefan Priebe - Profihost AG wrote: Is there any way to find out why a ceph-osd process takes around 10 times more load on rand 4k writes than on 4k reads? Something like perf or oprofile is probably your best bet. perf can be tedious to deploy, depending on where your kernel is coming from. oprofile seems to be deprecated, although I've had good results with it in the past. I've recorded 10s with perf - it is now a 300MB perf.data file. Sadly i've no idea what todo with it next. Pour yourself a stiff drink! (haha!) Try just doing a "perf report" in the directory where you've got the data file. Here's a nice tutorial: https://perf.wiki.kernel.org/index.php/Tutorial Also, if you see missing symbols you might benefit by chowning the file to root and running perf report as root. If you still see missing symbols, you may want to just give up and try sysprof. I've now used google perftools / google CPU profiler. It was the only tool who worked out of the box ;-) Attached is a PDF with a profiled ceph-osd process while 4k random write. It looks like a not insignificant portion of time is spent in the logging infrastructure. Could you add this to the osds' configuration to prevent any debug log gathering (it's logged/gathered): debug lockdep = 0/0 debug context = 0/0 debug crush = 0/0 debug buffer = 0/0 debug timer = 0/0 debug journaler = 0/0 debug osd = 0/0 debug optracker = 0/0 debug objclass = 0/0 debug filestore = 0/0 debug journal = 0/0 debug ms = 0/0 debug monc = 0/0 debug tp = 0/0 debug auth = 0/0 debug finisher = 0/0 debug heartbeatmap = 0/0 debug perfcounter = 0/0 debug asok = 0/0 debug throttle = 0/0 Josh Also, I'm not sure what version you are running, but you may want to try testing master and see if that helps. Sam has done some work on our threading and locking code that might help. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: less cores more iops / speed
On Thu, Nov 8, 2012 at 7:53 PM, Alexandre DERUMIER wrote: >>>So it is a problem of KVM which let's the processes jump between cores a >>>lot. > > maybe numad from redhat can help ? > http://fedoraproject.org/wiki/Features/numad > > It's try to keep process on same numa node and I think it's also doing some > dynamic pinning. Numad keeps only memory chunks on the preferred node, cpu pinning, which is a primary goal there, should be done separately via libvirt or manually for qemu process via cpuset(libvirt does pinning via taskset and seems that it is broken at least in debian wheezy - even affinity mask is set for qemu process, load spreads all over numa node, including cpus outside the set). > > - Mail original - > > De: "Stefan Priebe - Profihost AG" > À: "Mark Nelson" > Cc: "Joao Eduardo Luis" , ceph-devel@vger.kernel.org > Envoyé: Jeudi 8 Novembre 2012 16:14:32 > Objet: Re: less cores more iops / speed > > Am 08.11.2012 14:19, schrieb Mark Nelson: >> On 11/08/2012 02:45 AM, Stefan Priebe - Profihost AG wrote: >>> Am 08.11.2012 01:59, schrieb Mark Nelson: There's also the context switching overhead. It'd be interesting to know how much the writer processes were shifting around on cores. >>> What do you mean by that? I'm talking about the KVM guest not about the >>> ceph nodes. >> >> in this case, is fio bouncing around between cores? > > Thanks you're correct. If i bind fio to two cores on a 8 core VM it runs > with 16.000 iops. > > So it is a problem of KVM which let's the processes jump between cores a > lot. > > Greets, > Stefan > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: extreme ceph-osd cpu load for rand. 4k write
On 11/08/2012 01:27 PM, Stefan Priebe wrote: Am 08.11.2012 17:06, schrieb Mark Nelson: On 11/08/2012 09:45 AM, Stefan Priebe - Profihost AG wrote: Am 08.11.2012 16:01, schrieb Sage Weil: On Thu, 8 Nov 2012, Stefan Priebe - Profihost AG wrote: Is there any way to find out why a ceph-osd process takes around 10 times more load on rand 4k writes than on 4k reads? Something like perf or oprofile is probably your best bet. perf can be tedious to deploy, depending on where your kernel is coming from. oprofile seems to be deprecated, although I've had good results with it in the past. I've recorded 10s with perf - it is now a 300MB perf.data file. Sadly i've no idea what todo with it next. Pour yourself a stiff drink! (haha!) Try just doing a "perf report" in the directory where you've got the data file. Here's a nice tutorial: https://perf.wiki.kernel.org/index.php/Tutorial Also, if you see missing symbols you might benefit by chowning the file to root and running perf report as root. If you still see missing symbols, you may want to just give up and try sysprof. I've now used google perftools / google CPU profiler. It was the only tool who worked out of the box ;-) Attached is a PDF with a profiled ceph-osd process while 4k random write. It looks like a not insignificant portion of time is spent in the logging infrastructure. Could you add this to the osds' configuration to prevent any debug log gathering (it's logged/gathered): debug lockdep = 0/0 debug context = 0/0 debug crush = 0/0 debug buffer = 0/0 debug timer = 0/0 debug journaler = 0/0 debug osd = 0/0 debug optracker = 0/0 debug objclass = 0/0 debug filestore = 0/0 debug journal = 0/0 debug ms = 0/0 debug monc = 0/0 debug tp = 0/0 debug auth = 0/0 debug finisher = 0/0 debug heartbeatmap = 0/0 debug perfcounter = 0/0 debug asok = 0/0 debug throttle = 0/0 Josh -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unexpected problem with radosgw fcgi
On Wed, Nov 7, 2012 at 6:16 AM, Sławomir Skowron wrote: > I have realize that requests from fastcgi in nginx from radosgw returning: > > HTTP/1.1 200, not a HTTP/1.1 200 OK > > Any other cgi that i run, for example php via fastcgi return this like > RFC says, with OK. > > Is someone experience this problem ?? I have seen a similar issue in the past with nginx. It doesn't happen with apache. My guess is that it's either something with the way nginx is configured, or some difference in the fastcgi module implementation. > > I see in code: > > ./src/rgw/rgw_rest.cc line 36 > > const static struct rgw_html_errors RGW_HTML_ERRORS[] = { > { 0, 200, "" }, > > > What if i change this into: > > { 0, 200, "OK" }, The third field there specifies the error code embedded in the returned XML with S3, so it wouldn't fix anything. Yehuda -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: extreme ceph-osd cpu load for rand. 4k write
Am 08.11.2012 17:06, schrieb Mark Nelson: On 11/08/2012 09:45 AM, Stefan Priebe - Profihost AG wrote: Am 08.11.2012 16:01, schrieb Sage Weil: On Thu, 8 Nov 2012, Stefan Priebe - Profihost AG wrote: Is there any way to find out why a ceph-osd process takes around 10 times more load on rand 4k writes than on 4k reads? Something like perf or oprofile is probably your best bet. perf can be tedious to deploy, depending on where your kernel is coming from. oprofile seems to be deprecated, although I've had good results with it in the past. I've recorded 10s with perf - it is now a 300MB perf.data file. Sadly i've no idea what todo with it next. Pour yourself a stiff drink! (haha!) Try just doing a "perf report" in the directory where you've got the data file. Here's a nice tutorial: https://perf.wiki.kernel.org/index.php/Tutorial Also, if you see missing symbols you might benefit by chowning the file to root and running perf report as root. If you still see missing symbols, you may want to just give up and try sysprof. I've now used google perftools / google CPU profiler. It was the only tool who worked out of the box ;-) Attached is a PDF with a profiled ceph-osd process while 4k random write. Stefan out.pdf Description: Adobe PDF document
Re: SSD journal suggestion
On 9 November 2012 02:00, Atchley, Scott wrote: > On Nov 8, 2012, at 9:39 AM, Mark Nelson wrote: > >> On 11/08/2012 07:55 AM, Atchley, Scott wrote: >>> On Nov 8, 2012, at 3:22 AM, Gandalf Corvotempesta >>> wrote: >>> 2012/11/8 Mark Nelson : > I haven't done much with IPoIB (just RDMA), but my understanding is that > it > tends to top out at like 15Gb/s. Some others on this mailing list can > probably speak more authoritatively. Even with RDMA you are going to top > out at around 3.1-3.2GB/s. 15Gb/s is still faster than 10Gbe But this speed limit seems to be kernel-related and should be the same even in a 10Gbe environment, or not? >>> >>> We have a test cluster with Mellanox QDR HCAs (i.e. NICs). When using Verbs >>> (the native IB API), I see ~27 Gb/s between two hosts. When running Sockets >>> over these devices using IPoIB, I see 13-22 Gb/s depending on whether I use >>> interrupt affinity and process binding. >>> >>> For our Ceph testing, we will set the affinity of two of the mlx4 interrupt >>> handlers to cores 0 and 1 and we will not using process binding. For single >>> stream Netperf, we do use process binding and bind it to the same core >>> (i.e. 0) and we see ~22 Gb/s. For multiple, concurrent Netperf runs, we do >>> not use process binding but we still see ~22 Gb/s. >> >> Scott, this is very interesting! Does setting the interrupt affinity >> make the biggest difference then when you have concurrent netperf >> processes going? For some reason I thought that setting interrupt >> affinity wasn't even guaranteed in linux any more, but this is just some >> half-remembered recollection from a year or two ago. > > We are using RHEL6 with a 3.5.1 kernel. I tested single stream Netperf with > and without affinity: > > Default (irqbalance running) 12.8 Gb/s > IRQ balance off13.0 Gb/s > Set IRQ affinity to socket 0 17.3 Gb/s # using the Mellanox script > > When I set the affinity to cores 0-1 _and_ I bind Netperf to core 0, I get > ~22 Gb/s for a single stream. > >>> We used all of the Mellanox tuning recommendations for IPoIB available in >>> their tuning pdf: >>> >>> http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf >>> >>> We looked at their interrupt affinity setting scripts and then wrote our >>> own. >>> >>> Our testing is with IPoIB in "connected" mode, not "datagram" mode. >>> Connected mode is less scalable, but currently I only get ~3 Gb/s with >>> datagram mode. Mellanox claims that we should get identical performance >>> with both modes and we are looking into it. >>> >>> We are getting a new test cluster with FDR HCAs and I will look into those >>> as well. >> >> Nice! At some point I'll probably try to justify getting some FDR cards >> in house. I'd definitely like to hear how FDR ends up working for you. > > I'll post the numbers when I get access after they are set up. > > Scott > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html If you are running Ceph purely in userspace you could try using rsockets. rsockets is a pure userspace implementation of sockets over RDMA. It has much much lower latency and close to native throughput. My guess is rsockets will probably work perfectly and should give you 95% of theoretical max performance. I would like to see a somewhat native implementation of RDMA in Ceph one day. I was doing some preliminary work on it 1.5 years ago when Ceph was first gaining traction but we didn't end up putting our focus on Ceph and as such I never got anywhere with it. In theory one only needs to use RDMA for the fast path to gain alot of benefit. This can be done even in the RBD kernel module with the RDMA-CM which will interact nicely across kernelspace and userspace (they actually share he same API thankfully). Joseph. -- CTO | Orion Virtualisation Solutions | www.orionvm.com.au Phone: 1300 56 99 52 | Mobile: 0428 754 846 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ignoresync hack no longer applies on 3.6.5
Sorry about that, I think it got chopped. Here's a full trace from another run, using kernel 3.6.6 and definitely has the patch applied: https://gist.github.com/4041120 There are no instances of "sync_fs_one_sb skipping" in the logs. On Mon, Nov 5, 2012 at 1:29 AM, Sage Weil wrote: > On Sun, 4 Nov 2012, Nick Bartos wrote: >> Unfortunately I'm still seeing deadlocks. The trace was taken after a >> 'sync' from the command line was hung for a couple minutes. >> >> There was only one debug message (one fs on the system was mounted with >> 'mand'): > > This was with the updated patch applied? > > The dump below doesn't look complete, btw.. I don't see any ceph-osd > processses. Don't see any ceph-osd processes, among other things. > > sage > >> >> kernel: [11441.168954] [] ? sync_fs_one_sb+0x4d/0x4d >> >> Here's the trace: >> >> javaS 88040b06ba08 0 1623 1 0x >> 88040cb6dd08 0082 880405da8b30 >> 00012b40 00012b40 00012b40 >> 88040cb6dfd8 00012b40 00012b40 88040cb6dfd8 >> Call Trace: >> [] schedule+0x64/0x66 >> [] futex_wait_queue_me+0xc2/0xe1 >> [] futex_wait+0x120/0x275 >> [] do_futex+0x96/0x122 >> [] sys_futex+0x110/0x141 >> [] ? vfs_write+0xd0/0xdf >> [] ? fput+0x18/0xb6 >> [] ? fput_light+0xd/0xf >> [] ? sys_write+0x61/0x6e >> [] system_call_fastpath+0x16/0x1b >> javaS 88040ca4ba48 0 1624 1 0x >> 88040cb0bd08 0082 88040cb0bc88 81813410 >> 88040cb0bd28 00012b40 00012b40 00012b40 >> 88040cb0bfd8 00012b40 00012b40 88040cb0bfd8 >> Call Trace: >> [] schedule+0x64/0x66 >> [] futex_wait_queue_me+0xc2/0xe1 >> [] futex_wait+0x120/0x275 >> [] ? blkdev_issue_flush+0xc0/0xd2 >> [] do_futex+0x96/0x122 >> [] sys_futex+0x110/0x141 >> [] ? fput+0x18/0xb6 >> [] ? do_device_not_available+0xe/0x10 >> [] system_call_fastpath+0x16/0x1b >> javaS 88040ca4b058 0 1625 1 0x >> 880429d1fd08 0082 0400 81813410 >> 88040b06b4a8 00012b40 00012b40 00012b40 >> 880429d1ffd8 00012b40 00012b40 880429d1ffd8 >> Call Trace: >> [] schedule+0x64/0x66 >> [] futex_wait_queue_me+0xc2/0xe1 >> [] futex_wait+0x120/0x275 >> [] do_futex+0x96/0x122 >> [] sys_futex+0x110/0x141 >> [] ? do_device_not_available+0xe/0x10 >> [] system_call_fastpath+0x16/0x1b >> javaS 88040cd11a08 0 1632 1 0x >> 88040c40fd08 0082 88040c40fd68 88042b17f4e0 >> 88040c40ff38 00012b40 00012b40 00012b40 >> 88040c40ffd8 00012b40 00012b40 88040c40ffd8 >> Call Trace: >> [] schedule+0x64/0x66 >> [] futex_wait_queue_me+0xc2/0xe1 >> [] futex_wait+0x120/0x275 >> [] ? update_rmtp+0x65/0x65 >> [] ? hrtimer_start_range_ns+0x14/0x16 >> [] do_futex+0x96/0x122 >> [] sys_futex+0x110/0x141 >> [] ? vfs_write+0xd0/0xdf >> [] ? do_device_not_available+0xe/0x10 >> [] system_call_fastpath+0x16/0x1b >> javaS 88040cd10628 0 1633 1 0x >> 88040cd7da88 0082 0cd7da18 81813410 >> 88040cccecc0 00012b40 00012b40 00012b40 >> 88040cd7dfd8 00012b40 00012b40 88040cd7dfd8 >> Call Trace: >> [] schedule+0x64/0x66 >> [] schedule_timeout+0x36/0xe3 >> [] ? _local_bh_enable_ip.clone.8+0x20/0x89 >> [] ? local_bh_enable_ip+0xe/0x10 >> [] ? _raw_spin_unlock_bh+0x16/0x18 >> [] ? release_sock+0x128/0x131 >> [] sk_wait_data+0x82/0xc5 >> [] ? wake_up_bit+0x2a/0x2a >> [] ? local_bh_enable+0xe/0x10 >> [] tcp_recvmsg+0x4c5/0x92e >> [] ? update_curr+0xd6/0x110 >> [] ? __switch_to+0x1ac/0x33c >> [] inet_recvmsg+0x5e/0x73 >> [] __sock_recvmsg+0x75/0x84 >> [] sock_aio_read+0xf2/0x106 >> [] do_sync_read+0x70/0xad >> [] vfs_read+0xbc/0xdc >> [] ? fput+0x18/0xb6 >> [] sys_read+0x4a/0x6e >> [] system_call_fastpath+0x16/0x1b >> javaS 88040ce11a88 0 1634 1 0x >> 88040c9699f8 0082 0098967f 88042b17f4e0 >> 00012b40 00012b40 00012b40 >> 88040c969fd8 00012b40 00012b40 88040c969fd8 >> Call Trace: >> [] schedule+0x64/0x66 >> [] schedule_hrtimeout_range_clock+0xd2/0x11b >> [] ? update_rmtp+0x65/0x65 >> [] ? hrtimer_start_range_ns+0x14/0x16 >> [] schedule_hrtimeout_range+0x13/0x15 >> [] poll_schedule_timeout+0x48/0x64 >> [] do_poll.clone.3+0x1d0/0x1f1 >> [] do_sys_poll+0x146/0x1bd >> [] ? __pollwait+0xcc/0xcc >> [] ? __sock_recvmsg+0x75/0x84 >> [] ? sock_recvmsg+0x5b/0x7a >> [] ? get_futex_key+0x94/0x224 >> [] ? _raw_spin_lock+0xe/0x10 >> [] ? double_lock_hb+0x31/0x36 >> [] ? fget_light+0x6d/0x84 >> [] ? fput_light+0xd/0xf >> [] ? sys_recvf
Re: SSD journal suggestion
On Nov 8, 2012, at 11:19 AM, Andrey Korolyov wrote: > On Thu, Nov 8, 2012 at 7:02 PM, Atchley, Scott wrote: >> On Nov 8, 2012, at 10:00 AM, Scott Atchley wrote: >> >>> On Nov 8, 2012, at 9:39 AM, Mark Nelson wrote: >>> On 11/08/2012 07:55 AM, Atchley, Scott wrote: > On Nov 8, 2012, at 3:22 AM, Gandalf Corvotempesta > wrote: > >> 2012/11/8 Mark Nelson : >>> I haven't done much with IPoIB (just RDMA), but my understanding is >>> that it >>> tends to top out at like 15Gb/s. Some others on this mailing list can >>> probably speak more authoritatively. Even with RDMA you are going to >>> top >>> out at around 3.1-3.2GB/s. >> >> 15Gb/s is still faster than 10Gbe >> But this speed limit seems to be kernel-related and should be the same >> even in a 10Gbe environment, or not? > > We have a test cluster with Mellanox QDR HCAs (i.e. NICs). When using > Verbs (the native IB API), I see ~27 Gb/s between two hosts. When running > Sockets over these devices using IPoIB, I see 13-22 Gb/s depending on > whether I use interrupt affinity and process binding. > > For our Ceph testing, we will set the affinity of two of the mlx4 > interrupt handlers to cores 0 and 1 and we will not using process > binding. For single stream Netperf, we do use process binding and bind it > to the same core (i.e. 0) and we see ~22 Gb/s. For multiple, concurrent > Netperf runs, we do not use process binding but we still see ~22 Gb/s. Scott, this is very interesting! Does setting the interrupt affinity make the biggest difference then when you have concurrent netperf processes going? For some reason I thought that setting interrupt affinity wasn't even guaranteed in linux any more, but this is just some half-remembered recollection from a year or two ago. >>> >>> We are using RHEL6 with a 3.5.1 kernel. I tested single stream Netperf with >>> and without affinity: >>> >>> Default (irqbalance running) 12.8 Gb/s >>> IRQ balance off13.0 Gb/s >>> Set IRQ affinity to socket 0 17.3 Gb/s # using the Mellanox script >>> >>> When I set the affinity to cores 0-1 _and_ I bind Netperf to core 0, I get >>> ~22 Gb/s for a single stream. >> > > Did you tried Mellanox-baked modules for 2.6.32 before that? That came with RHEL6? No. Scott > >> Note, I used hwloc to determine which socket was closer to the mlx4 device >> on our dual socket machines. On these nodes, hwloc reported that both >> sockets were equally close, but a colleague has machines where one socket is >> closer than the other. In that case, bind to the closer socket (or to cores >> within the closer socket). >> >>> > We used all of the Mellanox tuning recommendations for IPoIB available in > their tuning pdf: > > http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf > > We looked at their interrupt affinity setting scripts and then wrote our > own. > > Our testing is with IPoIB in "connected" mode, not "datagram" mode. > Connected mode is less scalable, but currently I only get ~3 Gb/s with > datagram mode. Mellanox claims that we should get identical performance > with both modes and we are looking into it. > > We are getting a new test cluster with FDR HCAs and I will look into > those as well. Nice! At some point I'll probably try to justify getting some FDR cards in house. I'd definitely like to hear how FDR ends up working for you. >>> >>> I'll post the numbers when I get access after they are set up. >>> >>> Scott >>> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: problems creating new ceph cluster when using journal on block device
On 11/08/2012 11:36 AM, Travis Rhoden wrote: Solved! I stumbled into the solution while switching from block device to a file. I was being bit by running mkcephfs multiple times -- it wasn't really failing on the journal, it was failing because the OSD data disk had been initialized before. I couldn't see that until I used a file for the journal and then I see log output like: Yeah, that was a change that landed a couple of months ago. It's really important now to blow away the old data (I just reformat) if you want a totally clean ceph deployment rather than just running mkcephfs. === osd.0 === 2012-11-08 16:41:37.677620 7ffc3cfcd780 -1 provided osd id 0 != superblock's -1 2012-11-08 16:41:37.678726 7ffc3cfcd780 -1 ** ERROR: error creating empty object store in /var/lib/ceph/osd/ceph-0: (22) Invalid argument I unmounted the OSD's that had been touched before, reformatted them, and then remounted. I setup ceph.conf to use block devices for the journals, and then everything proceeded normally. So the final relevant bits from my ceph.conf file look like: [osd] osd journal size = 0 journal dio = true journal aio = true [osd.0] host = ceph1 osd journal = /dev/sda5 [osd.1] host = ceph1 osd journal = /dev/sda6 ... Thanks, - Travis On Thu, Nov 8, 2012 at 10:08 AM, Travis Rhoden wrote: One more thing -- Google search says this is harmless -- I see quite a few of these in syslog: hdparm: sending ioctl 2285 to a partition! -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: problems creating new ceph cluster when using journal on block device
Solved! I stumbled into the solution while switching from block device to a file. I was being bit by running mkcephfs multiple times -- it wasn't really failing on the journal, it was failing because the OSD data disk had been initialized before. I couldn't see that until I used a file for the journal and then I see log output like: === osd.0 === 2012-11-08 16:41:37.677620 7ffc3cfcd780 -1 provided osd id 0 != superblock's -1 2012-11-08 16:41:37.678726 7ffc3cfcd780 -1 ** ERROR: error creating empty object store in /var/lib/ceph/osd/ceph-0: (22) Invalid argument I unmounted the OSD's that had been touched before, reformatted them, and then remounted. I setup ceph.conf to use block devices for the journals, and then everything proceeded normally. So the final relevant bits from my ceph.conf file look like: [osd] osd journal size = 0 journal dio = true journal aio = true [osd.0] host = ceph1 osd journal = /dev/sda5 [osd.1] host = ceph1 osd journal = /dev/sda6 ... Thanks, - Travis On Thu, Nov 8, 2012 at 10:08 AM, Travis Rhoden wrote: > One more thing -- Google search says this is harmless -- I see quite a > few of these in syslog: > > hdparm: sending ioctl 2285 to a partition! -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Review request for branch wip-java-tests
Merged, thanks! sage On Thu, 8 Nov 2012, Joe Buck wrote: > I have a branch for review that reworks that tests for the java bindings and > builds them if both --enable-cephfs-java and --with-debug are specified. The > tests can also be built and run via ant. > > Branch name is wip-java-tests. > > Regards, > -Joe Buck > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Review request for branch wip-java-tests
I have a branch for review that reworks that tests for the java bindings and builds them if both --enable-cephfs-java and --with-debug are specified. The tests can also be built and run via ant. Branch name is wip-java-tests. Regards, -Joe Buck -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: some snapshot problems
Hi Liu, Sorry for the late reply; I have had a very busy week. :) On Thu, 1 Nov 2012, liu yaqi wrote: > Dear Mr.Weil > > I am a student of Institute of Computing Technology, Chinese Academy of > Sciences, and I am learning the realization of snapshot in ceph system. > There are sometings that puzzle me, and I want to ask you some questions. > First question, there is a command "ceph osd cluster_snap {name}", but i > cannot found the complete realization process, and I want to ask is the > snapshot for the whole cluster has been realized? The idea was to have a low-level cluster-wide snapshot that could be used for recovery if ceph itself went haywire and corrupted itself. The idea would be for the OSDs to create btrfs-level snapshots of their data. It was never completely implemented, though, and the OSD bits have mostly been removed. In particular, we never made a way for the monitor state to be checkpointed, which would be necessary for the whole scheme to work properly. > Second question, there > seems to be snapshots for pools and images. I want to ask what does pool and > image mean? Is an image means an osd? Lots of different snapshots: - librados lets you do 'selfmanaged snaps' in its API, which let an application control which snapshots apply to which objects. - you can create a 'pool' snapshot on an entire librados pool. this cannot be used at the same time as rbd, fs, or the above 'selfmanaged' snaps. - rbd let's you snapshot block device images (by usuing the librados selfmanaged snap API). - the ceph file system let's you snapshot any subdirectory (again utilizing the underlying RADOS functionality). > Third question, in the "mds" folder, > there are files like "snapserver" "MClientSnap" and so on, is there files > are used to snapshot the metadata only? Yes. > Does they have some relationship > with the pool or image snapshots? Not really. > The last question, is there snapshots > for a file path in the ceph? Or, the snapshots must be done on metadata and > data separately? For the file system, you create a snapshot on a directory and it affects all files in that directory and beneath it, including the data in those files. Hope that helps! sage > If you would kind enough to help me on the above questions, I will be > grateful. And I am looking forward to your reply. > > With best wishes for you. > > Yours, YaqiLiu >
Re: SSD journal suggestion
On Thu, Nov 8, 2012 at 7:02 PM, Atchley, Scott wrote: > On Nov 8, 2012, at 10:00 AM, Scott Atchley wrote: > >> On Nov 8, 2012, at 9:39 AM, Mark Nelson wrote: >> >>> On 11/08/2012 07:55 AM, Atchley, Scott wrote: On Nov 8, 2012, at 3:22 AM, Gandalf Corvotempesta wrote: > 2012/11/8 Mark Nelson : >> I haven't done much with IPoIB (just RDMA), but my understanding is that >> it >> tends to top out at like 15Gb/s. Some others on this mailing list can >> probably speak more authoritatively. Even with RDMA you are going to top >> out at around 3.1-3.2GB/s. > > 15Gb/s is still faster than 10Gbe > But this speed limit seems to be kernel-related and should be the same > even in a 10Gbe environment, or not? We have a test cluster with Mellanox QDR HCAs (i.e. NICs). When using Verbs (the native IB API), I see ~27 Gb/s between two hosts. When running Sockets over these devices using IPoIB, I see 13-22 Gb/s depending on whether I use interrupt affinity and process binding. For our Ceph testing, we will set the affinity of two of the mlx4 interrupt handlers to cores 0 and 1 and we will not using process binding. For single stream Netperf, we do use process binding and bind it to the same core (i.e. 0) and we see ~22 Gb/s. For multiple, concurrent Netperf runs, we do not use process binding but we still see ~22 Gb/s. >>> >>> Scott, this is very interesting! Does setting the interrupt affinity >>> make the biggest difference then when you have concurrent netperf >>> processes going? For some reason I thought that setting interrupt >>> affinity wasn't even guaranteed in linux any more, but this is just some >>> half-remembered recollection from a year or two ago. >> >> We are using RHEL6 with a 3.5.1 kernel. I tested single stream Netperf with >> and without affinity: >> >> Default (irqbalance running) 12.8 Gb/s >> IRQ balance off13.0 Gb/s >> Set IRQ affinity to socket 0 17.3 Gb/s # using the Mellanox script >> >> When I set the affinity to cores 0-1 _and_ I bind Netperf to core 0, I get >> ~22 Gb/s for a single stream. > Did you tried Mellanox-baked modules for 2.6.32 before that? > Note, I used hwloc to determine which socket was closer to the mlx4 device on > our dual socket machines. On these nodes, hwloc reported that both sockets > were equally close, but a colleague has machines where one socket is closer > than the other. In that case, bind to the closer socket (or to cores within > the closer socket). > >> We used all of the Mellanox tuning recommendations for IPoIB available in their tuning pdf: http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf We looked at their interrupt affinity setting scripts and then wrote our own. Our testing is with IPoIB in "connected" mode, not "datagram" mode. Connected mode is less scalable, but currently I only get ~3 Gb/s with datagram mode. Mellanox claims that we should get identical performance with both modes and we are looking into it. We are getting a new test cluster with FDR HCAs and I will look into those as well. >>> >>> Nice! At some point I'll probably try to justify getting some FDR cards >>> in house. I'd definitely like to hear how FDR ends up working for you. >> >> I'll post the numbers when I get access after they are set up. >> >> Scott >> > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: extreme ceph-osd cpu load for rand. 4k write
On 11/08/2012 09:45 AM, Stefan Priebe - Profihost AG wrote: Am 08.11.2012 16:01, schrieb Sage Weil: On Thu, 8 Nov 2012, Stefan Priebe - Profihost AG wrote: Is there any way to find out why a ceph-osd process takes around 10 times more load on rand 4k writes than on 4k reads? Something like perf or oprofile is probably your best bet. perf can be tedious to deploy, depending on where your kernel is coming from. oprofile seems to be deprecated, although I've had good results with it in the past. I've recorded 10s with perf - it is now a 300MB perf.data file. Sadly i've no idea what todo with it next. Pour yourself a stiff drink! (haha!) Try just doing a "perf report" in the directory where you've got the data file. Here's a nice tutorial: https://perf.wiki.kernel.org/index.php/Tutorial Also, if you see missing symbols you might benefit by chowning the file to root and running perf report as root. If you still see missing symbols, you may want to just give up and try sysprof. would love to see where the CPU is spending most of it's time. This is on current master? Yes I expect there are still some low-hanging fruit that can bring CPU utilization down (or even boost iops). Would be great to find them. Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: less cores more iops / speed
>>So it is a problem of KVM which let's the processes jump between cores a >>lot. maybe numad from redhat can help ? http://fedoraproject.org/wiki/Features/numad It's try to keep process on same numa node and I think it's also doing some dynamic pinning. - Mail original - De: "Stefan Priebe - Profihost AG" À: "Mark Nelson" Cc: "Joao Eduardo Luis" , ceph-devel@vger.kernel.org Envoyé: Jeudi 8 Novembre 2012 16:14:32 Objet: Re: less cores more iops / speed Am 08.11.2012 14:19, schrieb Mark Nelson: > On 11/08/2012 02:45 AM, Stefan Priebe - Profihost AG wrote: >> Am 08.11.2012 01:59, schrieb Mark Nelson: >>> There's also the context switching overhead. It'd be interesting to >>> know how much the writer processes were shifting around on cores. >> What do you mean by that? I'm talking about the KVM guest not about the >> ceph nodes. > > in this case, is fio bouncing around between cores? Thanks you're correct. If i bind fio to two cores on a 8 core VM it runs with 16.000 iops. So it is a problem of KVM which let's the processes jump between cores a lot. Greets, Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: extreme ceph-osd cpu load for rand. 4k write
Am 08.11.2012 16:01, schrieb Mark Nelson: Hi Stefan, You might want to try running sysprof or perf while the OSDs are running during the tests and see where CPU time is being spent. Also, how are you determining how much CPU usage is being used? Hi Mark, have a 300MB perf.data file and no idea what todo next ;-) Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: extreme ceph-osd cpu load for rand. 4k write
Am 08.11.2012 16:01, schrieb Sage Weil: On Thu, 8 Nov 2012, Stefan Priebe - Profihost AG wrote: Is there any way to find out why a ceph-osd process takes around 10 times more load on rand 4k writes than on 4k reads? Something like perf or oprofile is probably your best bet. perf can be tedious to deploy, depending on where your kernel is coming from. oprofile seems to be deprecated, although I've had good results with it in the past. I've recorded 10s with perf - it is now a 300MB perf.data file. Sadly i've no idea what todo with it next. would love to see where the CPU is spending most of it's time. This is on current master? Yes I expect there are still some low-hanging fruit that can bring CPU utilization down (or even boost iops). Would be great to find them. Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: less cores more iops / speed
Am 08.11.2012 14:19, schrieb Mark Nelson: On 11/08/2012 02:45 AM, Stefan Priebe - Profihost AG wrote: Am 08.11.2012 01:59, schrieb Mark Nelson: There's also the context switching overhead. It'd be interesting to know how much the writer processes were shifting around on cores. What do you mean by that? I'm talking about the KVM guest not about the ceph nodes. in this case, is fio bouncing around between cores? Thanks you're correct. If i bind fio to two cores on a 8 core VM it runs with 16.000 iops. So it is a problem of KVM which let's the processes jump between cores a lot. Greets, Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SSD journal suggestion
On Nov 8, 2012, at 10:00 AM, Scott Atchley wrote: > On Nov 8, 2012, at 9:39 AM, Mark Nelson wrote: > >> On 11/08/2012 07:55 AM, Atchley, Scott wrote: >>> On Nov 8, 2012, at 3:22 AM, Gandalf Corvotempesta >>> wrote: >>> 2012/11/8 Mark Nelson : > I haven't done much with IPoIB (just RDMA), but my understanding is that > it > tends to top out at like 15Gb/s. Some others on this mailing list can > probably speak more authoritatively. Even with RDMA you are going to top > out at around 3.1-3.2GB/s. 15Gb/s is still faster than 10Gbe But this speed limit seems to be kernel-related and should be the same even in a 10Gbe environment, or not? >>> >>> We have a test cluster with Mellanox QDR HCAs (i.e. NICs). When using Verbs >>> (the native IB API), I see ~27 Gb/s between two hosts. When running Sockets >>> over these devices using IPoIB, I see 13-22 Gb/s depending on whether I use >>> interrupt affinity and process binding. >>> >>> For our Ceph testing, we will set the affinity of two of the mlx4 interrupt >>> handlers to cores 0 and 1 and we will not using process binding. For single >>> stream Netperf, we do use process binding and bind it to the same core >>> (i.e. 0) and we see ~22 Gb/s. For multiple, concurrent Netperf runs, we do >>> not use process binding but we still see ~22 Gb/s. >> >> Scott, this is very interesting! Does setting the interrupt affinity >> make the biggest difference then when you have concurrent netperf >> processes going? For some reason I thought that setting interrupt >> affinity wasn't even guaranteed in linux any more, but this is just some >> half-remembered recollection from a year or two ago. > > We are using RHEL6 with a 3.5.1 kernel. I tested single stream Netperf with > and without affinity: > > Default (irqbalance running) 12.8 Gb/s > IRQ balance off13.0 Gb/s > Set IRQ affinity to socket 0 17.3 Gb/s # using the Mellanox script > > When I set the affinity to cores 0-1 _and_ I bind Netperf to core 0, I get > ~22 Gb/s for a single stream. Note, I used hwloc to determine which socket was closer to the mlx4 device on our dual socket machines. On these nodes, hwloc reported that both sockets were equally close, but a colleague has machines where one socket is closer than the other. In that case, bind to the closer socket (or to cores within the closer socket). > >>> We used all of the Mellanox tuning recommendations for IPoIB available in >>> their tuning pdf: >>> >>> http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf >>> >>> We looked at their interrupt affinity setting scripts and then wrote our >>> own. >>> >>> Our testing is with IPoIB in "connected" mode, not "datagram" mode. >>> Connected mode is less scalable, but currently I only get ~3 Gb/s with >>> datagram mode. Mellanox claims that we should get identical performance >>> with both modes and we are looking into it. >>> >>> We are getting a new test cluster with FDR HCAs and I will look into those >>> as well. >> >> Nice! At some point I'll probably try to justify getting some FDR cards >> in house. I'd definitely like to hear how FDR ends up working for you. > > I'll post the numbers when I get access after they are set up. > > Scott > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: problems creating new ceph cluster when using journal on block device
>>> [osd] >>> osd journal size = 4000 >> >> >> Not sure if this is the problem, but when using a block device you don't >> have to specify the size for the journal. So happy to know that, Wido! I had hoped there was a way to skip that. Tried without it -- only difference in the logs was seeing that it picked up the full size of the partition. So, same result. > Also might be useful to know make/model of ssd, plus motherboard make/model > (in case commenting out size does not fix)! It's an Intel X25-E, 64GB. It's a place-holder until some bigger ones we have on order show up. The mother board is a SuperMicro X8DT6. SSDs are connected to onboard SATA ports, data drives are connected to LSI 9211-8i (SAS2008) Maybe there is a special way I need to do the partition? My goal was to throw 6 journals on this disk, and it is partitioned like so: Model: ATA SSDSA2SH064G1GC (scsi) Disk /dev/sda: 64.0GB Sector size (logical/physical): 512B/512B Partition Table: msdos Number Start End SizeType File system Flags 1 1049kB 512MB 511MB primaryraid 2 512MB 2511MB 2000MB primaryraid 3 2511MB 6512MB 4000MB primaryraid 4 6512MB 64.0GB 57.5GB extended 5 6513MB 15.1GB 8590MB logical 6 15.1GB 23.7GB 8590MB logical 7 23.7GB 32.3GB 8590MB logical 8 32.3GB 40.9GB 8590MB logical 9 40.9GB 49.5GB 8590MB logical 10 49.5GB 58.1GB 8590MB logical So, sda5-10 are my journal partitions. I know that I have consumed most of the drive here, and that is bad for the SSD and such, but it really is a temporary setup. - Travis On Thu, Nov 8, 2012 at 3:24 AM, Mark Kirkwood wrote: > On 08/11/12 21:08, Wido den Hollander wrote: >> >> >> >> On 08-11-12 08:29, Travis Rhoden wrote: >>> >>> Hey folks, >>> >>> I'm trying to set up a brand new Ceph cluster, based on v0.53. My >>> hardware has SSDs for journals, and I'm trying to get mkcephfs to >>> intialize everything for me. However, the command hangs forever and I >>> eventually have to kill it. >>> >>> After poking around a bit, it's clear that the problem has something >>> to do with the journal. If I comment out the journal in ceph.conf, >>> the commands proceed just find. This is the first time I've tried to >>> throw a journal on a block device rather than a file, so maybe I've >>> done something wrong with that. >>> >>> Here is the info from ceph.conf: >>> >>> >>> [osd] >>> osd journal size = 4000 >> >> >> Not sure if this is the problem, but when using a block device you don't >> have to specify the size for the journal. > > > Also might be useful to know make/model of ssd, plus motherboard make/model > (in case commenting out size does not fix)! > > Regards > > Mark > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: extreme ceph-osd cpu load for rand. 4k write
Hi Stefan, You might want to try running sysprof or perf while the OSDs are running during the tests and see where CPU time is being spent. Also, how are you determining how much CPU usage is being used? Mark On 11/08/2012 08:58 AM, Stefan Priebe - Profihost AG wrote: Is there any way to find out why a ceph-osd process takes around 10 times more load on rand 4k writes than on 4k reads? Stefan Am 07.11.2012 21:41, schrieb Stefan Priebe: Hello list, whiling benchmarking i was wondering, why the ceph-osd load is so extreme high while having random 4k write i/o. Here an example while benchmarking: random 4k write: 16.000 iop/s 180% CPU Load in top from EACH ceph-osd process random 4k read: 16.000 iop/s 19% CPU Load in top from EACH ceph-osd process seq 4M write: 800MB/s 14% CPU Load in top from EACH ceph-osd process seq 4M read: 1600MB/s 9% CPU Load in top from EACH ceph-osd process I can't understand why in this single case the load is so EXTREMELY high. Greets Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: extreme ceph-osd cpu load for rand. 4k write
On Thu, 8 Nov 2012, Stefan Priebe - Profihost AG wrote: > Is there any way to find out why a ceph-osd process takes around 10 times more > load on rand 4k writes than on 4k reads? Something like perf or oprofile is probably your best bet. perf can be tedious to deploy, depending on where your kernel is coming from. oprofile seems to be deprecated, although I've had good results with it in the past. would love to see where the CPU is spending most of it's time. This is on current master? I expect there are still some low-hanging fruit that can bring CPU utilization down (or even boost iops). sage > > Stefan > > Am 07.11.2012 21:41, schrieb Stefan Priebe: > > Hello list, > > > > whiling benchmarking i was wondering, why the ceph-osd load is so > > extreme high while having random 4k write i/o. > > > > Here an example while benchmarking: > > > > random 4k write: 16.000 iop/s 180% CPU Load in top from EACH ceph-osd > > process > > > > random 4k read: 16.000 iop/s 19% CPU Load in top from EACH ceph-osd process > > > > seq 4M write: 800MB/s 14% CPU Load in top from EACH ceph-osd process > > > > seq 4M read: 1600MB/s 9% CPU Load in top from EACH ceph-osd process > > > > I can't understand why in this single case the load is so EXTREMELY high. > > > > Greets > > Stefan > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SSD journal suggestion
On Nov 8, 2012, at 9:39 AM, Mark Nelson wrote: > On 11/08/2012 07:55 AM, Atchley, Scott wrote: >> On Nov 8, 2012, at 3:22 AM, Gandalf Corvotempesta >> wrote: >> >>> 2012/11/8 Mark Nelson : I haven't done much with IPoIB (just RDMA), but my understanding is that it tends to top out at like 15Gb/s. Some others on this mailing list can probably speak more authoritatively. Even with RDMA you are going to top out at around 3.1-3.2GB/s. >>> >>> 15Gb/s is still faster than 10Gbe >>> But this speed limit seems to be kernel-related and should be the same >>> even in a 10Gbe environment, or not? >> >> We have a test cluster with Mellanox QDR HCAs (i.e. NICs). When using Verbs >> (the native IB API), I see ~27 Gb/s between two hosts. When running Sockets >> over these devices using IPoIB, I see 13-22 Gb/s depending on whether I use >> interrupt affinity and process binding. >> >> For our Ceph testing, we will set the affinity of two of the mlx4 interrupt >> handlers to cores 0 and 1 and we will not using process binding. For single >> stream Netperf, we do use process binding and bind it to the same core (i.e. >> 0) and we see ~22 Gb/s. For multiple, concurrent Netperf runs, we do not use >> process binding but we still see ~22 Gb/s. > > Scott, this is very interesting! Does setting the interrupt affinity > make the biggest difference then when you have concurrent netperf > processes going? For some reason I thought that setting interrupt > affinity wasn't even guaranteed in linux any more, but this is just some > half-remembered recollection from a year or two ago. We are using RHEL6 with a 3.5.1 kernel. I tested single stream Netperf with and without affinity: Default (irqbalance running) 12.8 Gb/s IRQ balance off13.0 Gb/s Set IRQ affinity to socket 0 17.3 Gb/s # using the Mellanox script When I set the affinity to cores 0-1 _and_ I bind Netperf to core 0, I get ~22 Gb/s for a single stream. >> We used all of the Mellanox tuning recommendations for IPoIB available in >> their tuning pdf: >> >> http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf >> >> We looked at their interrupt affinity setting scripts and then wrote our own. >> >> Our testing is with IPoIB in "connected" mode, not "datagram" mode. >> Connected mode is less scalable, but currently I only get ~3 Gb/s with >> datagram mode. Mellanox claims that we should get identical performance with >> both modes and we are looking into it. >> >> We are getting a new test cluster with FDR HCAs and I will look into those >> as well. > > Nice! At some point I'll probably try to justify getting some FDR cards > in house. I'd definitely like to hear how FDR ends up working for you. I'll post the numbers when I get access after they are set up. Scott -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: extreme ceph-osd cpu load for rand. 4k write
Is there any way to find out why a ceph-osd process takes around 10 times more load on rand 4k writes than on 4k reads? Stefan Am 07.11.2012 21:41, schrieb Stefan Priebe: Hello list, whiling benchmarking i was wondering, why the ceph-osd load is so extreme high while having random 4k write i/o. Here an example while benchmarking: random 4k write: 16.000 iop/s 180% CPU Load in top from EACH ceph-osd process random 4k read: 16.000 iop/s 19% CPU Load in top from EACH ceph-osd process seq 4M write: 800MB/s 14% CPU Load in top from EACH ceph-osd process seq 4M read: 1600MB/s 9% CPU Load in top from EACH ceph-osd process I can't understand why in this single case the load is so EXTREMELY high. Greets Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SSD journal suggestion
On 11/08/2012 07:55 AM, Atchley, Scott wrote: On Nov 8, 2012, at 3:22 AM, Gandalf Corvotempesta wrote: 2012/11/8 Mark Nelson : I haven't done much with IPoIB (just RDMA), but my understanding is that it tends to top out at like 15Gb/s. Some others on this mailing list can probably speak more authoritatively. Even with RDMA you are going to top out at around 3.1-3.2GB/s. 15Gb/s is still faster than 10Gbe But this speed limit seems to be kernel-related and should be the same even in a 10Gbe environment, or not? We have a test cluster with Mellanox QDR HCAs (i.e. NICs). When using Verbs (the native IB API), I see ~27 Gb/s between two hosts. When running Sockets over these devices using IPoIB, I see 13-22 Gb/s depending on whether I use interrupt affinity and process binding. For our Ceph testing, we will set the affinity of two of the mlx4 interrupt handlers to cores 0 and 1 and we will not using process binding. For single stream Netperf, we do use process binding and bind it to the same core (i.e. 0) and we see ~22 Gb/s. For multiple, concurrent Netperf runs, we do not use process binding but we still see ~22 Gb/s. Scott, this is very interesting! Does setting the interrupt affinity make the biggest difference then when you have concurrent netperf processes going? For some reason I thought that setting interrupt affinity wasn't even guaranteed in linux any more, but this is just some half-remembered recollection from a year or two ago. We used all of the Mellanox tuning recommendations for IPoIB available in their tuning pdf: http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf We looked at their interrupt affinity setting scripts and then wrote our own. Our testing is with IPoIB in "connected" mode, not "datagram" mode. Connected mode is less scalable, but currently I only get ~3 Gb/s with datagram mode. Mellanox claims that we should get identical performance with both modes and we are looking into it. We are getting a new test cluster with FDR HCAs and I will look into those as well. Nice! At some point I'll probably try to justify getting some FDR cards in house. I'd definitely like to hear how FDR ends up working for you. Scott Mark -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] rbd: end request on error in rbd_do_request() caller
Only one of the three callers of rbd_do_request() provide a collection structure to aggregate status. If an error occurs in rbd_do_request(), have the caller take care of calling rbd_coll_end_req() if necessary in that one spot. Signed-off-by: Alex Elder --- drivers/block/rbd.c | 11 --- 1 file changed, 4 insertions(+), 7 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index fb727c0..835153e 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -1128,12 +1128,8 @@ static int rbd_do_request(struct request *rq, struct ceph_osd_client *osdc; rbd_req = kzalloc(sizeof(*rbd_req), GFP_NOIO); - if (!rbd_req) { - if (coll) - rbd_coll_end_req_index(rq, coll, coll_index, - (s32) -ENOMEM, len); + if (!rbd_req) return -ENOMEM; - } if (coll) { rbd_req->coll = coll; @@ -1208,7 +1204,6 @@ done_err: bio_chain_put(rbd_req->bio); ceph_osdc_put_request(osd_req); done_pages: - rbd_coll_end_req(rbd_req, (s32) ret, len); kfree(rbd_req); return ret; } @@ -1361,7 +1356,9 @@ static int rbd_do_op(struct request *rq, ops, coll, coll_index, rbd_req_cb, 0, NULL); - + if (ret < 0) + rbd_coll_end_req_index(rq, coll, coll_index, + (s32) ret, seg_len); rbd_destroy_ops(ops); done: kfree(seg_name); -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] rbd: a little more cleanup of rbd_rq_fn()
Now that a big hunk in the middle of rbd_rq_fn() has been moved into its own routine we can simplify it a little more. Signed-off-by: Alex Elder --- drivers/block/rbd.c | 50 +++--- 1 file changed, 23 insertions(+), 27 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 6aed59b..fb727c0 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -1649,53 +1649,49 @@ static int rbd_dev_do_request(struct request *rq, static void rbd_rq_fn(struct request_queue *q) { struct rbd_device *rbd_dev = q->queuedata; + bool read_only = rbd_dev->mapping.read_only; struct request *rq; while ((rq = blk_fetch_request(q))) { - struct bio *bio; - bool do_write; - unsigned int size; - u64 ofs; - struct ceph_snap_context *snapc; + struct ceph_snap_context *snapc = NULL; int result; dout("fetched request\n"); - /* filter out block requests we don't understand */ + /* Filter out block requests we don't understand */ + if ((rq->cmd_type != REQ_TYPE_FS)) { __blk_end_request_all(rq, 0); continue; } + spin_unlock_irq(q->queue_lock); - /* deduce our operation (read, write) */ - do_write = (rq_data_dir(rq) == WRITE); - if (do_write && rbd_dev->mapping.read_only) { - __blk_end_request_all(rq, -EROFS); - continue; - } + /* Stop writes to a read-only device */ - spin_unlock_irq(q->queue_lock); + result = -EROFS; + if (read_only && rq_data_dir(rq) == WRITE) + goto out_end_request; + + /* Grab a reference to the snapshot context */ down_read(&rbd_dev->header_rwsem); + if (rbd_dev->exists) { + snapc = ceph_get_snap_context(rbd_dev->header.snapc); + rbd_assert(snapc != NULL); + } + up_read(&rbd_dev->header_rwsem); - if (!rbd_dev->exists) { + if (!snapc) { rbd_assert(rbd_dev->spec->snap_id != CEPH_NOSNAP); - up_read(&rbd_dev->header_rwsem); dout("request for non-existent snapshot"); - spin_lock_irq(q->queue_lock); - __blk_end_request_all(rq, -ENXIO); - continue; + result = -ENXIO; + goto out_end_request; } - snapc = ceph_get_snap_context(rbd_dev->header.snapc); - - up_read(&rbd_dev->header_rwsem); - - size = blk_rq_bytes(rq); - ofs = blk_rq_pos(rq) * SECTOR_SIZE; - bio = rq->bio; - - result = rbd_dev_do_request(rq, rbd_dev, snapc, ofs, size, bio); + result = rbd_dev_do_request(rq, rbd_dev, snapc, + blk_rq_pos(rq) * SECTOR_SIZE, + blk_rq_bytes(rq), rq->bio); +out_end_request: ceph_put_snap_context(snapc); spin_lock_irq(q->queue_lock); if (result < 0) -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] rbd: encapsulate handling for a single request
In rbd_rq_fn(), requests are fetched from the block layer and each request is processed, looping through the request's list of bio's until they've all been consumed. Separate the handling for a single request into its own function to make it a bit easier to see what's going on. Signed-off-by: Alex Elder --- drivers/block/rbd.c | 119 +++ 1 file changed, 63 insertions(+), 56 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index be18b5f..6aed59b 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -1585,6 +1585,64 @@ static struct rbd_req_coll *rbd_alloc_coll(int num_reqs) return coll; } +static int rbd_dev_do_request(struct request *rq, + struct rbd_device *rbd_dev, + struct ceph_snap_context *snapc, + u64 ofs, unsigned int size, + struct bio *bio_chain) +{ + int num_segs; + struct rbd_req_coll *coll; + unsigned int bio_offset; + int cur_seg = 0; + + dout("%s 0x%x bytes at 0x%llx\n", + rq_data_dir(rq) == WRITE ? "write" : "read", + size, (unsigned long long) blk_rq_pos(rq) * SECTOR_SIZE); + + num_segs = rbd_get_num_segments(&rbd_dev->header, ofs, size); + if (num_segs <= 0) + return num_segs; + + coll = rbd_alloc_coll(num_segs); + if (!coll) + return -ENOMEM; + + bio_offset = 0; + do { + u64 limit = rbd_segment_length(rbd_dev, ofs, size); + unsigned int clone_size; + struct bio *bio_clone; + + BUG_ON(limit > (u64) UINT_MAX); + clone_size = (unsigned int) limit; + dout("bio_chain->bi_vcnt=%hu\n", bio_chain->bi_vcnt); + + kref_get(&coll->kref); + + /* Pass a cloned bio chain via an osd request */ + + bio_clone = bio_chain_clone_range(&bio_chain, + &bio_offset, clone_size, + GFP_ATOMIC); + if (bio_clone) + (void) rbd_do_op(rq, rbd_dev, snapc, + ofs, clone_size, + bio_clone, coll, cur_seg); + else + rbd_coll_end_req_index(rq, coll, cur_seg, + (s32) -ENOMEM, + clone_size); + size -= clone_size; + ofs += clone_size; + + cur_seg++; + } while (size > 0); + kref_put(&coll->kref, rbd_coll_release); + + return 0; +} + /* * block device queue callback */ @@ -1598,10 +1656,8 @@ static void rbd_rq_fn(struct request_queue *q) bool do_write; unsigned int size; u64 ofs; - int num_segs, cur_seg = 0; - struct rbd_req_coll *coll; struct ceph_snap_context *snapc; - unsigned int bio_offset; + int result; dout("fetched request\n"); @@ -1639,60 +1695,11 @@ static void rbd_rq_fn(struct request_queue *q) ofs = blk_rq_pos(rq) * SECTOR_SIZE; bio = rq->bio; - dout("%s 0x%x bytes at 0x%llx\n", -do_write ? "write" : "read", -size, (unsigned long long) blk_rq_pos(rq) * SECTOR_SIZE); - - num_segs = rbd_get_num_segments(&rbd_dev->header, ofs, size); - if (num_segs <= 0) { - spin_lock_irq(q->queue_lock); - __blk_end_request_all(rq, num_segs); - ceph_put_snap_context(snapc); - continue; - } - coll = rbd_alloc_coll(num_segs); - if (!coll) { - spin_lock_irq(q->queue_lock); - __blk_end_request_all(rq, -ENOMEM); - ceph_put_snap_context(snapc); - continue; - } - - bio_offset = 0; - do { - u64 limit = rbd_segment_length(rbd_dev, ofs, size); - unsigned int chain_size; - struct bio *bio_chain; - - BUG_ON(limit > (u64) UINT_MAX); - chain_size = (unsigned int) limit; - dout("rq->bio->bi_vcnt=%hu\n", rq->bio->bi_vcnt); - - kref_get(&coll->kref); - - /* Pass a cloned bio chain via an osd request */ - - bio_chain = bio_chain_clone_range(&bio, - &bio_offset, chain_size, - GFP_ATOMIC); - if (bio_chain) -
[PATCH 0/2] rbd: clean up rbd_rq_fn()
Some refactoring to improve readability.-Alex [PATCH 1/2] rbd: encapsulate handling for a single request [PATCH 2/2] rbd: a little more cleanup of rbd_rq_fn() -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] rbd: be picky about osd request status type
The result field in a ceph osd reply header is a signed 32-bit type, but rbd code often casually uses int to represent it. The following changes the types of variables that handle this result value to be "s32" instead of "int" to be completely explicit about it. Only at the point we pass that result to __blk_end_request() does the type get converted to the plain old int defined for that interface. There is almost certainly no binary impact of this change, but I prefer to show the exact size and signedness of the value since we know it. Signed-off-by: Alex Elder --- drivers/block/rbd.c | 23 --- 1 file changed, 12 insertions(+), 11 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index caff180..be18b5f 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -171,7 +171,7 @@ struct rbd_client { */ struct rbd_req_status { int done; - int rc; + s32 rc; u64 bytes; }; @@ -1055,13 +1055,13 @@ static void rbd_destroy_ops(struct ceph_osd_req_op *ops) static void rbd_coll_end_req_index(struct request *rq, struct rbd_req_coll *coll, int index, - int ret, u64 len) + s32 ret, u64 len) { struct request_queue *q; int min, max, i; dout("rbd_coll_end_req_index %p index %d ret %d len %llu\n", -coll, index, ret, (unsigned long long) len); +coll, index, (int) ret, (unsigned long long) len); if (!rq) return; @@ -1082,7 +1082,7 @@ static void rbd_coll_end_req_index(struct request *rq, max++; for (i = min; istatus[i].rc, + __blk_end_request(rq, (int) coll->status[i].rc, coll->status[i].bytes); coll->num_done++; kref_put(&coll->kref, rbd_coll_release); @@ -1091,7 +1091,7 @@ static void rbd_coll_end_req_index(struct request *rq, } static void rbd_coll_end_req(struct rbd_request *rbd_req, -int ret, u64 len) +s32 ret, u64 len) { rbd_coll_end_req_index(rbd_req->rq, rbd_req->coll, rbd_req->coll_index, @@ -1131,7 +1131,7 @@ static int rbd_do_request(struct request *rq, if (!rbd_req) { if (coll) rbd_coll_end_req_index(rq, coll, coll_index, - -ENOMEM, len); + (s32) -ENOMEM, len); return -ENOMEM; } @@ -1208,7 +1208,7 @@ done_err: bio_chain_put(rbd_req->bio); ceph_osdc_put_request(osd_req); done_pages: - rbd_coll_end_req(rbd_req, ret, len); + rbd_coll_end_req(rbd_req, (s32) ret, len); kfree(rbd_req); return ret; } @@ -1221,7 +1221,7 @@ static void rbd_req_cb(struct ceph_osd_request *osd_req, struct ceph_msg *msg) struct rbd_request *rbd_req = osd_req->r_priv; struct ceph_osd_reply_head *replyhead; struct ceph_osd_op *op; - __s32 rc; + s32 rc; u64 bytes; int read_op; @@ -1229,14 +1229,14 @@ static void rbd_req_cb(struct ceph_osd_request *osd_req, struct ceph_msg *msg) replyhead = msg->front.iov_base; WARN_ON(le32_to_cpu(replyhead->num_ops) == 0); op = (void *)(replyhead + 1); - rc = le32_to_cpu(replyhead->result); + rc = (s32) le32_to_cpu(replyhead->result); bytes = le64_to_cpu(op->extent.length); read_op = (le16_to_cpu(op->op) == CEPH_OSD_OP_READ); dout("rbd_req_cb bytes=%llu readop=%d rc=%d\n", (unsigned long long) bytes, read_op, (int) rc); - if (rc == -ENOENT && read_op) { + if (rc == (s32) -ENOENT && read_op) { zero_bio_chain(rbd_req->bio, 0); rc = 0; } else if (rc == 0 && read_op && bytes < rbd_req->len) { @@ -1681,7 +1681,8 @@ static void rbd_rq_fn(struct request_queue *q) bio_chain, coll, cur_seg); else rbd_coll_end_req_index(rq, coll, cur_seg, - -ENOMEM, chain_size); + (s32) -ENOMEM, + chain_size); size -= chain_size; ofs += chain_size; -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] rbd: standardize ceph_osd_request variable names
There are spots where a ceph_osds_request pointer variable is given the name "req". Since we're dealing with (at least) three types of requests (block layer, rbd, and osd), I find this slightly distracting. Change such instances to use "osd_req" consistently to make the abstraction represented a little more obvious. Signed-off-by: Alex Elder --- drivers/block/rbd.c | 60 ++- 1 file changed, 31 insertions(+), 29 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 9d8b406..caff180 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -1113,12 +1113,12 @@ static int rbd_do_request(struct request *rq, struct ceph_osd_req_op *ops, struct rbd_req_coll *coll, int coll_index, - void (*rbd_cb)(struct ceph_osd_request *req, -struct ceph_msg *msg), + void (*rbd_cb)(struct ceph_osd_request *, +struct ceph_msg *), struct ceph_osd_request **linger_req, u64 *ver) { - struct ceph_osd_request *req; + struct ceph_osd_request *osd_req; struct ceph_file_layout *layout; int ret; u64 bno; @@ -1145,67 +1145,68 @@ static int rbd_do_request(struct request *rq, (unsigned long long) len, coll, coll_index); osdc = &rbd_dev->rbd_client->client->osdc; - req = ceph_osdc_alloc_request(osdc, flags, snapc, ops, + osd_req = ceph_osdc_alloc_request(osdc, flags, snapc, ops, false, GFP_NOIO, pages, bio); - if (!req) { + if (!osd_req) { ret = -ENOMEM; goto done_pages; } - req->r_callback = rbd_cb; + osd_req->r_callback = rbd_cb; rbd_req->rq = rq; rbd_req->bio = bio; rbd_req->pages = pages; rbd_req->len = len; - req->r_priv = rbd_req; + osd_req->r_priv = rbd_req; - reqhead = req->r_request->front.iov_base; + reqhead = osd_req->r_request->front.iov_base; reqhead->snapid = cpu_to_le64(CEPH_NOSNAP); - strncpy(req->r_oid, object_name, sizeof(req->r_oid)); - req->r_oid_len = strlen(req->r_oid); + strncpy(osd_req->r_oid, object_name, sizeof(osd_req->r_oid)); + osd_req->r_oid_len = strlen(osd_req->r_oid); - layout = &req->r_file_layout; + layout = &osd_req->r_file_layout; memset(layout, 0, sizeof(*layout)); layout->fl_stripe_unit = cpu_to_le32(1 << RBD_MAX_OBJ_ORDER); layout->fl_stripe_count = cpu_to_le32(1); layout->fl_object_size = cpu_to_le32(1 << RBD_MAX_OBJ_ORDER); layout->fl_pg_pool = cpu_to_le32((int) rbd_dev->spec->pool_id); ret = ceph_calc_raw_layout(osdc, layout, snapid, ofs, &len, &bno, - req, ops); + osd_req, ops); rbd_assert(ret == 0); - ceph_osdc_build_request(req, ofs, &len, + ceph_osdc_build_request(osd_req, ofs, &len, ops, snapc, &mtime, - req->r_oid, req->r_oid_len); + osd_req->r_oid, osd_req->r_oid_len); if (linger_req) { - ceph_osdc_set_request_linger(osdc, req); - *linger_req = req; + ceph_osdc_set_request_linger(osdc, osd_req); + *linger_req = osd_req; } - ret = ceph_osdc_start_request(osdc, req, false); + ret = ceph_osdc_start_request(osdc, osd_req, false); if (ret < 0) goto done_err; if (!rbd_cb) { - ret = ceph_osdc_wait_request(osdc, req); + u64 version; + + ret = ceph_osdc_wait_request(osdc, osd_req); + version = le64_to_cpu(osd_req->r_reassert_version.version); if (ver) - *ver = le64_to_cpu(req->r_reassert_version.version); - dout("reassert_ver=%llu\n", - (unsigned long long) - le64_to_cpu(req->r_reassert_version.version)); - ceph_osdc_put_request(req); + *ver = version; + dout("reassert_ver=%llu\n", (unsigned long long) version); + ceph_osdc_put_request(osd_req); } return ret; done_err: bio_chain_put(rbd_req->bio); - ceph_osdc_put_request(req); + ceph_osdc_put_request(osd_req); done_pages: rbd_coll_end_req(rbd_req, ret, len); kfree(rbd_req); @@ -1215,9 +1216,9 @@ done_pages: /* * Ceph osd op callback */ -static void rbd_req_cb(struct ceph_osd_request *req, struct ceph_msg *msg) +static void rbd_req_cb(struct ceph_osd_request *os
[PATCH 1/3] rbd: standardize rbd_request variable names
There are two names used for items of rbd_request structure type: "req" and "req_data". The former name is also used to represent items of pointers to struct ceph_osd_request. Change all variables that have these names so they are instead called "rbd_req" consistently. Signed-off-by: Alex Elder --- drivers/block/rbd.c | 50 ++ 1 file changed, 26 insertions(+), 24 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 5de49a1..9d8b406 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -1090,10 +1090,12 @@ static void rbd_coll_end_req_index(struct request *rq, spin_unlock_irq(q->queue_lock); } -static void rbd_coll_end_req(struct rbd_request *req, +static void rbd_coll_end_req(struct rbd_request *rbd_req, int ret, u64 len) { - rbd_coll_end_req_index(req->rq, req->coll, req->coll_index, ret, len); + rbd_coll_end_req_index(rbd_req->rq, + rbd_req->coll, rbd_req->coll_index, + ret, len); } /* @@ -1121,12 +1123,12 @@ static int rbd_do_request(struct request *rq, int ret; u64 bno; struct timespec mtime = CURRENT_TIME; - struct rbd_request *req_data; + struct rbd_request *rbd_req; struct ceph_osd_request_head *reqhead; struct ceph_osd_client *osdc; - req_data = kzalloc(sizeof(*req_data), GFP_NOIO); - if (!req_data) { + rbd_req = kzalloc(sizeof(*rbd_req), GFP_NOIO); + if (!rbd_req) { if (coll) rbd_coll_end_req_index(rq, coll, coll_index, -ENOMEM, len); @@ -1134,8 +1136,8 @@ static int rbd_do_request(struct request *rq, } if (coll) { - req_data->coll = coll; - req_data->coll_index = coll_index; + rbd_req->coll = coll; + rbd_req->coll_index = coll_index; } dout("rbd_do_request object_name=%s ofs=%llu len=%llu coll=%p[%d]\n", @@ -1152,12 +1154,12 @@ static int rbd_do_request(struct request *rq, req->r_callback = rbd_cb; - req_data->rq = rq; - req_data->bio = bio; - req_data->pages = pages; - req_data->len = len; + rbd_req->rq = rq; + rbd_req->bio = bio; + rbd_req->pages = pages; + rbd_req->len = len; - req->r_priv = req_data; + req->r_priv = rbd_req; reqhead = req->r_request->front.iov_base; reqhead->snapid = cpu_to_le64(CEPH_NOSNAP); @@ -1202,11 +1204,11 @@ static int rbd_do_request(struct request *rq, return ret; done_err: - bio_chain_put(req_data->bio); + bio_chain_put(rbd_req->bio); ceph_osdc_put_request(req); done_pages: - rbd_coll_end_req(req_data, ret, len); - kfree(req_data); + rbd_coll_end_req(rbd_req, ret, len); + kfree(rbd_req); return ret; } @@ -1215,7 +1217,7 @@ done_pages: */ static void rbd_req_cb(struct ceph_osd_request *req, struct ceph_msg *msg) { - struct rbd_request *req_data = req->r_priv; + struct rbd_request *rbd_req = req->r_priv; struct ceph_osd_reply_head *replyhead; struct ceph_osd_op *op; __s32 rc; @@ -1234,20 +1236,20 @@ static void rbd_req_cb(struct ceph_osd_request *req, struct ceph_msg *msg) (unsigned long long) bytes, read_op, (int) rc); if (rc == -ENOENT && read_op) { - zero_bio_chain(req_data->bio, 0); + zero_bio_chain(rbd_req->bio, 0); rc = 0; - } else if (rc == 0 && read_op && bytes < req_data->len) { - zero_bio_chain(req_data->bio, bytes); - bytes = req_data->len; + } else if (rc == 0 && read_op && bytes < rbd_req->len) { + zero_bio_chain(rbd_req->bio, bytes); + bytes = rbd_req->len; } - rbd_coll_end_req(req_data, rc, bytes); + rbd_coll_end_req(rbd_req, rc, bytes); - if (req_data->bio) - bio_chain_put(req_data->bio); + if (rbd_req->bio) + bio_chain_put(rbd_req->bio); ceph_osdc_put_request(req); - kfree(req_data); + kfree(rbd_req); } static void rbd_simple_req_cb(struct ceph_osd_request *req, struct ceph_msg *msg) -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] rbd: a few picky changes
These three changes are pretty trivial. -Alex [PATCH 1/3] rbd: standardize rbd_request variable names [PATCH 2/3] rbd: standardize ceph_osd_request variable names [PATCH 3/3] rbd: be picky about osd request status type -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SSD journal suggestion
On Nov 8, 2012, at 3:22 AM, Gandalf Corvotempesta wrote: > 2012/11/8 Mark Nelson : >> I haven't done much with IPoIB (just RDMA), but my understanding is that it >> tends to top out at like 15Gb/s. Some others on this mailing list can >> probably speak more authoritatively. Even with RDMA you are going to top >> out at around 3.1-3.2GB/s. > > 15Gb/s is still faster than 10Gbe > But this speed limit seems to be kernel-related and should be the same > even in a 10Gbe environment, or not? We have a test cluster with Mellanox QDR HCAs (i.e. NICs). When using Verbs (the native IB API), I see ~27 Gb/s between two hosts. When running Sockets over these devices using IPoIB, I see 13-22 Gb/s depending on whether I use interrupt affinity and process binding. For our Ceph testing, we will set the affinity of two of the mlx4 interrupt handlers to cores 0 and 1 and we will not using process binding. For single stream Netperf, we do use process binding and bind it to the same core (i.e. 0) and we see ~22 Gb/s. For multiple, concurrent Netperf runs, we do not use process binding but we still see ~22 Gb/s. We used all of the Mellanox tuning recommendations for IPoIB available in their tuning pdf: http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf We looked at their interrupt affinity setting scripts and then wrote our own. Our testing is with IPoIB in "connected" mode, not "datagram" mode. Connected mode is less scalable, but currently I only get ~3 Gb/s with datagram mode. Mellanox claims that we should get identical performance with both modes and we are looking into it. We are getting a new test cluster with FDR HCAs and I will look into those as well. Scott-- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: less cores more iops / speed
On 11/08/2012 02:45 AM, Stefan Priebe - Profihost AG wrote: Am 08.11.2012 01:59, schrieb Mark Nelson: There's also the context switching overhead. It'd be interesting to know how much the writer processes were shifting around on cores. What do you mean by that? I'm talking about the KVM guest not about the ceph nodes. in this case, is fio bouncing around between cores? Stefan, what tool were you using to do writes? as always: fio ;-) You could try using numactl to pin fio to a specific core. Also, it may be interesting to try multiple concurrent fio processes, and then concurrent fio processes with each pinned. Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] mds: Clear lock flushed if replica is waiting for AC_LOCKFLUSHED
From: "Yan, Zheng" So eval_gather() will not skip calling scatter_writebehind(), otherwise the replica lock may be in flushing state forever. Signed-off-by: Yan, Zheng --- src/mds/Locker.cc | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/src/mds/Locker.cc b/src/mds/Locker.cc index a1f957a..e2b1ff4 100644 --- a/src/mds/Locker.cc +++ b/src/mds/Locker.cc @@ -4383,8 +4383,12 @@ void Locker::handle_file_lock(ScatterLock *lock, MLock *m) if (lock->get_state() == LOCK_MIX_LOCK || lock->get_state() == LOCK_MIX_LOCK2 || lock->get_state() == LOCK_MIX_EXCL || - lock->get_state() == LOCK_MIX_TSYN) + lock->get_state() == LOCK_MIX_TSYN) { lock->decode_locked_state(m->get_data()); + // replica is waiting for AC_LOCKFLUSHED, eval_gather() should not + // delay calling scatter_writebehind(). + lock->clear_flushed(); +} if (lock->is_gathering()) { dout(7) << "handle_file_lock " << *in << " from " << from -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] mds: Don't expire log segment before it's fully flushed
From: "Yan, Zheng" Expiring log segment before it's fully flushed may cause various issues during log replay. Signed-off-by: Yan, Zheng --- src/leveldb | 2 +- src/mds/MDLog.cc | 8 +--- 2 files changed, 6 insertions(+), 4 deletions(-) diff --git a/src/mds/MDLog.cc b/src/mds/MDLog.cc index cac5615..b02c181 100644 --- a/src/mds/MDLog.cc +++ b/src/mds/MDLog.cc @@ -330,6 +330,11 @@ void MDLog::trim(int m) assert(ls); p++; +if (ls->end > journaler->get_write_safe_pos()) { + dout(5) << "trim segment " << ls->offset << ", not fully flushed yet, safe " + << journaler->get_write_safe_pos() << " < end " << ls->end << dendl; + break; +} if (expiring_segments.count(ls)) { dout(5) << "trim already expiring segment " << ls->offset << ", " << ls->num_events << " events" << dendl; } else if (expired_segments.count(ls)) { @@ -412,9 +417,6 @@ void MDLog::_expired(LogSegment *ls) if (!capped && ls == get_current_segment()) { dout(5) << "_expired not expiring " << ls->offset << ", last one and !capped" << dendl; - } else if (ls->end > journaler->get_write_safe_pos()) { -dout(5) << "_expired not expiring " << ls->offset << ", not fully flushed yet, safe " - << journaler->get_write_safe_pos() << " < end " << ls->end << dendl; } else { // expired. expired_segments.insert(ls); -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: clock syncronisation
Am 08.11.2012 13:00, schrieb Wido den Hollander: On 08-11-12 10:04, Stefan Priebe - Profihost AG wrote: Hello list, is there any prefered way to use clock syncronisation? I've tried running openntpd and ntpd on all servers but i'm still getting: 2012-11-08 09:55:38.255928 mon.0 [WRN] message from mon.2 was stamped 0.063136s in the future, clocks not synchronized 2012-11-08 09:55:39.328639 mon.0 [WRN] message from mon.2 was stamped 0.063285s in the future, clocks not synchronized 2012-11-08 09:55:39.328833 mon.0 [WRN] message from mon.2 was stamped 0.063301s in the future, clocks not synchronized 2012-11-08 09:55:40.819975 mon.0 [WRN] message from mon.2 was stamped 0.063360s in the future, clocks not synchronized What NTP server are you using? Network latency might cause the clocks not to be synchronised. pool.ntp.org But i've now switched to debian chrony instead of ntp and that seems to work fine. Haven't seen any messages again. Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: clock syncronisation
On Thu, Nov 8, 2012 at 4:00 PM, Wido den Hollander wrote: > > > On 08-11-12 10:04, Stefan Priebe - Profihost AG wrote: >> >> Hello list, >> >> is there any prefered way to use clock syncronisation? >> >> I've tried running openntpd and ntpd on all servers but i'm still getting: >> 2012-11-08 09:55:38.255928 mon.0 [WRN] message from mon.2 was stamped >> 0.063136s in the future, clocks not synchronized >> 2012-11-08 09:55:39.328639 mon.0 [WRN] message from mon.2 was stamped >> 0.063285s in the future, clocks not synchronized >> 2012-11-08 09:55:39.328833 mon.0 [WRN] message from mon.2 was stamped >> 0.063301s in the future, clocks not synchronized >> 2012-11-08 09:55:40.819975 mon.0 [WRN] message from mon.2 was stamped >> 0.063360s in the future, clocks not synchronized >> > > What NTP server are you using? Network latency might cause the clocks not to > be synchronised. > There is no real reason to worry about, quorum may suffer only large desync delays as some seconds or more. If you have unsynchronized clocks on mon hodes with such big delays, requests which have issued from cli, e.g. creating new connection may wait as long as delay itself, depend of clock value of selected monitor node. Clock drift caused mostly by heavy load, but of course playing with clocksources may have some effect(since most systems already use HPET timer, there is only one way, to sync with ntp server as frequent as you want to prevent drift). >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: clock syncronisation
On 08-11-12 10:04, Stefan Priebe - Profihost AG wrote: Hello list, is there any prefered way to use clock syncronisation? I've tried running openntpd and ntpd on all servers but i'm still getting: 2012-11-08 09:55:38.255928 mon.0 [WRN] message from mon.2 was stamped 0.063136s in the future, clocks not synchronized 2012-11-08 09:55:39.328639 mon.0 [WRN] message from mon.2 was stamped 0.063285s in the future, clocks not synchronized 2012-11-08 09:55:39.328833 mon.0 [WRN] message from mon.2 was stamped 0.063301s in the future, clocks not synchronized 2012-11-08 09:55:40.819975 mon.0 [WRN] message from mon.2 was stamped 0.063360s in the future, clocks not synchronized What NTP server are you using? Network latency might cause the clocks not to be synchronised. Wido Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: problem with hanging cluster
W dniu 08.11.2012 12:14, Adam Ochmański pisze: Hi, our test cluster going stuck every time when one of our osd host going down, when mising osd go to "up" state and recovery go to 100% cluster still not working propertly. I forgot add version of ceph i use: 0.53-422-g2d20f3a -- Best, blink -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: less cores more iops / speed
Am 08.11.2012 10:05, schrieb Alexandre DERUMIER: Do you have tried to compare virtio-blk and virtio-scsi ? How to change? Right now i'm using the PVE defaults => scsi-hd. (virtio-blk is "classic" virtio ;) Do you have tried directly from the host with the rbd kernel module ? No don't know how to use ;-) http://ceph.com/docs/master/rbd/rbd-ko/ #modprobe rbd #sudo rbd map {image-name} --pool {pool-name} --id {user-name} this gives me also 8000 iops on the host with 3.6 Ghz. So this is the same like in KVM. Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: less cores more iops / speed
> Do you have tried to compare virtio-blk and virtio-scsi ? >>How to change? Right now i'm using the PVE defaults => scsi-hd. (virtio-blk is "classic" virtio ;) >> Do you have tried directly from the host with the rbd kernel module ? >>No don't know how to use ;-) http://ceph.com/docs/master/rbd/rbd-ko/ #modprobe rbd #sudo rbd map {image-name} --pool {pool-name} --id {user-name} (then you'll have a /dev/rbd1) - Mail original - De: "Stefan Priebe - Profihost AG" À: "Alexandre DERUMIER" Cc: "Joao Eduardo Luis" , ceph-devel@vger.kernel.org, "Mark Nelson" Envoyé: Jeudi 8 Novembre 2012 10:02:23 Objet: Re: less cores more iops / speed Am 08.11.2012 09:58, schrieb Alexandre DERUMIER: >>> What do you mean by that? I'm talking about the KVM guest not about the >>> ceph nodes. > > Do you have tried to compare virtio-blk and virtio-scsi ? How to change? Right now i'm using the PVE defaults => scsi-hd. > Do you have tried directly from the host with the rbd kernel module ? No don't know how to use ;-) Stefan > - Mail original - > > De: "Stefan Priebe - Profihost AG" > À: "Mark Nelson" > Cc: "Joao Eduardo Luis" , ceph-devel@vger.kernel.org > Envoyé: Jeudi 8 Novembre 2012 09:45:17 > Objet: Re: less cores more iops / speed > > Am 08.11.2012 01:59, schrieb Mark Nelson: >> There's also the context switching overhead. It'd be interesting to >> know how much the writer processes were shifting around on cores. > What do you mean by that? I'm talking about the KVM guest not about the > ceph nodes. > >> Stefan, what tool were you using to do writes? > as always: fio ;-) > > Stefan > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
clock syncronisation
Hello list, is there any prefered way to use clock syncronisation? I've tried running openntpd and ntpd on all servers but i'm still getting: 2012-11-08 09:55:38.255928 mon.0 [WRN] message from mon.2 was stamped 0.063136s in the future, clocks not synchronized 2012-11-08 09:55:39.328639 mon.0 [WRN] message from mon.2 was stamped 0.063285s in the future, clocks not synchronized 2012-11-08 09:55:39.328833 mon.0 [WRN] message from mon.2 was stamped 0.063301s in the future, clocks not synchronized 2012-11-08 09:55:40.819975 mon.0 [WRN] message from mon.2 was stamped 0.063360s in the future, clocks not synchronized Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: less cores more iops / speed
Am 08.11.2012 09:58, schrieb Alexandre DERUMIER: What do you mean by that? I'm talking about the KVM guest not about the ceph nodes. Do you have tried to compare virtio-blk and virtio-scsi ? How to change? Right now i'm using the PVE defaults => scsi-hd. Do you have tried directly from the host with the rbd kernel module ? No don't know how to use ;-) Stefan - Mail original - De: "Stefan Priebe - Profihost AG" À: "Mark Nelson" Cc: "Joao Eduardo Luis" , ceph-devel@vger.kernel.org Envoyé: Jeudi 8 Novembre 2012 09:45:17 Objet: Re: less cores more iops / speed Am 08.11.2012 01:59, schrieb Mark Nelson: There's also the context switching overhead. It'd be interesting to know how much the writer processes were shifting around on cores. What do you mean by that? I'm talking about the KVM guest not about the ceph nodes. Stefan, what tool were you using to do writes? as always: fio ;-) Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: less cores more iops / speed
>>What do you mean by that? I'm talking about the KVM guest not about the >>ceph nodes. Do you have tried to compare virtio-blk and virtio-scsi ? Do you have tried directly from the host with the rbd kernel module ? - Mail original - De: "Stefan Priebe - Profihost AG" À: "Mark Nelson" Cc: "Joao Eduardo Luis" , ceph-devel@vger.kernel.org Envoyé: Jeudi 8 Novembre 2012 09:45:17 Objet: Re: less cores more iops / speed Am 08.11.2012 01:59, schrieb Mark Nelson: > There's also the context switching overhead. It'd be interesting to > know how much the writer processes were shifting around on cores. What do you mean by that? I'm talking about the KVM guest not about the ceph nodes. > Stefan, what tool were you using to do writes? as always: fio ;-) Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: syncfs slower than without syncfs
done: http://tracker.newdream.net/issues/3461 Am 08.11.2012 04:09, schrieb Josh Durgin: On 11/07/2012 08:26 AM, Stefan Priebe wrote: Am 07.11.2012 16:04, schrieb Mark Nelson: Whew, glad you found the problem Stefan! I was starting to wonder what was going on. :) Do you mind filling a bug about the control dependencies? Sure where should i fill it in? http://www.tracker.newdream.net/projects/ceph/issues/new -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: less cores more iops / speed
Am 08.11.2012 01:59, schrieb Mark Nelson: There's also the context switching overhead. It'd be interesting to know how much the writer processes were shifting around on cores. What do you mean by that? I'm talking about the KVM guest not about the ceph nodes. Stefan, what tool were you using to do writes? as always: fio ;-) Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: problems creating new ceph cluster when using journal on block device
On 08/11/12 21:08, Wido den Hollander wrote: On 08-11-12 08:29, Travis Rhoden wrote: Hey folks, I'm trying to set up a brand new Ceph cluster, based on v0.53. My hardware has SSDs for journals, and I'm trying to get mkcephfs to intialize everything for me. However, the command hangs forever and I eventually have to kill it. After poking around a bit, it's clear that the problem has something to do with the journal. If I comment out the journal in ceph.conf, the commands proceed just find. This is the first time I've tried to throw a journal on a block device rather than a file, so maybe I've done something wrong with that. Here is the info from ceph.conf: [osd] osd journal size = 4000 Not sure if this is the problem, but when using a block device you don't have to specify the size for the journal. Also might be useful to know make/model of ssd, plus motherboard make/model (in case commenting out size does not fix)! Regards Mark -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: problems creating new ceph cluster when using journal on block device
On 08-11-12 08:29, Travis Rhoden wrote: Hey folks, I'm trying to set up a brand new Ceph cluster, based on v0.53. My hardware has SSDs for journals, and I'm trying to get mkcephfs to intialize everything for me. However, the command hangs forever and I eventually have to kill it. After poking around a bit, it's clear that the problem has something to do with the journal. If I comment out the journal in ceph.conf, the commands proceed just find. This is the first time I've tried to throw a journal on a block device rather than a file, so maybe I've done something wrong with that. Here is the info from ceph.conf: [osd] osd journal size = 4000 Not sure if this is the problem, but when using a block device you don't have to specify the size for the journal. Wido [osd.0] host = ceph1 osd journal = /dev/sda5 when I log in the log file, here is what I see: 2012-11-07 23:18:20.578623 7fe2743e3780 1 filestore(/var/lib/ceph/osd/ceph-0) mkfs in /var/lib/ceph/osd/ceph-0 2012-11-07 23:18:20.578699 7fe2743e3780 1 filestore(/var/lib/ceph/osd/ceph-0) mkfs fsid is already set to 4aac6842-8d71-4405-88ad-e3e9e4da308d 2012-11-07 23:18:20.632138 7fe2743e3780 1 filestore(/var/lib/ceph/osd/ceph-0) leveldb db exists/created 2012-11-07 23:18:20.634338 7fe2743e3780 0 journal kernel version is 3.2.0 2012-11-07 23:18:20.634579 7fe2743e3780 1 journal _open /dev/sda5 fd 9: 4194304000 bytes, block size 4096 bytes, directio = 1, aio = 0 2012-11-07 23:18:20.634995 7fe2743e3780 1 journal check: header looks ok 2012-11-07 23:18:20.636020 7fe2743e3780 1 filestore(/var/lib/ceph/osd/ceph-0) mkfs done in /var/lib/ceph/osd/ceph-0 2012-11-07 23:18:20.682113 7fe2743e3780 0 filestore(/var/lib/ceph/osd/ceph-0) mount FIEMAP ioctl is supported and appears to work 2012-11-07 23:18:20.682125 7fe2743e3780 0 filestore(/var/lib/ceph/osd/ceph-0) mount FIEMAP ioctl is disabled via 'filestore fiemap' config option 2012-11-07 23:18:20.682424 7fe2743e3780 0 filestore(/var/lib/ceph/osd/ceph-0) mount did NOT detect btrfs 2012-11-07 23:18:20.781938 7fe2743e3780 0 filestore(/var/lib/ceph/osd/ceph-0) mount syncfs(2) syscall fully supported (by glibc and kernel) 2012-11-07 23:18:20.782061 7fe2743e3780 0 filestore(/var/lib/ceph/osd/ceph-0) mount found snaps <> 2012-11-07 23:18:20.823915 7fe2743e3780 0 filestore(/var/lib/ceph/osd/ceph-0) mount: enabling WRITEAHEAD journal mode: btrfs not detected 2012-11-07 23:18:20.826137 7fe2743e3780 0 journal kernel version is 3.2.0 2012-11-07 23:18:20.826386 7fe2743e3780 1 journal _open /dev/sda5 fd 15: 4194304000 bytes, block size 4096 bytes, directio = 1, aio = 0 So I know it is trying to use the right partition/block device. It just never get's past that line. Finally, I tried to track things down myself to see what was hanging using strace. I ran: strace /usr/bin/ceph-osd -c /tmp/travis/conf --monmap /tmp/travis/monmap -i 0 --mkfs --mkkey And the final output from that is: open("/dev/sda5", O_RDONLY) = 15 fstat(15, {st_mode=S_IFBLK|0660, st_rdev=makedev(8, 5), ...}) = 0 ioctl(15, BLKGETSIZE64, 0x7fffe7a587a8) = 0 geteuid() = 0 pipe2([16, 17], O_CLOEXEC) = 0 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f5365f28a50) = 707 close(17) = 0 fcntl(16, F_SETFD, 0) = 0 fstat(16, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f5365f14000 read(16, "\n/dev/sda5:\n write-caching = 1 "..., 4096) = 37 open("/proc/version", O_RDONLY) = 17 read(17, "Linux version 3.2.0-23-generic ("..., 127) = 127 futex(0x2db807c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x2db8078, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x2db8028, FUTEX_WAKE_PRIVATE, 1) = 1 close(17) = 0 close(16) = 0 wait4(707, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 707 munmap(0x7f5365f14000, 4096)= 0 io_setup(128, {139996169318400})= 0 futex(0x2db807c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x2db8078, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x2db8028, FUTEX_WAKE_PRIVATE, 1) = 1 pread(15, "\2\0\0\\0\0\0\1\0\0\0\0\0\0\0J\254hB\215qD\5\210\255\343\351\344\3320\215"..., 4096, 0) = 4096 And that's as far as it gets. Any thoughts? After some sleep, I'll try throwing the journal back on a file instead of a block device and see if that does it. Can anyone confirm that using a block device instead of a file is actually better performance? Thanks, - Travis -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http