[PATCH] vstart: allow minimum pool size of one

2012-11-08 Thread Noah Watkins
I needed this patch after some simple 1 OSD vstart environments
refused to allow clients to connect.

--

A minimum pool size of 2 was introduced by 13486857cf. This sets the
minimum to one so that basic vstart environments work.

Signed-off-by: Noah Watkins 

diff --git a/src/vstart.sh b/src/vstart.sh
index 4565efa..bdf02f3 100755
--- a/src/vstart.sh
+++ b/src/vstart.sh
@@ -290,6 +290,7 @@ if [ "$start_mon" -eq 1 ]; then
 [global]
 osd pg bits = 3
 osd pgp bits = 5  ; (invalid, but ceph should cope!)
+osd pool default min size = 1
 EOF
[ "$cephx" -eq 1 ] && cat<> $conf
 auth supported = cephx
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bobtail timing

2012-11-08 Thread Samuel Just
I've got wip_recovery_qos and wip_persist_missing that should go into bobtail.

wip_recovery_qos passed regression (mostly, failures due to fsx, a bug
fixed in master, and timeouts waiting for machines), and is waiting on
review.

wip_persist_missing has a teuthology test I'll push tomorrow
(wip_divergent_priors).  The second commit in wip_persist_missing I
think still needs review (formerly wip_divergent_entries).
-Sam

On Thu, Nov 8, 2012 at 5:30 PM, Yehuda Sadeh  wrote:
> On Wed, Oct 31, 2012 at 1:46 PM, Sage Weil  wrote:
>> I would like to freeze v0.55, the "bobtail" stable release, at the end of
>> next week.  If there is any functionality you are working on that should
>> be included, we need to get it into master (preferably well) before that.
>> There will be several weeks of testing in the 'next' branch after that
>> (probaly 3 weeks) before it is released.
>
> I merged (against current master) and pushed all the pending rgw stuff
> to wip-rgw-integration. This includes:
>
> wip-post-cleaned
> wip-stripe
> wip-keystone
> wip-3452
> wip-3453
> wip-swift-token
>
> All that stuff needs to go into bobtail, but still waiting for review.
> The bottom 3 are trivial.
>
> Yehuda
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rbd map command hangs for 15 minutes during system start up

2012-11-08 Thread Josh Durgin

On 11/08/2012 02:10 PM, Mandell Degerness wrote:

We are seeing a somewhat random, but frequent hang on our systems
during startup.  The hang happens at the point where an "rbd map
" command is run.

I've attached the ceph logs from the cluster.  The map command happens
at Nov  8 18:41:09 on server 172.18.0.15.  The process which hung can
be seen in the log as 172.18.0.15:0/1143980479.

It appears as if the TCP socket is opened to the OSD, but then times
out 15 minutes later, the process gets data when the socket is closed
on the client server and it retries.

Please help.

We are using ceph version 0.48.2argonaut
(commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe).

We are using a 3.5.7 kernel with the following list of patches applied:

1-libceph-encapsulate-out-message-data-setup.patch
2-libceph-dont-mark-footer-complete-before-it-is.patch
3-libceph-move-init-of-bio_iter.patch
4-libceph-dont-use-bio_iter-as-a-flag.patch
5-libceph-resubmit-linger-ops-when-pg-mapping-changes.patch
6-libceph-re-initialize-bio_iter-on-start-of-message-receive.patch
7-ceph-close-old-con-before-reopening-on-mds-reconnect.patch
8-libceph-protect-ceph_con_open-with-mutex.patch
9-libceph-reset-connection-retry-on-successfully-negotiation.patch
10-rbd-only-reset-capacity-when-pointing-to-head.patch
11-rbd-set-image-size-when-header-is-updated.patch
12-libceph-fix-crypto-key-null-deref-memory-leak.patch
13-ceph-tolerate-and-warn-on-extraneous-dentry-from-mds.patch
14-ceph-avoid-divide-by-zero-in-__validate_layout.patch
15-rbd-drop-dev-reference-on-error-in-rbd_open.patch
16-ceph-Fix-oops-when-handling-mdsmap-that-decreases-max_mds.patch
17-libceph-check-for-invalid-mapping.patch
18-ceph-propagate-layout-error-on-osd-request-creation.patch
19-rbd-BUG-on-invalid-layout.patch
20-ceph-return-EIO-on-invalid-layout-on-GET_DATALOC-ioctl.patch
21-ceph-avoid-32-bit-page-index-overflow.patch
23-ceph-fix-dentry-reference-leak-in-encode_fh.patch

Any suggestions?


The log shows your monitors don't have time sychronized enough among
them to make much progress (including authenticating new connections).
That's probably the real issue. 0.2s is pretty large clock drift.


One thought is that the following patch (which we could not apply) is
what is required:

22-rbd-reset-BACKOFF-if-unable-to-re-queue.patch


This is certainly useful too, but I don't think it's the cause of
the delay in this case.

Josh
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bobtail timing

2012-11-08 Thread Yehuda Sadeh
On Wed, Oct 31, 2012 at 1:46 PM, Sage Weil  wrote:
> I would like to freeze v0.55, the "bobtail" stable release, at the end of
> next week.  If there is any functionality you are working on that should
> be included, we need to get it into master (preferably well) before that.
> There will be several weeks of testing in the 'next' branch after that
> (probaly 3 weeks) before it is released.

I merged (against current master) and pushed all the pending rgw stuff
to wip-rgw-integration. This includes:

wip-post-cleaned
wip-stripe
wip-keystone
wip-3452
wip-3453
wip-swift-token

All that stuff needs to go into bobtail, but still waiting for review.
The bottom 3 are trivial.

Yehuda
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SSD journal suggestion / rsockets

2012-11-08 Thread Dieter Kasper
Joseph,

I've downloaded and read the presentation from 'Sean Hefty / Intel Corporation'
about rsockets, which sounds very promising to me.
Can you please teach me how to get access to the rsockets source ?

Thanks,
-Dieter


On Thu, Nov 08, 2012 at 09:12:45PM +0100, Joseph Glanville wrote:
> On 9 November 2012 02:00, Atchley, Scott  wrote:
> > On Nov 8, 2012, at 9:39 AM, Mark Nelson  wrote:
> >
> >> On 11/08/2012 07:55 AM, Atchley, Scott wrote:
> >>> On Nov 8, 2012, at 3:22 AM, Gandalf Corvotempesta 
> >>>  wrote:
> >>>
>  2012/11/8 Mark Nelson :
> > I haven't done much with IPoIB (just RDMA), but my understanding is 
> > that it
> > tends to top out at like 15Gb/s.  Some others on this mailing list can
> > probably speak more authoritatively.  Even with RDMA you are going to 
> > top
> > out at around 3.1-3.2GB/s.
> 
>  15Gb/s is still faster than 10Gbe
>  But this speed limit seems to be kernel-related and should be the same
>  even in a 10Gbe environment, or not?
> >>>
> >>> We have a test cluster with Mellanox QDR HCAs (i.e. NICs). When using 
> >>> Verbs (the native IB API), I see ~27 Gb/s between two hosts. When running 
> >>> Sockets over these devices using IPoIB, I see 13-22 Gb/s depending on 
> >>> whether I use interrupt affinity and process binding.
> >>>
> >>> For our Ceph testing, we will set the affinity of two of the mlx4 
> >>> interrupt handlers to cores 0 and 1 and we will not using process 
> >>> binding. For single stream Netperf, we do use process binding and bind it 
> >>> to the same core (i.e. 0) and we see ~22 Gb/s. For multiple, concurrent 
> >>> Netperf runs, we do not use process binding but we still see ~22 Gb/s.
> >>
> >> Scott, this is very interesting!  Does setting the interrupt affinity
> >> make the biggest difference then when you have concurrent netperf
> >> processes going?  For some reason I thought that setting interrupt
> >> affinity wasn't even guaranteed in linux any more, but this is just some
> >> half-remembered recollection from a year or two ago.
> >
> > We are using RHEL6 with a 3.5.1 kernel. I tested single stream Netperf with 
> > and without affinity:
> >
> > Default (irqbalance running)   12.8 Gb/s
> > IRQ balance off13.0 Gb/s
> > Set IRQ affinity to socket 0   17.3 Gb/s   # using the Mellanox script
> >
> > When I set the affinity to cores 0-1 _and_ I bind Netperf to core 0, I get 
> > ~22 Gb/s for a single stream.
> >
> >>> We used all of the Mellanox tuning recommendations for IPoIB available in 
> >>> their tuning pdf:
> >>>
> >>> http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf
> >>>
> >>> We looked at their interrupt affinity setting scripts and then wrote our 
> >>> own.
> >>>
> >>> Our testing is with IPoIB in "connected" mode, not "datagram" mode. 
> >>> Connected mode is less scalable, but currently I only get ~3 Gb/s with 
> >>> datagram mode. Mellanox claims that we should get identical performance 
> >>> with both modes and we are looking into it.
> >>>
> >>> We are getting a new test cluster with FDR HCAs and I will look into 
> >>> those as well.
> >>
> >> Nice!  At some point I'll probably try to justify getting some FDR cards
> >> in house.  I'd definitely like to hear how FDR ends up working for you.
> >
> > I'll post the numbers when I get access after they are set up.
> >
> > Scott
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> If you are running Ceph purely in userspace you could try using rsockets.
> rsockets is a pure userspace implementation of sockets over RDMA. It
> has much much lower latency and close to native throughput.
> My guess is rsockets will probably work perfectly and should give you
> 95% of theoretical max performance.
> 
> I would like to see a somewhat native implementation of RDMA in Ceph one day.
> I was doing some preliminary work on it 1.5 years ago when Ceph was
> first gaining traction but we didn't end up putting our focus on Ceph
> and as such I never got anywhere with it.
> In theory one only needs to use RDMA for the fast path to gain alot of
> benefit. This can be done even in the RBD kernel module with the
> RDMA-CM which will interact nicely across kernelspace and userspace
> (they actually share he same API thankfully).
> 
> Joseph.
> 
> -- 
> CTO | Orion Virtualisation Solutions | www.orionvm.com.au
> Phone: 1300 56 99 52 | Mobile: 0428 754 846
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kerne

Re: trying to import crushmap results in max_devices > osdmap max_osd

2012-11-08 Thread Josh Durgin

On 11/07/2012 07:28 AM, Stefan Priebe - Profihost AG wrote:

Hello,

i've added two nodes with 4 devices each and modified the crushmap.

But importing the new map results in:
crushmap max_devices 55 > osdmap max_osd 35

What's wrong?


I think this is an obsolete check since 
ee541c0f8d871172ec61962372efca943308e5fe.


wip-max-devices removes these checks.
Sage, is there any reason to keep them?

Josh
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Review request branch wip-java-test

2012-11-08 Thread Joe Buck
I have a 3 line change to the file qa/workunits/libcephfs-java/test.sh that 
tweaks how LD_LIBRARY_PATH is set for the test execution.

The branch is wip-java-test in ceph.git.

Best,
-Joe Buck--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: extreme ceph-osd cpu load for rand. 4k write

2012-11-08 Thread Stefan Priebe

Am 08.11.2012 22:50, schrieb Josh Durgin:

It looks like a not insignificant portion of time is spent in the
logging infrastructure. Could you add this to the osds' configuration
to prevent any debug log gathering (it's logged/gathered):

debug lockdep = 0/0

...


debug throttle = 0/0


New one attached.

Stefan


out.pdf
Description: Adobe PDF document


rbd map command hangs for 15 minutes during system start up

2012-11-08 Thread Mandell Degerness
We are seeing a somewhat random, but frequent hang on our systems
during startup.  The hang happens at the point where an "rbd map
" command is run.

I've attached the ceph logs from the cluster.  The map command happens
at Nov  8 18:41:09 on server 172.18.0.15.  The process which hung can
be seen in the log as 172.18.0.15:0/1143980479.

It appears as if the TCP socket is opened to the OSD, but then times
out 15 minutes later, the process gets data when the socket is closed
on the client server and it retries.

Please help.

We are using ceph version 0.48.2argonaut
(commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe).

We are using a 3.5.7 kernel with the following list of patches applied:

1-libceph-encapsulate-out-message-data-setup.patch
2-libceph-dont-mark-footer-complete-before-it-is.patch
3-libceph-move-init-of-bio_iter.patch
4-libceph-dont-use-bio_iter-as-a-flag.patch
5-libceph-resubmit-linger-ops-when-pg-mapping-changes.patch
6-libceph-re-initialize-bio_iter-on-start-of-message-receive.patch
7-ceph-close-old-con-before-reopening-on-mds-reconnect.patch
8-libceph-protect-ceph_con_open-with-mutex.patch
9-libceph-reset-connection-retry-on-successfully-negotiation.patch
10-rbd-only-reset-capacity-when-pointing-to-head.patch
11-rbd-set-image-size-when-header-is-updated.patch
12-libceph-fix-crypto-key-null-deref-memory-leak.patch
13-ceph-tolerate-and-warn-on-extraneous-dentry-from-mds.patch
14-ceph-avoid-divide-by-zero-in-__validate_layout.patch
15-rbd-drop-dev-reference-on-error-in-rbd_open.patch
16-ceph-Fix-oops-when-handling-mdsmap-that-decreases-max_mds.patch
17-libceph-check-for-invalid-mapping.patch
18-ceph-propagate-layout-error-on-osd-request-creation.patch
19-rbd-BUG-on-invalid-layout.patch
20-ceph-return-EIO-on-invalid-layout-on-GET_DATALOC-ioctl.patch
21-ceph-avoid-32-bit-page-index-overflow.patch
23-ceph-fix-dentry-reference-leak-in-encode_fh.patch

Any suggestions?

One thought is that the following patch (which we could not apply) is
what is required:

22-rbd-reset-BACKOFF-if-unable-to-re-queue.patch

Regards,
Mandell Degerness


hanglog_ceph.log.gz
Description: GNU Zip compressed data


Re: extreme ceph-osd cpu load for rand. 4k write

2012-11-08 Thread Stefan Priebe

Am 08.11.2012 22:58, schrieb Mark Nelson:

Also, I'm not sure what version you are running, but you may want to try
testing master and see if that helps.  Sam has done some work on our
threading and locking code that might help.


This is git master (two hours old).

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unexpected problem with radosgw fcgi

2012-11-08 Thread Sławomir Skowron
Ok, i will digg in nginx, thanks.

Dnia 8 lis 2012 o godz. 22:48 Yehuda Sadeh  napisał(a):

> On Wed, Nov 7, 2012 at 6:16 AM, Sławomir Skowron  wrote:
>> I have realize that requests from fastcgi in nginx from radosgw returning:
>>
>> HTTP/1.1 200, not a HTTP/1.1 200 OK
>>
>> Any other cgi that i run, for example php via fastcgi return this like
>> RFC says, with OK.
>>
>> Is someone experience this problem ??
>
> I have seen a similar issue in the past with nginx. It doesn't happen
> with apache. My guess is that it's either something with the way nginx
> is configured, or some difference in the fastcgi module
> implementation.
>
>>
>> I see in code:
>>
>> ./src/rgw/rgw_rest.cc line 36
>>
>> const static struct rgw_html_errors RGW_HTML_ERRORS[] = {
>>{ 0, 200, "" },
>> 
>>
>> What if i change this into:
>>
>> { 0, 200, "OK" },
>
> The third field there specifies the error code embedded in the
> returned XML with S3, so it wouldn't fix anything.
>
>
> Yehuda
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SSD journal suggestion / rsockets

2012-11-08 Thread Joseph Glanville
On 9 November 2012 08:21, Dieter Kasper  wrote:
> Joseph,
>
> I've downloaded and read the presentation from 'Sean Hefty / Intel 
> Corporation'
> about rsockets, which sounds very promising to me.
> Can you please teach me how to get access to the rsockets source ?
>
> Thanks,
> -Dieter
>
>

rsockets is distributed as part of librdmacm. You can clone the git
repository here:
git://beany.openfabrics.org/~shefty/librdmacm.git

I recommend using the latest master as it features much better support
for forking.

Joseph.

-- 
CTO | Orion Virtualisation Solutions | www.orionvm.com.au
Phone: 1300 56 99 52 | Mobile: 0428 754 846
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: extreme ceph-osd cpu load for rand. 4k write

2012-11-08 Thread Mark Nelson

On 11/08/2012 03:50 PM, Josh Durgin wrote:

On 11/08/2012 01:27 PM, Stefan Priebe wrote:

Am 08.11.2012 17:06, schrieb Mark Nelson:

On 11/08/2012 09:45 AM, Stefan Priebe - Profihost AG wrote:

Am 08.11.2012 16:01, schrieb Sage Weil:

On Thu, 8 Nov 2012, Stefan Priebe - Profihost AG wrote:

Is there any way to find out why a ceph-osd process takes around 10
times more
load on rand 4k writes than on 4k reads?


Something like perf or oprofile is probably your best bet.  perf
can be
tedious to deploy, depending on where your kernel is coming from.
oprofile seems to be deprecated, although I've had good results with
it in
the past.


I've recorded 10s with perf - it is now a 300MB perf.data file. Sadly
i've no idea what todo with it next.


Pour yourself a stiff drink! (haha!)

Try just doing a "perf report" in the directory where you've got the
data file.  Here's a nice tutorial:

https://perf.wiki.kernel.org/index.php/Tutorial

Also, if you see missing symbols you might benefit by chowning the file
to root and running perf report as root.  If you still see missing
symbols, you may want to just give up and try sysprof.


I've now used google perftools / google CPU profiler. It was the only
tool who worked out of the box ;-)

Attached is a PDF with a profiled ceph-osd process while 4k random write.


It looks like a not insignificant portion of time is spent in the
logging infrastructure. Could you add this to the osds' configuration
to prevent any debug log gathering (it's logged/gathered):

debug lockdep = 0/0
debug context = 0/0
debug crush = 0/0
debug buffer = 0/0
debug timer = 0/0
debug journaler = 0/0
debug osd = 0/0
debug optracker = 0/0
debug objclass = 0/0
debug filestore = 0/0
debug journal = 0/0
debug ms = 0/0
debug monc = 0/0
debug tp = 0/0
debug auth = 0/0
debug finisher = 0/0
debug heartbeatmap = 0/0
debug perfcounter = 0/0
debug asok = 0/0
debug throttle = 0/0

Josh


Also, I'm not sure what version you are running, but you may want to try 
testing master and see if that helps.  Sam has done some work on our 
threading and locking code that might help.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: less cores more iops / speed

2012-11-08 Thread Andrey Korolyov
On Thu, Nov 8, 2012 at 7:53 PM, Alexandre DERUMIER  wrote:
>>>So it is a problem of KVM which let's the processes jump between cores a
>>>lot.
>
> maybe numad from redhat can help ?
> http://fedoraproject.org/wiki/Features/numad
>
> It's try to keep process on same numa node and I think it's also doing some 
> dynamic pinning.

Numad keeps only memory chunks on the preferred node, cpu pinning,
which is a primary goal there, should be done separately via libvirt
or manually for qemu process via cpuset(libvirt does pinning via
taskset and seems that it is broken at least in debian wheezy - even
affinity mask is set for qemu process, load spreads all over numa
node, including cpus outside the set).

>
> - Mail original -
>
> De: "Stefan Priebe - Profihost AG" 
> À: "Mark Nelson" 
> Cc: "Joao Eduardo Luis" , ceph-devel@vger.kernel.org
> Envoyé: Jeudi 8 Novembre 2012 16:14:32
> Objet: Re: less cores more iops / speed
>
> Am 08.11.2012 14:19, schrieb Mark Nelson:
>> On 11/08/2012 02:45 AM, Stefan Priebe - Profihost AG wrote:
>>> Am 08.11.2012 01:59, schrieb Mark Nelson:
 There's also the context switching overhead. It'd be interesting to
 know how much the writer processes were shifting around on cores.
>>> What do you mean by that? I'm talking about the KVM guest not about the
>>> ceph nodes.
>>
>> in this case, is fio bouncing around between cores?
>
> Thanks you're correct. If i bind fio to two cores on a 8 core VM it runs
> with 16.000 iops.
>
> So it is a problem of KVM which let's the processes jump between cores a
> lot.
>
> Greets,
> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: extreme ceph-osd cpu load for rand. 4k write

2012-11-08 Thread Josh Durgin

On 11/08/2012 01:27 PM, Stefan Priebe wrote:

Am 08.11.2012 17:06, schrieb Mark Nelson:

On 11/08/2012 09:45 AM, Stefan Priebe - Profihost AG wrote:

Am 08.11.2012 16:01, schrieb Sage Weil:

On Thu, 8 Nov 2012, Stefan Priebe - Profihost AG wrote:

Is there any way to find out why a ceph-osd process takes around 10
times more
load on rand 4k writes than on 4k reads?


Something like perf or oprofile is probably your best bet.  perf can be
tedious to deploy, depending on where your kernel is coming from.
oprofile seems to be deprecated, although I've had good results with
it in
the past.


I've recorded 10s with perf - it is now a 300MB perf.data file. Sadly
i've no idea what todo with it next.


Pour yourself a stiff drink! (haha!)

Try just doing a "perf report" in the directory where you've got the
data file.  Here's a nice tutorial:

https://perf.wiki.kernel.org/index.php/Tutorial

Also, if you see missing symbols you might benefit by chowning the file
to root and running perf report as root.  If you still see missing
symbols, you may want to just give up and try sysprof.


I've now used google perftools / google CPU profiler. It was the only
tool who worked out of the box ;-)

Attached is a PDF with a profiled ceph-osd process while 4k random write.


It looks like a not insignificant portion of time is spent in the
logging infrastructure. Could you add this to the osds' configuration
to prevent any debug log gathering (it's logged/gathered):

debug lockdep = 0/0
debug context = 0/0
debug crush = 0/0
debug buffer = 0/0
debug timer = 0/0
debug journaler = 0/0
debug osd = 0/0
debug optracker = 0/0
debug objclass = 0/0
debug filestore = 0/0
debug journal = 0/0
debug ms = 0/0
debug monc = 0/0
debug tp = 0/0
debug auth = 0/0
debug finisher = 0/0
debug heartbeatmap = 0/0
debug perfcounter = 0/0
debug asok = 0/0
debug throttle = 0/0

Josh
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unexpected problem with radosgw fcgi

2012-11-08 Thread Yehuda Sadeh
On Wed, Nov 7, 2012 at 6:16 AM, Sławomir Skowron  wrote:
> I have realize that requests from fastcgi in nginx from radosgw returning:
>
> HTTP/1.1 200, not a HTTP/1.1 200 OK
>
> Any other cgi that i run, for example php via fastcgi return this like
> RFC says, with OK.
>
> Is someone experience this problem ??

I have seen a similar issue in the past with nginx. It doesn't happen
with apache. My guess is that it's either something with the way nginx
is configured, or some difference in the fastcgi module
implementation.

>
> I see in code:
>
> ./src/rgw/rgw_rest.cc line 36
>
> const static struct rgw_html_errors RGW_HTML_ERRORS[] = {
> { 0, 200, "" },
> 
>
> What if i change this into:
>
> { 0, 200, "OK" },

The third field there specifies the error code embedded in the
returned XML with S3, so it wouldn't fix anything.


Yehuda
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: extreme ceph-osd cpu load for rand. 4k write

2012-11-08 Thread Stefan Priebe

Am 08.11.2012 17:06, schrieb Mark Nelson:

On 11/08/2012 09:45 AM, Stefan Priebe - Profihost AG wrote:

Am 08.11.2012 16:01, schrieb Sage Weil:

On Thu, 8 Nov 2012, Stefan Priebe - Profihost AG wrote:

Is there any way to find out why a ceph-osd process takes around 10
times more
load on rand 4k writes than on 4k reads?


Something like perf or oprofile is probably your best bet.  perf can be
tedious to deploy, depending on where your kernel is coming from.
oprofile seems to be deprecated, although I've had good results with
it in
the past.


I've recorded 10s with perf - it is now a 300MB perf.data file. Sadly
i've no idea what todo with it next.


Pour yourself a stiff drink! (haha!)

Try just doing a "perf report" in the directory where you've got the
data file.  Here's a nice tutorial:

https://perf.wiki.kernel.org/index.php/Tutorial

Also, if you see missing symbols you might benefit by chowning the file
to root and running perf report as root.  If you still see missing
symbols, you may want to just give up and try sysprof.


I've now used google perftools / google CPU profiler. It was the only 
tool who worked out of the box ;-)


Attached is a PDF with a profiled ceph-osd process while 4k random write.

Stefan


out.pdf
Description: Adobe PDF document


Re: SSD journal suggestion

2012-11-08 Thread Joseph Glanville
On 9 November 2012 02:00, Atchley, Scott  wrote:
> On Nov 8, 2012, at 9:39 AM, Mark Nelson  wrote:
>
>> On 11/08/2012 07:55 AM, Atchley, Scott wrote:
>>> On Nov 8, 2012, at 3:22 AM, Gandalf Corvotempesta 
>>>  wrote:
>>>
 2012/11/8 Mark Nelson :
> I haven't done much with IPoIB (just RDMA), but my understanding is that 
> it
> tends to top out at like 15Gb/s.  Some others on this mailing list can
> probably speak more authoritatively.  Even with RDMA you are going to top
> out at around 3.1-3.2GB/s.

 15Gb/s is still faster than 10Gbe
 But this speed limit seems to be kernel-related and should be the same
 even in a 10Gbe environment, or not?
>>>
>>> We have a test cluster with Mellanox QDR HCAs (i.e. NICs). When using Verbs 
>>> (the native IB API), I see ~27 Gb/s between two hosts. When running Sockets 
>>> over these devices using IPoIB, I see 13-22 Gb/s depending on whether I use 
>>> interrupt affinity and process binding.
>>>
>>> For our Ceph testing, we will set the affinity of two of the mlx4 interrupt 
>>> handlers to cores 0 and 1 and we will not using process binding. For single 
>>> stream Netperf, we do use process binding and bind it to the same core 
>>> (i.e. 0) and we see ~22 Gb/s. For multiple, concurrent Netperf runs, we do 
>>> not use process binding but we still see ~22 Gb/s.
>>
>> Scott, this is very interesting!  Does setting the interrupt affinity
>> make the biggest difference then when you have concurrent netperf
>> processes going?  For some reason I thought that setting interrupt
>> affinity wasn't even guaranteed in linux any more, but this is just some
>> half-remembered recollection from a year or two ago.
>
> We are using RHEL6 with a 3.5.1 kernel. I tested single stream Netperf with 
> and without affinity:
>
> Default (irqbalance running)   12.8 Gb/s
> IRQ balance off13.0 Gb/s
> Set IRQ affinity to socket 0   17.3 Gb/s   # using the Mellanox script
>
> When I set the affinity to cores 0-1 _and_ I bind Netperf to core 0, I get 
> ~22 Gb/s for a single stream.
>
>>> We used all of the Mellanox tuning recommendations for IPoIB available in 
>>> their tuning pdf:
>>>
>>> http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf
>>>
>>> We looked at their interrupt affinity setting scripts and then wrote our 
>>> own.
>>>
>>> Our testing is with IPoIB in "connected" mode, not "datagram" mode. 
>>> Connected mode is less scalable, but currently I only get ~3 Gb/s with 
>>> datagram mode. Mellanox claims that we should get identical performance 
>>> with both modes and we are looking into it.
>>>
>>> We are getting a new test cluster with FDR HCAs and I will look into those 
>>> as well.
>>
>> Nice!  At some point I'll probably try to justify getting some FDR cards
>> in house.  I'd definitely like to hear how FDR ends up working for you.
>
> I'll post the numbers when I get access after they are set up.
>
> Scott
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

If you are running Ceph purely in userspace you could try using rsockets.
rsockets is a pure userspace implementation of sockets over RDMA. It
has much much lower latency and close to native throughput.
My guess is rsockets will probably work perfectly and should give you
95% of theoretical max performance.

I would like to see a somewhat native implementation of RDMA in Ceph one day.
I was doing some preliminary work on it 1.5 years ago when Ceph was
first gaining traction but we didn't end up putting our focus on Ceph
and as such I never got anywhere with it.
In theory one only needs to use RDMA for the fast path to gain alot of
benefit. This can be done even in the RBD kernel module with the
RDMA-CM which will interact nicely across kernelspace and userspace
(they actually share he same API thankfully).

Joseph.

-- 
CTO | Orion Virtualisation Solutions | www.orionvm.com.au
Phone: 1300 56 99 52 | Mobile: 0428 754 846
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ignoresync hack no longer applies on 3.6.5

2012-11-08 Thread Nick Bartos
Sorry about that, I think it got chopped.  Here's a full trace from
another run, using kernel 3.6.6 and definitely has the patch applied:
https://gist.github.com/4041120

There are no instances of "sync_fs_one_sb skipping" in the logs.



On Mon, Nov 5, 2012 at 1:29 AM, Sage Weil  wrote:
> On Sun, 4 Nov 2012, Nick Bartos wrote:
>> Unfortunately I'm still seeing deadlocks.  The trace was taken after a
>> 'sync' from the command line was hung for a couple minutes.
>>
>> There was only one debug message (one fs on the system was mounted with 
>> 'mand'):
>
> This was with the updated patch applied?
>
> The dump below doesn't look complete, btw.. I don't see any ceph-osd
> processses.  Don't see any ceph-osd processes, among other things.
>
> sage
>
>>
>> kernel: [11441.168954]  [] ? sync_fs_one_sb+0x4d/0x4d
>>
>> Here's the trace:
>>
>> javaS 88040b06ba08 0  1623  1 0x
>>  88040cb6dd08 0082  880405da8b30
>>   00012b40 00012b40 00012b40
>>  88040cb6dfd8 00012b40 00012b40 88040cb6dfd8
>> Call Trace:
>>  [] schedule+0x64/0x66
>>  [] futex_wait_queue_me+0xc2/0xe1
>>  [] futex_wait+0x120/0x275
>>  [] do_futex+0x96/0x122
>>  [] sys_futex+0x110/0x141
>>  [] ? vfs_write+0xd0/0xdf
>>  [] ? fput+0x18/0xb6
>>  [] ? fput_light+0xd/0xf
>>  [] ? sys_write+0x61/0x6e
>>  [] system_call_fastpath+0x16/0x1b
>> javaS 88040ca4ba48 0  1624  1 0x
>>  88040cb0bd08 0082 88040cb0bc88 81813410
>>  88040cb0bd28 00012b40 00012b40 00012b40
>>  88040cb0bfd8 00012b40 00012b40 88040cb0bfd8
>> Call Trace:
>>  [] schedule+0x64/0x66
>>  [] futex_wait_queue_me+0xc2/0xe1
>>  [] futex_wait+0x120/0x275
>>  [] ? blkdev_issue_flush+0xc0/0xd2
>>  [] do_futex+0x96/0x122
>>  [] sys_futex+0x110/0x141
>>  [] ? fput+0x18/0xb6
>>  [] ? do_device_not_available+0xe/0x10
>>  [] system_call_fastpath+0x16/0x1b
>> javaS 88040ca4b058 0  1625  1 0x
>>  880429d1fd08 0082 0400 81813410
>>  88040b06b4a8 00012b40 00012b40 00012b40
>>  880429d1ffd8 00012b40 00012b40 880429d1ffd8
>> Call Trace:
>>  [] schedule+0x64/0x66
>>  [] futex_wait_queue_me+0xc2/0xe1
>>  [] futex_wait+0x120/0x275
>>  [] do_futex+0x96/0x122
>>  [] sys_futex+0x110/0x141
>>  [] ? do_device_not_available+0xe/0x10
>>  [] system_call_fastpath+0x16/0x1b
>> javaS 88040cd11a08 0  1632  1 0x
>>  88040c40fd08 0082 88040c40fd68 88042b17f4e0
>>  88040c40ff38 00012b40 00012b40 00012b40
>>  88040c40ffd8 00012b40 00012b40 88040c40ffd8
>> Call Trace:
>>  [] schedule+0x64/0x66
>>  [] futex_wait_queue_me+0xc2/0xe1
>>  [] futex_wait+0x120/0x275
>>  [] ? update_rmtp+0x65/0x65
>>  [] ? hrtimer_start_range_ns+0x14/0x16
>>  [] do_futex+0x96/0x122
>>  [] sys_futex+0x110/0x141
>>  [] ? vfs_write+0xd0/0xdf
>>  [] ? do_device_not_available+0xe/0x10
>>  [] system_call_fastpath+0x16/0x1b
>> javaS 88040cd10628 0  1633  1 0x
>>  88040cd7da88 0082 0cd7da18 81813410
>>  88040cccecc0 00012b40 00012b40 00012b40
>>  88040cd7dfd8 00012b40 00012b40 88040cd7dfd8
>> Call Trace:
>>  [] schedule+0x64/0x66
>>  [] schedule_timeout+0x36/0xe3
>>  [] ? _local_bh_enable_ip.clone.8+0x20/0x89
>>  [] ? local_bh_enable_ip+0xe/0x10
>>  [] ? _raw_spin_unlock_bh+0x16/0x18
>>  [] ? release_sock+0x128/0x131
>>  [] sk_wait_data+0x82/0xc5
>>  [] ? wake_up_bit+0x2a/0x2a
>>  [] ? local_bh_enable+0xe/0x10
>>  [] tcp_recvmsg+0x4c5/0x92e
>>  [] ? update_curr+0xd6/0x110
>>  [] ? __switch_to+0x1ac/0x33c
>>  [] inet_recvmsg+0x5e/0x73
>>  [] __sock_recvmsg+0x75/0x84
>>  [] sock_aio_read+0xf2/0x106
>>  [] do_sync_read+0x70/0xad
>>  [] vfs_read+0xbc/0xdc
>>  [] ? fput+0x18/0xb6
>>  [] sys_read+0x4a/0x6e
>>  [] system_call_fastpath+0x16/0x1b
>> javaS 88040ce11a88 0  1634  1 0x
>>  88040c9699f8 0082 0098967f 88042b17f4e0
>>   00012b40 00012b40 00012b40
>>  88040c969fd8 00012b40 00012b40 88040c969fd8
>> Call Trace:
>>  [] schedule+0x64/0x66
>>  [] schedule_hrtimeout_range_clock+0xd2/0x11b
>>  [] ? update_rmtp+0x65/0x65
>>  [] ? hrtimer_start_range_ns+0x14/0x16
>>  [] schedule_hrtimeout_range+0x13/0x15
>>  [] poll_schedule_timeout+0x48/0x64
>>  [] do_poll.clone.3+0x1d0/0x1f1
>>  [] do_sys_poll+0x146/0x1bd
>>  [] ? __pollwait+0xcc/0xcc
>>  [] ? __sock_recvmsg+0x75/0x84
>>  [] ? sock_recvmsg+0x5b/0x7a
>>  [] ? get_futex_key+0x94/0x224
>>  [] ? _raw_spin_lock+0xe/0x10
>>  [] ? double_lock_hb+0x31/0x36
>>  [] ? fget_light+0x6d/0x84
>>  [] ? fput_light+0xd/0xf
>>  [] ? sys_recvf

Re: SSD journal suggestion

2012-11-08 Thread Atchley, Scott
On Nov 8, 2012, at 11:19 AM, Andrey Korolyov  wrote:

> On Thu, Nov 8, 2012 at 7:02 PM, Atchley, Scott  wrote:
>> On Nov 8, 2012, at 10:00 AM, Scott Atchley  wrote:
>> 
>>> On Nov 8, 2012, at 9:39 AM, Mark Nelson  wrote:
>>> 
 On 11/08/2012 07:55 AM, Atchley, Scott wrote:
> On Nov 8, 2012, at 3:22 AM, Gandalf Corvotempesta 
>  wrote:
> 
>> 2012/11/8 Mark Nelson :
>>> I haven't done much with IPoIB (just RDMA), but my understanding is 
>>> that it
>>> tends to top out at like 15Gb/s.  Some others on this mailing list can
>>> probably speak more authoritatively.  Even with RDMA you are going to 
>>> top
>>> out at around 3.1-3.2GB/s.
>> 
>> 15Gb/s is still faster than 10Gbe
>> But this speed limit seems to be kernel-related and should be the same
>> even in a 10Gbe environment, or not?
> 
> We have a test cluster with Mellanox QDR HCAs (i.e. NICs). When using 
> Verbs (the native IB API), I see ~27 Gb/s between two hosts. When running 
> Sockets over these devices using IPoIB, I see 13-22 Gb/s depending on 
> whether I use interrupt affinity and process binding.
> 
> For our Ceph testing, we will set the affinity of two of the mlx4 
> interrupt handlers to cores 0 and 1 and we will not using process 
> binding. For single stream Netperf, we do use process binding and bind it 
> to the same core (i.e. 0) and we see ~22 Gb/s. For multiple, concurrent 
> Netperf runs, we do not use process binding but we still see ~22 Gb/s.
 
 Scott, this is very interesting!  Does setting the interrupt affinity
 make the biggest difference then when you have concurrent netperf
 processes going?  For some reason I thought that setting interrupt
 affinity wasn't even guaranteed in linux any more, but this is just some
 half-remembered recollection from a year or two ago.
>>> 
>>> We are using RHEL6 with a 3.5.1 kernel. I tested single stream Netperf with 
>>> and without affinity:
>>> 
>>> Default (irqbalance running)   12.8 Gb/s
>>> IRQ balance off13.0 Gb/s
>>> Set IRQ affinity to socket 0   17.3 Gb/s   # using the Mellanox script
>>> 
>>> When I set the affinity to cores 0-1 _and_ I bind Netperf to core 0, I get 
>>> ~22 Gb/s for a single stream.
>> 
> 
> Did you tried Mellanox-baked modules for 2.6.32 before that?

That came with RHEL6? No.

Scott

> 
>> Note, I used hwloc to determine which socket was closer to the mlx4 device 
>> on our dual socket machines. On these nodes, hwloc reported that both 
>> sockets were equally close, but a colleague has machines where one socket is 
>> closer than the other. In that case, bind to the closer socket (or to cores 
>> within the closer socket).
>> 
>>> 
> We used all of the Mellanox tuning recommendations for IPoIB available in 
> their tuning pdf:
> 
> http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf
> 
> We looked at their interrupt affinity setting scripts and then wrote our 
> own.
> 
> Our testing is with IPoIB in "connected" mode, not "datagram" mode. 
> Connected mode is less scalable, but currently I only get ~3 Gb/s with 
> datagram mode. Mellanox claims that we should get identical performance 
> with both modes and we are looking into it.
> 
> We are getting a new test cluster with FDR HCAs and I will look into 
> those as well.
 
 Nice!  At some point I'll probably try to justify getting some FDR cards
 in house.  I'd definitely like to hear how FDR ends up working for you.
>>> 
>>> I'll post the numbers when I get access after they are set up.
>>> 
>>> Scott
>>> 
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: problems creating new ceph cluster when using journal on block device

2012-11-08 Thread Mark Nelson

On 11/08/2012 11:36 AM, Travis Rhoden wrote:

Solved!

I stumbled into the solution while switching from block device to a
file.  I was being bit by running mkcephfs multiple times -- it wasn't
really failing on the journal, it was failing because the OSD data
disk had been initialized before.  I couldn't see that until I used a
file for the journal and then I see log output like:


Yeah, that was a change that landed a couple of months ago.  It's really 
important now to blow away the old data (I just reformat) if you want a 
totally clean ceph deployment rather than just running mkcephfs.




=== osd.0 ===
2012-11-08 16:41:37.677620 7ffc3cfcd780 -1 provided osd id 0 != superblock's -1
2012-11-08 16:41:37.678726 7ffc3cfcd780 -1  ** ERROR: error creating
empty object store in /var/lib/ceph/osd/ceph-0: (22) Invalid argument

I unmounted the OSD's that had been touched before, reformatted them,
and then remounted.  I setup ceph.conf to use block devices for the
journals, and then everything proceeded normally.

So the final relevant bits from my ceph.conf file look like:

[osd]
 osd journal size = 0
 journal dio = true
 journal aio = true

[osd.0]
 host = ceph1
 osd journal = /dev/sda5

[osd.1]
 host = ceph1
 osd journal = /dev/sda6
...

Thanks,

  - Travis

On Thu, Nov 8, 2012 at 10:08 AM, Travis Rhoden  wrote:

One more thing -- Google search says this is harmless -- I see quite a
few of these in syslog:

hdparm: sending ioctl 2285 to a partition!

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: problems creating new ceph cluster when using journal on block device

2012-11-08 Thread Travis Rhoden
Solved!

I stumbled into the solution while switching from block device to a
file.  I was being bit by running mkcephfs multiple times -- it wasn't
really failing on the journal, it was failing because the OSD data
disk had been initialized before.  I couldn't see that until I used a
file for the journal and then I see log output like:

=== osd.0 ===
2012-11-08 16:41:37.677620 7ffc3cfcd780 -1 provided osd id 0 != superblock's -1
2012-11-08 16:41:37.678726 7ffc3cfcd780 -1  ** ERROR: error creating
empty object store in /var/lib/ceph/osd/ceph-0: (22) Invalid argument

I unmounted the OSD's that had been touched before, reformatted them,
and then remounted.  I setup ceph.conf to use block devices for the
journals, and then everything proceeded normally.

So the final relevant bits from my ceph.conf file look like:

[osd]
osd journal size = 0
journal dio = true
journal aio = true

[osd.0]
host = ceph1
osd journal = /dev/sda5

[osd.1]
host = ceph1
osd journal = /dev/sda6
...

Thanks,

 - Travis

On Thu, Nov 8, 2012 at 10:08 AM, Travis Rhoden  wrote:
> One more thing -- Google search says this is harmless -- I see quite a
> few of these in syslog:
>
> hdparm: sending ioctl 2285 to a partition!
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Review request for branch wip-java-tests

2012-11-08 Thread Sage Weil
Merged, thanks!

sage

On Thu, 8 Nov 2012, Joe Buck wrote:

> I have a branch for review that reworks that tests for the java bindings and
> builds them if both --enable-cephfs-java and --with-debug are specified. The
> tests can also be built and run via ant.
> 
> Branch name is wip-java-tests.
> 
> Regards,
> -Joe Buck
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Review request for branch wip-java-tests

2012-11-08 Thread Joe Buck
I have a branch for review that reworks that tests for the java bindings 
and builds them if both --enable-cephfs-java and --with-debug are 
specified. The tests can also be built and run via ant.


Branch name is wip-java-tests.

Regards,
-Joe Buck
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: some snapshot problems

2012-11-08 Thread Sage Weil
Hi Liu,

Sorry for the late reply; I have had a very busy week.  :)

On Thu, 1 Nov 2012, liu yaqi wrote:
> Dear Mr.Weil
> 
> I am a student of Institute of Computing Technology, Chinese Academy of
> Sciences, and I am learning the realization of snapshot in ceph system.
> There are sometings that puzzle me, and I want to ask you some questions.
> First question, there is a command "ceph osd cluster_snap {name}", but i
> cannot found the complete realization process, and I want to ask is the
> snapshot for the whole cluster has been  realized?

The idea was to have a low-level cluster-wide snapshot that could be used 
for recovery if ceph itself went haywire and corrupted itself.  The idea 
would be for the OSDs to create btrfs-level snapshots of their data.  It 
was never completely implemented, though, and the OSD bits have mostly 
been removed.  In particular, we never made a way for the monitor state to 
be checkpointed, which would be necessary for the whole scheme to work 
properly.

> Second question, there
> seems to be snapshots for pools and images. I want to ask what does pool and
> image mean? Is an image means an osd?

Lots of different snapshots:

 - librados lets you do 'selfmanaged snaps' in its API, which let an 
   application control which snapshots apply to which objects. 
 - you can create a 'pool' snapshot on an entire librados pool.  this 
   cannot be used at the same time as rbd, fs, or the above 'selfmanaged' 
   snaps.
 - rbd let's you snapshot block device images (by usuing the librados 
   selfmanaged snap API).
 - the ceph file system let's you snapshot any subdirectory (again 
   utilizing the underlying RADOS functionality).

> Third question, in the "mds" folder,
> there are files like "snapserver" "MClientSnap" and so on, is there files
> are used to snapshot the metadata only? 

Yes.

> Does they have some relationship
> with the pool or image snapshots? 

Not really.

> The last question, is there snapshots
> for a file path in the ceph? Or, the snapshots must be done on metadata and
> data  separately?

For the file system, you create a snapshot on a directory and it affects 
all files in that directory and beneath it, including the data in those 
files.

Hope that helps!
sage

> If you would kind enough to help me on the above questions, I will be
> grateful. And I am looking forward to your reply.
> 
> With best wishes for you.
> 
> Yours, YaqiLiu
> 

Re: SSD journal suggestion

2012-11-08 Thread Andrey Korolyov
On Thu, Nov 8, 2012 at 7:02 PM, Atchley, Scott  wrote:
> On Nov 8, 2012, at 10:00 AM, Scott Atchley  wrote:
>
>> On Nov 8, 2012, at 9:39 AM, Mark Nelson  wrote:
>>
>>> On 11/08/2012 07:55 AM, Atchley, Scott wrote:
 On Nov 8, 2012, at 3:22 AM, Gandalf Corvotempesta 
  wrote:

> 2012/11/8 Mark Nelson :
>> I haven't done much with IPoIB (just RDMA), but my understanding is that 
>> it
>> tends to top out at like 15Gb/s.  Some others on this mailing list can
>> probably speak more authoritatively.  Even with RDMA you are going to top
>> out at around 3.1-3.2GB/s.
>
> 15Gb/s is still faster than 10Gbe
> But this speed limit seems to be kernel-related and should be the same
> even in a 10Gbe environment, or not?

 We have a test cluster with Mellanox QDR HCAs (i.e. NICs). When using 
 Verbs (the native IB API), I see ~27 Gb/s between two hosts. When running 
 Sockets over these devices using IPoIB, I see 13-22 Gb/s depending on 
 whether I use interrupt affinity and process binding.

 For our Ceph testing, we will set the affinity of two of the mlx4 
 interrupt handlers to cores 0 and 1 and we will not using process binding. 
 For single stream Netperf, we do use process binding and bind it to the 
 same core (i.e. 0) and we see ~22 Gb/s. For multiple, concurrent Netperf 
 runs, we do not use process binding but we still see ~22 Gb/s.
>>>
>>> Scott, this is very interesting!  Does setting the interrupt affinity
>>> make the biggest difference then when you have concurrent netperf
>>> processes going?  For some reason I thought that setting interrupt
>>> affinity wasn't even guaranteed in linux any more, but this is just some
>>> half-remembered recollection from a year or two ago.
>>
>> We are using RHEL6 with a 3.5.1 kernel. I tested single stream Netperf with 
>> and without affinity:
>>
>> Default (irqbalance running)   12.8 Gb/s
>> IRQ balance off13.0 Gb/s
>> Set IRQ affinity to socket 0   17.3 Gb/s   # using the Mellanox script
>>
>> When I set the affinity to cores 0-1 _and_ I bind Netperf to core 0, I get 
>> ~22 Gb/s for a single stream.
>

Did you tried Mellanox-baked modules for 2.6.32 before that?

> Note, I used hwloc to determine which socket was closer to the mlx4 device on 
> our dual socket machines. On these nodes, hwloc reported that both sockets 
> were equally close, but a colleague has machines where one socket is closer 
> than the other. In that case, bind to the closer socket (or to cores within 
> the closer socket).
>
>>
 We used all of the Mellanox tuning recommendations for IPoIB available in 
 their tuning pdf:

 http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf

 We looked at their interrupt affinity setting scripts and then wrote our 
 own.

 Our testing is with IPoIB in "connected" mode, not "datagram" mode. 
 Connected mode is less scalable, but currently I only get ~3 Gb/s with 
 datagram mode. Mellanox claims that we should get identical performance 
 with both modes and we are looking into it.

 We are getting a new test cluster with FDR HCAs and I will look into those 
 as well.
>>>
>>> Nice!  At some point I'll probably try to justify getting some FDR cards
>>> in house.  I'd definitely like to hear how FDR ends up working for you.
>>
>> I'll post the numbers when I get access after they are set up.
>>
>> Scott
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: extreme ceph-osd cpu load for rand. 4k write

2012-11-08 Thread Mark Nelson

On 11/08/2012 09:45 AM, Stefan Priebe - Profihost AG wrote:

Am 08.11.2012 16:01, schrieb Sage Weil:

On Thu, 8 Nov 2012, Stefan Priebe - Profihost AG wrote:

Is there any way to find out why a ceph-osd process takes around 10
times more
load on rand 4k writes than on 4k reads?


Something like perf or oprofile is probably your best bet.  perf can be
tedious to deploy, depending on where your kernel is coming from.
oprofile seems to be deprecated, although I've had good results with
it in
the past.


I've recorded 10s with perf - it is now a 300MB perf.data file. Sadly
i've no idea what todo with it next.


Pour yourself a stiff drink! (haha!)

Try just doing a "perf report" in the directory where you've got the 
data file.  Here's a nice tutorial:


https://perf.wiki.kernel.org/index.php/Tutorial

Also, if you see missing symbols you might benefit by chowning the file 
to root and running perf report as root.  If you still see missing 
symbols, you may want to just give up and try sysprof.





  would love to see where the CPU is spending most of it's time.  This is
on current master?

Yes


 I expect there are still some low-hanging fruit that
can bring CPU utilization down (or even boost iops).

Would be great to find them.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: less cores more iops / speed

2012-11-08 Thread Alexandre DERUMIER
>>So it is a problem of KVM which let's the processes jump between cores a 
>>lot. 

maybe numad from redhat can help ?
http://fedoraproject.org/wiki/Features/numad

It's try to keep process on same numa node and I think it's also doing some 
dynamic pinning.

- Mail original - 

De: "Stefan Priebe - Profihost AG"  
À: "Mark Nelson"  
Cc: "Joao Eduardo Luis" , ceph-devel@vger.kernel.org 
Envoyé: Jeudi 8 Novembre 2012 16:14:32 
Objet: Re: less cores more iops / speed 

Am 08.11.2012 14:19, schrieb Mark Nelson: 
> On 11/08/2012 02:45 AM, Stefan Priebe - Profihost AG wrote: 
>> Am 08.11.2012 01:59, schrieb Mark Nelson: 
>>> There's also the context switching overhead. It'd be interesting to 
>>> know how much the writer processes were shifting around on cores. 
>> What do you mean by that? I'm talking about the KVM guest not about the 
>> ceph nodes. 
> 
> in this case, is fio bouncing around between cores? 

Thanks you're correct. If i bind fio to two cores on a 8 core VM it runs 
with 16.000 iops. 

So it is a problem of KVM which let's the processes jump between cores a 
lot. 

Greets, 
Stefan 
-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majord...@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: extreme ceph-osd cpu load for rand. 4k write

2012-11-08 Thread Stefan Priebe - Profihost AG

Am 08.11.2012 16:01, schrieb Mark Nelson:

Hi Stefan,

You might want to try running sysprof or perf while the OSDs are running
during the tests and see where CPU time is being spent.  Also, how are
you determining how much CPU usage is being used?


Hi Mark,

have a 300MB perf.data file and no idea what todo next ;-)

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: extreme ceph-osd cpu load for rand. 4k write

2012-11-08 Thread Stefan Priebe - Profihost AG

Am 08.11.2012 16:01, schrieb Sage Weil:

On Thu, 8 Nov 2012, Stefan Priebe - Profihost AG wrote:

Is there any way to find out why a ceph-osd process takes around 10 times more
load on rand 4k writes than on 4k reads?


Something like perf or oprofile is probably your best bet.  perf can be
tedious to deploy, depending on where your kernel is coming from.
oprofile seems to be deprecated, although I've had good results with it in
the past.


I've recorded 10s with perf - it is now a 300MB perf.data file. Sadly 
i've no idea what todo with it next.



  would love to see where the CPU is spending most of it's time.  This is
on current master?

Yes


 I expect there are still some low-hanging fruit that
can bring CPU utilization down (or even boost iops).

Would be great to find them.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: less cores more iops / speed

2012-11-08 Thread Stefan Priebe - Profihost AG

Am 08.11.2012 14:19, schrieb Mark Nelson:

On 11/08/2012 02:45 AM, Stefan Priebe - Profihost AG wrote:

Am 08.11.2012 01:59, schrieb Mark Nelson:

There's also the context switching overhead.  It'd be interesting to
know how much the writer processes were shifting around on cores.

What do you mean by that? I'm talking about the KVM guest not about the
ceph nodes.


in this case, is fio bouncing around between cores?


Thanks you're correct. If i bind fio to two cores on a 8 core VM it runs 
with 16.000 iops.


So it is a problem of KVM which let's the processes jump between cores a 
lot.


Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SSD journal suggestion

2012-11-08 Thread Atchley, Scott
On Nov 8, 2012, at 10:00 AM, Scott Atchley  wrote:

> On Nov 8, 2012, at 9:39 AM, Mark Nelson  wrote:
> 
>> On 11/08/2012 07:55 AM, Atchley, Scott wrote:
>>> On Nov 8, 2012, at 3:22 AM, Gandalf Corvotempesta 
>>>  wrote:
>>> 
 2012/11/8 Mark Nelson :
> I haven't done much with IPoIB (just RDMA), but my understanding is that 
> it
> tends to top out at like 15Gb/s.  Some others on this mailing list can
> probably speak more authoritatively.  Even with RDMA you are going to top
> out at around 3.1-3.2GB/s.
 
 15Gb/s is still faster than 10Gbe
 But this speed limit seems to be kernel-related and should be the same
 even in a 10Gbe environment, or not?
>>> 
>>> We have a test cluster with Mellanox QDR HCAs (i.e. NICs). When using Verbs 
>>> (the native IB API), I see ~27 Gb/s between two hosts. When running Sockets 
>>> over these devices using IPoIB, I see 13-22 Gb/s depending on whether I use 
>>> interrupt affinity and process binding.
>>> 
>>> For our Ceph testing, we will set the affinity of two of the mlx4 interrupt 
>>> handlers to cores 0 and 1 and we will not using process binding. For single 
>>> stream Netperf, we do use process binding and bind it to the same core 
>>> (i.e. 0) and we see ~22 Gb/s. For multiple, concurrent Netperf runs, we do 
>>> not use process binding but we still see ~22 Gb/s.
>> 
>> Scott, this is very interesting!  Does setting the interrupt affinity 
>> make the biggest difference then when you have concurrent netperf 
>> processes going?  For some reason I thought that setting interrupt 
>> affinity wasn't even guaranteed in linux any more, but this is just some 
>> half-remembered recollection from a year or two ago.
> 
> We are using RHEL6 with a 3.5.1 kernel. I tested single stream Netperf with 
> and without affinity:
> 
> Default (irqbalance running)   12.8 Gb/s
> IRQ balance off13.0 Gb/s
> Set IRQ affinity to socket 0   17.3 Gb/s   # using the Mellanox script
> 
> When I set the affinity to cores 0-1 _and_ I bind Netperf to core 0, I get 
> ~22 Gb/s for a single stream.

Note, I used hwloc to determine which socket was closer to the mlx4 device on 
our dual socket machines. On these nodes, hwloc reported that both sockets were 
equally close, but a colleague has machines where one socket is closer than the 
other. In that case, bind to the closer socket (or to cores within the closer 
socket).

> 
>>> We used all of the Mellanox tuning recommendations for IPoIB available in 
>>> their tuning pdf:
>>> 
>>> http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf
>>> 
>>> We looked at their interrupt affinity setting scripts and then wrote our 
>>> own.
>>> 
>>> Our testing is with IPoIB in "connected" mode, not "datagram" mode. 
>>> Connected mode is less scalable, but currently I only get ~3 Gb/s with 
>>> datagram mode. Mellanox claims that we should get identical performance 
>>> with both modes and we are looking into it.
>>> 
>>> We are getting a new test cluster with FDR HCAs and I will look into those 
>>> as well.
>> 
>> Nice!  At some point I'll probably try to justify getting some FDR cards 
>> in house.  I'd definitely like to hear how FDR ends up working for you.
> 
> I'll post the numbers when I get access after they are set up.
> 
> Scott
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: problems creating new ceph cluster when using journal on block device

2012-11-08 Thread Travis Rhoden
>>> [osd]
>>>  osd journal size = 4000
>>
>>
>> Not sure if this is the problem, but when using a block device you don't
>> have to specify the size for the journal.

So happy to know that, Wido!  I had hoped there was a way to skip that.

Tried without it -- only difference in the logs was seeing that it
picked up the full size of the partition.  So, same result.

> Also might be useful to know make/model of ssd, plus motherboard make/model
> (in case commenting out size does not fix)!

It's an Intel X25-E, 64GB.  It's a place-holder until some bigger ones
we have on order show up.

The mother board is a SuperMicro X8DT6.  SSDs are connected to onboard
SATA ports, data drives are connected to LSI 9211-8i (SAS2008)

Maybe there is a special way I need to do the partition?  My goal was
to throw 6 journals on this disk, and it is partitioned like so:

Model: ATA SSDSA2SH064G1GC (scsi)
Disk /dev/sda: 64.0GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos

Number  Start   End SizeType  File system  Flags
 1  1049kB  512MB   511MB   primaryraid
 2  512MB   2511MB  2000MB  primaryraid
 3  2511MB  6512MB  4000MB  primaryraid
 4  6512MB  64.0GB  57.5GB  extended
 5  6513MB  15.1GB  8590MB  logical
 6  15.1GB  23.7GB  8590MB  logical
 7  23.7GB  32.3GB  8590MB  logical
 8  32.3GB  40.9GB  8590MB  logical
 9  40.9GB  49.5GB  8590MB  logical
10  49.5GB  58.1GB  8590MB  logical


So, sda5-10 are my journal partitions.  I know that I have consumed
most of the drive here, and that is bad for the SSD and such, but it
really is a temporary setup.

 - Travis

On Thu, Nov 8, 2012 at 3:24 AM, Mark Kirkwood
 wrote:
> On 08/11/12 21:08, Wido den Hollander wrote:
>>
>>
>>
>> On 08-11-12 08:29, Travis Rhoden wrote:
>>>
>>> Hey folks,
>>>
>>> I'm trying to set up a brand new Ceph cluster, based on v0.53.  My
>>> hardware has SSDs for journals, and I'm trying to get mkcephfs to
>>> intialize everything for me. However, the command hangs forever and I
>>> eventually have to kill it.
>>>
>>> After poking around a bit, it's clear that the problem has something
>>> to do with the journal.  If I comment out the journal in ceph.conf,
>>> the commands proceed just find.  This is the first time I've tried to
>>> throw a journal on a block device rather than a file, so maybe I've
>>> done something wrong with that.
>>>
>>> Here is the info from ceph.conf:
>>>
>>>
>>> [osd]
>>>  osd journal size = 4000
>>
>>
>> Not sure if this is the problem, but when using a block device you don't
>> have to specify the size for the journal.
>
>
> Also might be useful to know make/model of ssd, plus motherboard make/model
> (in case commenting out size does not fix)!
>
> Regards
>
> Mark
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: extreme ceph-osd cpu load for rand. 4k write

2012-11-08 Thread Mark Nelson

Hi Stefan,

You might want to try running sysprof or perf while the OSDs are running 
during the tests and see where CPU time is being spent.  Also, how are 
you determining how much CPU usage is being used?


Mark

On 11/08/2012 08:58 AM, Stefan Priebe - Profihost AG wrote:

Is there any way to find out why a ceph-osd process takes around 10
times more load on rand 4k writes than on 4k reads?

Stefan

Am 07.11.2012 21:41, schrieb Stefan Priebe:

Hello list,

whiling benchmarking i was wondering, why the ceph-osd load is so
extreme high while having random 4k write i/o.

Here an example while benchmarking:

random 4k write: 16.000 iop/s 180% CPU Load in top from EACH ceph-osd
process

random 4k read: 16.000 iop/s 19% CPU Load in top from EACH ceph-osd
process

seq 4M write: 800MB/s 14% CPU Load in top from EACH ceph-osd process

seq 4M read: 1600MB/s 9% CPU Load in top from EACH ceph-osd process

I can't understand why in this single case the load is so EXTREMELY high.

Greets
Stefan

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: extreme ceph-osd cpu load for rand. 4k write

2012-11-08 Thread Sage Weil
On Thu, 8 Nov 2012, Stefan Priebe - Profihost AG wrote:
> Is there any way to find out why a ceph-osd process takes around 10 times more
> load on rand 4k writes than on 4k reads?

Something like perf or oprofile is probably your best bet.  perf can be 
tedious to deploy, depending on where your kernel is coming from.  
oprofile seems to be deprecated, although I've had good results with it in 
the past.

 would love to see where the CPU is spending most of it's time.  This is 
on current master?  I expect there are still some low-hanging fruit that 
can bring CPU utilization down (or even boost iops).

sage



> 
> Stefan
> 
> Am 07.11.2012 21:41, schrieb Stefan Priebe:
> > Hello list,
> > 
> > whiling benchmarking i was wondering, why the ceph-osd load is so
> > extreme high while having random 4k write i/o.
> > 
> > Here an example while benchmarking:
> > 
> > random 4k write: 16.000 iop/s 180% CPU Load in top from EACH ceph-osd
> > process
> > 
> > random 4k read: 16.000 iop/s 19% CPU Load in top from EACH ceph-osd process
> > 
> > seq 4M write: 800MB/s 14% CPU Load in top from EACH ceph-osd process
> > 
> > seq 4M read: 1600MB/s 9% CPU Load in top from EACH ceph-osd process
> > 
> > I can't understand why in this single case the load is so EXTREMELY high.
> > 
> > Greets
> > Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SSD journal suggestion

2012-11-08 Thread Atchley, Scott
On Nov 8, 2012, at 9:39 AM, Mark Nelson  wrote:

> On 11/08/2012 07:55 AM, Atchley, Scott wrote:
>> On Nov 8, 2012, at 3:22 AM, Gandalf Corvotempesta 
>>  wrote:
>> 
>>> 2012/11/8 Mark Nelson :
 I haven't done much with IPoIB (just RDMA), but my understanding is that it
 tends to top out at like 15Gb/s.  Some others on this mailing list can
 probably speak more authoritatively.  Even with RDMA you are going to top
 out at around 3.1-3.2GB/s.
>>> 
>>> 15Gb/s is still faster than 10Gbe
>>> But this speed limit seems to be kernel-related and should be the same
>>> even in a 10Gbe environment, or not?
>> 
>> We have a test cluster with Mellanox QDR HCAs (i.e. NICs). When using Verbs 
>> (the native IB API), I see ~27 Gb/s between two hosts. When running Sockets 
>> over these devices using IPoIB, I see 13-22 Gb/s depending on whether I use 
>> interrupt affinity and process binding.
>> 
>> For our Ceph testing, we will set the affinity of two of the mlx4 interrupt 
>> handlers to cores 0 and 1 and we will not using process binding. For single 
>> stream Netperf, we do use process binding and bind it to the same core (i.e. 
>> 0) and we see ~22 Gb/s. For multiple, concurrent Netperf runs, we do not use 
>> process binding but we still see ~22 Gb/s.
> 
> Scott, this is very interesting!  Does setting the interrupt affinity 
> make the biggest difference then when you have concurrent netperf 
> processes going?  For some reason I thought that setting interrupt 
> affinity wasn't even guaranteed in linux any more, but this is just some 
> half-remembered recollection from a year or two ago.

We are using RHEL6 with a 3.5.1 kernel. I tested single stream Netperf with and 
without affinity:

Default (irqbalance running)   12.8 Gb/s
IRQ balance off13.0 Gb/s
Set IRQ affinity to socket 0   17.3 Gb/s   # using the Mellanox script

When I set the affinity to cores 0-1 _and_ I bind Netperf to core 0, I get ~22 
Gb/s for a single stream.

>> We used all of the Mellanox tuning recommendations for IPoIB available in 
>> their tuning pdf:
>> 
>> http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf
>> 
>> We looked at their interrupt affinity setting scripts and then wrote our own.
>> 
>> Our testing is with IPoIB in "connected" mode, not "datagram" mode. 
>> Connected mode is less scalable, but currently I only get ~3 Gb/s with 
>> datagram mode. Mellanox claims that we should get identical performance with 
>> both modes and we are looking into it.
>> 
>> We are getting a new test cluster with FDR HCAs and I will look into those 
>> as well.
> 
> Nice!  At some point I'll probably try to justify getting some FDR cards 
> in house.  I'd definitely like to hear how FDR ends up working for you.

I'll post the numbers when I get access after they are set up.

Scott

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: extreme ceph-osd cpu load for rand. 4k write

2012-11-08 Thread Stefan Priebe - Profihost AG
Is there any way to find out why a ceph-osd process takes around 10 
times more load on rand 4k writes than on 4k reads?


Stefan

Am 07.11.2012 21:41, schrieb Stefan Priebe:

Hello list,

whiling benchmarking i was wondering, why the ceph-osd load is so
extreme high while having random 4k write i/o.

Here an example while benchmarking:

random 4k write: 16.000 iop/s 180% CPU Load in top from EACH ceph-osd
process

random 4k read: 16.000 iop/s 19% CPU Load in top from EACH ceph-osd process

seq 4M write: 800MB/s 14% CPU Load in top from EACH ceph-osd process

seq 4M read: 1600MB/s 9% CPU Load in top from EACH ceph-osd process

I can't understand why in this single case the load is so EXTREMELY high.

Greets
Stefan

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SSD journal suggestion

2012-11-08 Thread Mark Nelson

On 11/08/2012 07:55 AM, Atchley, Scott wrote:

On Nov 8, 2012, at 3:22 AM, Gandalf Corvotempesta 
 wrote:


2012/11/8 Mark Nelson :

I haven't done much with IPoIB (just RDMA), but my understanding is that it
tends to top out at like 15Gb/s.  Some others on this mailing list can
probably speak more authoritatively.  Even with RDMA you are going to top
out at around 3.1-3.2GB/s.


15Gb/s is still faster than 10Gbe
But this speed limit seems to be kernel-related and should be the same
even in a 10Gbe environment, or not?


We have a test cluster with Mellanox QDR HCAs (i.e. NICs). When using Verbs 
(the native IB API), I see ~27 Gb/s between two hosts. When running Sockets 
over these devices using IPoIB, I see 13-22 Gb/s depending on whether I use 
interrupt affinity and process binding.

For our Ceph testing, we will set the affinity of two of the mlx4 interrupt 
handlers to cores 0 and 1 and we will not using process binding. For single 
stream Netperf, we do use process binding and bind it to the same core (i.e. 0) 
and we see ~22 Gb/s. For multiple, concurrent Netperf runs, we do not use 
process binding but we still see ~22 Gb/s.


Scott, this is very interesting!  Does setting the interrupt affinity 
make the biggest difference then when you have concurrent netperf 
processes going?  For some reason I thought that setting interrupt 
affinity wasn't even guaranteed in linux any more, but this is just some 
half-remembered recollection from a year or two ago.




We used all of the Mellanox tuning recommendations for IPoIB available in their 
tuning pdf:

http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf

We looked at their interrupt affinity setting scripts and then wrote our own.

Our testing is with IPoIB in "connected" mode, not "datagram" mode. Connected 
mode is less scalable, but currently I only get ~3 Gb/s with datagram mode. Mellanox claims that we 
should get identical performance with both modes and we are looking into it.

We are getting a new test cluster with FDR HCAs and I will look into those as 
well.


Nice!  At some point I'll probably try to justify getting some FDR cards 
in house.  I'd definitely like to hear how FDR ends up working for you.




Scott



Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] rbd: end request on error in rbd_do_request() caller

2012-11-08 Thread Alex Elder
Only one of the three callers of rbd_do_request() provide a
collection structure to aggregate status.

If an error occurs in rbd_do_request(), have the caller
take care of calling rbd_coll_end_req() if necessary in
that one spot.

Signed-off-by: Alex Elder 
---
 drivers/block/rbd.c |   11 ---
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index fb727c0..835153e 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1128,12 +1128,8 @@ static int rbd_do_request(struct request *rq,
struct ceph_osd_client *osdc;

rbd_req = kzalloc(sizeof(*rbd_req), GFP_NOIO);
-   if (!rbd_req) {
-   if (coll)
-   rbd_coll_end_req_index(rq, coll, coll_index,
-  (s32) -ENOMEM, len);
+   if (!rbd_req)
return -ENOMEM;
-   }

if (coll) {
rbd_req->coll = coll;
@@ -1208,7 +1204,6 @@ done_err:
bio_chain_put(rbd_req->bio);
ceph_osdc_put_request(osd_req);
 done_pages:
-   rbd_coll_end_req(rbd_req, (s32) ret, len);
kfree(rbd_req);
return ret;
 }
@@ -1361,7 +1356,9 @@ static int rbd_do_op(struct request *rq,
 ops,
 coll, coll_index,
 rbd_req_cb, 0, NULL);
-
+   if (ret < 0)
+   rbd_coll_end_req_index(rq, coll, coll_index,
+   (s32) ret, seg_len);
rbd_destroy_ops(ops);
 done:
kfree(seg_name);
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] rbd: a little more cleanup of rbd_rq_fn()

2012-11-08 Thread Alex Elder
Now that a big hunk in the middle of rbd_rq_fn() has been moved
into its own routine we can simplify it a little more.

Signed-off-by: Alex Elder 
---
 drivers/block/rbd.c |   50
+++---
 1 file changed, 23 insertions(+), 27 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 6aed59b..fb727c0 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1649,53 +1649,49 @@ static int rbd_dev_do_request(struct request *rq,
 static void rbd_rq_fn(struct request_queue *q)
 {
struct rbd_device *rbd_dev = q->queuedata;
+   bool read_only = rbd_dev->mapping.read_only;
struct request *rq;

while ((rq = blk_fetch_request(q))) {
-   struct bio *bio;
-   bool do_write;
-   unsigned int size;
-   u64 ofs;
-   struct ceph_snap_context *snapc;
+   struct ceph_snap_context *snapc = NULL;
int result;

dout("fetched request\n");

-   /* filter out block requests we don't understand */
+   /* Filter out block requests we don't understand */
+
if ((rq->cmd_type != REQ_TYPE_FS)) {
__blk_end_request_all(rq, 0);
continue;
}
+   spin_unlock_irq(q->queue_lock);

-   /* deduce our operation (read, write) */
-   do_write = (rq_data_dir(rq) == WRITE);
-   if (do_write && rbd_dev->mapping.read_only) {
-   __blk_end_request_all(rq, -EROFS);
-   continue;
-   }
+   /* Stop writes to a read-only device */

-   spin_unlock_irq(q->queue_lock);
+   result = -EROFS;
+   if (read_only && rq_data_dir(rq) == WRITE)
+   goto out_end_request;
+
+   /* Grab a reference to the snapshot context */

down_read(&rbd_dev->header_rwsem);
+   if (rbd_dev->exists) {
+   snapc = ceph_get_snap_context(rbd_dev->header.snapc);
+   rbd_assert(snapc != NULL);
+   }
+   up_read(&rbd_dev->header_rwsem);

-   if (!rbd_dev->exists) {
+   if (!snapc) {
rbd_assert(rbd_dev->spec->snap_id != CEPH_NOSNAP);
-   up_read(&rbd_dev->header_rwsem);
dout("request for non-existent snapshot");
-   spin_lock_irq(q->queue_lock);
-   __blk_end_request_all(rq, -ENXIO);
-   continue;
+   result = -ENXIO;
+   goto out_end_request;
}

-   snapc = ceph_get_snap_context(rbd_dev->header.snapc);
-
-   up_read(&rbd_dev->header_rwsem);
-
-   size = blk_rq_bytes(rq);
-   ofs = blk_rq_pos(rq) * SECTOR_SIZE;
-   bio = rq->bio;
-
-   result = rbd_dev_do_request(rq, rbd_dev, snapc, ofs, size, bio);
+   result = rbd_dev_do_request(rq, rbd_dev, snapc,
+   blk_rq_pos(rq) * SECTOR_SIZE,
+   blk_rq_bytes(rq), rq->bio);
+out_end_request:
ceph_put_snap_context(snapc);
spin_lock_irq(q->queue_lock);
if (result < 0)
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] rbd: encapsulate handling for a single request

2012-11-08 Thread Alex Elder
In rbd_rq_fn(), requests are fetched from the block layer and each
request is processed, looping through the request's list of bio's
until they've all been consumed.

Separate the handling for a single request into its own function to
make it a bit easier to see what's going on.

Signed-off-by: Alex Elder 
---
 drivers/block/rbd.c |  119
+++
 1 file changed, 63 insertions(+), 56 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index be18b5f..6aed59b 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1585,6 +1585,64 @@ static struct rbd_req_coll *rbd_alloc_coll(int
num_reqs)
return coll;
 }

+static int rbd_dev_do_request(struct request *rq,
+   struct rbd_device *rbd_dev,
+   struct ceph_snap_context *snapc,
+   u64 ofs, unsigned int size,
+   struct bio *bio_chain)
+{
+   int num_segs;
+   struct rbd_req_coll *coll;
+   unsigned int bio_offset;
+   int cur_seg = 0;
+
+   dout("%s 0x%x bytes at 0x%llx\n",
+   rq_data_dir(rq) == WRITE ? "write" : "read",
+   size, (unsigned long long) blk_rq_pos(rq) * SECTOR_SIZE);
+
+   num_segs = rbd_get_num_segments(&rbd_dev->header, ofs, size);
+   if (num_segs <= 0)
+   return num_segs;
+
+   coll = rbd_alloc_coll(num_segs);
+   if (!coll)
+   return -ENOMEM;
+
+   bio_offset = 0;
+   do {
+   u64 limit = rbd_segment_length(rbd_dev, ofs, size);
+   unsigned int clone_size;
+   struct bio *bio_clone;
+
+   BUG_ON(limit > (u64) UINT_MAX);
+   clone_size = (unsigned int) limit;
+   dout("bio_chain->bi_vcnt=%hu\n", bio_chain->bi_vcnt);
+
+   kref_get(&coll->kref);
+
+   /* Pass a cloned bio chain via an osd request */
+
+   bio_clone = bio_chain_clone_range(&bio_chain,
+   &bio_offset, clone_size,
+   GFP_ATOMIC);
+   if (bio_clone)
+   (void) rbd_do_op(rq, rbd_dev, snapc,
+   ofs, clone_size,
+   bio_clone, coll, cur_seg);
+   else
+   rbd_coll_end_req_index(rq, coll, cur_seg,
+   (s32) -ENOMEM,
+   clone_size);
+   size -= clone_size;
+   ofs += clone_size;
+
+   cur_seg++;
+   } while (size > 0);
+   kref_put(&coll->kref, rbd_coll_release);
+
+   return 0;
+}
+
 /*
  * block device queue callback
  */
@@ -1598,10 +1656,8 @@ static void rbd_rq_fn(struct request_queue *q)
bool do_write;
unsigned int size;
u64 ofs;
-   int num_segs, cur_seg = 0;
-   struct rbd_req_coll *coll;
struct ceph_snap_context *snapc;
-   unsigned int bio_offset;
+   int result;

dout("fetched request\n");

@@ -1639,60 +1695,11 @@ static void rbd_rq_fn(struct request_queue *q)
ofs = blk_rq_pos(rq) * SECTOR_SIZE;
bio = rq->bio;

-   dout("%s 0x%x bytes at 0x%llx\n",
-do_write ? "write" : "read",
-size, (unsigned long long) blk_rq_pos(rq) * SECTOR_SIZE);
-
-   num_segs = rbd_get_num_segments(&rbd_dev->header, ofs, size);
-   if (num_segs <= 0) {
-   spin_lock_irq(q->queue_lock);
-   __blk_end_request_all(rq, num_segs);
-   ceph_put_snap_context(snapc);
-   continue;
-   }
-   coll = rbd_alloc_coll(num_segs);
-   if (!coll) {
-   spin_lock_irq(q->queue_lock);
-   __blk_end_request_all(rq, -ENOMEM);
-   ceph_put_snap_context(snapc);
-   continue;
-   }
-
-   bio_offset = 0;
-   do {
-   u64 limit = rbd_segment_length(rbd_dev, ofs, size);
-   unsigned int chain_size;
-   struct bio *bio_chain;
-
-   BUG_ON(limit > (u64) UINT_MAX);
-   chain_size = (unsigned int) limit;
-   dout("rq->bio->bi_vcnt=%hu\n", rq->bio->bi_vcnt);
-
-   kref_get(&coll->kref);
-
-   /* Pass a cloned bio chain via an osd request */
-
-   bio_chain = bio_chain_clone_range(&bio,
-   &bio_offset, chain_size,
-   GFP_ATOMIC);
-   if (bio_chain)
-   

[PATCH 0/2] rbd: clean up rbd_rq_fn()

2012-11-08 Thread Alex Elder
Some refactoring to improve readability.-Alex

[PATCH 1/2] rbd: encapsulate handling for a single request
[PATCH 2/2] rbd: a little more cleanup of rbd_rq_fn()
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] rbd: be picky about osd request status type

2012-11-08 Thread Alex Elder
The result field in a ceph osd reply header is a signed 32-bit type,
but rbd code often casually uses int to represent it.

The following changes the types of variables that handle this result
value to be "s32" instead of "int" to be completely explicit about
it.  Only at the point we pass that result to __blk_end_request()
does the type get converted to the plain old int defined for that
interface.

There is almost certainly no binary impact of this change, but I
prefer to show the exact size and signedness of the value since we
know it.

Signed-off-by: Alex Elder 
---
 drivers/block/rbd.c |   23 ---
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index caff180..be18b5f 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -171,7 +171,7 @@ struct rbd_client {
  */
 struct rbd_req_status {
int done;
-   int rc;
+   s32 rc;
u64 bytes;
 };

@@ -1055,13 +1055,13 @@ static void rbd_destroy_ops(struct
ceph_osd_req_op *ops)
 static void rbd_coll_end_req_index(struct request *rq,
   struct rbd_req_coll *coll,
   int index,
-  int ret, u64 len)
+  s32 ret, u64 len)
 {
struct request_queue *q;
int min, max, i;

dout("rbd_coll_end_req_index %p index %d ret %d len %llu\n",
-coll, index, ret, (unsigned long long) len);
+coll, index, (int) ret, (unsigned long long) len);

if (!rq)
return;
@@ -1082,7 +1082,7 @@ static void rbd_coll_end_req_index(struct request *rq,
max++;

for (i = min; istatus[i].rc,
+   __blk_end_request(rq, (int) coll->status[i].rc,
  coll->status[i].bytes);
coll->num_done++;
kref_put(&coll->kref, rbd_coll_release);
@@ -1091,7 +1091,7 @@ static void rbd_coll_end_req_index(struct request *rq,
 }

 static void rbd_coll_end_req(struct rbd_request *rbd_req,
-int ret, u64 len)
+s32 ret, u64 len)
 {
rbd_coll_end_req_index(rbd_req->rq,
rbd_req->coll, rbd_req->coll_index,
@@ -1131,7 +1131,7 @@ static int rbd_do_request(struct request *rq,
if (!rbd_req) {
if (coll)
rbd_coll_end_req_index(rq, coll, coll_index,
-  -ENOMEM, len);
+  (s32) -ENOMEM, len);
return -ENOMEM;
}

@@ -1208,7 +1208,7 @@ done_err:
bio_chain_put(rbd_req->bio);
ceph_osdc_put_request(osd_req);
 done_pages:
-   rbd_coll_end_req(rbd_req, ret, len);
+   rbd_coll_end_req(rbd_req, (s32) ret, len);
kfree(rbd_req);
return ret;
 }
@@ -1221,7 +1221,7 @@ static void rbd_req_cb(struct ceph_osd_request
*osd_req, struct ceph_msg *msg)
struct rbd_request *rbd_req = osd_req->r_priv;
struct ceph_osd_reply_head *replyhead;
struct ceph_osd_op *op;
-   __s32 rc;
+   s32 rc;
u64 bytes;
int read_op;

@@ -1229,14 +1229,14 @@ static void rbd_req_cb(struct ceph_osd_request
*osd_req, struct ceph_msg *msg)
replyhead = msg->front.iov_base;
WARN_ON(le32_to_cpu(replyhead->num_ops) == 0);
op = (void *)(replyhead + 1);
-   rc = le32_to_cpu(replyhead->result);
+   rc = (s32) le32_to_cpu(replyhead->result);
bytes = le64_to_cpu(op->extent.length);
read_op = (le16_to_cpu(op->op) == CEPH_OSD_OP_READ);

dout("rbd_req_cb bytes=%llu readop=%d rc=%d\n",
(unsigned long long) bytes, read_op, (int) rc);

-   if (rc == -ENOENT && read_op) {
+   if (rc == (s32) -ENOENT && read_op) {
zero_bio_chain(rbd_req->bio, 0);
rc = 0;
} else if (rc == 0 && read_op && bytes < rbd_req->len) {
@@ -1681,7 +1681,8 @@ static void rbd_rq_fn(struct request_queue *q)
bio_chain, coll, cur_seg);
else
rbd_coll_end_req_index(rq, coll, cur_seg,
-  -ENOMEM, chain_size);
+  (s32) -ENOMEM,
+  chain_size);
size -= chain_size;
ofs += chain_size;

-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] rbd: standardize ceph_osd_request variable names

2012-11-08 Thread Alex Elder
There are spots where a ceph_osds_request pointer variable is given
the name "req".  Since we're dealing with (at least) three types of
requests (block layer, rbd, and osd), I find this slightly
distracting.

Change such instances to use "osd_req" consistently to make the
abstraction represented a little more obvious.

Signed-off-by: Alex Elder 
---
 drivers/block/rbd.c |   60
++-
 1 file changed, 31 insertions(+), 29 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 9d8b406..caff180 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1113,12 +1113,12 @@ static int rbd_do_request(struct request *rq,
  struct ceph_osd_req_op *ops,
  struct rbd_req_coll *coll,
  int coll_index,
- void (*rbd_cb)(struct ceph_osd_request *req,
-struct ceph_msg *msg),
+ void (*rbd_cb)(struct ceph_osd_request *,
+struct ceph_msg *),
  struct ceph_osd_request **linger_req,
  u64 *ver)
 {
-   struct ceph_osd_request *req;
+   struct ceph_osd_request *osd_req;
struct ceph_file_layout *layout;
int ret;
u64 bno;
@@ -1145,67 +1145,68 @@ static int rbd_do_request(struct request *rq,
(unsigned long long) len, coll, coll_index);

osdc = &rbd_dev->rbd_client->client->osdc;
-   req = ceph_osdc_alloc_request(osdc, flags, snapc, ops,
+   osd_req = ceph_osdc_alloc_request(osdc, flags, snapc, ops,
false, GFP_NOIO, pages, bio);
-   if (!req) {
+   if (!osd_req) {
ret = -ENOMEM;
goto done_pages;
}

-   req->r_callback = rbd_cb;
+   osd_req->r_callback = rbd_cb;

rbd_req->rq = rq;
rbd_req->bio = bio;
rbd_req->pages = pages;
rbd_req->len = len;

-   req->r_priv = rbd_req;
+   osd_req->r_priv = rbd_req;

-   reqhead = req->r_request->front.iov_base;
+   reqhead = osd_req->r_request->front.iov_base;
reqhead->snapid = cpu_to_le64(CEPH_NOSNAP);

-   strncpy(req->r_oid, object_name, sizeof(req->r_oid));
-   req->r_oid_len = strlen(req->r_oid);
+   strncpy(osd_req->r_oid, object_name, sizeof(osd_req->r_oid));
+   osd_req->r_oid_len = strlen(osd_req->r_oid);

-   layout = &req->r_file_layout;
+   layout = &osd_req->r_file_layout;
memset(layout, 0, sizeof(*layout));
layout->fl_stripe_unit = cpu_to_le32(1 << RBD_MAX_OBJ_ORDER);
layout->fl_stripe_count = cpu_to_le32(1);
layout->fl_object_size = cpu_to_le32(1 << RBD_MAX_OBJ_ORDER);
layout->fl_pg_pool = cpu_to_le32((int) rbd_dev->spec->pool_id);
ret = ceph_calc_raw_layout(osdc, layout, snapid, ofs, &len, &bno,
-  req, ops);
+  osd_req, ops);
rbd_assert(ret == 0);

-   ceph_osdc_build_request(req, ofs, &len,
+   ceph_osdc_build_request(osd_req, ofs, &len,
ops,
snapc,
&mtime,
-   req->r_oid, req->r_oid_len);
+   osd_req->r_oid, osd_req->r_oid_len);

if (linger_req) {
-   ceph_osdc_set_request_linger(osdc, req);
-   *linger_req = req;
+   ceph_osdc_set_request_linger(osdc, osd_req);
+   *linger_req = osd_req;
}

-   ret = ceph_osdc_start_request(osdc, req, false);
+   ret = ceph_osdc_start_request(osdc, osd_req, false);
if (ret < 0)
goto done_err;

if (!rbd_cb) {
-   ret = ceph_osdc_wait_request(osdc, req);
+   u64 version;
+
+   ret = ceph_osdc_wait_request(osdc, osd_req);
+   version = le64_to_cpu(osd_req->r_reassert_version.version);
if (ver)
-   *ver = le64_to_cpu(req->r_reassert_version.version);
-   dout("reassert_ver=%llu\n",
-   (unsigned long long)
-   le64_to_cpu(req->r_reassert_version.version));
-   ceph_osdc_put_request(req);
+   *ver = version;
+   dout("reassert_ver=%llu\n", (unsigned long long) version);
+   ceph_osdc_put_request(osd_req);
}
return ret;

 done_err:
bio_chain_put(rbd_req->bio);
-   ceph_osdc_put_request(req);
+   ceph_osdc_put_request(osd_req);
 done_pages:
rbd_coll_end_req(rbd_req, ret, len);
kfree(rbd_req);
@@ -1215,9 +1216,9 @@ done_pages:
 /*
  * Ceph osd op callback
  */
-static void rbd_req_cb(struct ceph_osd_request *req, struct ceph_msg *msg)
+static void rbd_req_cb(struct ceph_osd_request *os

[PATCH 1/3] rbd: standardize rbd_request variable names

2012-11-08 Thread Alex Elder
There are two names used for items of rbd_request structure type:
"req" and "req_data".  The former name is also used to represent
items of pointers to struct ceph_osd_request.

Change all variables that have these names so they are instead
called "rbd_req" consistently.

Signed-off-by: Alex Elder 
---
 drivers/block/rbd.c |   50
++
 1 file changed, 26 insertions(+), 24 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 5de49a1..9d8b406 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1090,10 +1090,12 @@ static void rbd_coll_end_req_index(struct
request *rq,
spin_unlock_irq(q->queue_lock);
 }

-static void rbd_coll_end_req(struct rbd_request *req,
+static void rbd_coll_end_req(struct rbd_request *rbd_req,
 int ret, u64 len)
 {
-   rbd_coll_end_req_index(req->rq, req->coll, req->coll_index, ret, len);
+   rbd_coll_end_req_index(rbd_req->rq,
+   rbd_req->coll, rbd_req->coll_index,
+   ret, len);
 }

 /*
@@ -1121,12 +1123,12 @@ static int rbd_do_request(struct request *rq,
int ret;
u64 bno;
struct timespec mtime = CURRENT_TIME;
-   struct rbd_request *req_data;
+   struct rbd_request *rbd_req;
struct ceph_osd_request_head *reqhead;
struct ceph_osd_client *osdc;

-   req_data = kzalloc(sizeof(*req_data), GFP_NOIO);
-   if (!req_data) {
+   rbd_req = kzalloc(sizeof(*rbd_req), GFP_NOIO);
+   if (!rbd_req) {
if (coll)
rbd_coll_end_req_index(rq, coll, coll_index,
   -ENOMEM, len);
@@ -1134,8 +1136,8 @@ static int rbd_do_request(struct request *rq,
}

if (coll) {
-   req_data->coll = coll;
-   req_data->coll_index = coll_index;
+   rbd_req->coll = coll;
+   rbd_req->coll_index = coll_index;
}

dout("rbd_do_request object_name=%s ofs=%llu len=%llu coll=%p[%d]\n",
@@ -1152,12 +1154,12 @@ static int rbd_do_request(struct request *rq,

req->r_callback = rbd_cb;

-   req_data->rq = rq;
-   req_data->bio = bio;
-   req_data->pages = pages;
-   req_data->len = len;
+   rbd_req->rq = rq;
+   rbd_req->bio = bio;
+   rbd_req->pages = pages;
+   rbd_req->len = len;

-   req->r_priv = req_data;
+   req->r_priv = rbd_req;

reqhead = req->r_request->front.iov_base;
reqhead->snapid = cpu_to_le64(CEPH_NOSNAP);
@@ -1202,11 +1204,11 @@ static int rbd_do_request(struct request *rq,
return ret;

 done_err:
-   bio_chain_put(req_data->bio);
+   bio_chain_put(rbd_req->bio);
ceph_osdc_put_request(req);
 done_pages:
-   rbd_coll_end_req(req_data, ret, len);
-   kfree(req_data);
+   rbd_coll_end_req(rbd_req, ret, len);
+   kfree(rbd_req);
return ret;
 }

@@ -1215,7 +1217,7 @@ done_pages:
  */
 static void rbd_req_cb(struct ceph_osd_request *req, struct ceph_msg *msg)
 {
-   struct rbd_request *req_data = req->r_priv;
+   struct rbd_request *rbd_req = req->r_priv;
struct ceph_osd_reply_head *replyhead;
struct ceph_osd_op *op;
__s32 rc;
@@ -1234,20 +1236,20 @@ static void rbd_req_cb(struct ceph_osd_request
*req, struct ceph_msg *msg)
(unsigned long long) bytes, read_op, (int) rc);

if (rc == -ENOENT && read_op) {
-   zero_bio_chain(req_data->bio, 0);
+   zero_bio_chain(rbd_req->bio, 0);
rc = 0;
-   } else if (rc == 0 && read_op && bytes < req_data->len) {
-   zero_bio_chain(req_data->bio, bytes);
-   bytes = req_data->len;
+   } else if (rc == 0 && read_op && bytes < rbd_req->len) {
+   zero_bio_chain(rbd_req->bio, bytes);
+   bytes = rbd_req->len;
}

-   rbd_coll_end_req(req_data, rc, bytes);
+   rbd_coll_end_req(rbd_req, rc, bytes);

-   if (req_data->bio)
-   bio_chain_put(req_data->bio);
+   if (rbd_req->bio)
+   bio_chain_put(rbd_req->bio);

ceph_osdc_put_request(req);
-   kfree(req_data);
+   kfree(rbd_req);
 }

 static void rbd_simple_req_cb(struct ceph_osd_request *req, struct
ceph_msg *msg)
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] rbd: a few picky changes

2012-11-08 Thread Alex Elder
These three changes are pretty trivial. -Alex

[PATCH 1/3] rbd: standardize rbd_request variable names
[PATCH 2/3] rbd: standardize ceph_osd_request variable names
[PATCH 3/3] rbd: be picky about osd request status type
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SSD journal suggestion

2012-11-08 Thread Atchley, Scott
On Nov 8, 2012, at 3:22 AM, Gandalf Corvotempesta 
 wrote:

> 2012/11/8 Mark Nelson :
>> I haven't done much with IPoIB (just RDMA), but my understanding is that it
>> tends to top out at like 15Gb/s.  Some others on this mailing list can
>> probably speak more authoritatively.  Even with RDMA you are going to top
>> out at around 3.1-3.2GB/s.
> 
> 15Gb/s is still faster than 10Gbe
> But this speed limit seems to be kernel-related and should be the same
> even in a 10Gbe environment, or not?

We have a test cluster with Mellanox QDR HCAs (i.e. NICs). When using Verbs 
(the native IB API), I see ~27 Gb/s between two hosts. When running Sockets 
over these devices using IPoIB, I see 13-22 Gb/s depending on whether I use 
interrupt affinity and process binding.

For our Ceph testing, we will set the affinity of two of the mlx4 interrupt 
handlers to cores 0 and 1 and we will not using process binding. For single 
stream Netperf, we do use process binding and bind it to the same core (i.e. 0) 
and we see ~22 Gb/s. For multiple, concurrent Netperf runs, we do not use 
process binding but we still see ~22 Gb/s.

We used all of the Mellanox tuning recommendations for IPoIB available in their 
tuning pdf:

http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf

We looked at their interrupt affinity setting scripts and then wrote our own.

Our testing is with IPoIB in "connected" mode, not "datagram" mode. Connected 
mode is less scalable, but currently I only get ~3 Gb/s with datagram mode. 
Mellanox claims that we should get identical performance with both modes and we 
are looking into it.

We are getting a new test cluster with FDR HCAs and I will look into those as 
well.

Scott--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: less cores more iops / speed

2012-11-08 Thread Mark Nelson

On 11/08/2012 02:45 AM, Stefan Priebe - Profihost AG wrote:

Am 08.11.2012 01:59, schrieb Mark Nelson:

There's also the context switching overhead.  It'd be interesting to
know how much the writer processes were shifting around on cores.

What do you mean by that? I'm talking about the KVM guest not about the
ceph nodes.


in this case, is fio bouncing around between cores?




Stefan, what tool were you using to do writes?

as always: fio ;-)


You could try using numactl to pin fio to a specific core.  Also, it may 
be interesting to try multiple concurrent fio processes, and then 
concurrent fio processes with each pinned.




Stefan


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] mds: Clear lock flushed if replica is waiting for AC_LOCKFLUSHED

2012-11-08 Thread Yan, Zheng
From: "Yan, Zheng" 

So eval_gather() will not skip calling scatter_writebehind(),
otherwise the replica lock may be in flushing state forever.

Signed-off-by: Yan, Zheng 
---
 src/mds/Locker.cc | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/src/mds/Locker.cc b/src/mds/Locker.cc
index a1f957a..e2b1ff4 100644
--- a/src/mds/Locker.cc
+++ b/src/mds/Locker.cc
@@ -4383,8 +4383,12 @@ void Locker::handle_file_lock(ScatterLock *lock, MLock 
*m)
 if (lock->get_state() == LOCK_MIX_LOCK ||
lock->get_state() == LOCK_MIX_LOCK2 ||
lock->get_state() == LOCK_MIX_EXCL ||
-   lock->get_state() == LOCK_MIX_TSYN)
+   lock->get_state() == LOCK_MIX_TSYN) {
   lock->decode_locked_state(m->get_data());
+  // replica is waiting for AC_LOCKFLUSHED, eval_gather() should not
+  // delay calling scatter_writebehind().
+  lock->clear_flushed();
+}
 
 if (lock->is_gathering()) {
   dout(7) << "handle_file_lock " << *in << " from " << from
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] mds: Don't expire log segment before it's fully flushed

2012-11-08 Thread Yan, Zheng
From: "Yan, Zheng" 

Expiring log segment before it's fully flushed may cause various
issues during log replay.

Signed-off-by: Yan, Zheng 
---
 src/leveldb  | 2 +-
 src/mds/MDLog.cc | 8 +---
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/src/mds/MDLog.cc b/src/mds/MDLog.cc
index cac5615..b02c181 100644
--- a/src/mds/MDLog.cc
+++ b/src/mds/MDLog.cc
@@ -330,6 +330,11 @@ void MDLog::trim(int m)
 assert(ls);
 p++;
 
+if (ls->end > journaler->get_write_safe_pos()) {
+  dout(5) << "trim segment " << ls->offset << ", not fully flushed yet, 
safe "
+ << journaler->get_write_safe_pos() << " < end " << ls->end << 
dendl;
+  break;
+}
 if (expiring_segments.count(ls)) {
   dout(5) << "trim already expiring segment " << ls->offset << ", " << 
ls->num_events << " events" << dendl;
 } else if (expired_segments.count(ls)) {
@@ -412,9 +417,6 @@ void MDLog::_expired(LogSegment *ls)
 
   if (!capped && ls == get_current_segment()) {
 dout(5) << "_expired not expiring " << ls->offset << ", last one and 
!capped" << dendl;
-  } else if (ls->end > journaler->get_write_safe_pos()) {
-dout(5) << "_expired not expiring " << ls->offset << ", not fully flushed 
yet, safe "
-   << journaler->get_write_safe_pos() << " < end " << ls->end << dendl;
   } else {
 // expired.
 expired_segments.insert(ls);
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: clock syncronisation

2012-11-08 Thread Stefan Priebe - Profihost AG

Am 08.11.2012 13:00, schrieb Wido den Hollander:



On 08-11-12 10:04, Stefan Priebe - Profihost AG wrote:

Hello list,

is there any prefered way to use clock syncronisation?

I've tried running openntpd and ntpd on all servers but i'm still
getting:
2012-11-08 09:55:38.255928 mon.0 [WRN] message from mon.2 was stamped
0.063136s in the future, clocks not synchronized
2012-11-08 09:55:39.328639 mon.0 [WRN] message from mon.2 was stamped
0.063285s in the future, clocks not synchronized
2012-11-08 09:55:39.328833 mon.0 [WRN] message from mon.2 was stamped
0.063301s in the future, clocks not synchronized
2012-11-08 09:55:40.819975 mon.0 [WRN] message from mon.2 was stamped
0.063360s in the future, clocks not synchronized



What NTP server are you using? Network latency might cause the clocks
not to be synchronised.


pool.ntp.org

But i've now switched to debian chrony instead of ntp and that seems to 
work fine.


Haven't seen any messages again.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: clock syncronisation

2012-11-08 Thread Andrey Korolyov
On Thu, Nov 8, 2012 at 4:00 PM, Wido den Hollander  wrote:
>
>
> On 08-11-12 10:04, Stefan Priebe - Profihost AG wrote:
>>
>> Hello list,
>>
>> is there any prefered way to use clock syncronisation?
>>
>> I've tried running openntpd and ntpd on all servers but i'm still getting:
>> 2012-11-08 09:55:38.255928 mon.0 [WRN] message from mon.2 was stamped
>> 0.063136s in the future, clocks not synchronized
>> 2012-11-08 09:55:39.328639 mon.0 [WRN] message from mon.2 was stamped
>> 0.063285s in the future, clocks not synchronized
>> 2012-11-08 09:55:39.328833 mon.0 [WRN] message from mon.2 was stamped
>> 0.063301s in the future, clocks not synchronized
>> 2012-11-08 09:55:40.819975 mon.0 [WRN] message from mon.2 was stamped
>> 0.063360s in the future, clocks not synchronized
>>
>
> What NTP server are you using? Network latency might cause the clocks not to
> be synchronised.
>

There is no real reason to worry about, quorum may suffer only large
desync delays as some seconds or more. If you have unsynchronized
clocks on mon hodes with such big delays, requests which have issued
from cli, e.g. creating new connection may wait as long as delay
itself, depend of clock value of selected monitor node.

Clock drift caused mostly by heavy load, but of course playing with
clocksources may have some effect(since most systems already use HPET
timer, there is only one way, to sync with ntp server as frequent as
you want to prevent drift).


>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: clock syncronisation

2012-11-08 Thread Wido den Hollander



On 08-11-12 10:04, Stefan Priebe - Profihost AG wrote:

Hello list,

is there any prefered way to use clock syncronisation?

I've tried running openntpd and ntpd on all servers but i'm still getting:
2012-11-08 09:55:38.255928 mon.0 [WRN] message from mon.2 was stamped
0.063136s in the future, clocks not synchronized
2012-11-08 09:55:39.328639 mon.0 [WRN] message from mon.2 was stamped
0.063285s in the future, clocks not synchronized
2012-11-08 09:55:39.328833 mon.0 [WRN] message from mon.2 was stamped
0.063301s in the future, clocks not synchronized
2012-11-08 09:55:40.819975 mon.0 [WRN] message from mon.2 was stamped
0.063360s in the future, clocks not synchronized



What NTP server are you using? Network latency might cause the clocks 
not to be synchronised.


Wido


Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: problem with hanging cluster

2012-11-08 Thread Adam Ochmański

W dniu 08.11.2012 12:14, Adam Ochmański pisze:

Hi,
our test cluster going stuck every time when one of our osd host going
down, when mising osd go to "up" state and recovery go to 100% cluster
still not working propertly.


I forgot add version of ceph i use: 0.53-422-g2d20f3a

--
Best,
blink
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: less cores more iops / speed

2012-11-08 Thread Stefan Priebe - Profihost AG

Am 08.11.2012 10:05, schrieb Alexandre DERUMIER:

Do you have tried to compare virtio-blk and virtio-scsi ?

How to change? Right now i'm using the PVE defaults => scsi-hd.


(virtio-blk is "classic" virtio ;)


Do you have tried directly from the host with the rbd kernel module ?
No don't know how to use ;-)

http://ceph.com/docs/master/rbd/rbd-ko/
#modprobe rbd
#sudo rbd map {image-name} --pool {pool-name} --id {user-name}


this gives me also 8000 iops on the host with 3.6 Ghz. So this is the 
same like in KVM.


Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: less cores more iops / speed

2012-11-08 Thread Alexandre DERUMIER
> Do you have tried to compare virtio-blk and virtio-scsi ? 
>>How to change? Right now i'm using the PVE defaults => scsi-hd. 

(virtio-blk is "classic" virtio ;)

>> Do you have tried directly from the host with the rbd kernel module ? 
>>No don't know how to use ;-) 
http://ceph.com/docs/master/rbd/rbd-ko/
#modprobe rbd
#sudo rbd map {image-name} --pool {pool-name} --id {user-name}

(then you'll have a /dev/rbd1)




- Mail original - 

De: "Stefan Priebe - Profihost AG"  
À: "Alexandre DERUMIER"  
Cc: "Joao Eduardo Luis" , ceph-devel@vger.kernel.org, 
"Mark Nelson"  
Envoyé: Jeudi 8 Novembre 2012 10:02:23 
Objet: Re: less cores more iops / speed 

Am 08.11.2012 09:58, schrieb Alexandre DERUMIER: 
>>> What do you mean by that? I'm talking about the KVM guest not about the 
>>> ceph nodes. 
> 
> Do you have tried to compare virtio-blk and virtio-scsi ? 
How to change? Right now i'm using the PVE defaults => scsi-hd. 

> Do you have tried directly from the host with the rbd kernel module ? 
No don't know how to use ;-) 

Stefan 


> - Mail original - 
> 
> De: "Stefan Priebe - Profihost AG"  
> À: "Mark Nelson"  
> Cc: "Joao Eduardo Luis" , ceph-devel@vger.kernel.org 
> Envoyé: Jeudi 8 Novembre 2012 09:45:17 
> Objet: Re: less cores more iops / speed 
> 
> Am 08.11.2012 01:59, schrieb Mark Nelson: 
>> There's also the context switching overhead. It'd be interesting to 
>> know how much the writer processes were shifting around on cores. 
> What do you mean by that? I'm talking about the KVM guest not about the 
> ceph nodes. 
> 
>> Stefan, what tool were you using to do writes? 
> as always: fio ;-) 
> 
> Stefan 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


clock syncronisation

2012-11-08 Thread Stefan Priebe - Profihost AG

Hello list,

is there any prefered way to use clock syncronisation?

I've tried running openntpd and ntpd on all servers but i'm still getting:
2012-11-08 09:55:38.255928 mon.0 [WRN] message from mon.2 was stamped 
0.063136s in the future, clocks not synchronized
2012-11-08 09:55:39.328639 mon.0 [WRN] message from mon.2 was stamped 
0.063285s in the future, clocks not synchronized
2012-11-08 09:55:39.328833 mon.0 [WRN] message from mon.2 was stamped 
0.063301s in the future, clocks not synchronized
2012-11-08 09:55:40.819975 mon.0 [WRN] message from mon.2 was stamped 
0.063360s in the future, clocks not synchronized


Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: less cores more iops / speed

2012-11-08 Thread Stefan Priebe - Profihost AG

Am 08.11.2012 09:58, schrieb Alexandre DERUMIER:

What do you mean by that? I'm talking about the KVM guest not about the
ceph nodes.


Do you have tried to compare virtio-blk and virtio-scsi ?

How to change? Right now i'm using the PVE defaults => scsi-hd.


Do you have tried directly from the host with the rbd kernel module ?

No don't know how to use ;-)

Stefan



- Mail original -

De: "Stefan Priebe - Profihost AG" 
À: "Mark Nelson" 
Cc: "Joao Eduardo Luis" , ceph-devel@vger.kernel.org
Envoyé: Jeudi 8 Novembre 2012 09:45:17
Objet: Re: less cores more iops / speed

Am 08.11.2012 01:59, schrieb Mark Nelson:

There's also the context switching overhead. It'd be interesting to
know how much the writer processes were shifting around on cores.

What do you mean by that? I'm talking about the KVM guest not about the
ceph nodes.


Stefan, what tool were you using to do writes?

as always: fio ;-)

Stefan


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: less cores more iops / speed

2012-11-08 Thread Alexandre DERUMIER
>>What do you mean by that? I'm talking about the KVM guest not about the 
>>ceph nodes. 

Do you have tried to compare virtio-blk and virtio-scsi ?

Do you have tried directly from the host with the rbd kernel module ?



- Mail original - 

De: "Stefan Priebe - Profihost AG"  
À: "Mark Nelson"  
Cc: "Joao Eduardo Luis" , ceph-devel@vger.kernel.org 
Envoyé: Jeudi 8 Novembre 2012 09:45:17 
Objet: Re: less cores more iops / speed 

Am 08.11.2012 01:59, schrieb Mark Nelson: 
> There's also the context switching overhead. It'd be interesting to 
> know how much the writer processes were shifting around on cores. 
What do you mean by that? I'm talking about the KVM guest not about the 
ceph nodes. 

> Stefan, what tool were you using to do writes? 
as always: fio ;-) 

Stefan 
-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majord...@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: syncfs slower than without syncfs

2012-11-08 Thread Stefan Priebe - Profihost AG

done:
http://tracker.newdream.net/issues/3461
Am 08.11.2012 04:09, schrieb Josh Durgin:

On 11/07/2012 08:26 AM, Stefan Priebe wrote:

Am 07.11.2012 16:04, schrieb Mark Nelson:

Whew, glad you found the problem Stefan!  I was starting to wonder what
was going on. :)  Do you mind filling a bug about the control
dependencies?


Sure where should i fill it in?


http://www.tracker.newdream.net/projects/ceph/issues/new


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: less cores more iops / speed

2012-11-08 Thread Stefan Priebe - Profihost AG

Am 08.11.2012 01:59, schrieb Mark Nelson:

There's also the context switching overhead.  It'd be interesting to
know how much the writer processes were shifting around on cores.
What do you mean by that? I'm talking about the KVM guest not about the 
ceph nodes.



Stefan, what tool were you using to do writes?

as always: fio ;-)

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: problems creating new ceph cluster when using journal on block device

2012-11-08 Thread Mark Kirkwood

On 08/11/12 21:08, Wido den Hollander wrote:



On 08-11-12 08:29, Travis Rhoden wrote:

Hey folks,

I'm trying to set up a brand new Ceph cluster, based on v0.53.  My
hardware has SSDs for journals, and I'm trying to get mkcephfs to
intialize everything for me. However, the command hangs forever and I
eventually have to kill it.

After poking around a bit, it's clear that the problem has something
to do with the journal.  If I comment out the journal in ceph.conf,
the commands proceed just find.  This is the first time I've tried to
throw a journal on a block device rather than a file, so maybe I've
done something wrong with that.

Here is the info from ceph.conf:


[osd]
 osd journal size = 4000


Not sure if this is the problem, but when using a block device you don't
have to specify the size for the journal.


Also might be useful to know make/model of ssd, plus motherboard 
make/model (in case commenting out size does not fix)!


Regards

Mark

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: problems creating new ceph cluster when using journal on block device

2012-11-08 Thread Wido den Hollander



On 08-11-12 08:29, Travis Rhoden wrote:

Hey folks,

I'm trying to set up a brand new Ceph cluster, based on v0.53.  My
hardware has SSDs for journals, and I'm trying to get mkcephfs to
intialize everything for me. However, the command hangs forever and I
eventually have to kill it.

After poking around a bit, it's clear that the problem has something
to do with the journal.  If I comment out the journal in ceph.conf,
the commands proceed just find.  This is the first time I've tried to
throw a journal on a block device rather than a file, so maybe I've
done something wrong with that.

Here is the info from ceph.conf:


[osd]
 osd journal size = 4000


Not sure if this is the problem, but when using a block device you don't 
have to specify the size for the journal.


Wido


[osd.0]
 host = ceph1
 osd journal = /dev/sda5


when I log in the log file, here is what I see:

2012-11-07 23:18:20.578623 7fe2743e3780  1
filestore(/var/lib/ceph/osd/ceph-0) mkfs in /var/lib/ceph/osd/ceph-0
2012-11-07 23:18:20.578699 7fe2743e3780  1
filestore(/var/lib/ceph/osd/ceph-0) mkfs fsid is already set to
4aac6842-8d71-4405-88ad-e3e9e4da308d
2012-11-07 23:18:20.632138 7fe2743e3780  1
filestore(/var/lib/ceph/osd/ceph-0) leveldb db exists/created
2012-11-07 23:18:20.634338 7fe2743e3780  0 journal  kernel version is 3.2.0
2012-11-07 23:18:20.634579 7fe2743e3780  1 journal _open /dev/sda5 fd
9: 4194304000 bytes, block size 4096 bytes, directio = 1, aio = 0
2012-11-07 23:18:20.634995 7fe2743e3780  1 journal check: header looks ok
2012-11-07 23:18:20.636020 7fe2743e3780  1
filestore(/var/lib/ceph/osd/ceph-0) mkfs done in
/var/lib/ceph/osd/ceph-0
2012-11-07 23:18:20.682113 7fe2743e3780  0
filestore(/var/lib/ceph/osd/ceph-0) mount FIEMAP ioctl is supported
and appears to work
2012-11-07 23:18:20.682125 7fe2743e3780  0
filestore(/var/lib/ceph/osd/ceph-0) mount FIEMAP ioctl is disabled via
'filestore fiemap' config option
2012-11-07 23:18:20.682424 7fe2743e3780  0
filestore(/var/lib/ceph/osd/ceph-0) mount did NOT detect btrfs
2012-11-07 23:18:20.781938 7fe2743e3780  0
filestore(/var/lib/ceph/osd/ceph-0) mount syncfs(2) syscall fully
supported (by glibc and kernel)
2012-11-07 23:18:20.782061 7fe2743e3780  0
filestore(/var/lib/ceph/osd/ceph-0) mount found snaps <>
2012-11-07 23:18:20.823915 7fe2743e3780  0
filestore(/var/lib/ceph/osd/ceph-0) mount: enabling WRITEAHEAD journal
mode: btrfs not detected
2012-11-07 23:18:20.826137 7fe2743e3780  0 journal  kernel version is 3.2.0
2012-11-07 23:18:20.826386 7fe2743e3780  1 journal _open /dev/sda5 fd
15: 4194304000 bytes, block size 4096 bytes, directio = 1, aio = 0

So I know it is trying to use the right partition/block device.  It
just never get's past that line.

Finally, I tried to track things down myself to see what was hanging
using strace.  I ran:

strace /usr/bin/ceph-osd -c /tmp/travis/conf --monmap
/tmp/travis/monmap -i 0 --mkfs --mkkey

And the final output from that is:

open("/dev/sda5", O_RDONLY) = 15
fstat(15, {st_mode=S_IFBLK|0660, st_rdev=makedev(8, 5), ...}) = 0
ioctl(15, BLKGETSIZE64, 0x7fffe7a587a8) = 0
geteuid()   = 0
pipe2([16, 17], O_CLOEXEC)  = 0
clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x7f5365f28a50) = 707
close(17)   = 0
fcntl(16, F_SETFD, 0)   = 0
fstat(16, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0x7f5365f14000
read(16, "\n/dev/sda5:\n write-caching =  1 "..., 4096) = 37
open("/proc/version", O_RDONLY) = 17
read(17, "Linux version 3.2.0-23-generic ("..., 127) = 127
futex(0x2db807c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x2db8078,
{FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x2db8028, FUTEX_WAKE_PRIVATE, 1) = 1
close(17)   = 0
close(16)   = 0
wait4(707, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 707
munmap(0x7f5365f14000, 4096)= 0
io_setup(128, {139996169318400})= 0
futex(0x2db807c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x2db8078,
{FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x2db8028, FUTEX_WAKE_PRIVATE, 1) = 1
pread(15, 
"\2\0\0\\0\0\0\1\0\0\0\0\0\0\0J\254hB\215qD\5\210\255\343\351\344\3320\215"...,
4096, 0) = 4096

And that's as far as it gets.  Any thoughts?

After some sleep, I'll try throwing the journal back on a file instead
of a block device and see if that does it.

Can anyone confirm that using a block device instead of a file is
actually better performance?

Thanks,

  - Travis
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http