date:20121108


On 11/08/2012 03:50 PM, Josh Durgin wrote:

On 11/08/2012 01:27 PM, Stefan Priebe wrote:

Am 08.11.2012 17:06, schrieb Mark Nelson:

On 11/08/2012 09:45 AM, Stefan Priebe - Profihost AG wrote:

Am 08.11.2012 16:01, schrieb Sage Weil:

On Thu, 8 Nov 2012, Stefan Priebe - Profihost AG wrote:

Is there any way to find out why a ceph-osd process takes around 10
times more
load on rand 4k writes than on 4k reads?


Something like perf or oprofile is probably your best bet.  perf
can be
tedious to deploy, depending on where your kernel is coming from.
oprofile seems to be deprecated, although I've had good results with
it in
the past.


I've recorded 10s with perf - it is now a 300MB perf.data file. Sadly
i've no idea what todo with it next.


Pour yourself a stiff drink! (haha!)

Try just doing a "perf report" in the directory where you've got the
data file.  Here's a nice tutorial:

https://perf.wiki.kernel.org/index.php/Tutorial

Also, if you see missing symbols you might benefit by chowning the file
to root and running perf report as root.  If you still see missing
symbols, you may want to just give up and try sysprof.


I've now used google perftools / google CPU profiler. It was the only
tool who worked out of the box ;-)

Attached is a PDF with a profiled ceph-osd process while 4k random write.


It looks like a not insignificant portion of time is spent in the
logging infrastructure. Could you add this to the osds' configuration
to prevent any debug log gathering (it's logged/gathered):

debug lockdep = 0/0
debug context = 0/0
debug crush = 0/0
debug buffer = 0/0
debug timer = 0/0
debug journaler = 0/0
debug osd = 0/0
debug optracker = 0/0
debug objclass = 0/0
debug filestore = 0/0
debug journal = 0/0
debug ms = 0/0
debug monc = 0/0
debug tp = 0/0
debug auth = 0/0
debug finisher = 0/0
debug heartbeatmap = 0/0
debug perfcounter = 0/0
debug asok = 0/0
debug throttle = 0/0

Josh


Also, I'm not sure what version you are running, but you may want to try 
testing master and see if that helps.  Sam has done some work on our 
threading and locking code that might help.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: less cores more iops / speed

2012-11-08 Thread Andrey Korolyov

On Thu, Nov 8, 2012 at 7:53 PM, Alexandre DERUMIER  wrote:
>>>So it is a problem of KVM which let's the processes jump between cores a
>>>lot.
>
> maybe numad from redhat can help ?
> http://fedoraproject.org/wiki/Features/numad
>
> It's try to keep process on same numa node and I think it's also doing some 
> dynamic pinning.

Numad keeps only memory chunks on the preferred node, cpu pinning,
which is a primary goal there, should be done separately via libvirt
or manually for qemu process via cpuset(libvirt does pinning via
taskset and seems that it is broken at least in debian wheezy - even
affinity mask is set for qemu process, load spreads all over numa
node, including cpus outside the set).

>
> - Mail original -
>
> De: "Stefan Priebe - Profihost AG" 
> À: "Mark Nelson" 
> Cc: "Joao Eduardo Luis" , ceph-devel@vger.kernel.org
> Envoyé: Jeudi 8 Novembre 2012 16:14:32
> Objet: Re: less cores more iops / speed
>
> Am 08.11.2012 14:19, schrieb Mark Nelson:
>> On 11/08/2012 02:45 AM, Stefan Priebe - Profihost AG wrote:
>>> Am 08.11.2012 01:59, schrieb Mark Nelson:
 There's also the context switching overhead. It'd be interesting to
 know how much the writer processes were shifting around on cores.
>>> What do you mean by that? I'm talking about the KVM guest not about the
>>> ceph nodes.
>>
>> in this case, is fio bouncing around between cores?
>
> Thanks you're correct. If i bind fio to two cores on a 8 core VM it runs
> with 16.000 iops.
>
> So it is a problem of KVM which let's the processes jump between cores a
> lot.
>
> Greets,
> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: extreme ceph-osd cpu load for rand. 4k write

2012-11-08 Thread Josh Durgin


On 11/08/2012 01:27 PM, Stefan Priebe wrote:

Am 08.11.2012 17:06, schrieb Mark Nelson:

On 11/08/2012 09:45 AM, Stefan Priebe - Profihost AG wrote:

Am 08.11.2012 16:01, schrieb Sage Weil:

On Thu, 8 Nov 2012, Stefan Priebe - Profihost AG wrote:

Is there any way to find out why a ceph-osd process takes around 10
times more
load on rand 4k writes than on 4k reads?


Something like perf or oprofile is probably your best bet.  perf can be
tedious to deploy, depending on where your kernel is coming from.
oprofile seems to be deprecated, although I've had good results with
it in
the past.


I've recorded 10s with perf - it is now a 300MB perf.data file. Sadly
i've no idea what todo with it next.


Pour yourself a stiff drink! (haha!)

Try just doing a "perf report" in the directory where you've got the
data file.  Here's a nice tutorial:

https://perf.wiki.kernel.org/index.php/Tutorial

Also, if you see missing symbols you might benefit by chowning the file
to root and running perf report as root.  If you still see missing
symbols, you may want to just give up and try sysprof.


I've now used google perftools / google CPU profiler. It was the only
tool who worked out of the box ;-)

Attached is a PDF with a profiled ceph-osd process while 4k random write.


It looks like a not insignificant portion of time is spent in the
logging infrastructure. Could you add this to the osds' configuration
to prevent any debug log gathering (it's logged/gathered):

debug lockdep = 0/0
debug context = 0/0
debug crush = 0/0
debug buffer = 0/0
debug timer = 0/0
debug journaler = 0/0
debug osd = 0/0
debug optracker = 0/0
debug objclass = 0/0
debug filestore = 0/0
debug journal = 0/0
debug ms = 0/0
debug monc = 0/0
debug tp = 0/0
debug auth = 0/0
debug finisher = 0/0
debug heartbeatmap = 0/0
debug perfcounter = 0/0
debug asok = 0/0
debug throttle = 0/0

Josh
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: unexpected problem with radosgw fcgi

2012-11-08 Thread Yehuda Sadeh

On Wed, Nov 7, 2012 at 6:16 AM, Sławomir Skowron  wrote:
> I have realize that requests from fastcgi in nginx from radosgw returning:
>
> HTTP/1.1 200, not a HTTP/1.1 200 OK
>
> Any other cgi that i run, for example php via fastcgi return this like
> RFC says, with OK.
>
> Is someone experience this problem ??

I have seen a similar issue in the past with nginx. It doesn't happen
with apache. My guess is that it's either something with the way nginx
is configured, or some difference in the fastcgi module
implementation.

>
> I see in code:
>
> ./src/rgw/rgw_rest.cc line 36
>
> const static struct rgw_html_errors RGW_HTML_ERRORS[] = {
> { 0, 200, "" },
> 
>
> What if i change this into:
>
> { 0, 200, "OK" },

The third field there specifies the error code embedded in the
returned XML with S3, so it wouldn't fix anything.

Yehuda
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: extreme ceph-osd cpu load for rand. 4k write

2012-11-08 Thread Stefan Priebe


Am 08.11.2012 17:06, schrieb Mark Nelson:

On 11/08/2012 09:45 AM, Stefan Priebe - Profihost AG wrote:

Am 08.11.2012 16:01, schrieb Sage Weil:

On Thu, 8 Nov 2012, Stefan Priebe - Profihost AG wrote:

Is there any way to find out why a ceph-osd process takes around 10
times more
load on rand 4k writes than on 4k reads?


Something like perf or oprofile is probably your best bet.  perf can be
tedious to deploy, depending on where your kernel is coming from.
oprofile seems to be deprecated, although I've had good results with
it in
the past.


I've recorded 10s with perf - it is now a 300MB perf.data file. Sadly
i've no idea what todo with it next.


Pour yourself a stiff drink! (haha!)

Try just doing a "perf report" in the directory where you've got the
data file.  Here's a nice tutorial:

https://perf.wiki.kernel.org/index.php/Tutorial

Also, if you see missing symbols you might benefit by chowning the file
to root and running perf report as root.  If you still see missing
symbols, you may want to just give up and try sysprof.


I've now used google perftools / google CPU profiler. It was the only 
tool who worked out of the box ;-)


Attached is a PDF with a profiled ceph-osd process while 4k random write.

Stefan


out.pdf
Description: Adobe PDF document

Re: SSD journal suggestion

2012-11-08 Thread Joseph Glanville

On 9 November 2012 02:00, Atchley, Scott  wrote:
> On Nov 8, 2012, at 9:39 AM, Mark Nelson  wrote:
>
>> On 11/08/2012 07:55 AM, Atchley, Scott wrote:
>>> On Nov 8, 2012, at 3:22 AM, Gandalf Corvotempesta 
>>>  wrote:
>>>
 2012/11/8 Mark Nelson :
> I haven't done much with IPoIB (just RDMA), but my understanding is that 
> it
> tends to top out at like 15Gb/s.  Some others on this mailing list can
> probably speak more authoritatively.  Even with RDMA you are going to top
> out at around 3.1-3.2GB/s.

 15Gb/s is still faster than 10Gbe
 But this speed limit seems to be kernel-related and should be the same
 even in a 10Gbe environment, or not?
>>>
>>> We have a test cluster with Mellanox QDR HCAs (i.e. NICs). When using Verbs 
>>> (the native IB API), I see ~27 Gb/s between two hosts. When running Sockets 
>>> over these devices using IPoIB, I see 13-22 Gb/s depending on whether I use 
>>> interrupt affinity and process binding.
>>>
>>> For our Ceph testing, we will set the affinity of two of the mlx4 interrupt 
>>> handlers to cores 0 and 1 and we will not using process binding. For single 
>>> stream Netperf, we do use process binding and bind it to the same core 
>>> (i.e. 0) and we see ~22 Gb/s. For multiple, concurrent Netperf runs, we do 
>>> not use process binding but we still see ~22 Gb/s.
>>
>> Scott, this is very interesting!  Does setting the interrupt affinity
>> make the biggest difference then when you have concurrent netperf
>> processes going?  For some reason I thought that setting interrupt
>> affinity wasn't even guaranteed in linux any more, but this is just some
>> half-remembered recollection from a year or two ago.
>
> We are using RHEL6 with a 3.5.1 kernel. I tested single stream Netperf with 
> and without affinity:
>
> Default (irqbalance running)   12.8 Gb/s
> IRQ balance off13.0 Gb/s
> Set IRQ affinity to socket 0   17.3 Gb/s   # using the Mellanox script
>
> When I set the affinity to cores 0-1 _and_ I bind Netperf to core 0, I get 
> ~22 Gb/s for a single stream.
>
>>> We used all of the Mellanox tuning recommendations for IPoIB available in 
>>> their tuning pdf:
>>>
>>> http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf
>>>
>>> We looked at their interrupt affinity setting scripts and then wrote our 
>>> own.
>>>
>>> Our testing is with IPoIB in "connected" mode, not "datagram" mode. 
>>> Connected mode is less scalable, but currently I only get ~3 Gb/s with 
>>> datagram mode. Mellanox claims that we should get identical performance 
>>> with both modes and we are looking into it.
>>>
>>> We are getting a new test cluster with FDR HCAs and I will look into those 
>>> as well.
>>
>> Nice!  At some point I'll probably try to justify getting some FDR cards
>> in house.  I'd definitely like to hear how FDR ends up working for you.
>
> I'll post the numbers when I get access after they are set up.
>
> Scott
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

If you are running Ceph purely in userspace you could try using rsockets.
rsockets is a pure userspace implementation of sockets over RDMA. It
has much much lower latency and close to native throughput.
My guess is rsockets will probably work perfectly and should give you
95% of theoretical max performance.

I would like to see a somewhat native implementation of RDMA in Ceph one day.
I was doing some preliminary work on it 1.5 years ago when Ceph was
first gaining traction but we didn't end up putting our focus on Ceph
and as such I never got anywhere with it.
In theory one only needs to use RDMA for the fast path to gain alot of
benefit. This can be done even in the RBD kernel module with the
RDMA-CM which will interact nicely across kernelspace and userspace
(they actually share he same API thankfully).

Joseph.

-- 
CTO | Orion Virtualisation Solutions | www.orionvm.com.au
Phone: 1300 56 99 52 | Mobile: 0428 754 846
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ignoresync hack no longer applies on 3.6.5

2012-11-08 Thread Nick Bartos

Sorry about that, I think it got chopped.  Here's a full trace from
another run, using kernel 3.6.6 and definitely has the patch applied:
https://gist.github.com/4041120

There are no instances of "sync_fs_one_sb skipping" in the logs.



On Mon, Nov 5, 2012 at 1:29 AM, Sage Weil  wrote:
> On Sun, 4 Nov 2012, Nick Bartos wrote:
>> Unfortunately I'm still seeing deadlocks.  The trace was taken after a
>> 'sync' from the command line was hung for a couple minutes.
>>
>> There was only one debug message (one fs on the system was mounted with 
>> 'mand'):
>
> This was with the updated patch applied?
>
> The dump below doesn't look complete, btw.. I don't see any ceph-osd
> processses.  Don't see any ceph-osd processes, among other things.
>
> sage
>
>>
>> kernel: [11441.168954]  [] ? sync_fs_one_sb+0x4d/0x4d
>>
>> Here's the trace:
>>
>> javaS 88040b06ba08 0  1623  1 0x
>>  88040cb6dd08 0082  880405da8b30
>>   00012b40 00012b40 00012b40
>>  88040cb6dfd8 00012b40 00012b40 88040cb6dfd8
>> Call Trace:
>>  [] schedule+0x64/0x66
>>  [] futex_wait_queue_me+0xc2/0xe1
>>  [] futex_wait+0x120/0x275
>>  [] do_futex+0x96/0x122
>>  [] sys_futex+0x110/0x141
>>  [] ? vfs_write+0xd0/0xdf
>>  [] ? fput+0x18/0xb6
>>  [] ? fput_light+0xd/0xf
>>  [] ? sys_write+0x61/0x6e
>>  [] system_call_fastpath+0x16/0x1b
>> javaS 88040ca4ba48 0  1624  1 0x
>>  88040cb0bd08 0082 88040cb0bc88 81813410
>>  88040cb0bd28 00012b40 00012b40 00012b40
>>  88040cb0bfd8 00012b40 00012b40 88040cb0bfd8
>> Call Trace:
>>  [] schedule+0x64/0x66
>>  [] futex_wait_queue_me+0xc2/0xe1
>>  [] futex_wait+0x120/0x275
>>  [] ? blkdev_issue_flush+0xc0/0xd2
>>  [] do_futex+0x96/0x122
>>  [] sys_futex+0x110/0x141
>>  [] ? fput+0x18/0xb6
>>  [] ? do_device_not_available+0xe/0x10
>>  [] system_call_fastpath+0x16/0x1b
>> javaS 88040ca4b058 0  1625  1 0x
>>  880429d1fd08 0082 0400 81813410
>>  88040b06b4a8 00012b40 00012b40 00012b40
>>  880429d1ffd8 00012b40 00012b40 880429d1ffd8
>> Call Trace:
>>  [] schedule+0x64/0x66
>>  [] futex_wait_queue_me+0xc2/0xe1
>>  [] futex_wait+0x120/0x275
>>  [] do_futex+0x96/0x122
>>  [] sys_futex+0x110/0x141
>>  [] ? do_device_not_available+0xe/0x10
>>  [] system_call_fastpath+0x16/0x1b
>> javaS 88040cd11a08 0  1632  1 0x
>>  88040c40fd08 0082 88040c40fd68 88042b17f4e0
>>  88040c40ff38 00012b40 00012b40 00012b40
>>  88040c40ffd8 00012b40 00012b40 88040c40ffd8
>> Call Trace:
>>  [] schedule+0x64/0x66
>>  [] futex_wait_queue_me+0xc2/0xe1
>>  [] futex_wait+0x120/0x275
>>  [] ? update_rmtp+0x65/0x65
>>  [] ? hrtimer_start_range_ns+0x14/0x16
>>  [] do_futex+0x96/0x122
>>  [] sys_futex+0x110/0x141
>>  [] ? vfs_write+0xd0/0xdf
>>  [] ? do_device_not_available+0xe/0x10
>>  [] system_call_fastpath+0x16/0x1b
>> javaS 88040cd10628 0  1633  1 0x
>>  88040cd7da88 0082 0cd7da18 81813410
>>  88040cccecc0 00012b40 00012b40 00012b40
>>  88040cd7dfd8 00012b40 00012b40 88040cd7dfd8
>> Call Trace:
>>  [] schedule+0x64/0x66
>>  [] schedule_timeout+0x36/0xe3
>>  [] ? _local_bh_enable_ip.clone.8+0x20/0x89
>>  [] ? local_bh_enable_ip+0xe/0x10
>>  [] ? _raw_spin_unlock_bh+0x16/0x18
>>  [] ? release_sock+0x128/0x131
>>  [] sk_wait_data+0x82/0xc5
>>  [] ? wake_up_bit+0x2a/0x2a
>>  [] ? local_bh_enable+0xe/0x10
>>  [] tcp_recvmsg+0x4c5/0x92e
>>  [] ? update_curr+0xd6/0x110
>>  [] ? __switch_to+0x1ac/0x33c
>>  [] inet_recvmsg+0x5e/0x73
>>  [] __sock_recvmsg+0x75/0x84
>>  [] sock_aio_read+0xf2/0x106
>>  [] do_sync_read+0x70/0xad
>>  [] vfs_read+0xbc/0xdc
>>  [] ? fput+0x18/0xb6
>>  [] sys_read+0x4a/0x6e
>>  [] system_call_fastpath+0x16/0x1b
>> javaS 88040ce11a88 0  1634  1 0x
>>  88040c9699f8 0082 0098967f 88042b17f4e0
>>   00012b40 00012b40 00012b40
>>  88040c969fd8 00012b40 00012b40 88040c969fd8
>> Call Trace:
>>  [] schedule+0x64/0x66
>>  [] schedule_hrtimeout_range_clock+0xd2/0x11b
>>  [] ? update_rmtp+0x65/0x65
>>  [] ? hrtimer_start_range_ns+0x14/0x16
>>  [] schedule_hrtimeout_range+0x13/0x15
>>  [] poll_schedule_timeout+0x48/0x64
>>  [] do_poll.clone.3+0x1d0/0x1f1
>>  [] do_sys_poll+0x146/0x1bd
>>  [] ? __pollwait+0xcc/0xcc
>>  [] ? __sock_recvmsg+0x75/0x84
>>  [] ? sock_recvmsg+0x5b/0x7a
>>  [] ? get_futex_key+0x94/0x224
>>  [] ? _raw_spin_lock+0xe/0x10
>>  [] ? double_lock_hb+0x31/0x36
>>  [] ? fget_light+0x6d/0x84
>>  [] ? fput_light+0xd/0xf
>>  [] ? sys_recvf

Re: SSD journal suggestion

On Nov 8, 2012, at 11:19 AM, Andrey Korolyov  wrote:

> On Thu, Nov 8, 2012 at 7:02 PM, Atchley, Scott  wrote:
>> On Nov 8, 2012, at 10:00 AM, Scott Atchley  wrote:
>> 
>>> On Nov 8, 2012, at 9:39 AM, Mark Nelson  wrote:
>>> 
 On 11/08/2012 07:55 AM, Atchley, Scott wrote:
> On Nov 8, 2012, at 3:22 AM, Gandalf Corvotempesta 
>  wrote:
> 
>> 2012/11/8 Mark Nelson :
>>> I haven't done much with IPoIB (just RDMA), but my understanding is 
>>> that it
>>> tends to top out at like 15Gb/s.  Some others on this mailing list can
>>> probably speak more authoritatively.  Even with RDMA you are going to 
>>> top
>>> out at around 3.1-3.2GB/s.
>> 
>> 15Gb/s is still faster than 10Gbe
>> But this speed limit seems to be kernel-related and should be the same
>> even in a 10Gbe environment, or not?
> 
> We have a test cluster with Mellanox QDR HCAs (i.e. NICs). When using 
> Verbs (the native IB API), I see ~27 Gb/s between two hosts. When running 
> Sockets over these devices using IPoIB, I see 13-22 Gb/s depending on 
> whether I use interrupt affinity and process binding.
> 
> For our Ceph testing, we will set the affinity of two of the mlx4 
> interrupt handlers to cores 0 and 1 and we will not using process 
> binding. For single stream Netperf, we do use process binding and bind it 
> to the same core (i.e. 0) and we see ~22 Gb/s. For multiple, concurrent 
> Netperf runs, we do not use process binding but we still see ~22 Gb/s.
 
 Scott, this is very interesting!  Does setting the interrupt affinity
 make the biggest difference then when you have concurrent netperf
 processes going?  For some reason I thought that setting interrupt
 affinity wasn't even guaranteed in linux any more, but this is just some
 half-remembered recollection from a year or two ago.
>>> 
>>> We are using RHEL6 with a 3.5.1 kernel. I tested single stream Netperf with 
>>> and without affinity:
>>> 
>>> Default (irqbalance running)   12.8 Gb/s
>>> IRQ balance off13.0 Gb/s
>>> Set IRQ affinity to socket 0   17.3 Gb/s   # using the Mellanox script
>>> 
>>> When I set the affinity to cores 0-1 _and_ I bind Netperf to core 0, I get 
>>> ~22 Gb/s for a single stream.
>> 
> 
> Did you tried Mellanox-baked modules for 2.6.32 before that?

That came with RHEL6? No.

Scott

> 
>> Note, I used hwloc to determine which socket was closer to the mlx4 device 
>> on our dual socket machines. On these nodes, hwloc reported that both 
>> sockets were equally close, but a colleague has machines where one socket is 
>> closer than the other. In that case, bind to the closer socket (or to cores 
>> within the closer socket).
>> 
>>> 
> We used all of the Mellanox tuning recommendations for IPoIB available in 
> their tuning pdf:
> 
> http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf
> 
> We looked at their interrupt affinity setting scripts and then wrote our 
> own.
> 
> Our testing is with IPoIB in "connected" mode, not "datagram" mode. 
> Connected mode is less scalable, but currently I only get ~3 Gb/s with 
> datagram mode. Mellanox claims that we should get identical performance 
> with both modes and we are looking into it.
> 
> We are getting a new test cluster with FDR HCAs and I will look into 
> those as well.
 
 Nice!  At some point I'll probably try to justify getting some FDR cards
 in house.  I'd definitely like to hear how FDR ends up working for you.
>>> 
>>> I'll post the numbers when I get access after they are set up.
>>> 
>>> Scott
>>> 
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: problems creating new ceph cluster when using journal on block device


On 11/08/2012 11:36 AM, Travis Rhoden wrote:

Solved!

I stumbled into the solution while switching from block device to a
file.  I was being bit by running mkcephfs multiple times -- it wasn't
really failing on the journal, it was failing because the OSD data
disk had been initialized before.  I couldn't see that until I used a
file for the journal and then I see log output like:


Yeah, that was a change that landed a couple of months ago.  It's really 
important now to blow away the old data (I just reformat) if you want a 
totally clean ceph deployment rather than just running mkcephfs.




=== osd.0 ===
2012-11-08 16:41:37.677620 7ffc3cfcd780 -1 provided osd id 0 != superblock's -1
2012-11-08 16:41:37.678726 7ffc3cfcd780 -1  ** ERROR: error creating
empty object store in /var/lib/ceph/osd/ceph-0: (22) Invalid argument

I unmounted the OSD's that had been touched before, reformatted them,
and then remounted.  I setup ceph.conf to use block devices for the
journals, and then everything proceeded normally.

So the final relevant bits from my ceph.conf file look like:

[osd]
 osd journal size = 0
 journal dio = true
 journal aio = true

[osd.0]
 host = ceph1
 osd journal = /dev/sda5

[osd.1]
 host = ceph1
 osd journal = /dev/sda6
...

Thanks,

  - Travis

On Thu, Nov 8, 2012 at 10:08 AM, Travis Rhoden  wrote:

One more thing -- Google search says this is harmless -- I see quite a
few of these in syslog:

hdparm: sending ioctl 2285 to a partition!

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: problems creating new ceph cluster when using journal on block device

2012-11-08 Thread Travis Rhoden

Solved!

I stumbled into the solution while switching from block device to a
file.  I was being bit by running mkcephfs multiple times -- it wasn't
really failing on the journal, it was failing because the OSD data
disk had been initialized before.  I couldn't see that until I used a
file for the journal and then I see log output like:

=== osd.0 ===
2012-11-08 16:41:37.677620 7ffc3cfcd780 -1 provided osd id 0 != superblock's -1
2012-11-08 16:41:37.678726 7ffc3cfcd780 -1  ** ERROR: error creating
empty object store in /var/lib/ceph/osd/ceph-0: (22) Invalid argument

I unmounted the OSD's that had been touched before, reformatted them,
and then remounted.  I setup ceph.conf to use block devices for the
journals, and then everything proceeded normally.

So the final relevant bits from my ceph.conf file look like:

[osd]
osd journal size = 0
journal dio = true
journal aio = true

[osd.0]
host = ceph1
osd journal = /dev/sda5

[osd.1]
host = ceph1
osd journal = /dev/sda6
...

Thanks,

 - Travis

On Thu, Nov 8, 2012 at 10:08 AM, Travis Rhoden  wrote:
> One more thing -- Google search says this is harmless -- I see quite a
> few of these in syslog:
>
> hdparm: sending ioctl 2285 to a partition!
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Review request for branch wip-java-tests

2012-11-08 Thread Sage Weil

Merged, thanks!

sage

On Thu, 8 Nov 2012, Joe Buck wrote:

> I have a branch for review that reworks that tests for the java bindings and
> builds them if both --enable-cephfs-java and --with-debug are specified. The
> tests can also be built and run via ant.
> 
> Branch name is wip-java-tests.
> 
> Regards,
> -Joe Buck
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Review request for branch wip-java-tests

2012-11-08 Thread Joe Buck

I have a branch for review that reworks that tests for the java bindings 
and builds them if both --enable-cephfs-java and --with-debug are 
specified. The tests can also be built and run via ant.


Branch name is wip-java-tests.

Regards,
-Joe Buck
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: some snapshot problems

2012-11-08 Thread Sage Weil

Hi Liu,

Sorry for the late reply; I have had a very busy week.  :)

On Thu, 1 Nov 2012, liu yaqi wrote:
> Dear Mr.Weil
> 
> I am a student of Institute of Computing Technology, Chinese Academy of
> Sciences, and I am learning the realization of snapshot in ceph system.
> There are sometings that puzzle me, and I want to ask you some questions.
> First question, there is a command "ceph osd cluster_snap {name}", but i
> cannot found the complete realization process, and I want to ask is the
> snapshot for the whole cluster has been  realized?

The idea was to have a low-level cluster-wide snapshot that could be used 
for recovery if ceph itself went haywire and corrupted itself.  The idea 
would be for the OSDs to create btrfs-level snapshots of their data.  It 
was never completely implemented, though, and the OSD bits have mostly 
been removed.  In particular, we never made a way for the monitor state to 
be checkpointed, which would be necessary for the whole scheme to work 
properly.

> Second question, there
> seems to be snapshots for pools and images. I want to ask what does pool and
> image mean? Is an image means an osd?

Lots of different snapshots:

 - librados lets you do 'selfmanaged snaps' in its API, which let an 
   application control which snapshots apply to which objects. 
 - you can create a 'pool' snapshot on an entire librados pool.  this 
   cannot be used at the same time as rbd, fs, or the above 'selfmanaged' 
   snaps.
 - rbd let's you snapshot block device images (by usuing the librados 
   selfmanaged snap API).
 - the ceph file system let's you snapshot any subdirectory (again 
   utilizing the underlying RADOS functionality).

> Third question, in the "mds" folder,
> there are files like "snapserver" "MClientSnap" and so on, is there files
> are used to snapshot the metadata only? 

Yes.

> Does they have some relationship
> with the pool or image snapshots? 

Not really.

> The last question, is there snapshots
> for a file path in the ceph? Or, the snapshots must be done on metadata and
> data  separately?

For the file system, you create a snapshot on a directory and it affects 
all files in that directory and beneath it, including the data in those 
files.

Hope that helps!
sage

> If you would kind enough to help me on the above questions, I will be
> grateful. And I am looking forward to your reply.
> 
> With best wishes for you.
> 
> Yours, YaqiLiu
>

Re: SSD journal suggestion

2012-11-08 Thread Andrey Korolyov

On Thu, Nov 8, 2012 at 7:02 PM, Atchley, Scott  wrote:
> On Nov 8, 2012, at 10:00 AM, Scott Atchley  wrote:
>
>> On Nov 8, 2012, at 9:39 AM, Mark Nelson  wrote:
>>
>>> On 11/08/2012 07:55 AM, Atchley, Scott wrote:
 On Nov 8, 2012, at 3:22 AM, Gandalf Corvotempesta 
  wrote:

> 2012/11/8 Mark Nelson :
>> I haven't done much with IPoIB (just RDMA), but my understanding is that 
>> it
>> tends to top out at like 15Gb/s.  Some others on this mailing list can
>> probably speak more authoritatively.  Even with RDMA you are going to top
>> out at around 3.1-3.2GB/s.
>
> 15Gb/s is still faster than 10Gbe
> But this speed limit seems to be kernel-related and should be the same
> even in a 10Gbe environment, or not?

 We have a test cluster with Mellanox QDR HCAs (i.e. NICs). When using 
 Verbs (the native IB API), I see ~27 Gb/s between two hosts. When running 
 Sockets over these devices using IPoIB, I see 13-22 Gb/s depending on 
 whether I use interrupt affinity and process binding.

 For our Ceph testing, we will set the affinity of two of the mlx4 
 interrupt handlers to cores 0 and 1 and we will not using process binding. 
 For single stream Netperf, we do use process binding and bind it to the 
 same core (i.e. 0) and we see ~22 Gb/s. For multiple, concurrent Netperf 
 runs, we do not use process binding but we still see ~22 Gb/s.
>>>
>>> Scott, this is very interesting!  Does setting the interrupt affinity
>>> make the biggest difference then when you have concurrent netperf
>>> processes going?  For some reason I thought that setting interrupt
>>> affinity wasn't even guaranteed in linux any more, but this is just some
>>> half-remembered recollection from a year or two ago.
>>
>> We are using RHEL6 with a 3.5.1 kernel. I tested single stream Netperf with 
>> and without affinity:
>>
>> Default (irqbalance running)   12.8 Gb/s
>> IRQ balance off13.0 Gb/s
>> Set IRQ affinity to socket 0   17.3 Gb/s   # using the Mellanox script
>>
>> When I set the affinity to cores 0-1 _and_ I bind Netperf to core 0, I get 
>> ~22 Gb/s for a single stream.
>

Did you tried Mellanox-baked modules for 2.6.32 before that?

> Note, I used hwloc to determine which socket was closer to the mlx4 device on 
> our dual socket machines. On these nodes, hwloc reported that both sockets 
> were equally close, but a colleague has machines where one socket is closer 
> than the other. In that case, bind to the closer socket (or to cores within 
> the closer socket).
>
>>
 We used all of the Mellanox tuning recommendations for IPoIB available in 
 their tuning pdf:

 http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf

 We looked at their interrupt affinity setting scripts and then wrote our 
 own.

 Our testing is with IPoIB in "connected" mode, not "datagram" mode. 
 Connected mode is less scalable, but currently I only get ~3 Gb/s with 
 datagram mode. Mellanox claims that we should get identical performance 
 with both modes and we are looking into it.

 We are getting a new test cluster with FDR HCAs and I will look into those 
 as well.
>>>
>>> Nice!  At some point I'll probably try to justify getting some FDR cards
>>> in house.  I'd definitely like to hear how FDR ends up working for you.
>>
>> I'll post the numbers when I get access after they are set up.
>>
>> Scott
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: extreme ceph-osd cpu load for rand. 4k write


On 11/08/2012 09:45 AM, Stefan Priebe - Profihost AG wrote:

Am 08.11.2012 16:01, schrieb Sage Weil:

On Thu, 8 Nov 2012, Stefan Priebe - Profihost AG wrote:

Is there any way to find out why a ceph-osd process takes around 10
times more
load on rand 4k writes than on 4k reads?


Something like perf or oprofile is probably your best bet.  perf can be
tedious to deploy, depending on where your kernel is coming from.
oprofile seems to be deprecated, although I've had good results with
it in
the past.


I've recorded 10s with perf - it is now a 300MB perf.data file. Sadly
i've no idea what todo with it next.


Pour yourself a stiff drink! (haha!)

Try just doing a "perf report" in the directory where you've got the 
data file.  Here's a nice tutorial:


https://perf.wiki.kernel.org/index.php/Tutorial

Also, if you see missing symbols you might benefit by chowning the file 
to root and running perf report as root.  If you still see missing 
symbols, you may want to just give up and try sysprof.





  would love to see where the CPU is spending most of it's time.  This is
on current master?

Yes


 I expect there are still some low-hanging fruit that
can bring CPU utilization down (or even boost iops).

Would be great to find them.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: less cores more iops / speed

2012-11-08 Thread Alexandre DERUMIER

>>So it is a problem of KVM which let's the processes jump between cores a 
>>lot. 

maybe numad from redhat can help ?
http://fedoraproject.org/wiki/Features/numad

It's try to keep process on same numa node and I think it's also doing some 
dynamic pinning.

- Mail original - 

De: "Stefan Priebe - Profihost AG"  
À: "Mark Nelson"  
Cc: "Joao Eduardo Luis" , ceph-devel@vger.kernel.org 
Envoyé: Jeudi 8 Novembre 2012 16:14:32 
Objet: Re: less cores more iops / speed 

Am 08.11.2012 14:19, schrieb Mark Nelson: 
> On 11/08/2012 02:45 AM, Stefan Priebe - Profihost AG wrote: 
>> Am 08.11.2012 01:59, schrieb Mark Nelson: 
>>> There's also the context switching overhead. It'd be interesting to 
>>> know how much the writer processes were shifting around on cores. 
>> What do you mean by that? I'm talking about the KVM guest not about the 
>> ceph nodes. 
> 
> in this case, is fio bouncing around between cores? 

Thanks you're correct. If i bind fio to two cores on a 8 core VM it runs 
with 16.000 iops. 

So it is a problem of KVM which let's the processes jump between cores a 
lot. 

Greets, 
Stefan 
-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majord...@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: extreme ceph-osd cpu load for rand. 4k write


Am 08.11.2012 16:01, schrieb Mark Nelson:

Hi Stefan,

You might want to try running sysprof or perf while the OSDs are running
during the tests and see where CPU time is being spent.  Also, how are
you determining how much CPU usage is being used?


Hi Mark,

have a 300MB perf.data file and no idea what todo next ;-)

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: extreme ceph-osd cpu load for rand. 4k write


Am 08.11.2012 16:01, schrieb Sage Weil:

On Thu, 8 Nov 2012, Stefan Priebe - Profihost AG wrote:

Is there any way to find out why a ceph-osd process takes around 10 times more
load on rand 4k writes than on 4k reads?


Something like perf or oprofile is probably your best bet.  perf can be
tedious to deploy, depending on where your kernel is coming from.
oprofile seems to be deprecated, although I've had good results with it in
the past.


I've recorded 10s with perf - it is now a 300MB perf.data file. Sadly 
i've no idea what todo with it next.



  would love to see where the CPU is spending most of it's time.  This is
on current master?

Yes


 I expect there are still some low-hanging fruit that
can bring CPU utilization down (or even boost iops).

Would be great to find them.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: less cores more iops / speed


Am 08.11.2012 14:19, schrieb Mark Nelson:

On 11/08/2012 02:45 AM, Stefan Priebe - Profihost AG wrote:

Am 08.11.2012 01:59, schrieb Mark Nelson:

There's also the context switching overhead.  It'd be interesting to
know how much the writer processes were shifting around on cores.

What do you mean by that? I'm talking about the KVM guest not about the
ceph nodes.


in this case, is fio bouncing around between cores?


Thanks you're correct. If i bind fio to two cores on a 8 core VM it runs 
with 16.000 iops.


So it is a problem of KVM which let's the processes jump between cores a 
lot.


Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SSD journal suggestion

On Nov 8, 2012, at 10:00 AM, Scott Atchley  wrote:

> On Nov 8, 2012, at 9:39 AM, Mark Nelson  wrote:
> 
>> On 11/08/2012 07:55 AM, Atchley, Scott wrote:
>>> On Nov 8, 2012, at 3:22 AM, Gandalf Corvotempesta 
>>>  wrote:
>>> 
 2012/11/8 Mark Nelson :
> I haven't done much with IPoIB (just RDMA), but my understanding is that 
> it
> tends to top out at like 15Gb/s.  Some others on this mailing list can
> probably speak more authoritatively.  Even with RDMA you are going to top
> out at around 3.1-3.2GB/s.
 
 15Gb/s is still faster than 10Gbe
 But this speed limit seems to be kernel-related and should be the same
 even in a 10Gbe environment, or not?
>>> 
>>> We have a test cluster with Mellanox QDR HCAs (i.e. NICs). When using Verbs 
>>> (the native IB API), I see ~27 Gb/s between two hosts. When running Sockets 
>>> over these devices using IPoIB, I see 13-22 Gb/s depending on whether I use 
>>> interrupt affinity and process binding.
>>> 
>>> For our Ceph testing, we will set the affinity of two of the mlx4 interrupt 
>>> handlers to cores 0 and 1 and we will not using process binding. For single 
>>> stream Netperf, we do use process binding and bind it to the same core 
>>> (i.e. 0) and we see ~22 Gb/s. For multiple, concurrent Netperf runs, we do 
>>> not use process binding but we still see ~22 Gb/s.
>> 
>> Scott, this is very interesting!  Does setting the interrupt affinity 
>> make the biggest difference then when you have concurrent netperf 
>> processes going?  For some reason I thought that setting interrupt 
>> affinity wasn't even guaranteed in linux any more, but this is just some 
>> half-remembered recollection from a year or two ago.
> 
> We are using RHEL6 with a 3.5.1 kernel. I tested single stream Netperf with 
> and without affinity:
> 
> Default (irqbalance running)   12.8 Gb/s
> IRQ balance off13.0 Gb/s
> Set IRQ affinity to socket 0   17.3 Gb/s   # using the Mellanox script
> 
> When I set the affinity to cores 0-1 _and_ I bind Netperf to core 0, I get 
> ~22 Gb/s for a single stream.

Note, I used hwloc to determine which socket was closer to the mlx4 device on 
our dual socket machines. On these nodes, hwloc reported that both sockets were 
equally close, but a colleague has machines where one socket is closer than the 
other. In that case, bind to the closer socket (or to cores within the closer 
socket).

> 
>>> We used all of the Mellanox tuning recommendations for IPoIB available in 
>>> their tuning pdf:
>>> 
>>> http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf
>>> 
>>> We looked at their interrupt affinity setting scripts and then wrote our 
>>> own.
>>> 
>>> Our testing is with IPoIB in "connected" mode, not "datagram" mode. 
>>> Connected mode is less scalable, but currently I only get ~3 Gb/s with 
>>> datagram mode. Mellanox claims that we should get identical performance 
>>> with both modes and we are looking into it.
>>> 
>>> We are getting a new test cluster with FDR HCAs and I will look into those 
>>> as well.
>> 
>> Nice!  At some point I'll probably try to justify getting some FDR cards 
>> in house.  I'd definitely like to hear how FDR ends up working for you.
> 
> I'll post the numbers when I get access after they are set up.
> 
> Scott
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: problems creating new ceph cluster when using journal on block device

2012-11-08 Thread Travis Rhoden

>>> [osd]
>>>  osd journal size = 4000
>>
>>
>> Not sure if this is the problem, but when using a block device you don't
>> have to specify the size for the journal.

So happy to know that, Wido!  I had hoped there was a way to skip that.

Tried without it -- only difference in the logs was seeing that it
picked up the full size of the partition.  So, same result.

> Also might be useful to know make/model of ssd, plus motherboard make/model
> (in case commenting out size does not fix)!

It's an Intel X25-E, 64GB.  It's a place-holder until some bigger ones
we have on order show up.

The mother board is a SuperMicro X8DT6.  SSDs are connected to onboard
SATA ports, data drives are connected to LSI 9211-8i (SAS2008)

Maybe there is a special way I need to do the partition?  My goal was
to throw 6 journals on this disk, and it is partitioned like so:

Model: ATA SSDSA2SH064G1GC (scsi)
Disk /dev/sda: 64.0GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos

Number  Start   End SizeType  File system  Flags
 1  1049kB  512MB   511MB   primaryraid
 2  512MB   2511MB  2000MB  primaryraid
 3  2511MB  6512MB  4000MB  primaryraid
 4  6512MB  64.0GB  57.5GB  extended
 5  6513MB  15.1GB  8590MB  logical
 6  15.1GB  23.7GB  8590MB  logical
 7  23.7GB  32.3GB  8590MB  logical
 8  32.3GB  40.9GB  8590MB  logical
 9  40.9GB  49.5GB  8590MB  logical
10  49.5GB  58.1GB  8590MB  logical


So, sda5-10 are my journal partitions.  I know that I have consumed
most of the drive here, and that is bad for the SSD and such, but it
really is a temporary setup.

 - Travis

On Thu, Nov 8, 2012 at 3:24 AM, Mark Kirkwood
 wrote:
> On 08/11/12 21:08, Wido den Hollander wrote:
>>
>>
>>
>> On 08-11-12 08:29, Travis Rhoden wrote:
>>>
>>> Hey folks,
>>>
>>> I'm trying to set up a brand new Ceph cluster, based on v0.53.  My
>>> hardware has SSDs for journals, and I'm trying to get mkcephfs to
>>> intialize everything for me. However, the command hangs forever and I
>>> eventually have to kill it.
>>>
>>> After poking around a bit, it's clear that the problem has something
>>> to do with the journal.  If I comment out the journal in ceph.conf,
>>> the commands proceed just find.  This is the first time I've tried to
>>> throw a journal on a block device rather than a file, so maybe I've
>>> done something wrong with that.
>>>
>>> Here is the info from ceph.conf:
>>>
>>>
>>> [osd]
>>>  osd journal size = 4000
>>
>>
>> Not sure if this is the problem, but when using a block device you don't
>> have to specify the size for the journal.
>
>
> Also might be useful to know make/model of ssd, plus motherboard make/model
> (in case commenting out size does not fix)!
>
> Regards
>
> Mark
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: extreme ceph-osd cpu load for rand. 4k write


Hi Stefan,

You might want to try running sysprof or perf while the OSDs are running 
during the tests and see where CPU time is being spent.  Also, how are 
you determining how much CPU usage is being used?


Mark

On 11/08/2012 08:58 AM, Stefan Priebe - Profihost AG wrote:

Is there any way to find out why a ceph-osd process takes around 10
times more load on rand 4k writes than on 4k reads?

Stefan

Am 07.11.2012 21:41, schrieb Stefan Priebe:

Hello list,

whiling benchmarking i was wondering, why the ceph-osd load is so
extreme high while having random 4k write i/o.

Here an example while benchmarking:

random 4k write: 16.000 iop/s 180% CPU Load in top from EACH ceph-osd
process

random 4k read: 16.000 iop/s 19% CPU Load in top from EACH ceph-osd
process

seq 4M write: 800MB/s 14% CPU Load in top from EACH ceph-osd process

seq 4M read: 1600MB/s 9% CPU Load in top from EACH ceph-osd process

I can't understand why in this single case the load is so EXTREMELY high.

Greets
Stefan

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: extreme ceph-osd cpu load for rand. 4k write

2012-11-08 Thread Sage Weil

On Thu, 8 Nov 2012, Stefan Priebe - Profihost AG wrote:
> Is there any way to find out why a ceph-osd process takes around 10 times more
> load on rand 4k writes than on 4k reads?

Something like perf or oprofile is probably your best bet.  perf can be 
tedious to deploy, depending on where your kernel is coming from.  
oprofile seems to be deprecated, although I've had good results with it in 
the past.

 would love to see where the CPU is spending most of it's time.  This is 
on current master?  I expect there are still some low-hanging fruit that 
can bring CPU utilization down (or even boost iops).

sage



> 
> Stefan
> 
> Am 07.11.2012 21:41, schrieb Stefan Priebe:
> > Hello list,
> > 
> > whiling benchmarking i was wondering, why the ceph-osd load is so
> > extreme high while having random 4k write i/o.
> > 
> > Here an example while benchmarking:
> > 
> > random 4k write: 16.000 iop/s 180% CPU Load in top from EACH ceph-osd
> > process
> > 
> > random 4k read: 16.000 iop/s 19% CPU Load in top from EACH ceph-osd process
> > 
> > seq 4M write: 800MB/s 14% CPU Load in top from EACH ceph-osd process
> > 
> > seq 4M read: 1600MB/s 9% CPU Load in top from EACH ceph-osd process
> > 
> > I can't understand why in this single case the load is so EXTREMELY high.
> > 
> > Greets
> > Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SSD journal suggestion

On Nov 8, 2012, at 9:39 AM, Mark Nelson  wrote:

> On 11/08/2012 07:55 AM, Atchley, Scott wrote:
>> On Nov 8, 2012, at 3:22 AM, Gandalf Corvotempesta 
>>  wrote:
>> 
>>> 2012/11/8 Mark Nelson :
 I haven't done much with IPoIB (just RDMA), but my understanding is that it
 tends to top out at like 15Gb/s.  Some others on this mailing list can
 probably speak more authoritatively.  Even with RDMA you are going to top
 out at around 3.1-3.2GB/s.
>>> 
>>> 15Gb/s is still faster than 10Gbe
>>> But this speed limit seems to be kernel-related and should be the same
>>> even in a 10Gbe environment, or not?
>> 
>> We have a test cluster with Mellanox QDR HCAs (i.e. NICs). When using Verbs 
>> (the native IB API), I see ~27 Gb/s between two hosts. When running Sockets 
>> over these devices using IPoIB, I see 13-22 Gb/s depending on whether I use 
>> interrupt affinity and process binding.
>> 
>> For our Ceph testing, we will set the affinity of two of the mlx4 interrupt 
>> handlers to cores 0 and 1 and we will not using process binding. For single 
>> stream Netperf, we do use process binding and bind it to the same core (i.e. 
>> 0) and we see ~22 Gb/s. For multiple, concurrent Netperf runs, we do not use 
>> process binding but we still see ~22 Gb/s.
> 
> Scott, this is very interesting!  Does setting the interrupt affinity 
> make the biggest difference then when you have concurrent netperf 
> processes going?  For some reason I thought that setting interrupt 
> affinity wasn't even guaranteed in linux any more, but this is just some 
> half-remembered recollection from a year or two ago.

We are using RHEL6 with a 3.5.1 kernel. I tested single stream Netperf with and 
without affinity:

Default (irqbalance running)   12.8 Gb/s
IRQ balance off13.0 Gb/s
Set IRQ affinity to socket 0   17.3 Gb/s   # using the Mellanox script

When I set the affinity to cores 0-1 _and_ I bind Netperf to core 0, I get ~22 
Gb/s for a single stream.

>> We used all of the Mellanox tuning recommendations for IPoIB available in 
>> their tuning pdf:
>> 
>> http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf
>> 
>> We looked at their interrupt affinity setting scripts and then wrote our own.
>> 
>> Our testing is with IPoIB in "connected" mode, not "datagram" mode. 
>> Connected mode is less scalable, but currently I only get ~3 Gb/s with 
>> datagram mode. Mellanox claims that we should get identical performance with 
>> both modes and we are looking into it.
>> 
>> We are getting a new test cluster with FDR HCAs and I will look into those 
>> as well.
> 
> Nice!  At some point I'll probably try to justify getting some FDR cards 
> in house.  I'd definitely like to hear how FDR ends up working for you.

I'll post the numbers when I get access after they are set up.

Scott

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: extreme ceph-osd cpu load for rand. 4k write

Is there any way to find out why a ceph-osd process takes around 10 
times more load on rand 4k writes than on 4k reads?


Stefan

Am 07.11.2012 21:41, schrieb Stefan Priebe:

Hello list,

whiling benchmarking i was wondering, why the ceph-osd load is so
extreme high while having random 4k write i/o.

Here an example while benchmarking:

random 4k write: 16.000 iop/s 180% CPU Load in top from EACH ceph-osd
process

random 4k read: 16.000 iop/s 19% CPU Load in top from EACH ceph-osd process

seq 4M write: 800MB/s 14% CPU Load in top from EACH ceph-osd process

seq 4M read: 1600MB/s 9% CPU Load in top from EACH ceph-osd process

I can't understand why in this single case the load is so EXTREMELY high.

Greets
Stefan

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SSD journal suggestion

On 11/08/2012 07:55 AM, Atchley, Scott wrote:

On Nov 8, 2012, at 3:22 AM, Gandalf Corvotempesta
wrote:

2012/11/8 Mark Nelson :

I haven't done much with IPoIB (just RDMA), but my understanding is that it
tends to top out at like 15Gb/s. Some others on this mailing list can
probably speak more authoritatively. Even with RDMA you are going to top
out at around 3.1-3.2GB/s.

15Gb/s is still faster than 10Gbe
But this speed limit seems to be kernel-related and should be the same
even in a 10Gbe environment, or not?

We have a test cluster with Mellanox QDR HCAs (i.e. NICs). When using Verbs
(the native IB API), I see ~27 Gb/s between two hosts. When running Sockets
over these devices using IPoIB, I see 13-22 Gb/s depending on whether I use
interrupt affinity and process binding.

For our Ceph testing, we will set the affinity of two of the mlx4 interrupt
handlers to cores 0 and 1 and we will not using process binding. For single
stream Netperf, we do use process binding and bind it to the same core (i.e. 0)
and we see ~22 Gb/s. For multiple, concurrent Netperf runs, we do not use
process binding but we still see ~22 Gb/s.

Scott, this is very interesting! Does setting the interrupt affinity
make the biggest difference then when you have concurrent netperf
processes going? For some reason I thought that setting interrupt
affinity wasn't even guaranteed in linux any more, but this is just some
half-remembered recollection from a year or two ago.

We used all of the Mellanox tuning recommendations for IPoIB available in their
tuning pdf:

http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf

We looked at their interrupt affinity setting scripts and then wrote our own.

Our testing is with IPoIB in "connected" mode, not "datagram" mode. Connected
mode is less scalable, but currently I only get ~3 Gb/s with datagram mode. Mellanox claims that we
should get identical performance with both modes and we are looking into it.

We are getting a new test cluster with FDR HCAs and I will look into those as
well.

Nice! At some point I'll probably try to justify getting some FDR cards
in house. I'd definitely like to hear how FDR ends up working for you.

Scott

Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH] rbd: end request on error in rbd_do_request() caller

Only one of the three callers of rbd_do_request() provide a
collection structure to aggregate status.

If an error occurs in rbd_do_request(), have the caller
take care of calling rbd_coll_end_req() if necessary in
that one spot.

Signed-off-by: Alex Elder 
---
 drivers/block/rbd.c |   11 ---
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index fb727c0..835153e 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1128,12 +1128,8 @@ static int rbd_do_request(struct request *rq,
struct ceph_osd_client *osdc;

rbd_req = kzalloc(sizeof(*rbd_req), GFP_NOIO);
-   if (!rbd_req) {
-   if (coll)
-   rbd_coll_end_req_index(rq, coll, coll_index,
-  (s32) -ENOMEM, len);
+   if (!rbd_req)
return -ENOMEM;
-   }

if (coll) {
rbd_req->coll = coll;
@@ -1208,7 +1204,6 @@ done_err:
bio_chain_put(rbd_req->bio);
ceph_osdc_put_request(osd_req);
 done_pages:
-   rbd_coll_end_req(rbd_req, (s32) ret, len);
kfree(rbd_req);
return ret;
 }
@@ -1361,7 +1356,9 @@ static int rbd_do_op(struct request *rq,
 ops,
 coll, coll_index,
 rbd_req_cb, 0, NULL);
-
+   if (ret < 0)
+   rbd_coll_end_req_index(rq, coll, coll_index,
+   (s32) ret, seg_len);
rbd_destroy_ops(ops);
 done:
kfree(seg_name);
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/2] rbd: a little more cleanup of rbd_rq_fn()

Now that a big hunk in the middle of rbd_rq_fn() has been moved
into its own routine we can simplify it a little more.

Signed-off-by: Alex Elder 
---
 drivers/block/rbd.c |   50
+++---
 1 file changed, 23 insertions(+), 27 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 6aed59b..fb727c0 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1649,53 +1649,49 @@ static int rbd_dev_do_request(struct request *rq,
 static void rbd_rq_fn(struct request_queue *q)
 {
struct rbd_device *rbd_dev = q->queuedata;
+   bool read_only = rbd_dev->mapping.read_only;
struct request *rq;

while ((rq = blk_fetch_request(q))) {
-   struct bio *bio;
-   bool do_write;
-   unsigned int size;
-   u64 ofs;
-   struct ceph_snap_context *snapc;
+   struct ceph_snap_context *snapc = NULL;
int result;

dout("fetched request\n");

-   /* filter out block requests we don't understand */
+   /* Filter out block requests we don't understand */
+
if ((rq->cmd_type != REQ_TYPE_FS)) {
__blk_end_request_all(rq, 0);
continue;
}
+   spin_unlock_irq(q->queue_lock);

-   /* deduce our operation (read, write) */
-   do_write = (rq_data_dir(rq) == WRITE);
-   if (do_write && rbd_dev->mapping.read_only) {
-   __blk_end_request_all(rq, -EROFS);
-   continue;
-   }
+   /* Stop writes to a read-only device */

-   spin_unlock_irq(q->queue_lock);
+   result = -EROFS;
+   if (read_only && rq_data_dir(rq) == WRITE)
+   goto out_end_request;
+
+   /* Grab a reference to the snapshot context */

down_read(&rbd_dev->header_rwsem);
+   if (rbd_dev->exists) {
+   snapc = ceph_get_snap_context(rbd_dev->header.snapc);
+   rbd_assert(snapc != NULL);
+   }
+   up_read(&rbd_dev->header_rwsem);

-   if (!rbd_dev->exists) {
+   if (!snapc) {
rbd_assert(rbd_dev->spec->snap_id != CEPH_NOSNAP);
-   up_read(&rbd_dev->header_rwsem);
dout("request for non-existent snapshot");
-   spin_lock_irq(q->queue_lock);
-   __blk_end_request_all(rq, -ENXIO);
-   continue;
+   result = -ENXIO;
+   goto out_end_request;
}

-   snapc = ceph_get_snap_context(rbd_dev->header.snapc);
-
-   up_read(&rbd_dev->header_rwsem);
-
-   size = blk_rq_bytes(rq);
-   ofs = blk_rq_pos(rq) * SECTOR_SIZE;
-   bio = rq->bio;
-
-   result = rbd_dev_do_request(rq, rbd_dev, snapc, ofs, size, bio);
+   result = rbd_dev_do_request(rq, rbd_dev, snapc,
+   blk_rq_pos(rq) * SECTOR_SIZE,
+   blk_rq_bytes(rq), rq->bio);
+out_end_request:
ceph_put_snap_context(snapc);
spin_lock_irq(q->queue_lock);
if (result < 0)
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] rbd: encapsulate handling for a single request

In rbd_rq_fn(), requests are fetched from the block layer and each
request is processed, looping through the request's list of bio's
until they've all been consumed.

Separate the handling for a single request into its own function to
make it a bit easier to see what's going on.

Signed-off-by: Alex Elder 
---
 drivers/block/rbd.c |  119
+++
 1 file changed, 63 insertions(+), 56 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index be18b5f..6aed59b 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1585,6 +1585,64 @@ static struct rbd_req_coll *rbd_alloc_coll(int
num_reqs)
return coll;
 }

+static int rbd_dev_do_request(struct request *rq,
+   struct rbd_device *rbd_dev,
+   struct ceph_snap_context *snapc,
+   u64 ofs, unsigned int size,
+   struct bio *bio_chain)
+{
+   int num_segs;
+   struct rbd_req_coll *coll;
+   unsigned int bio_offset;
+   int cur_seg = 0;
+
+   dout("%s 0x%x bytes at 0x%llx\n",
+   rq_data_dir(rq) == WRITE ? "write" : "read",
+   size, (unsigned long long) blk_rq_pos(rq) * SECTOR_SIZE);
+
+   num_segs = rbd_get_num_segments(&rbd_dev->header, ofs, size);
+   if (num_segs <= 0)
+   return num_segs;
+
+   coll = rbd_alloc_coll(num_segs);
+   if (!coll)
+   return -ENOMEM;
+
+   bio_offset = 0;
+   do {
+   u64 limit = rbd_segment_length(rbd_dev, ofs, size);
+   unsigned int clone_size;
+   struct bio *bio_clone;
+
+   BUG_ON(limit > (u64) UINT_MAX);
+   clone_size = (unsigned int) limit;
+   dout("bio_chain->bi_vcnt=%hu\n", bio_chain->bi_vcnt);
+
+   kref_get(&coll->kref);
+
+   /* Pass a cloned bio chain via an osd request */
+
+   bio_clone = bio_chain_clone_range(&bio_chain,
+   &bio_offset, clone_size,
+   GFP_ATOMIC);
+   if (bio_clone)
+   (void) rbd_do_op(rq, rbd_dev, snapc,
+   ofs, clone_size,
+   bio_clone, coll, cur_seg);
+   else
+   rbd_coll_end_req_index(rq, coll, cur_seg,
+   (s32) -ENOMEM,
+   clone_size);
+   size -= clone_size;
+   ofs += clone_size;
+
+   cur_seg++;
+   } while (size > 0);
+   kref_put(&coll->kref, rbd_coll_release);
+
+   return 0;
+}
+
 /*
  * block device queue callback
  */
@@ -1598,10 +1656,8 @@ static void rbd_rq_fn(struct request_queue *q)
bool do_write;
unsigned int size;
u64 ofs;
-   int num_segs, cur_seg = 0;
-   struct rbd_req_coll *coll;
struct ceph_snap_context *snapc;
-   unsigned int bio_offset;
+   int result;

dout("fetched request\n");

@@ -1639,60 +1695,11 @@ static void rbd_rq_fn(struct request_queue *q)
ofs = blk_rq_pos(rq) * SECTOR_SIZE;
bio = rq->bio;

-   dout("%s 0x%x bytes at 0x%llx\n",
-do_write ? "write" : "read",
-size, (unsigned long long) blk_rq_pos(rq) * SECTOR_SIZE);
-
-   num_segs = rbd_get_num_segments(&rbd_dev->header, ofs, size);
-   if (num_segs <= 0) {
-   spin_lock_irq(q->queue_lock);
-   __blk_end_request_all(rq, num_segs);
-   ceph_put_snap_context(snapc);
-   continue;
-   }
-   coll = rbd_alloc_coll(num_segs);
-   if (!coll) {
-   spin_lock_irq(q->queue_lock);
-   __blk_end_request_all(rq, -ENOMEM);
-   ceph_put_snap_context(snapc);
-   continue;
-   }
-
-   bio_offset = 0;
-   do {
-   u64 limit = rbd_segment_length(rbd_dev, ofs, size);
-   unsigned int chain_size;
-   struct bio *bio_chain;
-
-   BUG_ON(limit > (u64) UINT_MAX);
-   chain_size = (unsigned int) limit;
-   dout("rq->bio->bi_vcnt=%hu\n", rq->bio->bi_vcnt);
-
-   kref_get(&coll->kref);
-
-   /* Pass a cloned bio chain via an osd request */
-
-   bio_chain = bio_chain_clone_range(&bio,
-   &bio_offset, chain_size,
-   GFP_ATOMIC);
-   if (bio_chain)
-

[PATCH 0/2] rbd: clean up rbd_rq_fn()

Some refactoring to improve readability.-Alex

[PATCH 1/2] rbd: encapsulate handling for a single request
[PATCH 2/2] rbd: a little more cleanup of rbd_rq_fn()
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/3] rbd: be picky about osd request status type

The result field in a ceph osd reply header is a signed 32-bit type,
but rbd code often casually uses int to represent it.

The following changes the types of variables that handle this result
value to be "s32" instead of "int" to be completely explicit about
it.  Only at the point we pass that result to __blk_end_request()
does the type get converted to the plain old int defined for that
interface.

There is almost certainly no binary impact of this change, but I
prefer to show the exact size and signedness of the value since we
know it.

Signed-off-by: Alex Elder 
---
 drivers/block/rbd.c |   23 ---
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index caff180..be18b5f 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -171,7 +171,7 @@ struct rbd_client {
  */
 struct rbd_req_status {
int done;
-   int rc;
+   s32 rc;
u64 bytes;
 };

@@ -1055,13 +1055,13 @@ static void rbd_destroy_ops(struct
ceph_osd_req_op *ops)
 static void rbd_coll_end_req_index(struct request *rq,
   struct rbd_req_coll *coll,
   int index,
-  int ret, u64 len)
+  s32 ret, u64 len)
 {
struct request_queue *q;
int min, max, i;

dout("rbd_coll_end_req_index %p index %d ret %d len %llu\n",
-coll, index, ret, (unsigned long long) len);
+coll, index, (int) ret, (unsigned long long) len);

if (!rq)
return;
@@ -1082,7 +1082,7 @@ static void rbd_coll_end_req_index(struct request *rq,
max++;

for (i = min; istatus[i].rc,
+   __blk_end_request(rq, (int) coll->status[i].rc,
  coll->status[i].bytes);
coll->num_done++;
kref_put(&coll->kref, rbd_coll_release);
@@ -1091,7 +1091,7 @@ static void rbd_coll_end_req_index(struct request *rq,
 }

 static void rbd_coll_end_req(struct rbd_request *rbd_req,
-int ret, u64 len)
+s32 ret, u64 len)
 {
rbd_coll_end_req_index(rbd_req->rq,
rbd_req->coll, rbd_req->coll_index,
@@ -1131,7 +1131,7 @@ static int rbd_do_request(struct request *rq,
if (!rbd_req) {
if (coll)
rbd_coll_end_req_index(rq, coll, coll_index,
-  -ENOMEM, len);
+  (s32) -ENOMEM, len);
return -ENOMEM;
}

@@ -1208,7 +1208,7 @@ done_err:
bio_chain_put(rbd_req->bio);
ceph_osdc_put_request(osd_req);
 done_pages:
-   rbd_coll_end_req(rbd_req, ret, len);
+   rbd_coll_end_req(rbd_req, (s32) ret, len);
kfree(rbd_req);
return ret;
 }
@@ -1221,7 +1221,7 @@ static void rbd_req_cb(struct ceph_osd_request
*osd_req, struct ceph_msg *msg)
struct rbd_request *rbd_req = osd_req->r_priv;
struct ceph_osd_reply_head *replyhead;
struct ceph_osd_op *op;
-   __s32 rc;
+   s32 rc;
u64 bytes;
int read_op;

@@ -1229,14 +1229,14 @@ static void rbd_req_cb(struct ceph_osd_request
*osd_req, struct ceph_msg *msg)
replyhead = msg->front.iov_base;
WARN_ON(le32_to_cpu(replyhead->num_ops) == 0);
op = (void *)(replyhead + 1);
-   rc = le32_to_cpu(replyhead->result);
+   rc = (s32) le32_to_cpu(replyhead->result);
bytes = le64_to_cpu(op->extent.length);
read_op = (le16_to_cpu(op->op) == CEPH_OSD_OP_READ);

dout("rbd_req_cb bytes=%llu readop=%d rc=%d\n",
(unsigned long long) bytes, read_op, (int) rc);

-   if (rc == -ENOENT && read_op) {
+   if (rc == (s32) -ENOENT && read_op) {
zero_bio_chain(rbd_req->bio, 0);
rc = 0;
} else if (rc == 0 && read_op && bytes < rbd_req->len) {
@@ -1681,7 +1681,8 @@ static void rbd_rq_fn(struct request_queue *q)
bio_chain, coll, cur_seg);
else
rbd_coll_end_req_index(rq, coll, cur_seg,
-  -ENOMEM, chain_size);
+  (s32) -ENOMEM,
+  chain_size);
size -= chain_size;
ofs += chain_size;

-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/3] rbd: standardize ceph_osd_request variable names

There are spots where a ceph_osds_request pointer variable is given
the name "req".  Since we're dealing with (at least) three types of
requests (block layer, rbd, and osd), I find this slightly
distracting.

Change such instances to use "osd_req" consistently to make the
abstraction represented a little more obvious.

Signed-off-by: Alex Elder 
---
 drivers/block/rbd.c |   60
++-
 1 file changed, 31 insertions(+), 29 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 9d8b406..caff180 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1113,12 +1113,12 @@ static int rbd_do_request(struct request *rq,
  struct ceph_osd_req_op *ops,
  struct rbd_req_coll *coll,
  int coll_index,
- void (*rbd_cb)(struct ceph_osd_request *req,
-struct ceph_msg *msg),
+ void (*rbd_cb)(struct ceph_osd_request *,
+struct ceph_msg *),
  struct ceph_osd_request **linger_req,
  u64 *ver)
 {
-   struct ceph_osd_request *req;
+   struct ceph_osd_request *osd_req;
struct ceph_file_layout *layout;
int ret;
u64 bno;
@@ -1145,67 +1145,68 @@ static int rbd_do_request(struct request *rq,
(unsigned long long) len, coll, coll_index);

osdc = &rbd_dev->rbd_client->client->osdc;
-   req = ceph_osdc_alloc_request(osdc, flags, snapc, ops,
+   osd_req = ceph_osdc_alloc_request(osdc, flags, snapc, ops,
false, GFP_NOIO, pages, bio);
-   if (!req) {
+   if (!osd_req) {
ret = -ENOMEM;
goto done_pages;
}

-   req->r_callback = rbd_cb;
+   osd_req->r_callback = rbd_cb;

rbd_req->rq = rq;
rbd_req->bio = bio;
rbd_req->pages = pages;
rbd_req->len = len;

-   req->r_priv = rbd_req;
+   osd_req->r_priv = rbd_req;

-   reqhead = req->r_request->front.iov_base;
+   reqhead = osd_req->r_request->front.iov_base;
reqhead->snapid = cpu_to_le64(CEPH_NOSNAP);

-   strncpy(req->r_oid, object_name, sizeof(req->r_oid));
-   req->r_oid_len = strlen(req->r_oid);
+   strncpy(osd_req->r_oid, object_name, sizeof(osd_req->r_oid));
+   osd_req->r_oid_len = strlen(osd_req->r_oid);

-   layout = &req->r_file_layout;
+   layout = &osd_req->r_file_layout;
memset(layout, 0, sizeof(*layout));
layout->fl_stripe_unit = cpu_to_le32(1 << RBD_MAX_OBJ_ORDER);
layout->fl_stripe_count = cpu_to_le32(1);
layout->fl_object_size = cpu_to_le32(1 << RBD_MAX_OBJ_ORDER);
layout->fl_pg_pool = cpu_to_le32((int) rbd_dev->spec->pool_id);
ret = ceph_calc_raw_layout(osdc, layout, snapid, ofs, &len, &bno,
-  req, ops);
+  osd_req, ops);
rbd_assert(ret == 0);

-   ceph_osdc_build_request(req, ofs, &len,
+   ceph_osdc_build_request(osd_req, ofs, &len,
ops,
snapc,
&mtime,
-   req->r_oid, req->r_oid_len);
+   osd_req->r_oid, osd_req->r_oid_len);

if (linger_req) {
-   ceph_osdc_set_request_linger(osdc, req);
-   *linger_req = req;
+   ceph_osdc_set_request_linger(osdc, osd_req);
+   *linger_req = osd_req;
}

-   ret = ceph_osdc_start_request(osdc, req, false);
+   ret = ceph_osdc_start_request(osdc, osd_req, false);
if (ret < 0)
goto done_err;

if (!rbd_cb) {
-   ret = ceph_osdc_wait_request(osdc, req);
+   u64 version;
+
+   ret = ceph_osdc_wait_request(osdc, osd_req);
+   version = le64_to_cpu(osd_req->r_reassert_version.version);
if (ver)
-   *ver = le64_to_cpu(req->r_reassert_version.version);
-   dout("reassert_ver=%llu\n",
-   (unsigned long long)
-   le64_to_cpu(req->r_reassert_version.version));
-   ceph_osdc_put_request(req);
+   *ver = version;
+   dout("reassert_ver=%llu\n", (unsigned long long) version);
+   ceph_osdc_put_request(osd_req);
}
return ret;

 done_err:
bio_chain_put(rbd_req->bio);
-   ceph_osdc_put_request(req);
+   ceph_osdc_put_request(osd_req);
 done_pages:
rbd_coll_end_req(rbd_req, ret, len);
kfree(rbd_req);
@@ -1215,9 +1216,9 @@ done_pages:
 /*
  * Ceph osd op callback
  */
-static void rbd_req_cb(struct ceph_osd_request *req, struct ceph_msg *msg)
+static void rbd_req_cb(struct ceph_osd_request *os

[PATCH 1/3] rbd: standardize rbd_request variable names

There are two names used for items of rbd_request structure type:
"req" and "req_data".  The former name is also used to represent
items of pointers to struct ceph_osd_request.

Change all variables that have these names so they are instead
called "rbd_req" consistently.

Signed-off-by: Alex Elder 
---
 drivers/block/rbd.c |   50
++
 1 file changed, 26 insertions(+), 24 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 5de49a1..9d8b406 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1090,10 +1090,12 @@ static void rbd_coll_end_req_index(struct
request *rq,
spin_unlock_irq(q->queue_lock);
 }

-static void rbd_coll_end_req(struct rbd_request *req,
+static void rbd_coll_end_req(struct rbd_request *rbd_req,
 int ret, u64 len)
 {
-   rbd_coll_end_req_index(req->rq, req->coll, req->coll_index, ret, len);
+   rbd_coll_end_req_index(rbd_req->rq,
+   rbd_req->coll, rbd_req->coll_index,
+   ret, len);
 }

 /*
@@ -1121,12 +1123,12 @@ static int rbd_do_request(struct request *rq,
int ret;
u64 bno;
struct timespec mtime = CURRENT_TIME;
-   struct rbd_request *req_data;
+   struct rbd_request *rbd_req;
struct ceph_osd_request_head *reqhead;
struct ceph_osd_client *osdc;

-   req_data = kzalloc(sizeof(*req_data), GFP_NOIO);
-   if (!req_data) {
+   rbd_req = kzalloc(sizeof(*rbd_req), GFP_NOIO);
+   if (!rbd_req) {
if (coll)
rbd_coll_end_req_index(rq, coll, coll_index,
   -ENOMEM, len);
@@ -1134,8 +1136,8 @@ static int rbd_do_request(struct request *rq,
}

if (coll) {
-   req_data->coll = coll;
-   req_data->coll_index = coll_index;
+   rbd_req->coll = coll;
+   rbd_req->coll_index = coll_index;
}

dout("rbd_do_request object_name=%s ofs=%llu len=%llu coll=%p[%d]\n",
@@ -1152,12 +1154,12 @@ static int rbd_do_request(struct request *rq,

req->r_callback = rbd_cb;

-   req_data->rq = rq;
-   req_data->bio = bio;
-   req_data->pages = pages;
-   req_data->len = len;
+   rbd_req->rq = rq;
+   rbd_req->bio = bio;
+   rbd_req->pages = pages;
+   rbd_req->len = len;

-   req->r_priv = req_data;
+   req->r_priv = rbd_req;

reqhead = req->r_request->front.iov_base;
reqhead->snapid = cpu_to_le64(CEPH_NOSNAP);
@@ -1202,11 +1204,11 @@ static int rbd_do_request(struct request *rq,
return ret;

 done_err:
-   bio_chain_put(req_data->bio);
+   bio_chain_put(rbd_req->bio);
ceph_osdc_put_request(req);
 done_pages:
-   rbd_coll_end_req(req_data, ret, len);
-   kfree(req_data);
+   rbd_coll_end_req(rbd_req, ret, len);
+   kfree(rbd_req);
return ret;
 }

@@ -1215,7 +1217,7 @@ done_pages:
  */
 static void rbd_req_cb(struct ceph_osd_request *req, struct ceph_msg *msg)
 {
-   struct rbd_request *req_data = req->r_priv;
+   struct rbd_request *rbd_req = req->r_priv;
struct ceph_osd_reply_head *replyhead;
struct ceph_osd_op *op;
__s32 rc;
@@ -1234,20 +1236,20 @@ static void rbd_req_cb(struct ceph_osd_request
*req, struct ceph_msg *msg)
(unsigned long long) bytes, read_op, (int) rc);

if (rc == -ENOENT && read_op) {
-   zero_bio_chain(req_data->bio, 0);
+   zero_bio_chain(rbd_req->bio, 0);
rc = 0;
-   } else if (rc == 0 && read_op && bytes < req_data->len) {
-   zero_bio_chain(req_data->bio, bytes);
-   bytes = req_data->len;
+   } else if (rc == 0 && read_op && bytes < rbd_req->len) {
+   zero_bio_chain(rbd_req->bio, bytes);
+   bytes = rbd_req->len;
}

-   rbd_coll_end_req(req_data, rc, bytes);
+   rbd_coll_end_req(rbd_req, rc, bytes);

-   if (req_data->bio)
-   bio_chain_put(req_data->bio);
+   if (rbd_req->bio)
+   bio_chain_put(rbd_req->bio);

ceph_osdc_put_request(req);
-   kfree(req_data);
+   kfree(rbd_req);
 }

 static void rbd_simple_req_cb(struct ceph_osd_request *req, struct
ceph_msg *msg)
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/3] rbd: a few picky changes

These three changes are pretty trivial. -Alex

[PATCH 1/3] rbd: standardize rbd_request variable names
[PATCH 2/3] rbd: standardize ceph_osd_request variable names
[PATCH 3/3] rbd: be picky about osd request status type
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SSD journal suggestion

On Nov 8, 2012, at 3:22 AM, Gandalf Corvotempesta 
 wrote:

> 2012/11/8 Mark Nelson :
>> I haven't done much with IPoIB (just RDMA), but my understanding is that it
>> tends to top out at like 15Gb/s.  Some others on this mailing list can
>> probably speak more authoritatively.  Even with RDMA you are going to top
>> out at around 3.1-3.2GB/s.
> 
> 15Gb/s is still faster than 10Gbe
> But this speed limit seems to be kernel-related and should be the same
> even in a 10Gbe environment, or not?

We have a test cluster with Mellanox QDR HCAs (i.e. NICs). When using Verbs 
(the native IB API), I see ~27 Gb/s between two hosts. When running Sockets 
over these devices using IPoIB, I see 13-22 Gb/s depending on whether I use 
interrupt affinity and process binding.

For our Ceph testing, we will set the affinity of two of the mlx4 interrupt 
handlers to cores 0 and 1 and we will not using process binding. For single 
stream Netperf, we do use process binding and bind it to the same core (i.e. 0) 
and we see ~22 Gb/s. For multiple, concurrent Netperf runs, we do not use 
process binding but we still see ~22 Gb/s.

We used all of the Mellanox tuning recommendations for IPoIB available in their 
tuning pdf:

http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf

We looked at their interrupt affinity setting scripts and then wrote our own.

Our testing is with IPoIB in "connected" mode, not "datagram" mode. Connected 
mode is less scalable, but currently I only get ~3 Gb/s with datagram mode. 
Mellanox claims that we should get identical performance with both modes and we 
are looking into it.

We are getting a new test cluster with FDR HCAs and I will look into those as 
well.

Scott--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: less cores more iops / speed


On 11/08/2012 02:45 AM, Stefan Priebe - Profihost AG wrote:

Am 08.11.2012 01:59, schrieb Mark Nelson:

There's also the context switching overhead.  It'd be interesting to
know how much the writer processes were shifting around on cores.

What do you mean by that? I'm talking about the KVM guest not about the
ceph nodes.


in this case, is fio bouncing around between cores?




Stefan, what tool were you using to do writes?

as always: fio ;-)


You could try using numactl to pin fio to a specific core.  Also, it may 
be interesting to try multiple concurrent fio processes, and then 
concurrent fio processes with each pinned.




Stefan


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/2] mds: Clear lock flushed if replica is waiting for AC_LOCKFLUSHED

2012-11-08 Thread Yan, Zheng

From: "Yan, Zheng" 

So eval_gather() will not skip calling scatter_writebehind(),
otherwise the replica lock may be in flushing state forever.

Signed-off-by: Yan, Zheng 
---
 src/mds/Locker.cc | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/src/mds/Locker.cc b/src/mds/Locker.cc
index a1f957a..e2b1ff4 100644
--- a/src/mds/Locker.cc
+++ b/src/mds/Locker.cc
@@ -4383,8 +4383,12 @@ void Locker::handle_file_lock(ScatterLock *lock, MLock 
*m)
 if (lock->get_state() == LOCK_MIX_LOCK ||
lock->get_state() == LOCK_MIX_LOCK2 ||
lock->get_state() == LOCK_MIX_EXCL ||
-   lock->get_state() == LOCK_MIX_TSYN)
+   lock->get_state() == LOCK_MIX_TSYN) {
   lock->decode_locked_state(m->get_data());
+  // replica is waiting for AC_LOCKFLUSHED, eval_gather() should not
+  // delay calling scatter_writebehind().
+  lock->clear_flushed();
+}
 
 if (lock->is_gathering()) {
   dout(7) << "handle_file_lock " << *in << " from " << from
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] mds: Don't expire log segment before it's fully flushed

2012-11-08 Thread Yan, Zheng

From: "Yan, Zheng" 

Expiring log segment before it's fully flushed may cause various
issues during log replay.

Signed-off-by: Yan, Zheng 
---
 src/leveldb  | 2 +-
 src/mds/MDLog.cc | 8 +---
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/src/mds/MDLog.cc b/src/mds/MDLog.cc
index cac5615..b02c181 100644
--- a/src/mds/MDLog.cc
+++ b/src/mds/MDLog.cc
@@ -330,6 +330,11 @@ void MDLog::trim(int m)
 assert(ls);
 p++;
 
+if (ls->end > journaler->get_write_safe_pos()) {
+  dout(5) << "trim segment " << ls->offset << ", not fully flushed yet, 
safe "
+ << journaler->get_write_safe_pos() << " < end " << ls->end << 
dendl;
+  break;
+}
 if (expiring_segments.count(ls)) {
   dout(5) << "trim already expiring segment " << ls->offset << ", " << 
ls->num_events << " events" << dendl;
 } else if (expired_segments.count(ls)) {
@@ -412,9 +417,6 @@ void MDLog::_expired(LogSegment *ls)
 
   if (!capped && ls == get_current_segment()) {
 dout(5) << "_expired not expiring " << ls->offset << ", last one and 
!capped" << dendl;
-  } else if (ls->end > journaler->get_write_safe_pos()) {
-dout(5) << "_expired not expiring " << ls->offset << ", not fully flushed 
yet, safe "
-   << journaler->get_write_safe_pos() << " < end " << ls->end << dendl;
   } else {
 // expired.
 expired_segments.insert(ls);
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: clock syncronisation


Am 08.11.2012 13:00, schrieb Wido den Hollander:



On 08-11-12 10:04, Stefan Priebe - Profihost AG wrote:

Hello list,

is there any prefered way to use clock syncronisation?

I've tried running openntpd and ntpd on all servers but i'm still
getting:
2012-11-08 09:55:38.255928 mon.0 [WRN] message from mon.2 was stamped
0.063136s in the future, clocks not synchronized
2012-11-08 09:55:39.328639 mon.0 [WRN] message from mon.2 was stamped
0.063285s in the future, clocks not synchronized
2012-11-08 09:55:39.328833 mon.0 [WRN] message from mon.2 was stamped
0.063301s in the future, clocks not synchronized
2012-11-08 09:55:40.819975 mon.0 [WRN] message from mon.2 was stamped
0.063360s in the future, clocks not synchronized



What NTP server are you using? Network latency might cause the clocks
not to be synchronised.


pool.ntp.org

But i've now switched to debian chrony instead of ntp and that seems to 
work fine.


Haven't seen any messages again.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: clock syncronisation

2012-11-08 Thread Andrey Korolyov

On Thu, Nov 8, 2012 at 4:00 PM, Wido den Hollander  wrote:
>
>
> On 08-11-12 10:04, Stefan Priebe - Profihost AG wrote:
>>
>> Hello list,
>>
>> is there any prefered way to use clock syncronisation?
>>
>> I've tried running openntpd and ntpd on all servers but i'm still getting:
>> 2012-11-08 09:55:38.255928 mon.0 [WRN] message from mon.2 was stamped
>> 0.063136s in the future, clocks not synchronized
>> 2012-11-08 09:55:39.328639 mon.0 [WRN] message from mon.2 was stamped
>> 0.063285s in the future, clocks not synchronized
>> 2012-11-08 09:55:39.328833 mon.0 [WRN] message from mon.2 was stamped
>> 0.063301s in the future, clocks not synchronized
>> 2012-11-08 09:55:40.819975 mon.0 [WRN] message from mon.2 was stamped
>> 0.063360s in the future, clocks not synchronized
>>
>
> What NTP server are you using? Network latency might cause the clocks not to
> be synchronised.
>

There is no real reason to worry about, quorum may suffer only large
desync delays as some seconds or more. If you have unsynchronized
clocks on mon hodes with such big delays, requests which have issued
from cli, e.g. creating new connection may wait as long as delay
itself, depend of clock value of selected monitor node.

Clock drift caused mostly by heavy load, but of course playing with
clocksources may have some effect(since most systems already use HPET
timer, there is only one way, to sync with ntp server as frequent as
you want to prevent drift).


>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: clock syncronisation

2012-11-08 Thread Wido den Hollander




On 08-11-12 10:04, Stefan Priebe - Profihost AG wrote:

Hello list,

is there any prefered way to use clock syncronisation?

I've tried running openntpd and ntpd on all servers but i'm still getting:
2012-11-08 09:55:38.255928 mon.0 [WRN] message from mon.2 was stamped
0.063136s in the future, clocks not synchronized
2012-11-08 09:55:39.328639 mon.0 [WRN] message from mon.2 was stamped
0.063285s in the future, clocks not synchronized
2012-11-08 09:55:39.328833 mon.0 [WRN] message from mon.2 was stamped
0.063301s in the future, clocks not synchronized
2012-11-08 09:55:40.819975 mon.0 [WRN] message from mon.2 was stamped
0.063360s in the future, clocks not synchronized



What NTP server are you using? Network latency might cause the clocks 
not to be synchronised.


Wido


Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: problem with hanging cluster

2012-11-08 Thread Adam Ochmański


W dniu 08.11.2012 12:14, Adam Ochmański pisze:

Hi,
our test cluster going stuck every time when one of our osd host going
down, when mising osd go to "up" state and recovery go to 100% cluster
still not working propertly.


I forgot add version of ceph i use: 0.53-422-g2d20f3a

--
Best,
blink
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: less cores more iops / speed


Am 08.11.2012 10:05, schrieb Alexandre DERUMIER:

Do you have tried to compare virtio-blk and virtio-scsi ?

How to change? Right now i'm using the PVE defaults => scsi-hd.


(virtio-blk is "classic" virtio ;)


Do you have tried directly from the host with the rbd kernel module ?
No don't know how to use ;-)

http://ceph.com/docs/master/rbd/rbd-ko/
#modprobe rbd
#sudo rbd map {image-name} --pool {pool-name} --id {user-name}


this gives me also 8000 iops on the host with 3.6 Ghz. So this is the 
same like in KVM.


Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: less cores more iops / speed

2012-11-08 Thread Alexandre DERUMIER

> Do you have tried to compare virtio-blk and virtio-scsi ? 
>>How to change? Right now i'm using the PVE defaults => scsi-hd. 

(virtio-blk is "classic" virtio ;)

>> Do you have tried directly from the host with the rbd kernel module ? 
>>No don't know how to use ;-) 
http://ceph.com/docs/master/rbd/rbd-ko/
#modprobe rbd
#sudo rbd map {image-name} --pool {pool-name} --id {user-name}

(then you'll have a /dev/rbd1)




- Mail original - 

De: "Stefan Priebe - Profihost AG"  
À: "Alexandre DERUMIER"  
Cc: "Joao Eduardo Luis" , ceph-devel@vger.kernel.org, 
"Mark Nelson"  
Envoyé: Jeudi 8 Novembre 2012 10:02:23 
Objet: Re: less cores more iops / speed 

Am 08.11.2012 09:58, schrieb Alexandre DERUMIER: 
>>> What do you mean by that? I'm talking about the KVM guest not about the 
>>> ceph nodes. 
> 
> Do you have tried to compare virtio-blk and virtio-scsi ? 
How to change? Right now i'm using the PVE defaults => scsi-hd. 

> Do you have tried directly from the host with the rbd kernel module ? 
No don't know how to use ;-) 

Stefan 


> - Mail original - 
> 
> De: "Stefan Priebe - Profihost AG"  
> À: "Mark Nelson"  
> Cc: "Joao Eduardo Luis" , ceph-devel@vger.kernel.org 
> Envoyé: Jeudi 8 Novembre 2012 09:45:17 
> Objet: Re: less cores more iops / speed 
> 
> Am 08.11.2012 01:59, schrieb Mark Nelson: 
>> There's also the context switching overhead. It'd be interesting to 
>> know how much the writer processes were shifting around on cores. 
> What do you mean by that? I'm talking about the KVM guest not about the 
> ceph nodes. 
> 
>> Stefan, what tool were you using to do writes? 
> as always: fio ;-) 
> 
> Stefan 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

clock syncronisation


Hello list,

is there any prefered way to use clock syncronisation?

I've tried running openntpd and ntpd on all servers but i'm still getting:
2012-11-08 09:55:38.255928 mon.0 [WRN] message from mon.2 was stamped 
0.063136s in the future, clocks not synchronized
2012-11-08 09:55:39.328639 mon.0 [WRN] message from mon.2 was stamped 
0.063285s in the future, clocks not synchronized
2012-11-08 09:55:39.328833 mon.0 [WRN] message from mon.2 was stamped 
0.063301s in the future, clocks not synchronized
2012-11-08 09:55:40.819975 mon.0 [WRN] message from mon.2 was stamped 
0.063360s in the future, clocks not synchronized


Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: less cores more iops / speed


Am 08.11.2012 09:58, schrieb Alexandre DERUMIER:

What do you mean by that? I'm talking about the KVM guest not about the
ceph nodes.


Do you have tried to compare virtio-blk and virtio-scsi ?

How to change? Right now i'm using the PVE defaults => scsi-hd.


Do you have tried directly from the host with the rbd kernel module ?

No don't know how to use ;-)

Stefan



- Mail original -

De: "Stefan Priebe - Profihost AG" 
À: "Mark Nelson" 
Cc: "Joao Eduardo Luis" , ceph-devel@vger.kernel.org
Envoyé: Jeudi 8 Novembre 2012 09:45:17
Objet: Re: less cores more iops / speed

Am 08.11.2012 01:59, schrieb Mark Nelson:

There's also the context switching overhead. It'd be interesting to
know how much the writer processes were shifting around on cores.

What do you mean by that? I'm talking about the KVM guest not about the
ceph nodes.


Stefan, what tool were you using to do writes?

as always: fio ;-)

Stefan


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: less cores more iops / speed

2012-11-08 Thread Alexandre DERUMIER

>>What do you mean by that? I'm talking about the KVM guest not about the 
>>ceph nodes. 

Do you have tried to compare virtio-blk and virtio-scsi ?

Do you have tried directly from the host with the rbd kernel module ?



- Mail original - 

De: "Stefan Priebe - Profihost AG"  
À: "Mark Nelson"  
Cc: "Joao Eduardo Luis" , ceph-devel@vger.kernel.org 
Envoyé: Jeudi 8 Novembre 2012 09:45:17 
Objet: Re: less cores more iops / speed 

Am 08.11.2012 01:59, schrieb Mark Nelson: 
> There's also the context switching overhead. It'd be interesting to 
> know how much the writer processes were shifting around on cores. 
What do you mean by that? I'm talking about the KVM guest not about the 
ceph nodes. 

> Stefan, what tool were you using to do writes? 
as always: fio ;-) 

Stefan 
-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majord...@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: syncfs slower than without syncfs


done:
http://tracker.newdream.net/issues/3461
Am 08.11.2012 04:09, schrieb Josh Durgin:

On 11/07/2012 08:26 AM, Stefan Priebe wrote:

Am 07.11.2012 16:04, schrieb Mark Nelson:

Whew, glad you found the problem Stefan!  I was starting to wonder what
was going on. :)  Do you mind filling a bug about the control
dependencies?


Sure where should i fill it in?


http://www.tracker.newdream.net/projects/ceph/issues/new


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: less cores more iops / speed