Re: [ceph-users] Scaling RBD module

2013-09-24 Thread Somnath Roy
Hi Sage,
Thanks for your input. I will try those. Please see my response inline.

Thanks & Regards
Somnath

-Original Message-
From: Sage Weil [mailto:s...@inktank.com]
Sent: Tuesday, September 24, 2013 3:47 PM
To: Somnath Roy
Cc: Travis Rhoden; Josh Durgin; ceph-de...@vger.kernel.org; Anirban Ray; 
ceph-users@lists.ceph.com
Subject: RE: [ceph-users] Scaling RBD module

Hi Somnath!

On Tue, 24 Sep 2013, Somnath Roy wrote:
>
> Hi Sage,
>
> We did quite a few experiment to see how ceph read performance can scale up.
> Here is the summary.
>
>
>
> 1.
>
> First we tried to see how far a single node cluster with one osd can
> scale up. We started with cuttlefish release and the entire osd file
> system is on the ssd. What we saw with 4K size object and with single
> rados client with dedicated 10G network, throughput can't go beyond a certain 
> point.

Are you using 'rados bench' to generate this load or something else?
We've noticed that individual rados bench commands do not scale beyond a point 
but have never looked into it; the problem may be in the bench code and not in 
librados or SimpleMessenger.

[Somnath] Yes, for generating load and measure the performance at the rados 
level.

> We dig through the code and found out SimpleMessenger is opening
> single socket connection (per client)to talk to the osd. Also, we saw
> there is only one dispatcher Q (Dispatch thread)/ SimpleMessenger to
> carry these requests to OSD. We started adding more dispatcher threads
> in Dispatch Q, rearrange several locks in the Pipe.cc to identify the
> bottleneck. What we end up discovering is that there is bottleneck
> both in upstream as well as in the downstream at osd level and
> changing the locking scheme in io path  will affect lot of other codes (that 
> we don't even know).
>
> So, we stopped that activity and started workaround the upstream
> bottleneck by introducing more clients to the single OSD. What we saw
> single OSD is scaling with lot of cpu utilization. To produce ~40K
> iops (4K) it is taking almost 12 core of cpu.

Just to make sure I understand: the single OSD dispatch queue does not become a 
problem with multiple clients?

[Somnath] We saw with single client/single osd, if we increase the dispatch 
thread till 3, we were getting some improvement. But, not beyond 3.
This is what we were also wondering !..But, looking at the architecture, it 
seems if the upstream bottleneck is removed this might be the next bottleneck. 
The next io request will not be in the OSD worker Q, till it completes 
OSD::ms_dispatch(). And there is lot of staff happening in this function.
Now, top of the function it is taking osd level lock , so, increasing the 
threads is not helping, but, I think rearranging the locks will help here.

Possibilities that come to mind:

- DispatchQueue is doing some funny stuff to keep individual clients'
messages ordered but to fairly process requests from multiple clients.
There could easily be a problem with the per-client queue portion of this.

- Pipe's use of MSG_MORE is making the TCP stream efficient... you might try 
setting 'ms tcp nodelay = false'.

[Somnath] I will try this.

- The message encode is happening in the thread that sends messages over the 
wire.  Maybe doing it in send_message() instead of writer() will keep that on a 
separate core than the thread that's shoveling data into the socket.

[Somnath] Are you telling to  move the following code snippet from 
Pipe::writer() to SimpleMessenger::_send_message() ?

// encode and copy out of *m
m->encode(connection_state->get_features(), 
!msgr->cct->_conf->ms_nocrc);

> Another point, I didn't see this single osd scale with the Dumpling
> release with the multiple clients !! Something changed..

What is it with dumpling?

[Somnath] We tried to compare but lot of changes , so, gave up :-(...But, I 
think eventually, if we want to increase the overall throughput we need to make 
individual osd efficient (both cpu/performance). So, we will definitely 
comeback to this.

> 2.   After that, we setup a proper cluster with 3 high performing
> nodes and total 30 osds. Here also, we are seeing single rados bech
> client as well as rbd client instance is not scaling beyond a certain
> limit. It is not able to generate much load as node cpu utilization
> remains very low. But running multiple client instance the performance is 
> scaling till hit the cpu limit.
>
> So, it is pretty clear we are not able to saturate anything with
> single client and that's why the 'noshare' option was very helpful to
> measure the rbd performance benchmark. I have a single osd/single
> client level call grind  data attached here.

Something from perf that shows a call graph would be more helpful to identify 
where th

Re: [ceph-users] Scaling RBD module

2013-09-24 Thread Sage Weil
Hi Somnath!

On Tue, 24 Sep 2013, Somnath Roy wrote:
> 
> Hi Sage,
> 
> We did quite a few experiment to see how ceph read performance can scale up.
> Here is the summary.
> 
>  
> 
> 1.
> 
> First we tried to see how far a single node cluster with one osd can scale
> up. We started with cuttlefish release and the entire osd file system is on
> the ssd. What we saw with 4K size object and with single rados client with
> dedicated 10G network, throughput can't go beyond a certain point.

Are you using 'rados bench' to generate this load or something else?  
We've noticed that individual rados bench commands do not scale beyond a 
point but have never looked into it; the problem may be in the bench code 
and not in librados or SimpleMessenger.

> We dig through the code and found out SimpleMessenger is opening single
> socket connection (per client)to talk to the osd. Also, we saw there is only
> one dispatcher Q (Dispatch thread)/ SimpleMessenger to carry these requests
> to OSD. We started adding more dispatcher threads in Dispatch Q, rearrange
> several locks in the Pipe.cc to identify the bottleneck. What we end up
> discovering is that there is bottleneck both in upstream as well as in the
> downstream at osd level and changing the locking scheme in io path  will
> affect lot of other codes (that we don't even know).
> 
> So, we stopped that activity and started workaround the upstream bottleneck
> by introducing more clients to the single OSD. What we saw single OSD is
> scaling with lot of cpu utilization. To produce ~40K iops (4K) it is taking
> almost 12 core of cpu.

Just to make sure I understand: the single OSD dispatch queue does not 
become a problem with multiple clients?

Possibilities that come to mind:

- DispatchQueue is doing some funny stuff to keep individual clients' 
messages ordered but to fairly process requests from multiple clients.  
There could easily be a problem with the per-client queue portion of this.

- Pipe's use of MSG_MORE is making the TCP stream efficient... you might 
try setting 'ms tcp nodelay = false'.

- The message encode is happening in the thread that sends messages over 
the wire.  Maybe doing it in send_message() instead of writer() will keep 
that on a separate core than the thread that's shoveling data into the 
socket.

> Another point, I didn't see this single osd scale with the Dumpling release
> with the multiple clients !! Something changed..

What is it with dumpling?

> 2.   After that, we setup a proper cluster with 3 high performing nodes and
> total 30 osds. Here also, we are seeing single rados bech client as well as
> rbd client instance is not scaling beyond a certain limit. It is not able to
> generate much load as node cpu utilization remains very low. But running
> multiple client instance the performance is scaling till hit the cpu limit.
> 
> So, it is pretty clear we are not able to saturate anything with single
> client and that's why the 'noshare' option was very helpful to measure the
> rbd performance benchmark. I have a single osd/single client level call
> grind  data attached here.

Something from perf that shows a call graph would be more helpful to 
identify where things are waiting.  We haven't done much optimizing at 
this level at all, so these results aren't entirely surprising.

> Now, I am doing the benchmark for radosgw and I think I am stuck with
> similar bottleneck here. Could you please confirm that if radosgw also
> opening single client instance to the cluster?
>  

It is: each radosgw has a single librados client instance.

> If so, is there any similar option like 'noshare' in this case ? Here also,
> creating multiple radosgw instance on separate nodes the performance is
> scaling.

No, but

> BTW, is there a way to run multiple radosgw to a single node or it has to be
> one/node ?

yes.  You just need to make sure they have different fastcgi sockets they 
listen on and probably set up a separate web server in front of each one.

I think the next step to understanding what is going on is getting the 
right profiling tools in place so we can see where the client threads are 
spending their (non-idle and idle) time...

sage


> 
>  
> 
> Thanks & Regards
> 
> Somnath
> 
>  
> 
>    
> 
>  
> 
> -Original Message-----
> From: ceph-devel-ow...@vger.kernel.org
> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Tuesday, September 24, 2013 2:16 PM
> To: Travis Rhoden
> Cc: Josh Durgin; ceph-de...@vger.kernel.org; Anirban Ray;
> ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Scaling RBD module
> 
>  
&g

Re: [ceph-users] Scaling RBD module

2013-09-24 Thread Somnath Roy
Hi Sage,

We did quite a few experiment to see how ceph read performance can scale up. 
Here is the summary.



1.

First we tried to see how far a single node cluster with one osd can scale up. 
We started with cuttlefish release and the entire osd file system is on the 
ssd. What we saw with 4K size object and with single rados client with 
dedicated 10G network, throughput can't go beyond a certain point.

We dig through the code and found out SimpleMessenger is opening single socket 
connection (per client)to talk to the osd. Also, we saw there is only one 
dispatcher Q (Dispatch thread)/ SimpleMessenger to carry these requests to OSD. 
We started adding more dispatcher threads in Dispatch Q, rearrange several 
locks in the Pipe.cc to identify the bottleneck. What we end up discovering is 
that there is bottleneck both in upstream as well as in the downstream at osd 
level and changing the locking scheme in io path  will affect lot of other 
codes (that we don't even know).

So, we stopped that activity and started workaround the upstream bottleneck by 
introducing more clients to the single OSD. What we saw single OSD is scaling 
with lot of cpu utilization. To produce ~40K iops (4K) it is taking almost 12 
core of cpu.

Another point, I didn't see this single osd scale with the Dumpling release 
with the multiple clients !! Something changed..



2.   After that, we setup a proper cluster with 3 high performing nodes and 
total 30 osds. Here also, we are seeing single rados bech client as well as rbd 
client instance is not scaling beyond a certain limit. It is not able to 
generate much load as node cpu utilization remains very low. But running 
multiple client instance the performance is scaling till hit the cpu limit.



So, it is pretty clear we are not able to saturate anything with single client 
and that's why the 'noshare' option was very helpful to measure the rbd 
performance benchmark. I have a single osd/single client level callgrind  data. 
Attachment is not going through the community I guess and that's why can't send 
it to you.



Now, I am doing the benchmark for radosgw and I think I am stuck with similar 
bottleneck here. Could you please confirm that if radosgw also opening single 
client instance to the cluster ?

If so, is there any similar option like 'noshare' in this case ? Here also, 
creating multiple radosgw instance on separate nodes the performance is scaling.

BTW, is there a way to run multiple radosgw to a single node or it has to be 
one/node ?





Thanks & Regards

Somnath







-Original Message-
From: ceph-devel-ow...@vger.kernel.org<mailto:ceph-devel-ow...@vger.kernel.org> 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
Sent: Tuesday, September 24, 2013 2:16 PM
To: Travis Rhoden
Cc: Josh Durgin; ceph-de...@vger.kernel.org<mailto:ceph-de...@vger.kernel.org>; 
Anirban Ray; ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Scaling RBD module



On Tue, 24 Sep 2013, Travis Rhoden wrote:

> This "noshare" option may have just helped me a ton -- I sure wish I

> would have asked similar questions sooner, because I have seen the

> same failure to scale.  =)

>

> One question -- when using the "noshare" option (or really, even

> without it) are there any practical limits on the number of RBDs that

> can be mounted?  I have servers with ~100 RBDs on them each, and am

> wondering if I switch them all over to using "noshare" if anything is

> going to blow up, use a ton more memory, etc.  Even without noshare,

> are there any known limits to how many RBDs can be mapped?



With noshare each mapped image will appear as a separate client instance, which 
means it will have it's own session with teh monitors and own TCP connections 
to the OSDs.  It may be a viable workaround for now but in general I would not 
recommend it.



I'm very curious what the scaling issue is with the shared client.  Do you have 
a working perf that can capture callgraph information on this machine?



sage



>

> Thanks!

>

>  - Travis

>

>

> On Thu, Sep 19, 2013 at 8:03 PM, Somnath Roy 
> mailto:somnath@sandisk.com>>

> wrote:

>   Thanks Josh !

>   I am able to successfully add this noshare option in the image

>   mapping now. Looking at dmesg output, I found that was indeed

>   the secret key problem. Block performance is scaling now.

>

>   Regards

>   Somnath

>

>   -Original Message-

>   From: 
> ceph-devel-ow...@vger.kernel.org<mailto:ceph-devel-ow...@vger.kernel.org>

>   [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Josh

>   Durgin

>   Sent: Thursday, September 19, 2013 12:24 PM

>       To: Somnath Roy

>   Cc: S

Re: [ceph-users] Scaling RBD module

2013-09-24 Thread Travis Rhoden
On Tue, Sep 24, 2013 at 5:16 PM, Sage Weil  wrote:
> On Tue, 24 Sep 2013, Travis Rhoden wrote:
>> This "noshare" option may have just helped me a ton -- I sure wish I would
>> have asked similar questions sooner, because I have seen the same failure to
>> scale.  =)
>>
>> One question -- when using the "noshare" option (or really, even without it)
>> are there any practical limits on the number of RBDs that can be mounted?  I
>> have servers with ~100 RBDs on them each, and am wondering if I switch them
>> all over to using "noshare" if anything is going to blow up, use a ton more
>> memory, etc.  Even without noshare, are there any known limits to how many
>> RBDs can be mapped?
>
> With noshare each mapped image will appear as a separate client instance,
> which means it will have it's own session with teh monitors and own TCP
> connections to the OSDs.  It may be a viable workaround for now but in
> general I would not recommend it.

Good to know.  We are still playing with CephFS as our ultimate
solution, but in the meantime this may indeed be a good workaround for
me.

>
> I'm very curious what the scaling issue is with the shared client.  Do you
> have a working perf that can capture callgraph information on this
> machine?

Not currently, but I could certainly work on it.  The issue that we
see is basically what the OP showed -- that there seems to be a finite
amount of bandwidth that I can read/write from a machine, regardless
of how many RBDs are involved.  i.e., if I can get 1GB/sec writes on
one RBD when everything else is idle, running the same test on two
RBDs in parallel *from the same machine* ends up with the sum of the
two at ~1GB/sec, split fairly evenly. However, if I do the same thing
and run the same test on two RBDs, each hosted on a separate machine,
I definitely see increased bandwidth.  Monitoring network traffic and
the Ceph OSD nodes seems to imply that they are not overloaded --
there is more bandwidth to be had, the clients just aren't able to
push the data fast enough.  That's why I'm hoping creating a "new"
client for each RBD will improve things.

I'm not going to enable this everywhere just yet, we will test things
on a few RBDs and test, and perhaps enable on some RBDs that are
particularly heavily loaded.

I'll work on the perf capture!

Thanks for the feedback, as always.

 - Travis
>
> sage
>
>>
>> Thanks!
>>
>>  - Travis
>>
>>
>> On Thu, Sep 19, 2013 at 8:03 PM, Somnath Roy 
>> wrote:
>>   Thanks Josh !
>>   I am able to successfully add this noshare option in the image
>>   mapping now. Looking at dmesg output, I found that was indeed
>>   the secret key problem. Block performance is scaling now.
>>
>>   Regards
>>   Somnath
>>
>>   -Original Message-
>>   From: ceph-devel-ow...@vger.kernel.org
>>   [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Josh
>>   Durgin
>>   Sent: Thursday, September 19, 2013 12:24 PM
>>   To: Somnath Roy
>>   Cc: Sage Weil; ceph-de...@vger.kernel.org; Anirban Ray;
>>   ceph-users@lists.ceph.com
>>   Subject: Re: [ceph-users] Scaling RBD module
>>
>>   On 09/19/2013 12:04 PM, Somnath Roy wrote:
>>   > Hi Josh,
>>   > Thanks for the information. I am trying to add the following
>>   but hitting some permission issue.
>>   >
>>   > root@emsclient:/etc# echo
>>   :6789,:6789,:6789
>>   > name=admin,key=client.admin,noshare test_rbd ceph_block_test'
>>   >
>>   > /sys/bus/rbd/add
>>   > -bash: echo: write error: Operation not permitted
>>
>>   If you check dmesg, it will probably show an error trying to
>>   authenticate to the cluster.
>>
>>   Instead of key=client.admin, you can pass the base64 secret
>>   value as shown in 'ceph auth list' with the
>>   secret=X option.
>>
>>   BTW, there's a ticket for adding the noshare option to rbd map
>>   so using the sysfs interface like this is never necessary:
>>
>>   http://tracker.ceph.com/issues/6264
>>
>>   Josh
>>
>>   > Here is the contents of rbd directory..
>>   >
>>   > root@emsclient:/sys/bus/rbd# ll
>>   > total 0
>>   > drwxr-xr-x  4 root root0 Sep 19 11:59 ./
>>   > drwxr-xr-x 30 root root0 Sep 13 11:41 ../
>>   > --w---  1 root root 4096 Sep 19 11:59 add
>>   > drwxr-xr-x  2 root 

Re: [ceph-users] Scaling RBD module

2013-09-24 Thread Sage Weil
On Tue, 24 Sep 2013, Travis Rhoden wrote:
> This "noshare" option may have just helped me a ton -- I sure wish I would
> have asked similar questions sooner, because I have seen the same failure to
> scale.  =)
> 
> One question -- when using the "noshare" option (or really, even without it)
> are there any practical limits on the number of RBDs that can be mounted?  I
> have servers with ~100 RBDs on them each, and am wondering if I switch them
> all over to using "noshare" if anything is going to blow up, use a ton more
> memory, etc.  Even without noshare, are there any known limits to how many
> RBDs can be mapped?

With noshare each mapped image will appear as a separate client instance, 
which means it will have it's own session with teh monitors and own TCP 
connections to the OSDs.  It may be a viable workaround for now but in 
general I would not recommend it.

I'm very curious what the scaling issue is with the shared client.  Do you 
have a working perf that can capture callgraph information on this 
machine?

sage

> 
> Thanks!
> 
>  - Travis
> 
> 
> On Thu, Sep 19, 2013 at 8:03 PM, Somnath Roy 
> wrote:
>   Thanks Josh !
>   I am able to successfully add this noshare option in the image
>   mapping now. Looking at dmesg output, I found that was indeed
>   the secret key problem. Block performance is scaling now.
> 
>   Regards
>   Somnath
> 
>   -Original Message-
>   From: ceph-devel-ow...@vger.kernel.org
>   [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Josh
>   Durgin
>   Sent: Thursday, September 19, 2013 12:24 PM
>       To: Somnath Roy
>       Cc: Sage Weil; ceph-de...@vger.kernel.org; Anirban Ray;
>   ceph-users@lists.ceph.com
>   Subject: Re: [ceph-users] Scaling RBD module
> 
>   On 09/19/2013 12:04 PM, Somnath Roy wrote:
>   > Hi Josh,
>   > Thanks for the information. I am trying to add the following
>   but hitting some permission issue.
>   >
>   > root@emsclient:/etc# echo
>   :6789,:6789,:6789
>   > name=admin,key=client.admin,noshare test_rbd ceph_block_test'
>   >
>   > /sys/bus/rbd/add
>   > -bash: echo: write error: Operation not permitted
> 
>   If you check dmesg, it will probably show an error trying to
>   authenticate to the cluster.
> 
>   Instead of key=client.admin, you can pass the base64 secret
>   value as shown in 'ceph auth list' with the
>   secret=X option.
> 
>   BTW, there's a ticket for adding the noshare option to rbd map
>   so using the sysfs interface like this is never necessary:
> 
>   http://tracker.ceph.com/issues/6264
> 
>   Josh
> 
>   > Here is the contents of rbd directory..
>   >
>   > root@emsclient:/sys/bus/rbd# ll
>   > total 0
>   > drwxr-xr-x  4 root root    0 Sep 19 11:59 ./
>   > drwxr-xr-x 30 root root    0 Sep 13 11:41 ../
>   > --w---  1 root root 4096 Sep 19 11:59 add
>   > drwxr-xr-x  2 root root    0 Sep 19 12:03 devices/
>   > drwxr-xr-x  2 root root    0 Sep 19 12:03 drivers/
>   > -rw-r--r--  1 root root 4096 Sep 19 12:03 drivers_autoprobe
>   > --w---  1 root root 4096 Sep 19 12:03 drivers_probe
>   > --w---  1 root root 4096 Sep 19 12:03 remove
>   > --w---  1 root root 4096 Sep 19 11:59 uevent
>   >
>   >
>   > I checked even if I am logged in as root , I can't write
>   anything on /sys.
>   >
>   > Here is the Ubuntu version I am using..
>   >
>   > root@emsclient:/etc# lsb_release -a
>   > No LSB modules are available.
>   > Distributor ID: Ubuntu
>   > Description:    Ubuntu 13.04
>   > Release:        13.04
>   > Codename:       raring
>   >
>   > Here is the mount information
>   >
>   > root@emsclient:/etc# mount
>   > /dev/mapper/emsclient--vg-root on / type ext4
>   (rw,errors=remount-ro)
>   > proc on /proc type proc (rw,noexec,nosuid,nodev) sysfs on /sys
>   type
>   > sysfs (rw,noexec,nosuid,nodev) none on /sys/fs/cgroup type
>   tmpfs (rw)
>   > none on /sys/fs/fuse/connections type fusectl (rw) none on
>   > /sys/kernel/debug type debugfs (rw) none on
>   /sys/kernel/security type
>   > securityfs (rw) udev on /dev type devtmpfs (rw,mode=0755)
>   devpts on
>   > /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=0620)
>   > tmpfs on /r

Re: [ceph-users] Scaling RBD module

2013-09-24 Thread Travis Rhoden
This "noshare" option may have just helped me a ton -- I sure wish I would
have asked similar questions sooner, because I have seen the same failure
to scale.  =)

One question -- when using the "noshare" option (or really, even without
it) are there any practical limits on the number of RBDs that can be
mounted?  I have servers with ~100 RBDs on them each, and am wondering if I
switch them all over to using "noshare" if anything is going to blow up,
use a ton more memory, etc.  Even without noshare, are there any known
limits to how many RBDs can be mapped?

Thanks!

 - Travis


On Thu, Sep 19, 2013 at 8:03 PM, Somnath Roy wrote:

> Thanks Josh !
> I am able to successfully add this noshare option in the image mapping
> now. Looking at dmesg output, I found that was indeed the secret key
> problem. Block performance is scaling now.
>
> Regards
> Somnath
>
> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org [mailto:
> ceph-devel-ow...@vger.kernel.org] On Behalf Of Josh Durgin
> Sent: Thursday, September 19, 2013 12:24 PM
> To: Somnath Roy
> Cc: Sage Weil; ceph-de...@vger.kernel.org; Anirban Ray;
> ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Scaling RBD module
>
> On 09/19/2013 12:04 PM, Somnath Roy wrote:
> > Hi Josh,
> > Thanks for the information. I am trying to add the following but hitting
> some permission issue.
> >
> > root@emsclient:/etc# echo :6789,:6789,:6789
> > name=admin,key=client.admin,noshare test_rbd ceph_block_test' >
> > /sys/bus/rbd/add
> > -bash: echo: write error: Operation not permitted
>
> If you check dmesg, it will probably show an error trying to authenticate
> to the cluster.
>
> Instead of key=client.admin, you can pass the base64 secret value as shown
> in 'ceph auth list' with the secret=X option.
>
> BTW, there's a ticket for adding the noshare option to rbd map so using
> the sysfs interface like this is never necessary:
>
> http://tracker.ceph.com/issues/6264
>
> Josh
>
> > Here is the contents of rbd directory..
> >
> > root@emsclient:/sys/bus/rbd# ll
> > total 0
> > drwxr-xr-x  4 root root0 Sep 19 11:59 ./
> > drwxr-xr-x 30 root root0 Sep 13 11:41 ../
> > --w---  1 root root 4096 Sep 19 11:59 add
> > drwxr-xr-x  2 root root0 Sep 19 12:03 devices/
> > drwxr-xr-x  2 root root0 Sep 19 12:03 drivers/
> > -rw-r--r--  1 root root 4096 Sep 19 12:03 drivers_autoprobe
> > --w---  1 root root 4096 Sep 19 12:03 drivers_probe
> > --w---  1 root root 4096 Sep 19 12:03 remove
> > --w---  1 root root 4096 Sep 19 11:59 uevent
> >
> >
> > I checked even if I am logged in as root , I can't write anything on
> /sys.
> >
> > Here is the Ubuntu version I am using..
> >
> > root@emsclient:/etc# lsb_release -a
> > No LSB modules are available.
> > Distributor ID: Ubuntu
> > Description:Ubuntu 13.04
> > Release:13.04
> > Codename:   raring
> >
> > Here is the mount information
> >
> > root@emsclient:/etc# mount
> > /dev/mapper/emsclient--vg-root on / type ext4 (rw,errors=remount-ro)
> > proc on /proc type proc (rw,noexec,nosuid,nodev) sysfs on /sys type
> > sysfs (rw,noexec,nosuid,nodev) none on /sys/fs/cgroup type tmpfs (rw)
> > none on /sys/fs/fuse/connections type fusectl (rw) none on
> > /sys/kernel/debug type debugfs (rw) none on /sys/kernel/security type
> > securityfs (rw) udev on /dev type devtmpfs (rw,mode=0755) devpts on
> > /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=0620)
> > tmpfs on /run type tmpfs (rw,noexec,nosuid,size=10%,mode=0755)
> > none on /run/lock type tmpfs (rw,noexec,nosuid,nodev,size=5242880)
> > none on /run/shm type tmpfs (rw,nosuid,nodev) none on /run/user type
> > tmpfs (rw,noexec,nosuid,nodev,size=104857600,mode=0755)
> > /dev/sda1 on /boot type ext2 (rw)
> > /dev/mapper/emsclient--vg-home on /home type ext4 (rw)
> >
> >
> > Any idea what went wrong here ?
> >
> > Thanks & Regards
> > Somnath
> >
> > -Original Message-
> > From: Josh Durgin [mailto:josh.dur...@inktank.com]
> > Sent: Wednesday, September 18, 2013 6:10 PM
> > To: Somnath Roy
> > Cc: Sage Weil; ceph-de...@vger.kernel.org; Anirban Ray;
> > ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] Scaling RBD module
> >
> > On 09/17/2013 03:30 PM, Somnath Roy wrote:
> >> Hi,
> >> I am running Ceph on a 3 node cluster and each of my server node is
> running 10 OSDs, one for each disk. I have one admin node and all the nod

Re: [ceph-users] Scaling RBD module

2013-09-19 Thread Somnath Roy
Thanks Josh !
I am able to successfully add this noshare option in the image mapping now. 
Looking at dmesg output, I found that was indeed the secret key problem. Block 
performance is scaling now.

Regards
Somnath

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Josh Durgin
Sent: Thursday, September 19, 2013 12:24 PM
To: Somnath Roy
Cc: Sage Weil; ceph-de...@vger.kernel.org; Anirban Ray; 
ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Scaling RBD module

On 09/19/2013 12:04 PM, Somnath Roy wrote:
> Hi Josh,
> Thanks for the information. I am trying to add the following but hitting some 
> permission issue.
>
> root@emsclient:/etc# echo :6789,:6789,:6789 
> name=admin,key=client.admin,noshare test_rbd ceph_block_test' > 
> /sys/bus/rbd/add
> -bash: echo: write error: Operation not permitted

If you check dmesg, it will probably show an error trying to authenticate to 
the cluster.

Instead of key=client.admin, you can pass the base64 secret value as shown in 
'ceph auth list' with the secret=X option.

BTW, there's a ticket for adding the noshare option to rbd map so using the 
sysfs interface like this is never necessary:

http://tracker.ceph.com/issues/6264

Josh

> Here is the contents of rbd directory..
>
> root@emsclient:/sys/bus/rbd# ll
> total 0
> drwxr-xr-x  4 root root0 Sep 19 11:59 ./
> drwxr-xr-x 30 root root0 Sep 13 11:41 ../
> --w---  1 root root 4096 Sep 19 11:59 add
> drwxr-xr-x  2 root root0 Sep 19 12:03 devices/
> drwxr-xr-x  2 root root0 Sep 19 12:03 drivers/
> -rw-r--r--  1 root root 4096 Sep 19 12:03 drivers_autoprobe
> --w---  1 root root 4096 Sep 19 12:03 drivers_probe
> --w---  1 root root 4096 Sep 19 12:03 remove
> --w---  1 root root 4096 Sep 19 11:59 uevent
>
>
> I checked even if I am logged in as root , I can't write anything on /sys.
>
> Here is the Ubuntu version I am using..
>
> root@emsclient:/etc# lsb_release -a
> No LSB modules are available.
> Distributor ID: Ubuntu
> Description:Ubuntu 13.04
> Release:13.04
> Codename:   raring
>
> Here is the mount information
>
> root@emsclient:/etc# mount
> /dev/mapper/emsclient--vg-root on / type ext4 (rw,errors=remount-ro) 
> proc on /proc type proc (rw,noexec,nosuid,nodev) sysfs on /sys type 
> sysfs (rw,noexec,nosuid,nodev) none on /sys/fs/cgroup type tmpfs (rw) 
> none on /sys/fs/fuse/connections type fusectl (rw) none on 
> /sys/kernel/debug type debugfs (rw) none on /sys/kernel/security type 
> securityfs (rw) udev on /dev type devtmpfs (rw,mode=0755) devpts on 
> /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=0620)
> tmpfs on /run type tmpfs (rw,noexec,nosuid,size=10%,mode=0755)
> none on /run/lock type tmpfs (rw,noexec,nosuid,nodev,size=5242880)
> none on /run/shm type tmpfs (rw,nosuid,nodev) none on /run/user type 
> tmpfs (rw,noexec,nosuid,nodev,size=104857600,mode=0755)
> /dev/sda1 on /boot type ext2 (rw)
> /dev/mapper/emsclient--vg-home on /home type ext4 (rw)
>
>
> Any idea what went wrong here ?
>
> Thanks & Regards
> Somnath
>
> -Original Message-
> From: Josh Durgin [mailto:josh.dur...@inktank.com]
> Sent: Wednesday, September 18, 2013 6:10 PM
> To: Somnath Roy
> Cc: Sage Weil; ceph-de...@vger.kernel.org; Anirban Ray; 
> ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Scaling RBD module
>
> On 09/17/2013 03:30 PM, Somnath Roy wrote:
>> Hi,
>> I am running Ceph on a 3 node cluster and each of my server node is running 
>> 10 OSDs, one for each disk. I have one admin node and all the nodes are 
>> connected with 2 X 10G network. One network is for cluster and other one 
>> configured as public network.
>>
>> Here is the status of my cluster.
>>
>> ~/fio_test# ceph -s
>>
>> cluster b2e0b4db-6342-490e-9c28-0aadf0188023
>>  health HEALTH_WARN clock skew detected on mon. , mon. 
>> 
>>  monmap e1: 3 mons at {=xxx.xxx.xxx.xxx:6789/0, 
>> =xxx.xxx.xxx.xxx:6789/0, 
>> =xxx.xxx.xxx.xxx:6789/0}, election epoch 64, quorum 0,1,2 
>> ,,
>>  osdmap e391: 30 osds: 30 up, 30 in
>>   pgmap v5202: 30912 pgs: 30912 active+clean; 8494 MB data, 27912 MB 
>> used, 11145 GB / 11172 GB avail
>>  mdsmap e1: 0/0/1 up
>>
>>
>> I started with rados bench command to benchmark the read performance of this 
>> Cluster on a large pool (~10K PGs) and found that each rados client has a 
>> limitation. Each client can only drive up to a certain mark. Each server  
>> node cpu utilization shows it is  around 85-90% idle and the admin node 
>> (from where rados client is r

Re: [ceph-users] Scaling RBD module

2013-09-19 Thread Somnath Roy
Hi Josh,
Thanks for the information. I am trying to add the following but hitting some 
permission issue.

root@emsclient:/etc# echo :6789,:6789,:6789 
name=admin,key=client.admin,noshare test_rbd ceph_block_test' > /sys/bus/rbd/add
-bash: echo: write error: Operation not permitted

Here is the contents of rbd directory..

root@emsclient:/sys/bus/rbd# ll
total 0
drwxr-xr-x  4 root root0 Sep 19 11:59 ./
drwxr-xr-x 30 root root0 Sep 13 11:41 ../
--w---  1 root root 4096 Sep 19 11:59 add
drwxr-xr-x  2 root root0 Sep 19 12:03 devices/
drwxr-xr-x  2 root root0 Sep 19 12:03 drivers/
-rw-r--r--  1 root root 4096 Sep 19 12:03 drivers_autoprobe
--w---  1 root root 4096 Sep 19 12:03 drivers_probe
--w---  1 root root 4096 Sep 19 12:03 remove
--w---  1 root root 4096 Sep 19 11:59 uevent


I checked even if I am logged in as root , I can't write anything on /sys.

Here is the Ubuntu version I am using..

root@emsclient:/etc# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:Ubuntu 13.04
Release:13.04
Codename:   raring

Here is the mount information

root@emsclient:/etc# mount
/dev/mapper/emsclient--vg-root on / type ext4 (rw,errors=remount-ro)
proc on /proc type proc (rw,noexec,nosuid,nodev)
sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
none on /sys/fs/cgroup type tmpfs (rw)
none on /sys/fs/fuse/connections type fusectl (rw)
none on /sys/kernel/debug type debugfs (rw)
none on /sys/kernel/security type securityfs (rw)
udev on /dev type devtmpfs (rw,mode=0755)
devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=0620)
tmpfs on /run type tmpfs (rw,noexec,nosuid,size=10%,mode=0755)
none on /run/lock type tmpfs (rw,noexec,nosuid,nodev,size=5242880)
none on /run/shm type tmpfs (rw,nosuid,nodev)
none on /run/user type tmpfs (rw,noexec,nosuid,nodev,size=104857600,mode=0755)
/dev/sda1 on /boot type ext2 (rw)
/dev/mapper/emsclient--vg-home on /home type ext4 (rw)


Any idea what went wrong here ?

Thanks & Regards
Somnath

-Original Message-
From: Josh Durgin [mailto:josh.dur...@inktank.com]
Sent: Wednesday, September 18, 2013 6:10 PM
To: Somnath Roy
Cc: Sage Weil; ceph-de...@vger.kernel.org; Anirban Ray; 
ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Scaling RBD module

On 09/17/2013 03:30 PM, Somnath Roy wrote:
> Hi,
> I am running Ceph on a 3 node cluster and each of my server node is running 
> 10 OSDs, one for each disk. I have one admin node and all the nodes are 
> connected with 2 X 10G network. One network is for cluster and other one 
> configured as public network.
>
> Here is the status of my cluster.
>
> ~/fio_test# ceph -s
>
>cluster b2e0b4db-6342-490e-9c28-0aadf0188023
> health HEALTH_WARN clock skew detected on mon. , mon. 
> 
> monmap e1: 3 mons at {=xxx.xxx.xxx.xxx:6789/0, 
> =xxx.xxx.xxx.xxx:6789/0, 
> =xxx.xxx.xxx.xxx:6789/0}, election epoch 64, quorum 0,1,2 
> ,,
> osdmap e391: 30 osds: 30 up, 30 in
>  pgmap v5202: 30912 pgs: 30912 active+clean; 8494 MB data, 27912 MB used, 
> 11145 GB / 11172 GB avail
> mdsmap e1: 0/0/1 up
>
>
> I started with rados bench command to benchmark the read performance of this 
> Cluster on a large pool (~10K PGs) and found that each rados client has a 
> limitation. Each client can only drive up to a certain mark. Each server  
> node cpu utilization shows it is  around 85-90% idle and the admin node (from 
> where rados client is running) is around ~80-85% idle. I am trying with 4K 
> object size.

Note that rados bench with 4k objects is different from rbd with 4k-sized I/Os 
- rados bench sends each request to a new object, while rbd objects are 4M by 
default.

> Now, I started running more clients on the admin node and the performance is 
> scaling till it hits the client cpu limit. Server still has the cpu of 30-35% 
> idle. With small object size I must say that the ceph per osd cpu utilization 
> is not promising!
>
> After this, I started testing the rados block interface with kernel rbd 
> module from my admin node.
> I have created 8 images mapped on the pool having around 10K PGs and I am not 
> able to scale up the performance by running fio (either by creating a 
> software raid or running on individual /dev/rbd* instances). For example, 
> running multiple fio instances (one in /dev/rbd1 and the other in /dev/rbd2)  
> the performance I am getting is half of what I am getting if running one 
> instance. Here is my fio job script.
>
> [random-reads]
> ioengine=libaio
> iodepth=32
> filename=/dev/rbd1
> rw=randread
> bs=4k
> direct=1
> size=2G
> numjobs=64
>
> Let me know if I am following the proper procedure or not.
>
> But, If my understanding is correct, kernel rbd module is acting as a client 
> to the cluster and in one admin node I can run only

Re: [ceph-users] Scaling RBD module

2013-09-19 Thread Josh Durgin

On 09/19/2013 12:04 PM, Somnath Roy wrote:

Hi Josh,
Thanks for the information. I am trying to add the following but hitting some 
permission issue.

root@emsclient:/etc# echo :6789,:6789,:6789 
name=admin,key=client.admin,noshare test_rbd ceph_block_test' > /sys/bus/rbd/add
-bash: echo: write error: Operation not permitted


If you check dmesg, it will probably show an error trying to
authenticate to the cluster.

Instead of key=client.admin, you can pass the base64 secret value as
shown in 'ceph auth list' with the secret=X option.

BTW, there's a ticket for adding the noshare option to rbd map so using
the sysfs interface like this is never necessary:

http://tracker.ceph.com/issues/6264

Josh


Here is the contents of rbd directory..

root@emsclient:/sys/bus/rbd# ll
total 0
drwxr-xr-x  4 root root0 Sep 19 11:59 ./
drwxr-xr-x 30 root root0 Sep 13 11:41 ../
--w---  1 root root 4096 Sep 19 11:59 add
drwxr-xr-x  2 root root0 Sep 19 12:03 devices/
drwxr-xr-x  2 root root0 Sep 19 12:03 drivers/
-rw-r--r--  1 root root 4096 Sep 19 12:03 drivers_autoprobe
--w---  1 root root 4096 Sep 19 12:03 drivers_probe
--w---  1 root root 4096 Sep 19 12:03 remove
--w---  1 root root 4096 Sep 19 11:59 uevent


I checked even if I am logged in as root , I can't write anything on /sys.

Here is the Ubuntu version I am using..

root@emsclient:/etc# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:Ubuntu 13.04
Release:13.04
Codename:   raring

Here is the mount information

root@emsclient:/etc# mount
/dev/mapper/emsclient--vg-root on / type ext4 (rw,errors=remount-ro)
proc on /proc type proc (rw,noexec,nosuid,nodev)
sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
none on /sys/fs/cgroup type tmpfs (rw)
none on /sys/fs/fuse/connections type fusectl (rw)
none on /sys/kernel/debug type debugfs (rw)
none on /sys/kernel/security type securityfs (rw)
udev on /dev type devtmpfs (rw,mode=0755)
devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=0620)
tmpfs on /run type tmpfs (rw,noexec,nosuid,size=10%,mode=0755)
none on /run/lock type tmpfs (rw,noexec,nosuid,nodev,size=5242880)
none on /run/shm type tmpfs (rw,nosuid,nodev)
none on /run/user type tmpfs (rw,noexec,nosuid,nodev,size=104857600,mode=0755)
/dev/sda1 on /boot type ext2 (rw)
/dev/mapper/emsclient--vg-home on /home type ext4 (rw)


Any idea what went wrong here ?

Thanks & Regards
Somnath

-Original Message-
From: Josh Durgin [mailto:josh.dur...@inktank.com]
Sent: Wednesday, September 18, 2013 6:10 PM
To: Somnath Roy
Cc: Sage Weil; ceph-de...@vger.kernel.org; Anirban Ray; 
ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Scaling RBD module

On 09/17/2013 03:30 PM, Somnath Roy wrote:

Hi,
I am running Ceph on a 3 node cluster and each of my server node is running 10 
OSDs, one for each disk. I have one admin node and all the nodes are connected 
with 2 X 10G network. One network is for cluster and other one configured as 
public network.

Here is the status of my cluster.

~/fio_test# ceph -s

cluster b2e0b4db-6342-490e-9c28-0aadf0188023
 health HEALTH_WARN clock skew detected on mon. , mon. 

 monmap e1: 3 mons at {=xxx.xxx.xxx.xxx:6789/0, 
=xxx.xxx.xxx.xxx:6789/0, =xxx.xxx.xxx.xxx:6789/0}, election epoch 64, 
quorum 0,1,2 ,,
 osdmap e391: 30 osds: 30 up, 30 in
  pgmap v5202: 30912 pgs: 30912 active+clean; 8494 MB data, 27912 MB used, 
11145 GB / 11172 GB avail
 mdsmap e1: 0/0/1 up


I started with rados bench command to benchmark the read performance of this 
Cluster on a large pool (~10K PGs) and found that each rados client has a 
limitation. Each client can only drive up to a certain mark. Each server  node 
cpu utilization shows it is  around 85-90% idle and the admin node (from where 
rados client is running) is around ~80-85% idle. I am trying with 4K object 
size.


Note that rados bench with 4k objects is different from rbd with 4k-sized I/Os 
- rados bench sends each request to a new object, while rbd objects are 4M by 
default.


Now, I started running more clients on the admin node and the performance is 
scaling till it hits the client cpu limit. Server still has the cpu of 30-35% 
idle. With small object size I must say that the ceph per osd cpu utilization 
is not promising!

After this, I started testing the rados block interface with kernel rbd module 
from my admin node.
I have created 8 images mapped on the pool having around 10K PGs and I am not 
able to scale up the performance by running fio (either by creating a software 
raid or running on individual /dev/rbd* instances). For example, running 
multiple fio instances (one in /dev/rbd1 and the other in /dev/rbd2)  the 
performance I am getting is half of what I am getting if running one instance. 
Here is my fio job script.

[random-reads]
ioengine=libaio
iodepth=32
filename=/dev/rbd1
rw=randread
bs=4k
direct=1
size=2G

Re: [ceph-users] Scaling RBD module

2013-09-18 Thread Josh Durgin

On 09/17/2013 03:30 PM, Somnath Roy wrote:

Hi,
I am running Ceph on a 3 node cluster and each of my server node is running 10 
OSDs, one for each disk. I have one admin node and all the nodes are connected 
with 2 X 10G network. One network is for cluster and other one configured as 
public network.

Here is the status of my cluster.

~/fio_test# ceph -s

   cluster b2e0b4db-6342-490e-9c28-0aadf0188023
health HEALTH_WARN clock skew detected on mon. , mon. 

monmap e1: 3 mons at {=xxx.xxx.xxx.xxx:6789/0, 
=xxx.xxx.xxx.xxx:6789/0, =xxx.xxx.xxx.xxx:6789/0}, election epoch 64, 
quorum 0,1,2 ,,
osdmap e391: 30 osds: 30 up, 30 in
 pgmap v5202: 30912 pgs: 30912 active+clean; 8494 MB data, 27912 MB used, 
11145 GB / 11172 GB avail
mdsmap e1: 0/0/1 up


I started with rados bench command to benchmark the read performance of this 
Cluster on a large pool (~10K PGs) and found that each rados client has a 
limitation. Each client can only drive up to a certain mark. Each server  node 
cpu utilization shows it is  around 85-90% idle and the admin node (from where 
rados client is running) is around ~80-85% idle. I am trying with 4K object 
size.


Note that rados bench with 4k objects is different from rbd with
4k-sized I/Os - rados bench sends each request to a new object,
while rbd objects are 4M by default.


Now, I started running more clients on the admin node and the performance is 
scaling till it hits the client cpu limit. Server still has the cpu of 30-35% 
idle. With small object size I must say that the ceph per osd cpu utilization 
is not promising!

After this, I started testing the rados block interface with kernel rbd module 
from my admin node.
I have created 8 images mapped on the pool having around 10K PGs and I am not 
able to scale up the performance by running fio (either by creating a software 
raid or running on individual /dev/rbd* instances). For example, running 
multiple fio instances (one in /dev/rbd1 and the other in /dev/rbd2)  the 
performance I am getting is half of what I am getting if running one instance. 
Here is my fio job script.

[random-reads]
ioengine=libaio
iodepth=32
filename=/dev/rbd1
rw=randread
bs=4k
direct=1
size=2G
numjobs=64

Let me know if I am following the proper procedure or not.

But, If my understanding is correct, kernel rbd module is acting as a client to 
the cluster and in one admin node I can run only one of such kernel instance.
If so, I am then limited to the client bottleneck that I stated earlier. The 
cpu utilization of the server side is around 85-90% idle, so, it is clear that 
client is not driving.

My question is, is there any way to hit the cluster  with more client from a 
single box while testing the rbd module ?


You can run multiple librbd instances easily (for example with
multiple runs of the rbd bench-write command).

The kernel rbd driver uses the same rados client instance for multiple
block devices by default. There's an option (noshare) to use a new
rados client instance for a newly mapped device, but it's not exposed
by the rbd cli. You need to use the sysfs interface that 'rbd map' uses
instead.

Once you've used rbd map once on a machine, the kernel will already
have the auth key stored, and you can use:

echo '1.2.3.4:6789 name=admin,key=client.admin,noshare poolname 
imagename' > /sys/bus/rbd/add


Where 1.2.3.4:6789 is the address of a monitor, and you're connecting
as client.admin.

You can use 'rbd unmap' as usual.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com