Re: speedup ceph / scaling / find the bottleneck

2012-07-09 Thread Stefan Priebe

Am 06.07.2012 20:17, schrieb Gregory Farnum:

Am 06.07.2012 um 19:11 schrieb Gregory Farnum :

I'm interested in figuring out why we aren't getting useful data out
of the admin socket, and for that I need the actual configuration
files. It wouldn't surprise me if there are several layers to this
issue but I'd like to start at the client's endpoint. :)


While I'm on holiday I can't send you my ceph.conf but it doesn't contain 
anything else than the locations and journal dio false for tmpfs and 
/var/run/ceph_$name.sock


Is that socket in the global area?

Yes

> Does the KVM process have

permission to access that directory?

Yes it is also created if i skip $name and set it to /var/run/ceph.sock


Regarding the random IO, you shouldn't overestimate your storage.
Under plenty of scenarios your drives are lucky to do more than 2k
IO/s, which is about what you're seeing
http://techreport.com/articles.x/22415/9

You're fine if the ceph workload is the same as the iometer file server 
workload. I don't know. I've measured the raw random 4k workload. Also I've 
tested adding another osd and speed still doesn't change but with a size of 
200gb I should hit several osd servers.

Okay — just wanted to point it out.


Thanks also with sheepdog i can get 40 000 IOp/s.

Stefan

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speedup ceph / scaling / find the bottleneck

2012-07-06 Thread Gregory Farnum
On Fri, Jul 6, 2012 at 11:09 AM, Stefan Priebe - Profihost AG
 wrote:
> Am 06.07.2012 um 19:11 schrieb Gregory Farnum :
>
>> On Thu, Jul 5, 2012 at 8:50 PM, Alexandre DERUMIER  
>> wrote:
>>> Hi,
>>> Stefan is on vacation for the moment,I don't know if he can reply you.
>>>
>>> But I can reoly for him for the kvm part (as we do same tests together in 
>>> parallel).
>>>
>>> - kvm is 1.1
>>> - rbd 0.48
>>> - drive option 
>>> rbd:pool/volume:auth_supported=cephx;none;keyring=/etc/pve/priv/ceph/ceph.keyring:mon_host=X.X.X.X";
>>> -using writeback
>>>
>>> writeback tuning in ceph.conf on the kvm host
>>>
>>> rbd_cache_size = 33554432
>>> rbd_cache_max_age = 2.0
>>>
>>> benchmark use in kvm guest:
>>> fio --filename=$DISK --direct=1 --rw=randwrite --bs=4k --size=200G 
>>> --numjobs=50 --runtime=90 --group_reporting --name=file1
>>>
>>> results show max 14000io/s with 1 vm, 7000io/s by vm with 2vm,...
>>> so it doesn't scale
>>>
>>> (bench is with directio, so maybe writeback cache don't help)
>>>
>>> hardware for ceph , is 3 nodes with 4 intel ssd each. (1 drive can handle 
>>> 4io/s randwrite locally)
>>
>> I'm interested in figuring out why we aren't getting useful data out
>> of the admin socket, and for that I need the actual configuration
>> files. It wouldn't surprise me if there are several layers to this
>> issue but I'd like to start at the client's endpoint. :)
>
> While I'm on holiday I can't send you my ceph.conf but it doesn't contain 
> anything else than the locations and journal dio false for tmpfs and 
> /var/run/ceph_$name.sock

Is that socket in the global area? Does the KVM process have
permission to access that directory? If you enable logging can you get
any outputs that reference errors opening that file? (I realize you're
on holiday; these are just the questions we'll need answered to get it
working.)

>
>>
>> Regarding the random IO, you shouldn't overestimate your storage.
>> Under plenty of scenarios your drives are lucky to do more than 2k
>> IO/s, which is about what you're seeing
>> http://techreport.com/articles.x/22415/9
> You're fine if the ceph workload is the same as the iometer file server 
> workload. I don't know. I've measured the raw random 4k workload. Also I've 
> tested adding another osd and speed still doesn't change but with a size of 
> 200gb I should hit several osd servers.
Okay — just wanted to point it out.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speedup ceph / scaling / find the bottleneck

2012-07-06 Thread Stefan Priebe - Profihost AG
Am 06.07.2012 um 19:11 schrieb Gregory Farnum :

> On Thu, Jul 5, 2012 at 8:50 PM, Alexandre DERUMIER  
> wrote:
>> Hi,
>> Stefan is on vacation for the moment,I don't know if he can reply you.
>> 
>> But I can reoly for him for the kvm part (as we do same tests together in 
>> parallel).
>> 
>> - kvm is 1.1
>> - rbd 0.48
>> - drive option 
>> rbd:pool/volume:auth_supported=cephx;none;keyring=/etc/pve/priv/ceph/ceph.keyring:mon_host=X.X.X.X";
>> -using writeback
>> 
>> writeback tuning in ceph.conf on the kvm host
>> 
>> rbd_cache_size = 33554432
>> rbd_cache_max_age = 2.0
>> 
>> benchmark use in kvm guest:
>> fio --filename=$DISK --direct=1 --rw=randwrite --bs=4k --size=200G 
>> --numjobs=50 --runtime=90 --group_reporting --name=file1
>> 
>> results show max 14000io/s with 1 vm, 7000io/s by vm with 2vm,...
>> so it doesn't scale
>> 
>> (bench is with directio, so maybe writeback cache don't help)
>> 
>> hardware for ceph , is 3 nodes with 4 intel ssd each. (1 drive can handle 
>> 4io/s randwrite locally)
> 
> I'm interested in figuring out why we aren't getting useful data out
> of the admin socket, and for that I need the actual configuration
> files. It wouldn't surprise me if there are several layers to this
> issue but I'd like to start at the client's endpoint. :)

While I'm on holiday I can't send you my ceph.conf but it doesn't contain 
anything else than the locations and journal dio false for tmpfs and 
/var/run/ceph_$name.sock

> 
> Regarding the random IO, you shouldn't overestimate your storage.
> Under plenty of scenarios your drives are lucky to do more than 2k
> IO/s, which is about what you're seeing
> http://techreport.com/articles.x/22415/9
You're fine if the ceph workload is the same as the iometer file server 
workload. I don't know. I've measured the raw random 4k workload. Also I've 
tested adding another osd and speed still doesn't change but with a size of 
200gb I should hit several osd servers.

Stefan


> -Greg
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speedup ceph / scaling / find the bottleneck

2012-07-06 Thread Gregory Farnum
On Thu, Jul 5, 2012 at 8:50 PM, Alexandre DERUMIER  wrote:
> Hi,
> Stefan is on vacation for the moment,I don't know if he can reply you.
>
> But I can reoly for him for the kvm part (as we do same tests together in 
> parallel).
>
> - kvm is 1.1
> - rbd 0.48
> - drive option 
> rbd:pool/volume:auth_supported=cephx;none;keyring=/etc/pve/priv/ceph/ceph.keyring:mon_host=X.X.X.X";
> -using writeback
>
> writeback tuning in ceph.conf on the kvm host
>
> rbd_cache_size = 33554432
> rbd_cache_max_age = 2.0
>
> benchmark use in kvm guest:
> fio --filename=$DISK --direct=1 --rw=randwrite --bs=4k --size=200G 
> --numjobs=50 --runtime=90 --group_reporting --name=file1
>
> results show max 14000io/s with 1 vm, 7000io/s by vm with 2vm,...
> so it doesn't scale
>
> (bench is with directio, so maybe writeback cache don't help)
>
> hardware for ceph , is 3 nodes with 4 intel ssd each. (1 drive can handle 
> 4io/s randwrite locally)

I'm interested in figuring out why we aren't getting useful data out
of the admin socket, and for that I need the actual configuration
files. It wouldn't surprise me if there are several layers to this
issue but I'd like to start at the client's endpoint. :)

Regarding the random IO, you shouldn't overestimate your storage.
Under plenty of scenarios your drives are lucky to do more than 2k
IO/s, which is about what you're seeing
http://techreport.com/articles.x/22415/9
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speedup ceph / scaling / find the bottleneck

2012-07-06 Thread Stefan Priebe
Am 06.07.2012 um 05:50 schrieb Alexandre DERUMIER :

> Hi, 
> Stefan is on vacation for the moment,I don't know if he can reply you.
Thanks!

> 
> But I can reoly for him for the kvm part (as we do same tests together in 
> parallel).
> 
> - kvm is 1.1
> - rbd 0.48
> - drive option 
> rbd:pool/volume:auth_supported=cephx;none;keyring=/etc/pve/priv/ceph/ceph.keyring:mon_host=X.X.X.X";
> -using writeback
> 
> writeback tuning in ceph.conf on the kvm host
> 
> rbd_cache_size = 33554432 
> rbd_cache_max_age = 2.0 
Correct

> 
> benchmark use in kvm guest:
> fio --filename=$DISK --direct=1 --rw=randwrite --bs=4k --size=200G 
> --numjobs=50 --runtime=90 --group_reporting --name=file1
> 
> results show max 14000io/s with 1 vm, 7000io/s by vm with 2vm,...
> so it doesn't scale
Correct too

> 
> (bench is with directio, so maybe writeback cache don't help)
> 
> hardware for ceph , is 3 nodes with 4 intel ssd each. (1 drive can handle 
> 4io/s randwrite locally)
3 but still enough

Stefan

> - Alexandre
> 
> - Mail original - 
> 
> De: "Gregory Farnum"  
> À: "Stefan Priebe"  
> Cc: ceph-devel@vger.kernel.org, "Sage Weil"  
> Envoyé: Jeudi 5 Juillet 2012 23:33:18 
> Objet: Re: speedup ceph / scaling / find the bottleneck 
> 
> Could you send over the ceph.conf on your KVM host, as well as how 
> you're configuring KVM to use rbd? 
> 
> On Tue, Jul 3, 2012 at 11:20 AM, Stefan Priebe  wrote: 
>> I'm sorry but this is the KVM Host Machine there is no ceph running on this 
>> machine. 
>> 
>> If i change the admin socket to: 
>> admin_socket=/var/run/ceph_$name.sock 
>> 
>> i don't have any socket at all ;-( 
>> 
>> Am 03.07.2012 17:31, schrieb Sage Weil: 
>> 
>>> On Tue, 3 Jul 2012, Stefan Priebe - Profihost AG wrote: 
>>>> 
>>>> Hello, 
>>>> 
>>>> Am 02.07.2012 22:30, schrieb Josh Durgin: 
>>>>> 
>>>>> If you add admin_socket=/path/to/admin_socket for your client running 
>>>>> qemu (in that client's ceph.conf section or manually in the qemu 
>>>>> command line) you can check that caching is enabled: 
>>>>> 
>>>>> ceph --admin-daemon /path/to/admin_socket show config | grep rbd_cache 
>>>>> 
>>>>> And see statistics it generates (look for cache) with: 
>>>>> 
>>>>> ceph --admin-daemon /path/to/admin_socket perfcounters_dump 
>>>> 
>>>> 
>>>> This doesn't work for me: 
>>>> ceph --admin-daemon /var/run/ceph.sock show config 
>>>> read only got 0 bytes of 4 expected for response length; invalid 
>>>> command?2012-07-03 09:46:57.931821 7fa75d129700 -1 asok(0x8115a0) 
>>>> AdminSocket: 
>>>> request 'show config' not defined 
>>> 
>>> 
>>> Oh, it's 'config show'. Also, 'help' will list the supported commands. 
>>> 
>>>> Also perfcounters does not show anything: 
>>>> # ceph --admin-daemon /var/run/ceph.sock perfcounters_dump 
>>>> {} 
>>> 
>>> 
>>> There may be another daemon that tried to attach to the same socket file. 
>>> You might want to set 'admin socket = /var/run/ceph/$name.sock' or 
>>> something similar, or whatever else is necessary to make it a unique file. 
>>> 
>>>> ~]# ceph -v 
>>>> ceph version 0.48argonaut-2-gb576faa 
>>>> (commit:b576faa6f24356f4d3ec7205e298d58659e29c68) 
>>> 
>>> 
>>> Out of curiousity, what patches are you applying on top of the release? 
>>> 
>>> sage 
>>> 
>> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
> the body of a message to majord...@vger.kernel.org 
> More majordomo info at http://vger.kernel.org/majordomo-info.html 
> 
> 
> 
> -- 
> 
> -- 
> 
> 
> 
>
> 
> Alexandre D e rumier 
> 
> Ingénieur Systèmes et Réseaux 
> 
> 
> Fixe : 03 20 68 88 85 
> 
> Fax : 03 20 68 90 88 
> 
> 
> 45 Bvd du Général Leclerc 59100 Roubaix 
> 12 rue Marivaux 75002 Paris 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speedup ceph / scaling / find the bottleneck

2012-07-05 Thread Alexandre DERUMIER
Hi, 
Stefan is on vacation for the moment,I don't know if he can reply you.

But I can reoly for him for the kvm part (as we do same tests together in 
parallel).

- kvm is 1.1
- rbd 0.48
- drive option 
rbd:pool/volume:auth_supported=cephx;none;keyring=/etc/pve/priv/ceph/ceph.keyring:mon_host=X.X.X.X";
-using writeback

writeback tuning in ceph.conf on the kvm host

rbd_cache_size = 33554432 
rbd_cache_max_age = 2.0 

benchmark use in kvm guest:
fio --filename=$DISK --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 
--runtime=90 --group_reporting --name=file1

results show max 14000io/s with 1 vm, 7000io/s by vm with 2vm,...
so it doesn't scale

(bench is with directio, so maybe writeback cache don't help)

hardware for ceph , is 3 nodes with 4 intel ssd each. (1 drive can handle 
4io/s randwrite locally)


- Alexandre

- Mail original - 

De: "Gregory Farnum"  
À: "Stefan Priebe"  
Cc: ceph-devel@vger.kernel.org, "Sage Weil"  
Envoyé: Jeudi 5 Juillet 2012 23:33:18 
Objet: Re: speedup ceph / scaling / find the bottleneck 

Could you send over the ceph.conf on your KVM host, as well as how 
you're configuring KVM to use rbd? 

On Tue, Jul 3, 2012 at 11:20 AM, Stefan Priebe  wrote: 
> I'm sorry but this is the KVM Host Machine there is no ceph running on this 
> machine. 
> 
> If i change the admin socket to: 
> admin_socket=/var/run/ceph_$name.sock 
> 
> i don't have any socket at all ;-( 
> 
> Am 03.07.2012 17:31, schrieb Sage Weil: 
> 
>> On Tue, 3 Jul 2012, Stefan Priebe - Profihost AG wrote: 
>>> 
>>> Hello, 
>>> 
>>> Am 02.07.2012 22:30, schrieb Josh Durgin: 
>>>> 
>>>> If you add admin_socket=/path/to/admin_socket for your client running 
>>>> qemu (in that client's ceph.conf section or manually in the qemu 
>>>> command line) you can check that caching is enabled: 
>>>> 
>>>> ceph --admin-daemon /path/to/admin_socket show config | grep rbd_cache 
>>>> 
>>>> And see statistics it generates (look for cache) with: 
>>>> 
>>>> ceph --admin-daemon /path/to/admin_socket perfcounters_dump 
>>> 
>>> 
>>> This doesn't work for me: 
>>> ceph --admin-daemon /var/run/ceph.sock show config 
>>> read only got 0 bytes of 4 expected for response length; invalid 
>>> command?2012-07-03 09:46:57.931821 7fa75d129700 -1 asok(0x8115a0) 
>>> AdminSocket: 
>>> request 'show config' not defined 
>> 
>> 
>> Oh, it's 'config show'. Also, 'help' will list the supported commands. 
>> 
>>> Also perfcounters does not show anything: 
>>> # ceph --admin-daemon /var/run/ceph.sock perfcounters_dump 
>>> {} 
>> 
>> 
>> There may be another daemon that tried to attach to the same socket file. 
>> You might want to set 'admin socket = /var/run/ceph/$name.sock' or 
>> something similar, or whatever else is necessary to make it a unique file. 
>> 
>>> ~]# ceph -v 
>>> ceph version 0.48argonaut-2-gb576faa 
>>> (commit:b576faa6f24356f4d3ec7205e298d58659e29c68) 
>> 
>> 
>> Out of curiousity, what patches are you applying on top of the release? 
>> 
>> sage 
>> 
> 
-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majord...@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 



-- 

-- 





Alexandre D e rumier 

Ingénieur Systèmes et Réseaux 


Fixe : 03 20 68 88 85 

Fax : 03 20 68 90 88 


45 Bvd du Général Leclerc 59100 Roubaix 
12 rue Marivaux 75002 Paris 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speedup ceph / scaling / find the bottleneck

2012-07-05 Thread Gregory Farnum
Could you send over the ceph.conf on your KVM host, as well as how
you're configuring KVM to use rbd?

On Tue, Jul 3, 2012 at 11:20 AM, Stefan Priebe  wrote:
> I'm sorry but this is the KVM Host Machine there is no ceph running on this
> machine.
>
> If i change the admin socket to:
> admin_socket=/var/run/ceph_$name.sock
>
> i don't have any socket at all ;-(
>
> Am 03.07.2012 17:31, schrieb Sage Weil:
>
>> On Tue, 3 Jul 2012, Stefan Priebe - Profihost AG wrote:
>>>
>>> Hello,
>>>
>>> Am 02.07.2012 22:30, schrieb Josh Durgin:

 If you add admin_socket=/path/to/admin_socket for your client running
 qemu (in that client's ceph.conf section or manually in the qemu
 command line) you can check that caching is enabled:

 ceph --admin-daemon /path/to/admin_socket show config | grep rbd_cache

 And see statistics it generates (look for cache) with:

 ceph --admin-daemon /path/to/admin_socket perfcounters_dump
>>>
>>>
>>> This doesn't work for me:
>>> ceph --admin-daemon /var/run/ceph.sock show config
>>> read only got 0 bytes of 4 expected for response length; invalid
>>> command?2012-07-03 09:46:57.931821 7fa75d129700 -1 asok(0x8115a0)
>>> AdminSocket:
>>> request 'show config' not defined
>>
>>
>> Oh, it's 'config show'.  Also, 'help' will list the supported commands.
>>
>>> Also perfcounters does not show anything:
>>> # ceph --admin-daemon /var/run/ceph.sock perfcounters_dump
>>> {}
>>
>>
>> There may be another daemon that tried to attach to the same socket file.
>> You might want to set 'admin socket = /var/run/ceph/$name.sock' or
>> something similar, or whatever else is necessary to make it a unique file.
>>
>>> ~]# ceph -v
>>> ceph version 0.48argonaut-2-gb576faa
>>> (commit:b576faa6f24356f4d3ec7205e298d58659e29c68)
>>
>>
>> Out of curiousity, what patches are you applying on top of the release?
>>
>> sage
>>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speedup ceph / scaling / find the bottleneck

2012-07-03 Thread Stefan Priebe

Am 03.07.2012 17:31, schrieb Sage Weil:

~]# ceph -v
ceph version 0.48argonaut-2-gb576faa
(commit:b576faa6f24356f4d3ec7205e298d58659e29c68)


Out of curiousity, what patches are you applying on top of the release?

just wip-filestore-min

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speedup ceph / scaling / find the bottleneck

2012-07-03 Thread Stefan Priebe
I'm sorry but this is the KVM Host Machine there is no ceph running on 
this machine.


If i change the admin socket to:
admin_socket=/var/run/ceph_$name.sock

i don't have any socket at all ;-(

Am 03.07.2012 17:31, schrieb Sage Weil:

On Tue, 3 Jul 2012, Stefan Priebe - Profihost AG wrote:

Hello,

Am 02.07.2012 22:30, schrieb Josh Durgin:

If you add admin_socket=/path/to/admin_socket for your client running
qemu (in that client's ceph.conf section or manually in the qemu
command line) you can check that caching is enabled:

ceph --admin-daemon /path/to/admin_socket show config | grep rbd_cache

And see statistics it generates (look for cache) with:

ceph --admin-daemon /path/to/admin_socket perfcounters_dump


This doesn't work for me:
ceph --admin-daemon /var/run/ceph.sock show config
read only got 0 bytes of 4 expected for response length; invalid
command?2012-07-03 09:46:57.931821 7fa75d129700 -1 asok(0x8115a0) AdminSocket:
request 'show config' not defined


Oh, it's 'config show'.  Also, 'help' will list the supported commands.


Also perfcounters does not show anything:
# ceph --admin-daemon /var/run/ceph.sock perfcounters_dump
{}


There may be another daemon that tried to attach to the same socket file.
You might want to set 'admin socket = /var/run/ceph/$name.sock' or
something similar, or whatever else is necessary to make it a unique file.


~]# ceph -v
ceph version 0.48argonaut-2-gb576faa
(commit:b576faa6f24356f4d3ec7205e298d58659e29c68)


Out of curiousity, what patches are you applying on top of the release?

sage



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speedup ceph / scaling / find the bottleneck

2012-07-03 Thread Sage Weil
On Tue, 3 Jul 2012, Stefan Priebe - Profihost AG wrote:
> Hello,
> 
> Am 02.07.2012 22:30, schrieb Josh Durgin:
> > If you add admin_socket=/path/to/admin_socket for your client running
> > qemu (in that client's ceph.conf section or manually in the qemu
> > command line) you can check that caching is enabled:
> > 
> > ceph --admin-daemon /path/to/admin_socket show config | grep rbd_cache
> > 
> > And see statistics it generates (look for cache) with:
> > 
> > ceph --admin-daemon /path/to/admin_socket perfcounters_dump
> 
> This doesn't work for me:
> ceph --admin-daemon /var/run/ceph.sock show config
> read only got 0 bytes of 4 expected for response length; invalid
> command?2012-07-03 09:46:57.931821 7fa75d129700 -1 asok(0x8115a0) AdminSocket:
> request 'show config' not defined

Oh, it's 'config show'.  Also, 'help' will list the supported commands.
 
> Also perfcounters does not show anything:
> # ceph --admin-daemon /var/run/ceph.sock perfcounters_dump
> {}

There may be another daemon that tried to attach to the same socket file.  
You might want to set 'admin socket = /var/run/ceph/$name.sock' or 
something similar, or whatever else is necessary to make it a unique file.
 
> ~]# ceph -v
> ceph version 0.48argonaut-2-gb576faa
> (commit:b576faa6f24356f4d3ec7205e298d58659e29c68)

Out of curiousity, what patches are you applying on top of the release?

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speedup ceph / scaling / find the bottleneck

2012-07-03 Thread Stefan Priebe - Profihost AG

Hello,


Am 02.07.2012 22:30, schrieb Josh Durgin:

If you add admin_socket=/path/to/admin_socket for your client running
qemu (in that client's ceph.conf section or manually in the qemu
command line) you can check that caching is enabled:

ceph --admin-daemon /path/to/admin_socket show config | grep rbd_cache

And see statistics it generates (look for cache) with:

ceph --admin-daemon /path/to/admin_socket perfcounters_dump


This doesn't work for me:
ceph --admin-daemon /var/run/ceph.sock show config
read only got 0 bytes of 4 expected for response length; invalid 
command?2012-07-03 09:46:57.931821 7fa75d129700 -1 asok(0x8115a0) 
AdminSocket: request 'show config' not defined


Also perfcounters does not show anything:
# ceph --admin-daemon /var/run/ceph.sock perfcounters_dump
{}

~]# ceph -v
ceph version 0.48argonaut-2-gb576faa 
(commit:b576faa6f24356f4d3ec7205e298d58659e29c68)


Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speedup ceph / scaling / find the bottleneck

2012-07-02 Thread Alexandre DERUMIER
Stefan,

As fio benchmark use directio (--direct) , maybe the writeback cache is not 
working ?

perfcounters should give us the answer.

- Mail original - 

De: "Josh Durgin"  
À: "Stefan Priebe"  
Cc: "Gregory Farnum" , "Alexandre DERUMIER" 
, "Sage Weil" , 
ceph-devel@vger.kernel.org, "Mark Nelson"  
Envoyé: Lundi 2 Juillet 2012 22:30:19 
Objet: Re: speedup ceph / scaling / find the bottleneck 

On 07/02/2012 12:22 PM, Stefan Priebe wrote: 
> Am 02.07.2012 18:51, schrieb Gregory Farnum: 
>> On Sun, Jul 1, 2012 at 11:12 PM, Stefan Priebe - Profihost AG 
>>  wrote: 
>>> @sage / mark 
>>> How does the aggregation work? Does it work 4MB blockwise or target node 
>>> based? 
>> Aggregation is based on the 4MB blocks, and if you've got caching 
>> enabled then it's also not going to flush them out to disk very often 
>> if you're continuously updating the block — I don't remember all the 
>> conditions, but essentially, you'll run into dirty limits and it will 
>> asynchronously flush out the data based on a combination of how old it 
>> is, and how long it's been since some version of it was stable on 
>> disk. 
> Is there any way to check if rbd caching works correctly? For me the I/O 
> values do not change if i switch writeback on or of and it also doesn't 
> matter how large i set the cache size. 
> 
> ... 

If you add admin_socket=/path/to/admin_socket for your client running 
qemu (in that client's ceph.conf section or manually in the qemu 
command line) you can check that caching is enabled: 

ceph --admin-daemon /path/to/admin_socket show config | grep rbd_cache 

And see statistics it generates (look for cache) with: 

ceph --admin-daemon /path/to/admin_socket perfcounters_dump 

Josh 

>>> Ceph: 
>>> 2 VMs: 
>>> write: io=2234MB, bw=25405KB/s, iops=6351, runt= 90041msec 
>>> read : io=4760MB, bw=54156KB/s, iops=13538, runt= 90007msec 
>>> write: io=56372MB, bw=638402KB/s, iops=155, runt= 90421msec 
>>> read : io=86572MB, bw=981225KB/s, iops=239, runt= 90346msec 
>>> 
>>> write: io=MB, bw=25275KB/s, iops=6318, runt= 90011msec 
>>> read : io=4747MB, bw=54000KB/s, iops=13500, runt= 90008msec 
>>> write: io=55300MB, bw=626733KB/s, iops=153, runt= 90353msec 
>>> read : io=84992MB, bw=965283KB/s, iops=235, runt= 90162msec 
>> 
>> I can't quite tell what's going on here, can you describe the test in 
>> more detail? 
> 
> I've network booted my VM and then run the following command: 
> export DISK=/dev/vda; (fio --filename=$DISK --direct=1 --rw=randwrite 
> --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting 
> --name=file1;fio --filename=$DISK --direct=1 --rw=randread --bs=4k 
> --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1;fio 
> --filename=$DISK --direct=1 --rw=write --bs=4M --size=200G --numjobs=50 
> --runtime=90 --group_reporting --name=file1;fio --filename=$DISK 
> --direct=1 --rw=read --bs=4M --size=200G --numjobs=50 --runtime=90 
> --group_reporting --name=file1 )|egrep " read| write" 
> 
> - write random 4k I/O 
> - read random 4k I/O 
> - write seq 4M I/O 
> - read seq 4M I/O 
> 
> Stefan 




-- 

-- 





Alexandre D e rumier 

Ingénieur Systèmes et Réseaux 


Fixe : 03 20 68 88 85 

Fax : 03 20 68 90 88 


45 Bvd du Général Leclerc 59100 Roubaix 
12 rue Marivaux 75002 Paris 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speedup ceph / scaling / find the bottleneck

2012-07-02 Thread Alexandre DERUMIER
Stefan,

As fio benchmark use directio (--direct) , maybe the writeback cache is not 
working ?

perfcounters should give us the answer.

- Mail original - 

De: "Josh Durgin"  
À: "Stefan Priebe"  
Cc: "Gregory Farnum" , "Alexandre DERUMIER" 
, "Sage Weil" , 
ceph-devel@vger.kernel.org, "Mark Nelson"  
Envoyé: Lundi 2 Juillet 2012 22:30:19 
Objet: Re: speedup ceph / scaling / find the bottleneck 

On 07/02/2012 12:22 PM, Stefan Priebe wrote: 
> Am 02.07.2012 18:51, schrieb Gregory Farnum: 
>> On Sun, Jul 1, 2012 at 11:12 PM, Stefan Priebe - Profihost AG 
>>  wrote: 
>>> @sage / mark 
>>> How does the aggregation work? Does it work 4MB blockwise or target node 
>>> based? 
>> Aggregation is based on the 4MB blocks, and if you've got caching 
>> enabled then it's also not going to flush them out to disk very often 
>> if you're continuously updating the block — I don't remember all the 
>> conditions, but essentially, you'll run into dirty limits and it will 
>> asynchronously flush out the data based on a combination of how old it 
>> is, and how long it's been since some version of it was stable on 
>> disk. 
> Is there any way to check if rbd caching works correctly? For me the I/O 
> values do not change if i switch writeback on or of and it also doesn't 
> matter how large i set the cache size. 
> 
> ... 

If you add admin_socket=/path/to/admin_socket for your client running 
qemu (in that client's ceph.conf section or manually in the qemu 
command line) you can check that caching is enabled: 

ceph --admin-daemon /path/to/admin_socket show config | grep rbd_cache 

And see statistics it generates (look for cache) with: 

ceph --admin-daemon /path/to/admin_socket perfcounters_dump 

Josh 

>>> Ceph: 
>>> 2 VMs: 
>>> write: io=2234MB, bw=25405KB/s, iops=6351, runt= 90041msec 
>>> read : io=4760MB, bw=54156KB/s, iops=13538, runt= 90007msec 
>>> write: io=56372MB, bw=638402KB/s, iops=155, runt= 90421msec 
>>> read : io=86572MB, bw=981225KB/s, iops=239, runt= 90346msec 
>>> 
>>> write: io=MB, bw=25275KB/s, iops=6318, runt= 90011msec 
>>> read : io=4747MB, bw=54000KB/s, iops=13500, runt= 90008msec 
>>> write: io=55300MB, bw=626733KB/s, iops=153, runt= 90353msec 
>>> read : io=84992MB, bw=965283KB/s, iops=235, runt= 90162msec 
>> 
>> I can't quite tell what's going on here, can you describe the test in 
>> more detail? 
> 
> I've network booted my VM and then run the following command: 
> export DISK=/dev/vda; (fio --filename=$DISK --direct=1 --rw=randwrite 
> --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting 
> --name=file1;fio --filename=$DISK --direct=1 --rw=randread --bs=4k 
> --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1;fio 
> --filename=$DISK --direct=1 --rw=write --bs=4M --size=200G --numjobs=50 
> --runtime=90 --group_reporting --name=file1;fio --filename=$DISK 
> --direct=1 --rw=read --bs=4M --size=200G --numjobs=50 --runtime=90 
> --group_reporting --name=file1 )|egrep " read| write" 
> 
> - write random 4k I/O 
> - read random 4k I/O 
> - write seq 4M I/O 
> - read seq 4M I/O 
> 
> Stefan 




-- 

-- 





Alexandre D e rumier 

Ingénieur Systèmes et Réseaux 


Fixe : 03 20 68 88 85 

Fax : 03 20 68 90 88 


45 Bvd du Général Leclerc 59100 Roubaix 
12 rue Marivaux 75002 Paris 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speedup ceph / scaling / find the bottleneck

2012-07-02 Thread Josh Durgin

On 07/02/2012 12:22 PM, Stefan Priebe wrote:

Am 02.07.2012 18:51, schrieb Gregory Farnum:

On Sun, Jul 1, 2012 at 11:12 PM, Stefan Priebe - Profihost AG
 wrote:

@sage / mark
How does the aggregation work? Does it work 4MB blockwise or target node
based?

Aggregation is based on the 4MB blocks, and if you've got caching
enabled then it's also not going to flush them out to disk very often
if you're continuously updating the block — I don't remember all the
conditions, but essentially, you'll run into dirty limits and it will
asynchronously flush out the data based on a combination of how old it
is, and how long it's been since some version of it was stable on
disk.

Is there any way to check if rbd caching works correctly? For me the I/O
values do not change if i switch writeback on or of and it also doesn't
matter how large i set the cache size.

...


If you add admin_socket=/path/to/admin_socket for your client running
qemu (in that client's ceph.conf section or manually in the qemu
command line) you can check that caching is enabled:

ceph --admin-daemon /path/to/admin_socket show config | grep rbd_cache

And see statistics it generates (look for cache) with:

ceph --admin-daemon /path/to/admin_socket perfcounters_dump

Josh


Ceph:
2 VMs:
write: io=2234MB, bw=25405KB/s, iops=6351, runt= 90041msec
read : io=4760MB, bw=54156KB/s, iops=13538, runt= 90007msec
write: io=56372MB, bw=638402KB/s, iops=155, runt= 90421msec
read : io=86572MB, bw=981225KB/s, iops=239, runt= 90346msec

write: io=MB, bw=25275KB/s, iops=6318, runt= 90011msec
read : io=4747MB, bw=54000KB/s, iops=13500, runt= 90008msec
write: io=55300MB, bw=626733KB/s, iops=153, runt= 90353msec
read : io=84992MB, bw=965283KB/s, iops=235, runt= 90162msec


I can't quite tell what's going on here, can you describe the test in
more detail?


I've network booted my VM and then run the following command:
export DISK=/dev/vda; (fio --filename=$DISK --direct=1 --rw=randwrite
--bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting
--name=file1;fio --filename=$DISK --direct=1 --rw=randread --bs=4k
--size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1;fio
--filename=$DISK --direct=1 --rw=write --bs=4M --size=200G --numjobs=50
--runtime=90 --group_reporting --name=file1;fio --filename=$DISK
--direct=1 --rw=read --bs=4M --size=200G --numjobs=50 --runtime=90
--group_reporting --name=file1 )|egrep " read| write"

- write random 4k I/O
- read random 4k I/O
- write seq 4M I/O
- read seq 4M I/O

Stefan


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speedup ceph / scaling / find the bottleneck

2012-07-02 Thread Stefan Priebe

Am 02.07.2012 18:51, schrieb Gregory Farnum:

On Sun, Jul 1, 2012 at 11:12 PM, Stefan Priebe - Profihost AG
 wrote:

@sage / mark
How does the aggregation work? Does it work 4MB blockwise or target node
based?

Aggregation is based on the 4MB blocks, and if you've got caching
enabled then it's also not going to flush them out to disk very often
if you're continuously updating the block — I don't remember all the
conditions, but essentially, you'll run into dirty limits and it will
asynchronously flush out the data based on a combination of how old it
is, and how long it's been since some version of it was stable on
disk.
Is there any way to check if rbd caching works correctly? For me the I/O 
values do not change if i switch writeback on or of and it also doesn't 
matter how large i set the cache size.


...


Ceph:
2 VMs:
   write: io=2234MB, bw=25405KB/s, iops=6351, runt= 90041msec
   read : io=4760MB, bw=54156KB/s, iops=13538, runt= 90007msec
   write: io=56372MB, bw=638402KB/s, iops=155, runt= 90421msec
   read : io=86572MB, bw=981225KB/s, iops=239, runt= 90346msec

   write: io=MB, bw=25275KB/s, iops=6318, runt= 90011msec
   read : io=4747MB, bw=54000KB/s, iops=13500, runt= 90008msec
   write: io=55300MB, bw=626733KB/s, iops=153, runt= 90353msec
   read : io=84992MB, bw=965283KB/s, iops=235, runt= 90162msec


I can't quite tell what's going on here, can you describe the test in
more detail?


I've network booted my VM and then run the following command:
export DISK=/dev/vda; (fio --filename=$DISK --direct=1 --rw=randwrite 
--bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting 
--name=file1;fio --filename=$DISK --direct=1 --rw=randread --bs=4k 
--size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1;fio 
--filename=$DISK --direct=1 --rw=write --bs=4M --size=200G --numjobs=50 
--runtime=90 --group_reporting --name=file1;fio --filename=$DISK 
--direct=1 --rw=read --bs=4M --size=200G --numjobs=50 --runtime=90 
--group_reporting --name=file1 )|egrep " read| write"


- write random 4k I/O
- read random 4k I/O
- write seq 4M I/O
- read seq 4M I/O

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speedup ceph / scaling / find the bottleneck

2012-07-02 Thread Gregory Farnum
On Sun, Jul 1, 2012 at 11:12 PM, Stefan Priebe - Profihost AG
 wrote:
> Am 02.07.2012 07:02, schrieb Alexandre DERUMIER:
>
>> Hi,
>> my 2cent,
>> maybe with lower range (like 100MB) of random io,
>> you have more chance to aggregate them in 4MB block ?
>
>
> Yes maybe. If you have just a range of 100MB the chance you'll hit the same
> 4MB block again is very high.
>
> @sage / mark
> How does the aggregation work? Does it work 4MB blockwise or target node
> based?
Aggregation is based on the 4MB blocks, and if you've got caching
enabled then it's also not going to flush them out to disk very often
if you're continuously updating the block — I don't remember all the
conditions, but essentially, you'll run into dirty limits and it will
asynchronously flush out the data based on a combination of how old it
is, and how long it's been since some version of it was stable on
disk.


On Mon, Jul 2, 2012 at 6:19 AM, Stefan Priebe - Profihost AG
 wrote:
> Hello,
>
> i just want to report back some test results.
>
> Just some results from a sheepdog test using the same hardware.
>
> Sheepdog:
>
> 1 VM:
>   write: io=12544MB, bw=142678KB/s, iops=35669, runt= 90025msec
>   read : io=14519MB, bw=165186KB/s, iops=41296, runt= 90003msec
>   write: io=16520MB, bw=185842KB/s, iops=45, runt= 91026msec
>   read : io=102936MB, bw=1135MB/s, iops=283, runt= 90684msec
>
> 2 VMs:
>   write: io=7042MB, bw=80062KB/s, iops=20015, runt= 90062msec
>   read : io=8672MB, bw=98661KB/s, iops=24665, runt= 90004msec
>   write: io=14008MB, bw=157443KB/s, iops=38, runt= 91107msec
>   read : io=43924MB, bw=498462KB/s, iops=121, runt= 90234msec
>
>   write: io=6048MB, bw=68772KB/s, iops=17192, runt= 90055msec
>   read : io=9151MB, bw=104107KB/s, iops=26026, runt= 90006msec
>   write: io=12716MB, bw=142693KB/s, iops=34, runt= 91253msec
>   read : io=59616MB, bw=675648KB/s, iops=164, runt= 90353msec
>
>
> Ceph:
> 2 VMs:
>   write: io=2234MB, bw=25405KB/s, iops=6351, runt= 90041msec
>   read : io=4760MB, bw=54156KB/s, iops=13538, runt= 90007msec
>   write: io=56372MB, bw=638402KB/s, iops=155, runt= 90421msec
>   read : io=86572MB, bw=981225KB/s, iops=239, runt= 90346msec
>
>   write: io=MB, bw=25275KB/s, iops=6318, runt= 90011msec
>   read : io=4747MB, bw=54000KB/s, iops=13500, runt= 90008msec
>   write: io=55300MB, bw=626733KB/s, iops=153, runt= 90353msec
>   read : io=84992MB, bw=965283KB/s, iops=235, runt= 90162msec

I can't quite tell what's going on here, can you describe the test in
more detail?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speedup ceph / scaling / find the bottleneck

2012-07-02 Thread Stefan Priebe - Profihost AG

Hello,

i just want to report back some test results.

Just some results from a sheepdog test using the same hardware.

Sheepdog:

1 VM:
  write: io=12544MB, bw=142678KB/s, iops=35669, runt= 90025msec
  read : io=14519MB, bw=165186KB/s, iops=41296, runt= 90003msec
  write: io=16520MB, bw=185842KB/s, iops=45, runt= 91026msec
  read : io=102936MB, bw=1135MB/s, iops=283, runt= 90684msec

2 VMs:
  write: io=7042MB, bw=80062KB/s, iops=20015, runt= 90062msec
  read : io=8672MB, bw=98661KB/s, iops=24665, runt= 90004msec
  write: io=14008MB, bw=157443KB/s, iops=38, runt= 91107msec
  read : io=43924MB, bw=498462KB/s, iops=121, runt= 90234msec

  write: io=6048MB, bw=68772KB/s, iops=17192, runt= 90055msec
  read : io=9151MB, bw=104107KB/s, iops=26026, runt= 90006msec
  write: io=12716MB, bw=142693KB/s, iops=34, runt= 91253msec
  read : io=59616MB, bw=675648KB/s, iops=164, runt= 90353msec


Ceph:
2 VMs:
  write: io=2234MB, bw=25405KB/s, iops=6351, runt= 90041msec
  read : io=4760MB, bw=54156KB/s, iops=13538, runt= 90007msec
  write: io=56372MB, bw=638402KB/s, iops=155, runt= 90421msec
  read : io=86572MB, bw=981225KB/s, iops=239, runt= 90346msec

  write: io=MB, bw=25275KB/s, iops=6318, runt= 90011msec
  read : io=4747MB, bw=54000KB/s, iops=13500, runt= 90008msec
  write: io=55300MB, bw=626733KB/s, iops=153, runt= 90353msec
  read : io=84992MB, bw=965283KB/s, iops=235, runt= 90162msec

So ceph has pretty good values for sequential stuff but for random I/O 
it would be really cool to improve it.


Right now my testsystem has a theoretical 4k random I/Os bandwith of 
350.000 iops - 14 disks with 25 000 iops each (test with fio too).


Greets
Stefan


Am 01.07.2012 23:01, schrieb Stefan Priebe:

Hello list,
  Hello sage,

i've made some further tests.

Sequential 4k writes over 200GB: 300% CPU usage of kvm process 34712 iops

Random 4k writes over 200GB: 170% CPU usage of kvm process 5500 iops

When i make random 4k writes over 100MB: 450% CPU usage of kvm process
and !! 25059 iops !!

Random 4k writes over 1GB: 380% CPU usage of kvm process 14387 iops

So the range where the random I/O happen seem to be important and the
cpu usage just seem to reflect the iops.

So i'm not sure if the problem is really the client rbd driver. Mark i
hope you can make some tests next week.

Greets
Stefan


Am 29.06.2012 23:18, schrieb Stefan Priebe:

Am 29.06.2012 17:28, schrieb Sage Weil:

On Fri, 29 Jun 2012, Stefan Priebe - Profihost AG wrote:

Am 29.06.2012 13:49, schrieb Mark Nelson:

I'll try to replicate your findings in house.  I've got some other
things I have to do today, but hopefully I can take a look next
week. If
I recall correctly, in the other thread you said that sequential
writes
are using much less CPU time on your systems?


Random 4k writes: 10% idle
Seq 4k writes: !! 99,7% !! idle
Seq 4M writes: 90% idle


I take it 'rbd cache = true'?

Yes


It sounds like librbd (or the guest file
system) is coalescing the sequential writes into big writes.  I'm a bit
surprised that the 4k ones have lower CPU utilization, but there are
lots
of opportunity for noise there, so I would



n't read too far into it yet.

90 to 99,7 is OK the 9% goes to flush, kworker and xfs processes. It was
the overall system load. Not just ceph-osd.


  Do you see better scaling in that case?


3 osd nodes:
1 VM:
Rand 4k writes: 7000 iops

<-- this one is WRONG! sorry it is 14100 iops



Seq 4k writes: 19900 iops

2 VMs:
Rand 4k writes: 6000 iops each
Seq 4k writes: 4000 iops VM 1
Seq 4k writes: 18500 iops VM 2


4 osd nodes:
1 VM:
Rand 4k writes: 14400 iops  <-- 


Can you double-check this number?

Triple checked BUT i see the the Rand 4k writes with 3 osd nodes was
wrong. Sorry.


Seq 4k writes: 19000 iops

2 VMs:
Rand 4k writes: 7000 iops each
Seq 4k writes: 18000 iops each


With the exception of that one number above, it really sounds like the
bottleneck is in the client (VM or librbd+librados) and not in the
cluster.  Performance won't improve when you add OSDs if the limiting
factor is the clients ability to dispatch/stream/sustatin IOs.  That
also
seems concistent with the fact that limiting the # of CPUs on the OSDs
doesn't affect much.

ACK


Aboe, with 2 VMs, for instance, your total iops for the cluster doubled
(36000 total).  Can you try with 4 VMs and see if it continues to
scale in
that dimension?  At some point you will start to saturate the OSDs,
and at
that point adding more OSDs should show aggregate throughput going up.

 From where did you get that value? It scales to VMs on some points but
it does not scale with OSDs.

Stefan




--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speedup ceph / scaling / find the bottleneck

2012-07-01 Thread Stefan Priebe - Profihost AG

Am 02.07.2012 07:02, schrieb Alexandre DERUMIER:

Hi,
my 2cent,
maybe with lower range (like 100MB) of random io,
you have more chance to aggregate them in 4MB block ?


Yes maybe. If you have just a range of 100MB the chance you'll hit the 
same 4MB block again is very high.


@sage / mark
How does the aggregation work? Does it work 4MB blockwise or target node 
based?


Greets
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speedup ceph / scaling / find the bottleneck

2012-07-01 Thread Alexandre DERUMIER
Hi,
my 2cent,
maybe with lower range (like 100MB) of random io,
you have more chance to aggregate them in 4MB block ?

I'll do some tests today with my 15K drives

- Mail original - 

De: "Stefan Priebe"  
À: "Mark Nelson"  
Cc: "Sage Weil" , ceph-devel@vger.kernel.org 
Envoyé: Dimanche 1 Juillet 2012 23:27:30 
Objet: Re: speedup ceph / scaling / find the bottleneck 

Am 01.07.2012 23:13, schrieb Mark Nelson: 
> On 7/1/12 4:01 PM, Stefan Priebe wrote: 
>> Hello list, 
>> Hello sage, 
>> 
>> i've made some further tests. 
>> 
>> Sequential 4k writes over 200GB: 300% CPU usage of kvm process 34712 iops 
>> 
>> Random 4k writes over 200GB: 170% CPU usage of kvm process 5500 iops 
>> 
>> When i make random 4k writes over 100MB: 450% CPU usage of kvm process 
>> and !! 25059 iops !! 
>> 
> When you say 100MB vs 200GB, do you mean the total amount of data that 
> is written for the test? 

Yes/No, it is the max amount of data written but for random I/O it is 
also the range like random block device position between 0 and X where 
to write the 4K block. 

> Also, are these starting out on a fresh 
> filesystem? 
Yes, 5 Min old in this case ;-) 

Stefan 
-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majord...@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 



-- 

-- 





Alexandre D e rumier 

Ingénieur Systèmes et Réseaux 


Fixe : 03 20 68 88 85 

Fax : 03 20 68 90 88 


45 Bvd du Général Leclerc 59100 Roubaix 
12 rue Marivaux 75002 Paris 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speedup ceph / scaling / find the bottleneck

2012-07-01 Thread Stefan Priebe

Am 01.07.2012 23:13, schrieb Mark Nelson:

On 7/1/12 4:01 PM, Stefan Priebe wrote:

Hello list,
Hello sage,

i've made some further tests.

Sequential 4k writes over 200GB: 300% CPU usage of kvm process 34712 iops

Random 4k writes over 200GB: 170% CPU usage of kvm process 5500 iops

When i make random 4k writes over 100MB: 450% CPU usage of kvm process
and !! 25059 iops !!


When you say 100MB vs 200GB, do you mean the total amount of data that
is written for the test?


Yes/No, it is the max amount of data written but for random I/O it is 
also the range like random block device position between 0 and X where 
to write the 4K block.


> Also, are these starting out on a fresh

filesystem?

Yes, 5 Min old in this case ;-)

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speedup ceph / scaling / find the bottleneck

2012-07-01 Thread Mark Nelson

On 7/1/12 4:01 PM, Stefan Priebe wrote:

Hello list,
Hello sage,

i've made some further tests.

Sequential 4k writes over 200GB: 300% CPU usage of kvm process 34712 iops

Random 4k writes over 200GB: 170% CPU usage of kvm process 5500 iops

When i make random 4k writes over 100MB: 450% CPU usage of kvm process
and !! 25059 iops !!



When you say 100MB vs 200GB, do you mean the total amount of data that 
is written for the test?  Also, are these starting out on a fresh 
filesystem?  Recently I've been working on tracking down an issue where 
small write performance is degrading as data is written.  The tests I've 
done have been for sequential writes, but I wonder if the problem may be 
significantly worse with random writes.



Random 4k writes over 1GB: 380% CPU usage of kvm process 14387 iops

So the range where the random I/O happen seem to be important and the
cpu usage just seem to reflect the iops.

So i'm not sure if the problem is really the client rbd driver. Mark i
hope you can make some tests next week.


I need to get perf setup on our test boxes, but once I do that I'm 
hoping to follow up on this.




Greets
Stefan

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html


Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speedup ceph / scaling / find the bottleneck

2012-07-01 Thread Stefan Priebe

Hello list,
 Hello sage,

i've made some further tests.

Sequential 4k writes over 200GB: 300% CPU usage of kvm process 34712 iops

Random 4k writes over 200GB: 170% CPU usage of kvm process 5500 iops

When i make random 4k writes over 100MB: 450% CPU usage of kvm process 
and !! 25059 iops !!


Random 4k writes over 1GB: 380% CPU usage of kvm process 14387 iops

So the range where the random I/O happen seem to be important and the 
cpu usage just seem to reflect the iops.


So i'm not sure if the problem is really the client rbd driver. Mark i 
hope you can make some tests next week.


Greets
Stefan


Am 29.06.2012 23:18, schrieb Stefan Priebe:

Am 29.06.2012 17:28, schrieb Sage Weil:

On Fri, 29 Jun 2012, Stefan Priebe - Profihost AG wrote:

Am 29.06.2012 13:49, schrieb Mark Nelson:

I'll try to replicate your findings in house.  I've got some other
things I have to do today, but hopefully I can take a look next
week. If
I recall correctly, in the other thread you said that sequential writes
are using much less CPU time on your systems?


Random 4k writes: 10% idle
Seq 4k writes: !! 99,7% !! idle
Seq 4M writes: 90% idle


I take it 'rbd cache = true'?

Yes


It sounds like librbd (or the guest file
system) is coalescing the sequential writes into big writes.  I'm a bit
surprised that the 4k ones have lower CPU utilization, but there are lots
of opportunity for noise there, so I would



n't read too far into it yet.

90 to 99,7 is OK the 9% goes to flush, kworker and xfs processes. It was
the overall system load. Not just ceph-osd.


  Do you see better scaling in that case?


3 osd nodes:
1 VM:
Rand 4k writes: 7000 iops

<-- this one is WRONG! sorry it is 14100 iops



Seq 4k writes: 19900 iops

2 VMs:
Rand 4k writes: 6000 iops each
Seq 4k writes: 4000 iops VM 1
Seq 4k writes: 18500 iops VM 2


4 osd nodes:
1 VM:
Rand 4k writes: 14400 iops  <-- 


Can you double-check this number?

Triple checked BUT i see the the Rand 4k writes with 3 osd nodes was
wrong. Sorry.


Seq 4k writes: 19000 iops

2 VMs:
Rand 4k writes: 7000 iops each
Seq 4k writes: 18000 iops each


With the exception of that one number above, it really sounds like the
bottleneck is in the client (VM or librbd+librados) and not in the
cluster.  Performance won't improve when you add OSDs if the limiting
factor is the clients ability to dispatch/stream/sustatin IOs.  That also
seems concistent with the fact that limiting the # of CPUs on the OSDs
doesn't affect much.

ACK


Aboe, with 2 VMs, for instance, your total iops for the cluster doubled
(36000 total).  Can you try with 4 VMs and see if it continues to
scale in
that dimension?  At some point you will start to saturate the OSDs,
and at
that point adding more OSDs should show aggregate throughput going up.

 From where did you get that value? It scales to VMs on some points but
it does not scale with OSDs.

Stefan


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speedup ceph / scaling / find the bottleneck

2012-06-29 Thread Stefan Priebe

Am 29.06.2012 17:28, schrieb Sage Weil:

On Fri, 29 Jun 2012, Stefan Priebe - Profihost AG wrote:

Am 29.06.2012 13:49, schrieb Mark Nelson:

I'll try to replicate your findings in house.  I've got some other
things I have to do today, but hopefully I can take a look next week. If
I recall correctly, in the other thread you said that sequential writes
are using much less CPU time on your systems?


Random 4k writes: 10% idle
Seq 4k writes: !! 99,7% !! idle
Seq 4M writes: 90% idle


I take it 'rbd cache = true'?

Yes


It sounds like librbd (or the guest file
system) is coalescing the sequential writes into big writes.  I'm a bit
surprised that the 4k ones have lower CPU utilization, but there are lots
of opportunity for noise there, so I wouldn't read too far into it yet.
90 to 99,7 is OK the 9% goes to flush, kworker and xfs processes. It was 
the overall system load. Not just ceph-osd.



  Do you see better scaling in that case?


3 osd nodes:
1 VM:
Rand 4k writes: 7000 iops

<-- this one is WRONG! sorry it is 14100 iops



Seq 4k writes: 19900 iops

2 VMs:
Rand 4k writes: 6000 iops each
Seq 4k writes: 4000 iops VM 1
Seq 4k writes: 18500 iops VM 2


4 osd nodes:
1 VM:
Rand 4k writes: 14400 iops  <-- 


Can you double-check this number?
Triple checked BUT i see the the Rand 4k writes with 3 osd nodes was 
wrong. Sorry.



Seq 4k writes: 19000 iops

2 VMs:
Rand 4k writes: 7000 iops each
Seq 4k writes: 18000 iops each


With the exception of that one number above, it really sounds like the
bottleneck is in the client (VM or librbd+librados) and not in the
cluster.  Performance won't improve when you add OSDs if the limiting
factor is the clients ability to dispatch/stream/sustatin IOs.  That also
seems concistent with the fact that limiting the # of CPUs on the OSDs
doesn't affect much.

ACK


Aboe, with 2 VMs, for instance, your total iops for the cluster doubled
(36000 total).  Can you try with 4 VMs and see if it continues to scale in
that dimension?  At some point you will start to saturate the OSDs, and at
that point adding more OSDs should show aggregate throughput going up.
From where did you get that value? It scales to VMs on some points but 
it does not scale with OSDs.


Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speedup ceph / scaling / find the bottleneck

2012-06-29 Thread Sage Weil
On Fri, 29 Jun 2012, Stefan Priebe - Profihost AG wrote:
> Am 29.06.2012 13:49, schrieb Mark Nelson:
> > I'll try to replicate your findings in house.  I've got some other
> > things I have to do today, but hopefully I can take a look next week. If
> > I recall correctly, in the other thread you said that sequential writes
> > are using much less CPU time on your systems?
> 
> Random 4k writes: 10% idle
> Seq 4k writes: !! 99,7% !! idle
> Seq 4M writes: 90% idle

I take it 'rbd cache = true'?  It sounds like librbd (or the guest file 
system) is coalescing the sequential writes into big writes.  I'm a bit 
surprised that the 4k ones have lower CPU utilization, but there are lots 
of opportunity for noise there, so I wouldn't read too far into it yet.

> >  Do you see better scaling in that case?
> 
> 3 osd nodes:
> 1 VM:
> Rand 4k writes: 7000 iops
> Seq 4k writes: 19900 iops
> 
> 2 VMs:
> Rand 4k writes: 6000 iops each
> Seq 4k writes: 4000 iops each VM 1
> Seq 4k writes: 18500 iops each VM 2
> 
> 
> 4 osd nodes:
> 1 VM:
> Rand 4k writes: 14400 iops  <-- 

Can you double-check this number?

> Seq 4k writes: 19000 iops
> 
> 2 VMs:
> Rand 4k writes: 7000 iops each
> Seq 4k writes: 18000 iops each

With the exception of that one number above, it really sounds like the 
bottleneck is in the client (VM or librbd+librados) and not in the 
cluster.  Performance won't improve when you add OSDs if the limiting 
factor is the clients ability to dispatch/stream/sustatin IOs.  That also 
seems concistent with the fact that limiting the # of CPUs on the OSDs 
doesn't affect much.

Aboe, with 2 VMs, for instance, your total iops for the cluster doubled 
(36000 total).  Can you try with 4 VMs and see if it continues to scale in 
that dimension?  At some point you will start to saturate the OSDs, and at 
that point adding more OSDs should show aggregate throughput going up.  

I think the typical way to approach this is to first scale the client side 
independently to get the iops-per-osd figure, then pick a reasonable ratio 
between the two, then scale both the client and server side proportional 
to make sure the load distribution and network infrastructure scales 
properly.

sage



> 
> 
> 
> > To figure out where CPU is being used, you could try various options:
> > oprofile, perf, valgrind, strace.  Each has it's own advantages.
> > 
> > Here's how you can create a simple callgraph with perf:
> > 
> > http://lwn.net/Articles/340010/
> 10s perf data output while doing random 4k writes:
> https://raw.github.com/gist/2c16136faebec381ae35/09e6de68a5461a198430a9ec19dfd5392f276706/gistfile1.txt
> 
> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speedup ceph / scaling / find the bottleneck

2012-06-29 Thread Stefan Priebe - Profihost AG


iostat output via iostat -x -t 5 while 4k random writes


06/29/2012 03:20:55 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
  31,630,00   52,640,780,00   14,95

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s 
avgrq-sz avgqu-sz   await  svctm  %util
sdb   0,00   690,400,00 3143,60 0,00 33958,80 
10,80 2,680,85   0,08  24,08
sdc   0,00  1069,800,00 5151,60 0,00 54693,00 
10,62 8,311,61   0,06  29,68
sdd   0,00   581,000,00 2762,80 0,00 27809,00 
10,07 2,450,89   0,08  21,12
sde   0,00   820,000,00 4208,20 0,00 43457,40 
10,33 4,000,95   0,07  28,56
sda   0,00 0,000,000,40 0,00 9,60 
24,00 0,000,00   0,00   0,00


06/29/2012 03:21:00 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
  29,680,00   52,890,980,00   16,45

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s 
avgrq-sz avgqu-sz   await  svctm  %util
sdb   0,00  1046,600,00 5544,20 0,00 57938,00 
10,45 6,081,10   0,06  32,08
sdc   0,00   115,600,00 3483,60 0,00 29368,00 
8,43 3,450,99   0,06  21,36
sdd   0,00  1143,200,00 5991,00 0,00 62607,40 
10,45 6,031,01   0,06  35,20
sde   0,00  1070,000,00 5561,60 0,00 58207,20 
10,47 5,761,04   0,07  38,08
sda   0,00 0,000,000,00 0,00 0,00 
0,00 0,000,00   0,00   0,00


06/29/2012 03:21:05 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
  29,690,00   53,060,600,00   16,65

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s 
avgrq-sz avgqu-sz   await  svctm  %util
sdb   0,00   199,600,00 4484,40 0,00 41338,20 
9,22 1,960,44   0,07  30,56
sdc   0,00   766,600,00 3616,20 0,00 38829,00 
10,74 3,621,00   0,07  25,68
sdd   0,00   149,200,00 5066,60 0,00 45793,60 
9,04 4,480,89   0,06  28,48
sde   0,00   150,000,00 4328,80 0,00 36496,00 
8,43 2,960,68   0,07  32,40
sda   0,00 0,000,000,40 0,0035,20 
88,00 0,000,00   0,00   0,00


06/29/2012 03:21:10 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
  29,110,00   46,580,500,00   23,81

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s 
avgrq-sz avgqu-sz   await  svctm  %util
sdb   0,00   881,200,00 3077,20 0,00 33382,80 
10,85 3,441,12   0,06  18,16
sdc   0,00   867,600,00 5098,40 0,00 52056,20 
10,21 5,651,11   0,05  24,32
sdd   0,00   864,400,00 2759,00 0,00 30321,60 
10,99 3,391,23   0,06  17,36
sde   0,00   846,200,00 3193,40 0,00 36795,60 
11,52 3,481,09   0,06  19,92
sda   0,00 0,000,001,40 0,0011,20 
8,00 0,014,57   2,29   0,32



Am 29.06.2012 15:16, schrieb Stefan Priebe - Profihost AG:

Big sorry. ceph was scrubbing during my last test. Didn't recognized this.

When i redo the test i see writes between 20MB/s and 100Mb/s. That is
OK. Sorry.

Stefan

Am 29.06.2012 15:11, schrieb Stefan Priebe - Profihost AG:

Another BIG hint.

While doing random 4k I/O from one VM i archieve 14k I/Os. This is
around 54MB/s. But EACH ceph-osd machine is writing between 500MB/s and
750MB/s. What do they write?!?!

Just an idea?:
Do they completely rewrite EACH 4MB block for each 4k write?

Stefan

Am 29.06.2012 15:02, schrieb Stefan Priebe - Profihost AG:

Am 29.06.2012 13:49, schrieb Mark Nelson:

I'll try to replicate your findings in house.  I've got some other
things I have to do today, but hopefully I can take a look next
week. If
I recall correctly, in the other thread you said that sequential writes
are using much less CPU time on your systems?


Random 4k writes: 10% idle
Seq 4k writes: !! 99,7% !! idle
Seq 4M writes: 90% idle


 >  Do you see better scaling in that case?

3 osd nodes:
1 VM:
Rand 4k writes: 7000 iops
Seq 4k writes: 19900 iops

2 VMs:
Rand 4k writes: 6000 iops each
Seq 4k writes: 4000 iops each VM 1
Seq 4k writes: 18500 iops each VM 2


4 osd nodes:
1 VM:
Rand 4k writes: 14400 iops
Seq 4k writes: 19000 iops

2 VMs:
Rand 4k writes: 7000 iops each
Seq 4k writes: 18000 iops each




To figure out where CPU is being used, you could try various options:
oprofile, perf, valgrind, strace.  Each has it's own advantages.

Here's how you can create a simple callgraph with perf:

http://lwn.net/Articles/340010/

10s perf data output while doing random 4k writes:
https://raw.github.com/gist/2c16136faebec381ae35/09e6de68a5461a198430a9ec19dfd5392f276706/gistfile1.txt




Stefan






--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to m

Re: speedup ceph / scaling / find the bottleneck

2012-06-29 Thread Stefan Priebe - Profihost AG

Big sorry. ceph was scrubbing during my last test. Didn't recognized this.

When i redo the test i see writes between 20MB/s and 100Mb/s. That is 
OK. Sorry.


Stefan

Am 29.06.2012 15:11, schrieb Stefan Priebe - Profihost AG:

Another BIG hint.

While doing random 4k I/O from one VM i archieve 14k I/Os. This is
around 54MB/s. But EACH ceph-osd machine is writing between 500MB/s and
750MB/s. What do they write?!?!

Just an idea?:
Do they completely rewrite EACH 4MB block for each 4k write?

Stefan

Am 29.06.2012 15:02, schrieb Stefan Priebe - Profihost AG:

Am 29.06.2012 13:49, schrieb Mark Nelson:

I'll try to replicate your findings in house.  I've got some other
things I have to do today, but hopefully I can take a look next week. If
I recall correctly, in the other thread you said that sequential writes
are using much less CPU time on your systems?


Random 4k writes: 10% idle
Seq 4k writes: !! 99,7% !! idle
Seq 4M writes: 90% idle


 >  Do you see better scaling in that case?

3 osd nodes:
1 VM:
Rand 4k writes: 7000 iops
Seq 4k writes: 19900 iops

2 VMs:
Rand 4k writes: 6000 iops each
Seq 4k writes: 4000 iops each VM 1
Seq 4k writes: 18500 iops each VM 2


4 osd nodes:
1 VM:
Rand 4k writes: 14400 iops
Seq 4k writes: 19000 iops

2 VMs:
Rand 4k writes: 7000 iops each
Seq 4k writes: 18000 iops each




To figure out where CPU is being used, you could try various options:
oprofile, perf, valgrind, strace.  Each has it's own advantages.

Here's how you can create a simple callgraph with perf:

http://lwn.net/Articles/340010/

10s perf data output while doing random 4k writes:
https://raw.github.com/gist/2c16136faebec381ae35/09e6de68a5461a198430a9ec19dfd5392f276706/gistfile1.txt



Stefan




--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speedup ceph / scaling / find the bottleneck

2012-06-29 Thread Stefan Priebe - Profihost AG

Another BIG hint.

While doing random 4k I/O from one VM i archieve 14k I/Os. This is 
around 54MB/s. But EACH ceph-osd machine is writing between 500MB/s and 
750MB/s. What do they write?!?!


Just an idea?:
Do they completely rewrite EACH 4MB block for each 4k write?

Stefan

Am 29.06.2012 15:02, schrieb Stefan Priebe - Profihost AG:

Am 29.06.2012 13:49, schrieb Mark Nelson:

I'll try to replicate your findings in house.  I've got some other
things I have to do today, but hopefully I can take a look next week. If
I recall correctly, in the other thread you said that sequential writes
are using much less CPU time on your systems?


Random 4k writes: 10% idle
Seq 4k writes: !! 99,7% !! idle
Seq 4M writes: 90% idle


 >  Do you see better scaling in that case?

3 osd nodes:
1 VM:
Rand 4k writes: 7000 iops
Seq 4k writes: 19900 iops

2 VMs:
Rand 4k writes: 6000 iops each
Seq 4k writes: 4000 iops each VM 1
Seq 4k writes: 18500 iops each VM 2


4 osd nodes:
1 VM:
Rand 4k writes: 14400 iops
Seq 4k writes: 19000 iops

2 VMs:
Rand 4k writes: 7000 iops each
Seq 4k writes: 18000 iops each




To figure out where CPU is being used, you could try various options:
oprofile, perf, valgrind, strace.  Each has it's own advantages.

Here's how you can create a simple callgraph with perf:

http://lwn.net/Articles/340010/

10s perf data output while doing random 4k writes:
https://raw.github.com/gist/2c16136faebec381ae35/09e6de68a5461a198430a9ec19dfd5392f276706/gistfile1.txt


Stefan


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speedup ceph / scaling / find the bottleneck

2012-06-29 Thread Stefan Priebe - Profihost AG

Am 29.06.2012 13:49, schrieb Mark Nelson:

I'll try to replicate your findings in house.  I've got some other
things I have to do today, but hopefully I can take a look next week. If
I recall correctly, in the other thread you said that sequential writes
are using much less CPU time on your systems?


Random 4k writes: 10% idle
Seq 4k writes: !! 99,7% !! idle
Seq 4M writes: 90% idle


>  Do you see better scaling in that case?

3 osd nodes:
1 VM:
Rand 4k writes: 7000 iops
Seq 4k writes: 19900 iops

2 VMs:
Rand 4k writes: 6000 iops each
Seq 4k writes: 4000 iops each VM 1
Seq 4k writes: 18500 iops each VM 2


4 osd nodes:
1 VM:
Rand 4k writes: 14400 iops
Seq 4k writes: 19000 iops

2 VMs:
Rand 4k writes: 7000 iops each
Seq 4k writes: 18000 iops each




To figure out where CPU is being used, you could try various options:
oprofile, perf, valgrind, strace.  Each has it's own advantages.

Here's how you can create a simple callgraph with perf:

http://lwn.net/Articles/340010/

10s perf data output while doing random 4k writes:
https://raw.github.com/gist/2c16136faebec381ae35/09e6de68a5461a198430a9ec19dfd5392f276706/gistfile1.txt

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speedup ceph / scaling / find the bottleneck

2012-06-29 Thread Stefan Priebe - Profihost AG

Some more testing / results:

= lowering CPU cores =
1.) disabling CPUs via
echo 0 >/sys/devices/system/cpu/cpuX/online
for core 4-7 does not change anything

2.) When only 50% of the CPUs are available each ceph-osd process takes 
only half of the CPU load they use when all are useable.


3.) Even then iops stay at 14k

= changing replication level =

1.) Even changing the replication level from 2 to 1 results in still 14k 
iops


= change random ios to sequential ios =

1.) when i change i do a write test with 4k blocks instead of randwrite 
i get jumping values from 13k to 30k average is 18k


2.) the interesting thing is here that the ceph-osd processes take just 
1% CPU load


= direct io to disk =

1.) when i directly write to the OSD disk from the system itself i 
archieve around 25000 iops


2.) As with ceph the load should spread to several disks i should see 
higher and not lower iops even when the network is involved


Stefan

Am 29.06.2012 12:46, schrieb Stefan Priebe - Profihost AG:

Hello list,

i've made some further testing and have the problem that ceph doesn't
scale for me. I added a 4th osd server to my existing 3 node osd
cluster. I also reformated all to be able to start with a clean system.

While doing random 4k writes from two VMs i see about 8% idle on the osd
servers (Single Intel Xeon E5 8 cores 3,6Ghz). I believe that this is
the limiting factor and also the reason why i don't see any improvement
by adding osd servers.

3 nodes: 2VMS: 7000 IOp/s 4k writes osds: 7-15% idle
4 nodes: 2VMS: 7500 IOp/s 4k writes osds: 7-15% idle

Even the cpu is not the limiting factor i think it would be really
important to lower the CPU usage while doing 4k writes. The CPU is only
used by the ceph-osd process. I see nearly no usage by other processes
(only 5% by kworker and 5% flush).

Could somebody recommand me a way to debug this? So we know where all
this CPU usage goes?

Stefan


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speedup ceph / scaling / find the bottleneck

2012-06-29 Thread Mark Nelson

On 6/29/12 5:46 AM, Stefan Priebe - Profihost AG wrote:

Hello list,

i've made some further testing and have the problem that ceph doesn't
scale for me. I added a 4th osd server to my existing 3 node osd
cluster. I also reformated all to be able to start with a clean system.

While doing random 4k writes from two VMs i see about 8% idle on the osd
servers (Single Intel Xeon E5 8 cores 3,6Ghz). I believe that this is
the limiting factor and also the reason why i don't see any improvement
by adding osd servers.

3 nodes: 2VMS: 7000 IOp/s 4k writes osds: 7-15% idle
4 nodes: 2VMS: 7500 IOp/s 4k writes osds: 7-15% idle

Even the cpu is not the limiting factor i think it would be really
important to lower the CPU usage while doing 4k writes. The CPU is only
used by the ceph-osd process. I see nearly no usage by other processes
(only 5% by kworker and 5% flush).

Could somebody recommand me a way to debug this? So we know where all
this CPU usage goes?


Hi Stefan,

I'll try to replicate your findings in house.  I've got some other 
things I have to do today, but hopefully I can take a look next week. 
If I recall correctly, in the other thread you said that sequential 
writes are using much less CPU time on your systems?  Do you see better 
scaling in that case?


To figure out where CPU is being used, you could try various options: 
oprofile, perf, valgrind, strace.  Each has it's own advantages.


Here's how you can create a simple callgraph with perf:

http://lwn.net/Articles/340010/

A more general tutorial is here:

https://perf.wiki.kernel.org/index.php/Tutorial

Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speedup ceph / scaling / find the bottleneck

2012-06-29 Thread Alexandre DERUMIER
I see something strange with my tests:

3 nodes (8cores E5420  @ 2.50GHz) , 5 osd (xfs) by node with 15k drives, 
journal on tmpfs


kvm guest, with cache=writeback or cache=none (same result):

random write test with 4k block: 5000iop/s , cpu idle 20%
sequential write test with 4k block: 2iop/s , cpu idle 80%  (I'm saturating 
my gibagit link)


So what's the difference in osd between random or sequential write if block 
have same size ?



- Mail original - 

De: "Stefan Priebe - Profihost AG"  
À: ceph-devel@vger.kernel.org 
Envoyé: Vendredi 29 Juin 2012 12:46:42 
Objet: speedup ceph / scaling / find the bottleneck 

Hello list, 

i've made some further testing and have the problem that ceph doesn't 
scale for me. I added a 4th osd server to my existing 3 node osd 
cluster. I also reformated all to be able to start with a clean system. 

While doing random 4k writes from two VMs i see about 8% idle on the osd 
servers (Single Intel Xeon E5 8 cores 3,6Ghz). I believe that this is 
the limiting factor and also the reason why i don't see any improvement 
by adding osd servers. 

3 nodes: 2VMS: 7000 IOp/s 4k writes osds: 7-15% idle 
4 nodes: 2VMS: 7500 IOp/s 4k writes osds: 7-15% idle 

Even the cpu is not the limiting factor i think it would be really 
important to lower the CPU usage while doing 4k writes. The CPU is only 
used by the ceph-osd process. I see nearly no usage by other processes 
(only 5% by kworker and 5% flush). 

Could somebody recommand me a way to debug this? So we know where all 
this CPU usage goes? 

Stefan 
-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majord...@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 



-- 

-- 





Alexandre D e rumier 

Ingénieur Systèmes et Réseaux 


Fixe : 03 20 68 88 85 

Fax : 03 20 68 90 88 


45 Bvd du Général Leclerc 59100 Roubaix 
12 rue Marivaux 75002 Paris 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


speedup ceph / scaling / find the bottleneck

2012-06-29 Thread Stefan Priebe - Profihost AG

Hello list,

i've made some further testing and have the problem that ceph doesn't 
scale for me. I added a 4th osd server to my existing 3 node osd 
cluster. I also reformated all to be able to start with a clean system.


While doing random 4k writes from two VMs i see about 8% idle on the osd 
servers (Single Intel Xeon E5 8 cores 3,6Ghz). I believe that this is 
the limiting factor and also the reason why i don't see any improvement 
by adding osd servers.


3 nodes: 2VMS: 7000 IOp/s 4k writes osds: 7-15% idle
4 nodes: 2VMS: 7500 IOp/s 4k writes osds: 7-15% idle

Even the cpu is not the limiting factor i think it would be really 
important to lower the CPU usage while doing 4k writes. The CPU is only 
used by the ceph-osd process. I see nearly no usage by other processes 
(only 5% by kworker and 5% flush).


Could somebody recommand me a way to debug this? So we know where all 
this CPU usage goes?


Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html