[ceph-users] RBD snapshot - time and consistent

2013-05-11 Thread Timofey Koolin
Is snapshot time depend from image size?
Do snapshot create consistent state of image for moment at start snapshot?

For example if I have file system on don't stop IO before start snapshot -
Is it worse than turn of power during IO?

-- 
Blog: www.rekby.ru
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD vs RADOS benchmark performance

2013-05-11 Thread Greg

Le 11/05/2013 02:52, Mark Nelson a écrit :

On 05/10/2013 07:20 PM, Greg wrote:

Le 11/05/2013 00:56, Mark Nelson a écrit :

On 05/10/2013 12:16 PM, Greg wrote:

Hello folks,

I'm in the process of testing CEPH and RBD, I have set up a small
cluster of  hosts running each a MON and an OSD with both journal and
data on the same SSD (ok this is stupid but this is simple to 
verify the
disks are not the bottleneck for 1 client). All nodes are connected 
on a

1Gb network (no dedicated network for OSDs, shame on me :).

Summary : the RBD performance is poor compared to benchmark

A 5 seconds seq read benchmark shows something like this :

   sec Cur ops   started  finished avg MB/s  cur MB/s  last lat   avg
lat
 0   0 0 0 0 0 - 0
 1  163923   91.958692 0.966117
0.431249
 2  166448   95.9602   100 0.513435 
0.53849
 3  169074   98.6317   104 0.25631 
0.55494
 4  119584   83.973540 1.80038 
0.58712

 Total time run:4.165747
Total reads made: 95
Read size:4194304
Bandwidth (MB/sec):91.220

Average Latency:   0.678901
Max latency:   1.80038
Min latency:   0.104719


91MB read performance, quite good !

Now the RBD performance :

root@client:~# dd if=/dev/rbd1 of=/dev/null bs=4M count=100
100+0 records in
100+0 records out
419430400 bytes (419 MB) copied, 13.0568 s, 32.1 MB/s


There is a 3x performance factor (same for write: ~60M benchmark, ~20M
dd on block device)

The network is ok, the CPU is also ok on all OSDs.
CEPH is Bobtail 0.56.4, linux is 3.8.1 arm (vanilla release + some
patches for the SoC being used)

Can you show me the starting point for digging into this ?


Hi Greg, First things first, are you doing kernel rbd or qemu/kvm?  If
you are doing qemu/kvm, make sure you are using virtio disks. This
can have a pretty big performance impact. Next, are you using RBD
cache? With 0.56.4 there are some performance issues with large
sequential writes if cache is on, but it does provide benefit for
small sequential writes.  In general RBD cache behaviour has improved
with Cuttlefish.

Beyond that, are the pools being targeted by RBD and rados bench setup
the same way?  Same number of Pgs?  Same replication?

Mark, thanks for your prompt reply.

I'm doing kernel RBD and so, I have not enabled the cache (default
setting?)
Sorry, I forgot to mention the pool used for bench and RBD is the same.


Interesting.  Does your rados bench performance change if you run a 
longer test?  So far I've been seeing about a 20-30% performance 
overhead for kernel RBD, but 3x is excessive!  It might be worth 
watching the underlying IO sizes to the OSDs in each case with 
something like collectl -sD -oT to see if there's any significant 
differences.

Mark,

I'll gather you some more data with collectl, meanwhile I realized a 
difference : the benchmark performs 16 concurrent reads while RBD only 
does 1. Shouldn't be a problem but still these are 2 different usage 
patterns.


Cheers,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Maximums for Ceph architectures

2013-05-11 Thread Igor Laskovy
Hi all,

Does anybody know where to learn about Maximums for Ceph architectures?
For example, I'm trying to find out about the maximum size of rbd image and
cephfs file. Additionally want to know maximum size for RADOS Gateway
object (meaning file for uploading).

-- 
Igor Laskovy
facebook.com/igor.laskovy
studiogrizzly.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD vs RADOS benchmark performance

2013-05-11 Thread Mike Kelly
(Sorry for sending this twice... Forgot to reply to the list)

Is rbd caching safe to enable when you may need to do a live migration of
the guest later on? It was my understanding that it wasn't, and that
libvirt prevented you from doing the migration of it knew about the caching
setting.

If it isn't, is there anything else that could help performance? Like, some
tuning of block size parameters for the rbd image or the qemu
On May 10, 2013 8:57 PM, Mark Nelson mark.nel...@inktank.com wrote:

 On 05/10/2013 07:21 PM, Yun Mao wrote:

 Hi Mark,

 Given the same hardware, optimal configuration (I have no idea what that
 means exactly but feel free to specify), which is supposed to perform
 better, kernel rbd or qemu/kvm? Thanks,

 Yun


 Hi Yun,

 I'm in the process of actually running some tests right now.

 In previous testing, it looked like kernel rbd and qemu/kvm performed
 about the same with cache off.  With cache on (in cuttlefish), small
 sequential write performance improved pretty dramatically vs without cache.
  Large write performance seemed to take more concurrency to reach peak
 performance, but ultimately aggregate throughput was about the same.

 Hopefully I should have some new results published in the near future.

 Mark



 On Fri, May 10, 2013 at 6:56 PM, Mark Nelson mark.nel...@inktank.com
 mailto:mark.nelson@inktank.**com mark.nel...@inktank.com wrote:

 On 05/10/2013 12:16 PM, Greg wrote:

 Hello folks,

 I'm in the process of testing CEPH and RBD, I have set up a small
 cluster of  hosts running each a MON and an OSD with both
 journal and
 data on the same SSD (ok this is stupid but this is simple to
 verify the
 disks are not the bottleneck for 1 client). All nodes are
 connected on a
 1Gb network (no dedicated network for OSDs, shame on me :).

 Summary : the RBD performance is poor compared to benchmark

 A 5 seconds seq read benchmark shows something like this :

 sec Cur ops   started  finished  avg MB/s  cur MB/s
   last lat   avg
 lat
   0   0 0 0 0 0 -
0
   1  163923   91.958692
 0.966117  0.431249
   2  166448   95.9602   100
 0.513435   0.53849
   3  169074   98.6317   104
 0.25631   0.55494
   4  119584   83.973540
 1.80038   0.58712
   Total time run:4.165747
 Total reads made: 95
 Read size:4194304
 Bandwidth (MB/sec):91.220

 Average Latency:   0.678901
 Max latency:   1.80038
 Min latency:   0.104719


 91MB read performance, quite good !

 Now the RBD performance :

 root@client:~# dd if=/dev/rbd1 of=/dev/null bs=4M count=100
 100+0 records in
 100+0 records out
 419430400 bytes (419 MB) copied, 13.0568 s, 32.1 MB/s


 There is a 3x performance factor (same for write: ~60M
 benchmark, ~20M
 dd on block device)

 The network is ok, the CPU is also ok on all OSDs.
 CEPH is Bobtail 0.56.4, linux is 3.8.1 arm (vanilla release + some
 patches for the SoC being used)

 Can you show me the starting point for digging into this ?


 Hi Greg, First things first, are you doing kernel rbd or qemu/kvm?
   If you are doing qemu/kvm, make sure you are using virtio disks.
   This can have a pretty big performance impact.  Next, are you
 using RBD cache? With 0.56.4 there are some performance issues with
 large sequential writes if cache is on, but it does provide benefit
 for small sequential writes.  In general RBD cache behaviour has
 improved with Cuttlefish.

 Beyond that, are the pools being targeted by RBD and rados bench
 setup the same way?  Same number of Pgs?  Same replication?



 Thanks!
 __**___
 ceph-users mailing list
 ceph-users@lists.ceph.com 
 mailto:ceph-us...@lists.ceph.**comceph-users@lists.ceph.com
 
 
 http://lists.ceph.com/__**listinfo.cgi/ceph-users-ceph._**_comhttp://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
 
 http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 


 __**___
 ceph-users mailing list
 ceph-users@lists.ceph.com 
 mailto:ceph-us...@lists.ceph.**comceph-users@lists.ceph.com
 
 
 http://lists.ceph.com/__**listinfo.cgi/ceph-users-ceph._**_comhttp://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
 
 

Re: [ceph-users] RBD vs RADOS benchmark performance

2013-05-11 Thread Michael Lowe
I believe that this is fixed in the most recent versions of libvirt, sheepdog 
and rbd were marked erroneously as unsafe.

http://libvirt.org/git/?p=libvirt.git;a=commit;h=78290b1641e95304c862062ee0aca95395c5926c

Sent from my iPad

On May 11, 2013, at 8:36 AM, Mike Kelly pi...@pioto.org wrote:

 (Sorry for sending this twice... Forgot to reply to the list)
 
 Is rbd caching safe to enable when you may need to do a live migration of the 
 guest later on? It was my understanding that it wasn't, and that libvirt 
 prevented you from doing the migration of it knew about the caching setting.
 
 If it isn't, is there anything else that could help performance? Like, some 
 tuning of block size parameters for the rbd image or the qemu
 
 On May 10, 2013 8:57 PM, Mark Nelson mark.nel...@inktank.com wrote:
 On 05/10/2013 07:21 PM, Yun Mao wrote:
 Hi Mark,
 
 Given the same hardware, optimal configuration (I have no idea what that
 means exactly but feel free to specify), which is supposed to perform
 better, kernel rbd or qemu/kvm? Thanks,
 
 Yun
 
 Hi Yun,
 
 I'm in the process of actually running some tests right now.
 
 In previous testing, it looked like kernel rbd and qemu/kvm performed about 
 the same with cache off.  With cache on (in cuttlefish), small sequential 
 write performance improved pretty dramatically vs without cache.  Large 
 write performance seemed to take more concurrency to reach peak performance, 
 but ultimately aggregate throughput was about the same.
 
 Hopefully I should have some new results published in the near future.
 
 Mark
 
 
 
 On Fri, May 10, 2013 at 6:56 PM, Mark Nelson mark.nel...@inktank.com
 mailto:mark.nel...@inktank.com wrote:
 
 On 05/10/2013 12:16 PM, Greg wrote:
 
 Hello folks,
 
 I'm in the process of testing CEPH and RBD, I have set up a small
 cluster of  hosts running each a MON and an OSD with both
 journal and
 data on the same SSD (ok this is stupid but this is simple to
 verify the
 disks are not the bottleneck for 1 client). All nodes are
 connected on a
 1Gb network (no dedicated network for OSDs, shame on me :).
 
 Summary : the RBD performance is poor compared to benchmark
 
 A 5 seconds seq read benchmark shows something like this :
 
 sec Cur ops   started  finished  avg MB/s  cur MB/s
   last lat   avg
 lat
   0   0 0 0 0 0 -
0
   1  163923   91.958692
 0.966117  0.431249
   2  166448   95.9602   100
 0.513435   0.53849
   3  169074   98.6317   104
 0.25631   0.55494
   4  119584   83.973540
 1.80038   0.58712
   Total time run:4.165747
 Total reads made: 95
 Read size:4194304
 Bandwidth (MB/sec):91.220
 
 Average Latency:   0.678901
 Max latency:   1.80038
 Min latency:   0.104719
 
 
 91MB read performance, quite good !
 
 Now the RBD performance :
 
 root@client:~# dd if=/dev/rbd1 of=/dev/null bs=4M count=100
 100+0 records in
 100+0 records out
 419430400 bytes (419 MB) copied, 13.0568 s, 32.1 MB/s
 
 
 There is a 3x performance factor (same for write: ~60M
 benchmark, ~20M
 dd on block device)
 
 The network is ok, the CPU is also ok on all OSDs.
 CEPH is Bobtail 0.56.4, linux is 3.8.1 arm (vanilla release + some
 patches for the SoC being used)
 
 Can you show me the starting point for digging into this ?
 
 
 Hi Greg, First things first, are you doing kernel rbd or qemu/kvm?
   If you are doing qemu/kvm, make sure you are using virtio disks.
   This can have a pretty big performance impact.  Next, are you
 using RBD cache? With 0.56.4 there are some performance issues with
 large sequential writes if cache is on, but it does provide benefit
 for small sequential writes.  In general RBD cache behaviour has
 improved with Cuttlefish.
 
 Beyond that, are the pools being targeted by RBD and rados bench
 setup the same way?  Same number of Pgs?  Same replication?
 
 
 
 Thanks!
 _
 ceph-users mailing list
 ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
 http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 _
 ceph-users mailing list
 ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
 

[ceph-users] Hardware recommendation / calculation for large cluster

2013-05-11 Thread Tim Mohlmann
Hi,

First of all I am new to ceph and this mailing list. At this moment I am 
looking into the possibilities to get involved in the storage business. I am 
trying to get an estimate about costs and after that I will start to determine 
how to get sufficient income.

First I will describe my case, at the bottom you will find my questions.


GENERAL LAYOUT:

Part of this cost calculation is of course hardware. For the larger part I've 
already figured it out. In my plans I will be leasing a full rack (46U). 
Depending on the domestic needs I will be using 36 or 40U for ODS storage 
servers. (I will assume 36U from here on, to keep a solid value for 
calculation and have enough spare space for extra devices).

Each OSD server uses 4U and can take 36x3.5 drives. So in 36U I can put 
36/4=9 OSD servers, containing 9*36=324 HDDs.


HARD DISK DRIVES

I have been looking for WD digital RE and RED series. RE is more expensive per 
GB, but has a larger MTBF and offers a 4TB model. RED is (real) cheap per GB, 
but only goes as far a 3TB.

At my current calculations it does not matter much if I would put expensive WD 
RE 4TB disks or cheaper WD RED 3TB, the price per GB over the complete cluster 
expense and 3 years of running costs (including AFR) is almost the same.

So basically, if I could reduce the costs of all the other components used in 
the cluster, I would go for the 3TB disk and if the costs will be higher then 
my first calculation, I would use the 4TB disk.

Let's assume 4TB from now on. So, 4*324=1296TB. So lets go Peta-byte ;).


NETWORK

I will use a redundant 2x10Gbe network connection for each node. Two 
independent 10Gbe switches will be used and I will use bonding between the 
interfaces on each node. (Thanks some guy in the #Ceph irc for pointing this 
option out). I will use VLAN's to split front-side, backside and Internet 
networks.


OSD SERVER

SuperMicro based, 36 HDD hotswap. Dual socket mainboard. 16x DIMM sockets. It 
is advertised they can take up to 512GB of RAM. I will install 2 x Intel Xeon 
E5620 2.40ghz processor, having 4 cores and 8 threads each. For the RAM I am 
in doubt (see below). I am looking into running 1 OSD per disk.


MON AND MDS SERVERS

Now comes the big question. What specs are required? It first I had the plan to 
use 4 SuperMicro superservers, with a 4 socket mainboards that contain up to 
the new 16core AMD processors and up to 1TB of RAM.

I want all 4 of the servers to run a MON service, MDS service and costumer / 
public services. Probably I would use VM's (kvm) to separate them. I will 
compile my own kernel to enable Kernel Samepage Merge, Hugepage support and 
memory compaction to make RAM use more efficient. The requirements for my 
public 
services will be added up, once I know what I need for MON and MDS.


RAM FOR ALL SERVERS

So what would you estimate to be the ram usage?
http://ceph.com/docs/master/install/hardware-recommendations/#minimum-
hardware-recommendations.

Sounds OK for the OSD part. 500 MB per daemon, would put the minimum RAM 
requirement for my OSD server to 18GB. 32GB should be more then enough. 
Although I would like to see if it is possible to use btrfs compression? In 
that case I'd need more RAM in there.

What I really want to know: how many RAM do I need for MON and MDS servers? 
1GB per daemon sounds pretty steep. As everybody knows, RAM is expensive!

In my case I would need at least 324 GB of ram for each of them. Initially I 
was planning to use 4 servers and each of them running both. Joining those in 
a single system, with the other duties the system has to perform I would need 
the full 1TB of RAM. I would need to use 32GB modules witch are really 
expensive per GB and difficult to find. (not may server hardware vendors in the 
Netherlands have them).


QUESTIONS

Question 1: Is it really the amount for OSD's that counts for MON and MDS RAM 
usage, or the size of the object store?

Question 2: can I do it with less RAM? Any statistics, or better: a 
calculation? I can imagine memory pages becoming redundant if the cluster 
grows, so less memory required per OSD.

Question 3: If it is the amount of OSDs that counts, would it be beneficial to 
combine disks in a RAID 0 (lvm or btrfs) array?

Question 4: Is it safe / possible to store MON files inside of the cluster 
itself? The 10GB per daemon requirement would mean I need 3240GB of storage 
for each MON, meaning I need to get some huge disks and a (lvm) RAID 1 array 
for redundancy, while I have a huge redundant file sytem at hand already.

Question 5: Is it possible to enable btrfs compression? I know btrfs is not 
stable for production yet, but it would be nice if compression is supported in 
the future, when it does become stable

If the RAM requirement is not so steep, I am thinking about the possibility to 
run the MON service from 4 OSD servers. Upgrading them to 16x16GB of RAM would 
give me 256GB of RAM. (Again, 32GB modules are to expensive and not an 
option). 

Re: [ceph-users] Hardware recommendation / calculation for large cluster

2013-05-11 Thread Leen Besselink
Hi,

Someone is going to correct me if I'm wrong, but I think you misread something.

The Mon-daemon doesn't need that much RAM:

The 'RAM: 1 GB per daemon' is per Mon-daemon, not per OSD-daemon.

The same for disk-space.

You should read this page again:

http://ceph.com/docs/master/install/hardware-recommendations/

Some of the other questions are answered there as well.

Like how much memory does a OSD-daemon need and why/when.



On Sat, May 11, 2013 at 03:42:59PM +0200, Tim Mohlmann wrote:
 Hi,
 
 First of all I am new to ceph and this mailing list. At this moment I am 
 looking into the possibilities to get involved in the storage business. I am 
 trying to get an estimate about costs and after that I will start to 
 determine 
 how to get sufficient income.
 
 First I will describe my case, at the bottom you will find my questions.
 
 
 GENERAL LAYOUT:
 
 Part of this cost calculation is of course hardware. For the larger part I've 
 already figured it out. In my plans I will be leasing a full rack (46U). 
 Depending on the domestic needs I will be using 36 or 40U for ODS storage 
 servers. (I will assume 36U from here on, to keep a solid value for 
 calculation and have enough spare space for extra devices).
 
 Each OSD server uses 4U and can take 36x3.5 drives. So in 36U I can put 
 36/4=9 OSD servers, containing 9*36=324 HDDs.
 
 
 HARD DISK DRIVES
 
 I have been looking for WD digital RE and RED series. RE is more expensive 
 per 
 GB, but has a larger MTBF and offers a 4TB model. RED is (real) cheap per GB, 
 but only goes as far a 3TB.
 
 At my current calculations it does not matter much if I would put expensive 
 WD 
 RE 4TB disks or cheaper WD RED 3TB, the price per GB over the complete 
 cluster 
 expense and 3 years of running costs (including AFR) is almost the same.
 
 So basically, if I could reduce the costs of all the other components used in 
 the cluster, I would go for the 3TB disk and if the costs will be higher then 
 my first calculation, I would use the 4TB disk.
 
 Let's assume 4TB from now on. So, 4*324=1296TB. So lets go Peta-byte ;).
 
 
 NETWORK
 
 I will use a redundant 2x10Gbe network connection for each node. Two 
 independent 10Gbe switches will be used and I will use bonding between the 
 interfaces on each node. (Thanks some guy in the #Ceph irc for pointing this 
 option out). I will use VLAN's to split front-side, backside and Internet 
 networks.
 
 
 OSD SERVER
 
 SuperMicro based, 36 HDD hotswap. Dual socket mainboard. 16x DIMM sockets. It 
 is advertised they can take up to 512GB of RAM. I will install 2 x Intel Xeon 
 E5620 2.40ghz processor, having 4 cores and 8 threads each. For the RAM I am 
 in doubt (see below). I am looking into running 1 OSD per disk.
 
 
 MON AND MDS SERVERS
 
 Now comes the big question. What specs are required? It first I had the plan 
 to 
 use 4 SuperMicro superservers, with a 4 socket mainboards that contain up to 
 the new 16core AMD processors and up to 1TB of RAM.
 
 I want all 4 of the servers to run a MON service, MDS service and costumer / 
 public services. Probably I would use VM's (kvm) to separate them. I will 
 compile my own kernel to enable Kernel Samepage Merge, Hugepage support and 
 memory compaction to make RAM use more efficient. The requirements for my 
 public 
 services will be added up, once I know what I need for MON and MDS.
 
 
 RAM FOR ALL SERVERS
 
 So what would you estimate to be the ram usage?
 http://ceph.com/docs/master/install/hardware-recommendations/#minimum-
 hardware-recommendations.
 
 Sounds OK for the OSD part. 500 MB per daemon, would put the minimum RAM 
 requirement for my OSD server to 18GB. 32GB should be more then enough. 
 Although I would like to see if it is possible to use btrfs compression? In 
 that case I'd need more RAM in there.
 
 What I really want to know: how many RAM do I need for MON and MDS servers? 
 1GB per daemon sounds pretty steep. As everybody knows, RAM is expensive!
 
 In my case I would need at least 324 GB of ram for each of them. Initially I 
 was planning to use 4 servers and each of them running both. Joining those in 
 a single system, with the other duties the system has to perform I would need 
 the full 1TB of RAM. I would need to use 32GB modules witch are really 
 expensive per GB and difficult to find. (not may server hardware vendors in 
 the 
 Netherlands have them).
 
 
 QUESTIONS
 
 Question 1: Is it really the amount for OSD's that counts for MON and MDS RAM 
 usage, or the size of the object store?
 
 Question 2: can I do it with less RAM? Any statistics, or better: a 
 calculation? I can imagine memory pages becoming redundant if the cluster 
 grows, so less memory required per OSD.
 
 Question 3: If it is the amount of OSDs that counts, would it be beneficial 
 to 
 combine disks in a RAID 0 (lvm or btrfs) array?
 
 Question 4: Is it safe / possible to store MON files inside of the cluster 
 itself? The 10GB per daemon requirement would 

Re: [ceph-users] RBD vs RADOS benchmark performance

2013-05-11 Thread w sun
The reference Mike provided is not valid to me.  Anyone else has the same 
problem? --weiguo

From: j.michael.l...@gmail.com
Date: Sat, 11 May 2013 08:45:41 -0400
To: pi...@pioto.org
CC: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] RBD vs RADOS benchmark performance

I believe that this is fixed in the most recent versions of libvirt, sheepdog 
and rbd were marked erroneously as unsafe.
http://libvirt.org/git/?p=libvirt.git;a=commit;h=78290b1641e95304c862062ee0aca95395c5926c

Sent from my iPad
On May 11, 2013, at 8:36 AM, Mike Kelly pi...@pioto.org wrote:

(Sorry for sending this twice... Forgot to reply to the list)
Is rbd caching safe to enable when you may need to do a live migration of the 
guest later on? It was my understanding that it wasn't, and that libvirt 
prevented you from doing the migration of it knew about the caching setting.

If it isn't, is there anything else that could help performance? Like, some 
tuning of block size parameters for the rbd image or the qemu 
On May 10, 2013 8:57 PM, Mark Nelson mark.nel...@inktank.com wrote:

On 05/10/2013 07:21 PM, Yun Mao wrote:


Hi Mark,



Given the same hardware, optimal configuration (I have no idea what that

means exactly but feel free to specify), which is supposed to perform

better, kernel rbd or qemu/kvm? Thanks,



Yun




Hi Yun,



I'm in the process of actually running some tests right now.



In previous testing, it looked like kernel rbd and qemu/kvm performed about the 
same with cache off.  With cache on (in cuttlefish), small sequential write 
performance improved pretty dramatically vs without cache.  Large write 
performance seemed to take more concurrency to reach peak performance, but 
ultimately aggregate throughput was about the same.




Hopefully I should have some new results published in the near future.



Mark








On Fri, May 10, 2013 at 6:56 PM, Mark Nelson mark.nel...@inktank.com

mailto:mark.nel...@inktank.com wrote:



On 05/10/2013 12:16 PM, Greg wrote:



Hello folks,



I'm in the process of testing CEPH and RBD, I have set up a small

cluster of  hosts running each a MON and an OSD with both

journal and

data on the same SSD (ok this is stupid but this is simple to

verify the

disks are not the bottleneck for 1 client). All nodes are

connected on a

1Gb network (no dedicated network for OSDs, shame on me :).



Summary : the RBD performance is poor compared to benchmark



A 5 seconds seq read benchmark shows something like this :



sec Cur ops   started  finished  avg MB/s  cur MB/s

  last lat   avg

lat

  0   0 0 0 0 0 -

   0

  1  163923   91.958692

0.966117  0.431249

  2  166448   95.9602   100

0.513435   0.53849

  3  169074   98.6317   104

0.25631   0.55494

  4  119584   83.973540

1.80038   0.58712

  Total time run:4.165747

Total reads made: 95

Read size:4194304

Bandwidth (MB/sec):91.220



Average Latency:   0.678901

Max latency:   1.80038

Min latency:   0.104719





91MB read performance, quite good !



Now the RBD performance :



root@client:~# dd if=/dev/rbd1 of=/dev/null bs=4M count=100

100+0 records in

100+0 records out

419430400 bytes (419 MB) copied, 13.0568 s, 32.1 MB/s





There is a 3x performance factor (same for write: ~60M

benchmark, ~20M

dd on block device)



The network is ok, the CPU is also ok on all OSDs.

CEPH is Bobtail 0.56.4, linux is 3.8.1 arm (vanilla release + some

patches for the SoC being used)



Can you show me the starting point for digging into this ?





Hi Greg, First things first, are you doing kernel rbd or qemu/kvm?

  If you are doing qemu/kvm, make sure you are using virtio disks.

  This can have a pretty big performance impact.  Next, are you

using RBD cache? With 0.56.4 there are some performance issues with

large sequential writes if cache is on, but it does provide benefit

for small sequential writes.  In general RBD cache behaviour has

improved with Cuttlefish.



Beyond that, are the pools being targeted by RBD and rados bench

setup the same way?  Same number of Pgs?  Same replication?







Thanks!

_

ceph-users mailing list

ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com