[ceph-users] RBD snapshot - time and consistent
Is snapshot time depend from image size? Do snapshot create consistent state of image for moment at start snapshot? For example if I have file system on don't stop IO before start snapshot - Is it worse than turn of power during IO? -- Blog: www.rekby.ru ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD vs RADOS benchmark performance
Le 11/05/2013 02:52, Mark Nelson a écrit : On 05/10/2013 07:20 PM, Greg wrote: Le 11/05/2013 00:56, Mark Nelson a écrit : On 05/10/2013 12:16 PM, Greg wrote: Hello folks, I'm in the process of testing CEPH and RBD, I have set up a small cluster of hosts running each a MON and an OSD with both journal and data on the same SSD (ok this is stupid but this is simple to verify the disks are not the bottleneck for 1 client). All nodes are connected on a 1Gb network (no dedicated network for OSDs, shame on me :). Summary : the RBD performance is poor compared to benchmark A 5 seconds seq read benchmark shows something like this : sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 163923 91.958692 0.966117 0.431249 2 166448 95.9602 100 0.513435 0.53849 3 169074 98.6317 104 0.25631 0.55494 4 119584 83.973540 1.80038 0.58712 Total time run:4.165747 Total reads made: 95 Read size:4194304 Bandwidth (MB/sec):91.220 Average Latency: 0.678901 Max latency: 1.80038 Min latency: 0.104719 91MB read performance, quite good ! Now the RBD performance : root@client:~# dd if=/dev/rbd1 of=/dev/null bs=4M count=100 100+0 records in 100+0 records out 419430400 bytes (419 MB) copied, 13.0568 s, 32.1 MB/s There is a 3x performance factor (same for write: ~60M benchmark, ~20M dd on block device) The network is ok, the CPU is also ok on all OSDs. CEPH is Bobtail 0.56.4, linux is 3.8.1 arm (vanilla release + some patches for the SoC being used) Can you show me the starting point for digging into this ? Hi Greg, First things first, are you doing kernel rbd or qemu/kvm? If you are doing qemu/kvm, make sure you are using virtio disks. This can have a pretty big performance impact. Next, are you using RBD cache? With 0.56.4 there are some performance issues with large sequential writes if cache is on, but it does provide benefit for small sequential writes. In general RBD cache behaviour has improved with Cuttlefish. Beyond that, are the pools being targeted by RBD and rados bench setup the same way? Same number of Pgs? Same replication? Mark, thanks for your prompt reply. I'm doing kernel RBD and so, I have not enabled the cache (default setting?) Sorry, I forgot to mention the pool used for bench and RBD is the same. Interesting. Does your rados bench performance change if you run a longer test? So far I've been seeing about a 20-30% performance overhead for kernel RBD, but 3x is excessive! It might be worth watching the underlying IO sizes to the OSDs in each case with something like collectl -sD -oT to see if there's any significant differences. Mark, I'll gather you some more data with collectl, meanwhile I realized a difference : the benchmark performs 16 concurrent reads while RBD only does 1. Shouldn't be a problem but still these are 2 different usage patterns. Cheers, ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Maximums for Ceph architectures
Hi all, Does anybody know where to learn about Maximums for Ceph architectures? For example, I'm trying to find out about the maximum size of rbd image and cephfs file. Additionally want to know maximum size for RADOS Gateway object (meaning file for uploading). -- Igor Laskovy facebook.com/igor.laskovy studiogrizzly.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD vs RADOS benchmark performance
(Sorry for sending this twice... Forgot to reply to the list) Is rbd caching safe to enable when you may need to do a live migration of the guest later on? It was my understanding that it wasn't, and that libvirt prevented you from doing the migration of it knew about the caching setting. If it isn't, is there anything else that could help performance? Like, some tuning of block size parameters for the rbd image or the qemu On May 10, 2013 8:57 PM, Mark Nelson mark.nel...@inktank.com wrote: On 05/10/2013 07:21 PM, Yun Mao wrote: Hi Mark, Given the same hardware, optimal configuration (I have no idea what that means exactly but feel free to specify), which is supposed to perform better, kernel rbd or qemu/kvm? Thanks, Yun Hi Yun, I'm in the process of actually running some tests right now. In previous testing, it looked like kernel rbd and qemu/kvm performed about the same with cache off. With cache on (in cuttlefish), small sequential write performance improved pretty dramatically vs without cache. Large write performance seemed to take more concurrency to reach peak performance, but ultimately aggregate throughput was about the same. Hopefully I should have some new results published in the near future. Mark On Fri, May 10, 2013 at 6:56 PM, Mark Nelson mark.nel...@inktank.com mailto:mark.nelson@inktank.**com mark.nel...@inktank.com wrote: On 05/10/2013 12:16 PM, Greg wrote: Hello folks, I'm in the process of testing CEPH and RBD, I have set up a small cluster of hosts running each a MON and an OSD with both journal and data on the same SSD (ok this is stupid but this is simple to verify the disks are not the bottleneck for 1 client). All nodes are connected on a 1Gb network (no dedicated network for OSDs, shame on me :). Summary : the RBD performance is poor compared to benchmark A 5 seconds seq read benchmark shows something like this : sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 163923 91.958692 0.966117 0.431249 2 166448 95.9602 100 0.513435 0.53849 3 169074 98.6317 104 0.25631 0.55494 4 119584 83.973540 1.80038 0.58712 Total time run:4.165747 Total reads made: 95 Read size:4194304 Bandwidth (MB/sec):91.220 Average Latency: 0.678901 Max latency: 1.80038 Min latency: 0.104719 91MB read performance, quite good ! Now the RBD performance : root@client:~# dd if=/dev/rbd1 of=/dev/null bs=4M count=100 100+0 records in 100+0 records out 419430400 bytes (419 MB) copied, 13.0568 s, 32.1 MB/s There is a 3x performance factor (same for write: ~60M benchmark, ~20M dd on block device) The network is ok, the CPU is also ok on all OSDs. CEPH is Bobtail 0.56.4, linux is 3.8.1 arm (vanilla release + some patches for the SoC being used) Can you show me the starting point for digging into this ? Hi Greg, First things first, are you doing kernel rbd or qemu/kvm? If you are doing qemu/kvm, make sure you are using virtio disks. This can have a pretty big performance impact. Next, are you using RBD cache? With 0.56.4 there are some performance issues with large sequential writes if cache is on, but it does provide benefit for small sequential writes. In general RBD cache behaviour has improved with Cuttlefish. Beyond that, are the pools being targeted by RBD and rados bench setup the same way? Same number of Pgs? Same replication? Thanks! __**___ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-us...@lists.ceph.**comceph-users@lists.ceph.com http://lists.ceph.com/__**listinfo.cgi/ceph-users-ceph._**_comhttp://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com __**___ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-us...@lists.ceph.**comceph-users@lists.ceph.com http://lists.ceph.com/__**listinfo.cgi/ceph-users-ceph._**_comhttp://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
Re: [ceph-users] RBD vs RADOS benchmark performance
I believe that this is fixed in the most recent versions of libvirt, sheepdog and rbd were marked erroneously as unsafe. http://libvirt.org/git/?p=libvirt.git;a=commit;h=78290b1641e95304c862062ee0aca95395c5926c Sent from my iPad On May 11, 2013, at 8:36 AM, Mike Kelly pi...@pioto.org wrote: (Sorry for sending this twice... Forgot to reply to the list) Is rbd caching safe to enable when you may need to do a live migration of the guest later on? It was my understanding that it wasn't, and that libvirt prevented you from doing the migration of it knew about the caching setting. If it isn't, is there anything else that could help performance? Like, some tuning of block size parameters for the rbd image or the qemu On May 10, 2013 8:57 PM, Mark Nelson mark.nel...@inktank.com wrote: On 05/10/2013 07:21 PM, Yun Mao wrote: Hi Mark, Given the same hardware, optimal configuration (I have no idea what that means exactly but feel free to specify), which is supposed to perform better, kernel rbd or qemu/kvm? Thanks, Yun Hi Yun, I'm in the process of actually running some tests right now. In previous testing, it looked like kernel rbd and qemu/kvm performed about the same with cache off. With cache on (in cuttlefish), small sequential write performance improved pretty dramatically vs without cache. Large write performance seemed to take more concurrency to reach peak performance, but ultimately aggregate throughput was about the same. Hopefully I should have some new results published in the near future. Mark On Fri, May 10, 2013 at 6:56 PM, Mark Nelson mark.nel...@inktank.com mailto:mark.nel...@inktank.com wrote: On 05/10/2013 12:16 PM, Greg wrote: Hello folks, I'm in the process of testing CEPH and RBD, I have set up a small cluster of hosts running each a MON and an OSD with both journal and data on the same SSD (ok this is stupid but this is simple to verify the disks are not the bottleneck for 1 client). All nodes are connected on a 1Gb network (no dedicated network for OSDs, shame on me :). Summary : the RBD performance is poor compared to benchmark A 5 seconds seq read benchmark shows something like this : sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 163923 91.958692 0.966117 0.431249 2 166448 95.9602 100 0.513435 0.53849 3 169074 98.6317 104 0.25631 0.55494 4 119584 83.973540 1.80038 0.58712 Total time run:4.165747 Total reads made: 95 Read size:4194304 Bandwidth (MB/sec):91.220 Average Latency: 0.678901 Max latency: 1.80038 Min latency: 0.104719 91MB read performance, quite good ! Now the RBD performance : root@client:~# dd if=/dev/rbd1 of=/dev/null bs=4M count=100 100+0 records in 100+0 records out 419430400 bytes (419 MB) copied, 13.0568 s, 32.1 MB/s There is a 3x performance factor (same for write: ~60M benchmark, ~20M dd on block device) The network is ok, the CPU is also ok on all OSDs. CEPH is Bobtail 0.56.4, linux is 3.8.1 arm (vanilla release + some patches for the SoC being used) Can you show me the starting point for digging into this ? Hi Greg, First things first, are you doing kernel rbd or qemu/kvm? If you are doing qemu/kvm, make sure you are using virtio disks. This can have a pretty big performance impact. Next, are you using RBD cache? With 0.56.4 there are some performance issues with large sequential writes if cache is on, but it does provide benefit for small sequential writes. In general RBD cache behaviour has improved with Cuttlefish. Beyond that, are the pools being targeted by RBD and rados bench setup the same way? Same number of Pgs? Same replication? Thanks! _ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
[ceph-users] Hardware recommendation / calculation for large cluster
Hi, First of all I am new to ceph and this mailing list. At this moment I am looking into the possibilities to get involved in the storage business. I am trying to get an estimate about costs and after that I will start to determine how to get sufficient income. First I will describe my case, at the bottom you will find my questions. GENERAL LAYOUT: Part of this cost calculation is of course hardware. For the larger part I've already figured it out. In my plans I will be leasing a full rack (46U). Depending on the domestic needs I will be using 36 or 40U for ODS storage servers. (I will assume 36U from here on, to keep a solid value for calculation and have enough spare space for extra devices). Each OSD server uses 4U and can take 36x3.5 drives. So in 36U I can put 36/4=9 OSD servers, containing 9*36=324 HDDs. HARD DISK DRIVES I have been looking for WD digital RE and RED series. RE is more expensive per GB, but has a larger MTBF and offers a 4TB model. RED is (real) cheap per GB, but only goes as far a 3TB. At my current calculations it does not matter much if I would put expensive WD RE 4TB disks or cheaper WD RED 3TB, the price per GB over the complete cluster expense and 3 years of running costs (including AFR) is almost the same. So basically, if I could reduce the costs of all the other components used in the cluster, I would go for the 3TB disk and if the costs will be higher then my first calculation, I would use the 4TB disk. Let's assume 4TB from now on. So, 4*324=1296TB. So lets go Peta-byte ;). NETWORK I will use a redundant 2x10Gbe network connection for each node. Two independent 10Gbe switches will be used and I will use bonding between the interfaces on each node. (Thanks some guy in the #Ceph irc for pointing this option out). I will use VLAN's to split front-side, backside and Internet networks. OSD SERVER SuperMicro based, 36 HDD hotswap. Dual socket mainboard. 16x DIMM sockets. It is advertised they can take up to 512GB of RAM. I will install 2 x Intel Xeon E5620 2.40ghz processor, having 4 cores and 8 threads each. For the RAM I am in doubt (see below). I am looking into running 1 OSD per disk. MON AND MDS SERVERS Now comes the big question. What specs are required? It first I had the plan to use 4 SuperMicro superservers, with a 4 socket mainboards that contain up to the new 16core AMD processors and up to 1TB of RAM. I want all 4 of the servers to run a MON service, MDS service and costumer / public services. Probably I would use VM's (kvm) to separate them. I will compile my own kernel to enable Kernel Samepage Merge, Hugepage support and memory compaction to make RAM use more efficient. The requirements for my public services will be added up, once I know what I need for MON and MDS. RAM FOR ALL SERVERS So what would you estimate to be the ram usage? http://ceph.com/docs/master/install/hardware-recommendations/#minimum- hardware-recommendations. Sounds OK for the OSD part. 500 MB per daemon, would put the minimum RAM requirement for my OSD server to 18GB. 32GB should be more then enough. Although I would like to see if it is possible to use btrfs compression? In that case I'd need more RAM in there. What I really want to know: how many RAM do I need for MON and MDS servers? 1GB per daemon sounds pretty steep. As everybody knows, RAM is expensive! In my case I would need at least 324 GB of ram for each of them. Initially I was planning to use 4 servers and each of them running both. Joining those in a single system, with the other duties the system has to perform I would need the full 1TB of RAM. I would need to use 32GB modules witch are really expensive per GB and difficult to find. (not may server hardware vendors in the Netherlands have them). QUESTIONS Question 1: Is it really the amount for OSD's that counts for MON and MDS RAM usage, or the size of the object store? Question 2: can I do it with less RAM? Any statistics, or better: a calculation? I can imagine memory pages becoming redundant if the cluster grows, so less memory required per OSD. Question 3: If it is the amount of OSDs that counts, would it be beneficial to combine disks in a RAID 0 (lvm or btrfs) array? Question 4: Is it safe / possible to store MON files inside of the cluster itself? The 10GB per daemon requirement would mean I need 3240GB of storage for each MON, meaning I need to get some huge disks and a (lvm) RAID 1 array for redundancy, while I have a huge redundant file sytem at hand already. Question 5: Is it possible to enable btrfs compression? I know btrfs is not stable for production yet, but it would be nice if compression is supported in the future, when it does become stable If the RAM requirement is not so steep, I am thinking about the possibility to run the MON service from 4 OSD servers. Upgrading them to 16x16GB of RAM would give me 256GB of RAM. (Again, 32GB modules are to expensive and not an option).
Re: [ceph-users] Hardware recommendation / calculation for large cluster
Hi, Someone is going to correct me if I'm wrong, but I think you misread something. The Mon-daemon doesn't need that much RAM: The 'RAM: 1 GB per daemon' is per Mon-daemon, not per OSD-daemon. The same for disk-space. You should read this page again: http://ceph.com/docs/master/install/hardware-recommendations/ Some of the other questions are answered there as well. Like how much memory does a OSD-daemon need and why/when. On Sat, May 11, 2013 at 03:42:59PM +0200, Tim Mohlmann wrote: Hi, First of all I am new to ceph and this mailing list. At this moment I am looking into the possibilities to get involved in the storage business. I am trying to get an estimate about costs and after that I will start to determine how to get sufficient income. First I will describe my case, at the bottom you will find my questions. GENERAL LAYOUT: Part of this cost calculation is of course hardware. For the larger part I've already figured it out. In my plans I will be leasing a full rack (46U). Depending on the domestic needs I will be using 36 or 40U for ODS storage servers. (I will assume 36U from here on, to keep a solid value for calculation and have enough spare space for extra devices). Each OSD server uses 4U and can take 36x3.5 drives. So in 36U I can put 36/4=9 OSD servers, containing 9*36=324 HDDs. HARD DISK DRIVES I have been looking for WD digital RE and RED series. RE is more expensive per GB, but has a larger MTBF and offers a 4TB model. RED is (real) cheap per GB, but only goes as far a 3TB. At my current calculations it does not matter much if I would put expensive WD RE 4TB disks or cheaper WD RED 3TB, the price per GB over the complete cluster expense and 3 years of running costs (including AFR) is almost the same. So basically, if I could reduce the costs of all the other components used in the cluster, I would go for the 3TB disk and if the costs will be higher then my first calculation, I would use the 4TB disk. Let's assume 4TB from now on. So, 4*324=1296TB. So lets go Peta-byte ;). NETWORK I will use a redundant 2x10Gbe network connection for each node. Two independent 10Gbe switches will be used and I will use bonding between the interfaces on each node. (Thanks some guy in the #Ceph irc for pointing this option out). I will use VLAN's to split front-side, backside and Internet networks. OSD SERVER SuperMicro based, 36 HDD hotswap. Dual socket mainboard. 16x DIMM sockets. It is advertised they can take up to 512GB of RAM. I will install 2 x Intel Xeon E5620 2.40ghz processor, having 4 cores and 8 threads each. For the RAM I am in doubt (see below). I am looking into running 1 OSD per disk. MON AND MDS SERVERS Now comes the big question. What specs are required? It first I had the plan to use 4 SuperMicro superservers, with a 4 socket mainboards that contain up to the new 16core AMD processors and up to 1TB of RAM. I want all 4 of the servers to run a MON service, MDS service and costumer / public services. Probably I would use VM's (kvm) to separate them. I will compile my own kernel to enable Kernel Samepage Merge, Hugepage support and memory compaction to make RAM use more efficient. The requirements for my public services will be added up, once I know what I need for MON and MDS. RAM FOR ALL SERVERS So what would you estimate to be the ram usage? http://ceph.com/docs/master/install/hardware-recommendations/#minimum- hardware-recommendations. Sounds OK for the OSD part. 500 MB per daemon, would put the minimum RAM requirement for my OSD server to 18GB. 32GB should be more then enough. Although I would like to see if it is possible to use btrfs compression? In that case I'd need more RAM in there. What I really want to know: how many RAM do I need for MON and MDS servers? 1GB per daemon sounds pretty steep. As everybody knows, RAM is expensive! In my case I would need at least 324 GB of ram for each of them. Initially I was planning to use 4 servers and each of them running both. Joining those in a single system, with the other duties the system has to perform I would need the full 1TB of RAM. I would need to use 32GB modules witch are really expensive per GB and difficult to find. (not may server hardware vendors in the Netherlands have them). QUESTIONS Question 1: Is it really the amount for OSD's that counts for MON and MDS RAM usage, or the size of the object store? Question 2: can I do it with less RAM? Any statistics, or better: a calculation? I can imagine memory pages becoming redundant if the cluster grows, so less memory required per OSD. Question 3: If it is the amount of OSDs that counts, would it be beneficial to combine disks in a RAID 0 (lvm or btrfs) array? Question 4: Is it safe / possible to store MON files inside of the cluster itself? The 10GB per daemon requirement would
Re: [ceph-users] RBD vs RADOS benchmark performance
The reference Mike provided is not valid to me. Anyone else has the same problem? --weiguo From: j.michael.l...@gmail.com Date: Sat, 11 May 2013 08:45:41 -0400 To: pi...@pioto.org CC: ceph-users@lists.ceph.com Subject: Re: [ceph-users] RBD vs RADOS benchmark performance I believe that this is fixed in the most recent versions of libvirt, sheepdog and rbd were marked erroneously as unsafe. http://libvirt.org/git/?p=libvirt.git;a=commit;h=78290b1641e95304c862062ee0aca95395c5926c Sent from my iPad On May 11, 2013, at 8:36 AM, Mike Kelly pi...@pioto.org wrote: (Sorry for sending this twice... Forgot to reply to the list) Is rbd caching safe to enable when you may need to do a live migration of the guest later on? It was my understanding that it wasn't, and that libvirt prevented you from doing the migration of it knew about the caching setting. If it isn't, is there anything else that could help performance? Like, some tuning of block size parameters for the rbd image or the qemu On May 10, 2013 8:57 PM, Mark Nelson mark.nel...@inktank.com wrote: On 05/10/2013 07:21 PM, Yun Mao wrote: Hi Mark, Given the same hardware, optimal configuration (I have no idea what that means exactly but feel free to specify), which is supposed to perform better, kernel rbd or qemu/kvm? Thanks, Yun Hi Yun, I'm in the process of actually running some tests right now. In previous testing, it looked like kernel rbd and qemu/kvm performed about the same with cache off. With cache on (in cuttlefish), small sequential write performance improved pretty dramatically vs without cache. Large write performance seemed to take more concurrency to reach peak performance, but ultimately aggregate throughput was about the same. Hopefully I should have some new results published in the near future. Mark On Fri, May 10, 2013 at 6:56 PM, Mark Nelson mark.nel...@inktank.com mailto:mark.nel...@inktank.com wrote: On 05/10/2013 12:16 PM, Greg wrote: Hello folks, I'm in the process of testing CEPH and RBD, I have set up a small cluster of hosts running each a MON and an OSD with both journal and data on the same SSD (ok this is stupid but this is simple to verify the disks are not the bottleneck for 1 client). All nodes are connected on a 1Gb network (no dedicated network for OSDs, shame on me :). Summary : the RBD performance is poor compared to benchmark A 5 seconds seq read benchmark shows something like this : sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 163923 91.958692 0.966117 0.431249 2 166448 95.9602 100 0.513435 0.53849 3 169074 98.6317 104 0.25631 0.55494 4 119584 83.973540 1.80038 0.58712 Total time run:4.165747 Total reads made: 95 Read size:4194304 Bandwidth (MB/sec):91.220 Average Latency: 0.678901 Max latency: 1.80038 Min latency: 0.104719 91MB read performance, quite good ! Now the RBD performance : root@client:~# dd if=/dev/rbd1 of=/dev/null bs=4M count=100 100+0 records in 100+0 records out 419430400 bytes (419 MB) copied, 13.0568 s, 32.1 MB/s There is a 3x performance factor (same for write: ~60M benchmark, ~20M dd on block device) The network is ok, the CPU is also ok on all OSDs. CEPH is Bobtail 0.56.4, linux is 3.8.1 arm (vanilla release + some patches for the SoC being used) Can you show me the starting point for digging into this ? Hi Greg, First things first, are you doing kernel rbd or qemu/kvm? If you are doing qemu/kvm, make sure you are using virtio disks. This can have a pretty big performance impact. Next, are you using RBD cache? With 0.56.4 there are some performance issues with large sequential writes if cache is on, but it does provide benefit for small sequential writes. In general RBD cache behaviour has improved with Cuttlefish. Beyond that, are the pools being targeted by RBD and rados bench setup the same way? Same number of Pgs? Same replication? Thanks! _ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com