Re: [ceph-users] Storing VM Images on CEPH with RBD-QEMU driver
Hi, Our fio tests against qemu-kvm on RBD look quite promising, details here: https://docs.google.com/spreadsheet/ccc?key=0AoB4ekP8AM3RdGlDaHhoSV81MDhUS25EUVZxdmN6WHcusp=drive_web#gid=0 tl;dr: rbd with caching enabled is (1) at least 2x faster than the local instance storage, and (2) reaches the hypervisor's GbE network limit in ~all cases except very small random writes. BTW, currently we have ~10 VMs running those fio tests in a loop, and we're seeing ~25,000op/s sustained in the ceph logs. Not bad IMHO. Cheers, Dan CERN IT/DSS On Thu, Dec 19, 2013 at 4:00 PM, Peder Jansson pe...@portlane.com wrote: Hi, I'm testing CEPH with the RBD/QEMU driver through libvirt to store my VM images on. Installation and configuration all went very well with the ceph-deploy tool. I have set up authx authentication in libvirt and that works like a charm too. However, when coming to performance I have big issues getting expected results inside the hosted VM. I see high latency and bad write performance, down to 20MB/s in VM. My setup: 3xDELL R410, 2xXeon X5650, 48 GB RAM, 2xSATA RAID1 for System, 2x250GB Samsung Evo SSD for OSD's (with XFS on each one) ceph version 0.72.1 (4d923861868f6a15dcb33fef7f50f674997322de) Linux server1 3.11.0-14-generic #21-Ubuntu SMP Tue Nov 12 17:04:55 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux Ubuntu 13.10 In total: 6 OSD 1 MON 3 MDS So, question is; is there anyone out there that have experience of running the RBD/QEMU driver in production, and getting any good performance inside the VM? I suspect the main performance issue to be caused by high latency, since it all feels quite high when running those tests below with bonnie++. (bonnie++ -s 4096 -r 2048 -u root -d X -m BenchClient) Inside VPS running on native image in RBD pool: -- Without any Cache Version 1.96 --Sequential Output-- --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP BenchClient 4G 733 96 64919 8 20271 3 3013 97 30770 3 2887 82 Latency 17425us1093ms 894ms 16789us 19390us 89203us Version 1.96 --Sequential Create-- Random Create BenchClient -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 27951 52 + +++ + +++ 24921 45 + +++ 22535 29 Latency 1986us 826us1065us 216us 41us 611us --With Writeback Cache(QEMU) Version 1.96 --Sequential Output-- --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP BenchClient 4G 872 96 67327 8 22424 3 2516 94 32013 3 2800 82 Latency 16196us 657ms 843ms 37889us 19207us 85407us Version 1.96 --Sequential Create-- Random Create BenchClient -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 27225 51 + +++ + +++ 27325 47 + +++ 21645 28 Latency 1986us 852us 874us 252us 34us 595us --With Writethrough Cache(QEMU) Version 1.96 --Sequential Output-- --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP BenchClient 4G 833 95 27469 3 6520 1 2743 93 33003 3 1912 61 Latency 17330us2388ms1165ms 48442us 19577us 91228us Version 1.96 --Sequential Create-- Random Create BenchClient -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 16378 31 + +++ 18864 24 18024 33 + +++ 14734 19 Latency 2028us 761us1188us 271us 36us 567us ---With Writeback Cache (CEPH) Version 1.96 --Sequential Output-- --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP BenchClient 4G 785 95 67573 8 19906 3 2777 96 32681 3 2764 80 Latency 17410us 729ms 737ms 15103us 22802us 88876us Version 1.96 --Sequential Create-- Random Create BenchClient -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
Re: [ceph-users] Storing VM Images on CEPH with RBD-QEMU driver
Hello, On Fri, 20 Dec 2013 09:20:48 +0100 Dan van der Ster wrote: Hi, Our fio tests against qemu-kvm on RBD look quite promising, details here: https://docs.google.com/spreadsheet/ccc?key=0AoB4ekP8AM3RdGlDaHhoSV81MDhUS25EUVZxdmN6WHcusp=drive_web#gid=0 That data is very interesting and welcome, however it would be a lot more relevant if it included information about your setup (though it is relatively easy to create a Ceph cluster than can saturate GbE ^.^) and your configuration. For example I assume you're using the native QEMU RBD interface. How did you configure caching, just turned it on and left it at the default values? tl;dr: rbd with caching enabled is (1) at least 2x faster than the local instance storage, and (2) reaches the hypervisor's GbE network limit in ~all cases except very small random writes. BTW, currently we have ~10 VMs running those fio tests in a loop, and we're seeing ~25,000op/s sustained in the ceph logs. Not bad IMHO. Given the feedback I got from my Sanity Check mail, I'm even more interested in the actual setup you're using now. Given your workplace, I expect to be impressed. ^o^ Cheers, Dan CERN IT/DSS [snip] Regards, Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph at the University
Hi Ceph, Just wanted to share Yann Dupont's talk about his experience in using Ceph at the University. He goes beyond telling his own story and it can probably be a source of inspiration for various use cases in the academic world. http://video.renater.fr/jres/2013/index.php?play=jres2013_article_48_720p.mp4 It was recorded this month during JRES 2013 https://2013.jres.org/ Yann also wrote a paper but I'm not sure if it's publicly available. Cheers -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Need Java bindings for librados....
Hi, I need Java bindings for librados. And also i'm new to use Java bindings. Could you please help me get a best way to use librados with java program. And what is the problem we will face, if we will use Java bindings. Is there any alternatives... Thanks amp; Regards, Upendra Yadav ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Need Java bindings for librados....
On 12/20/2013 12:15 PM, upendrayadav.u wrote: Hi, I need *Java bindings* for librados. And also i'm new to use Java bindings. Could you please help me get a best way to use *librados* with java program. And what is the problem we will face, if we will use Java bindings. Is there any alternatives... Java bindings for librados are available at: https://github.com/ceph/rados-java A Maven repository is available at: http://ceph.com/maven/ Examples can be found in the Unit Test code for the Java bindings. * * *Thanks Regards,* *Upendra Yadav* ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Storing VM Images on CEPH with RBD-QEMU driver
- Original Message - From: Wido den Hollander w...@42on.com To: ceph-users@lists.ceph.com Sent: Friday, December 20, 2013 8:04:09 AM Subject: Re: [ceph-users] Storing VM Images on CEPH with RBD-QEMU driver Hi, Hi, I'm testing CEPH with the RBD/QEMU driver through libvirt to store my VM images on. Installation and configuration all went very well with the ceph-deploy tool. I have set up authx authentication in libvirt and that works like a charm too. However, when coming to performance I have big issues getting expected results inside the hosted VM. I see high latency and bad write performance, down to 20MB/s in VM. Have you tried running rados bench to see the throughput that is getting? Yes i have tried it: rados bench -p vm_system 50 write ... Total time run: 50.578626 Total writes made: 1363 Write size: 4194304 Bandwidth (MB/sec): 107.793 Stddev Bandwidth: 19.8729 Max bandwidth (MB/sec): 136 Min bandwidth (MB/sec): 0 Average Latency:0.59249 Stddev Latency: 0.341871 Max latency:2.08384 Min latency:0.14101 My setup: 3xDELL R410, 2xXeon X5650, 48 GB RAM, 2xSATA RAID1 for System, 2x250GB Samsung Evo SSD for OSD's (with XFS on each one) So you are running the journal on the same system? With XFS that means that you will do three writes for one write coming in to the OSD. We are running journal on all xfs disk, but our test shows there is only a problem when ran in qemu vms. I have tested to turn off journal on ext4 on the qemu image, with no effect. ceph version 0.72.1 (4d923861868f6a15dcb33fef7f50f674997322de) Linux server1 3.11.0-14-generic #21-Ubuntu SMP Tue Nov 12 17:04:55 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux Ubuntu 13.10 Which Qemu version do you use? I suggest to use at least Qemu 1.5 and enable the RBD write cache. We are running: QEMU emulator version 1.5.0 (Debian 1.5.0+dfsg-3ubuntu5.1) In total: 6 OSD 1 MON 3 MDS For RBD the MDS is not required. So, question is; is there anyone out there that have experience of running the RBD/QEMU driver in production, and getting any good performance inside the VM? I suspect the main performance issue to be caused by high latency, since it all feels quite high when running those tests below with bonnie++. (bonnie++ -s 4096 -r 2048 -u root -d X -m BenchClient) Inside VPS running on native image in RBD pool: -- Without any Cache Version 1.96 --Sequential Output-- --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP BenchClient 4G 733 96 64919 8 20271 3 3013 97 30770 3 2887 82 Latency 17425us1093ms 894ms 16789us 19390us 89203us Version 1.96 --Sequential Create-- Random Create BenchClient -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 27951 52 + +++ + +++ 24921 45 + +++ 22535 29 Latency 1986us 826us1065us 216us 41us 611us --With Writeback Cache(QEMU) Version 1.96 --Sequential Output-- --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP BenchClient 4G 872 96 67327 8 22424 3 2516 94 32013 3 2800 82 Latency 16196us 657ms 843ms 37889us 19207us 85407us Version 1.96 --Sequential Create-- Random Create BenchClient -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 27225 51 + +++ + +++ 27325 47 + +++ 21645 28 Latency 1986us 852us 874us 252us 34us 595us --With Writethrough Cache(QEMU) Version 1.96 --Sequential Output-- --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP BenchClient 4G 833 95 27469 3 6520 1 2743 93 33003 3 1912 61 Latency 17330us2388ms1165ms 48442us 19577us 91228us Version 1.96 --Sequential Create-- Random Create BenchClient -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 16378 31 + +++ 18864 24 18024 33 + +++ 14734 19 Latency 2028us 761us1188us 271us 36us 567us ---With Writeback Cache (CEPH)
Re: [ceph-users] Storing VM Images on CEPH with RBD-QEMU driver
On Fri, Dec 20, 2013 at 9:44 AM, Christian Balzer ch...@gol.com wrote: Hello, On Fri, 20 Dec 2013 09:20:48 +0100 Dan van der Ster wrote: Hi, Our fio tests against qemu-kvm on RBD look quite promising, details here: https://docs.google.com/spreadsheet/ccc?key=0AoB4ekP8AM3RdGlDaHhoSV81MDhUS25EUVZxdmN6WHcusp=drive_web#gid=0 That data is very interesting and welcome, however it would be a lot more relevant if it included information about your setup (though it is relatively easy to create a Ceph cluster than can saturate GbE ^.^) and your configuration. For example I assume you're using the native QEMU RBD interface. How did you configure caching, just turned it on and left it at the default values? It's all RedHat 6.5, qemu-kvm-rhev-0.12.1.2-2.415.el6_5.3 on the HVs, ceph 0.67.4 on the servers. Caching is enabled with the usual rbd cache = true rbd cache writethrough until flush = true (otherwise defaults) The hardware is 47 OSD servers with 24 OSDs each, single 10GbE NIC per server, no SSDs, write journal as a file on the OSD partition (which is a baaad idea for small write latency, so we are slowly reinstalling everything to put the journal on a separate partition) Cheers, Dan tl;dr: rbd with caching enabled is (1) at least 2x faster than the local instance storage, and (2) reaches the hypervisor's GbE network limit in ~all cases except very small random writes. BTW, currently we have ~10 VMs running those fio tests in a loop, and we're seeing ~25,000op/s sustained in the ceph logs. Not bad IMHO. Given the feedback I got from my Sanity Check mail, I'm even more interested in the actual setup you're using now. Given your workplace, I expect to be impressed. ^o^ Cheers, Dan CERN IT/DSS [snip] Regards, Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Storing VM Images on CEPH with RBD-QEMU driver
Hello Dan, On Fri, 20 Dec 2013 14:01:04 +0100 Dan van der Ster wrote: On Fri, Dec 20, 2013 at 9:44 AM, Christian Balzer ch...@gol.com wrote: Hello, On Fri, 20 Dec 2013 09:20:48 +0100 Dan van der Ster wrote: Hi, Our fio tests against qemu-kvm on RBD look quite promising, details here: https://docs.google.com/spreadsheet/ccc?key=0AoB4ekP8AM3RdGlDaHhoSV81MDhUS25EUVZxdmN6WHcusp=drive_web#gid=0 That data is very interesting and welcome, however it would be a lot more relevant if it included information about your setup (though it is relatively easy to create a Ceph cluster than can saturate GbE ^.^) and your configuration. For example I assume you're using the native QEMU RBD interface. How did you configure caching, just turned it on and left it at the default values? It's all RedHat 6.5, qemu-kvm-rhev-0.12.1.2-2.415.el6_5.3 on the HVs, ceph 0.67.4 on the servers. Caching is enabled with the usual rbd cache = true rbd cache writethrough until flush = true (otherwise defaults) That's a good data point, I'll probably play with those defaults eventually. One thinks that the same amount of cache as a consumer HD can be improved upon, given memory prices and all. ^o^ The hardware is 47 OSD servers with 24 OSDs each, single 10GbE NIC per server, no SSDs, write journal as a file on the OSD partition (which is a baaad idea for small write latency, so we are slowly reinstalling everything to put the journal on a separate partition) Ah yes, there is the impressive bit, 47 times 24 should easily give you that amount of IOPs, even with the journal not optimized. Regards, Christian Cheers, Dan tl;dr: rbd with caching enabled is (1) at least 2x faster than the local instance storage, and (2) reaches the hypervisor's GbE network limit in ~all cases except very small random writes. BTW, currently we have ~10 VMs running those fio tests in a loop, and we're seeing ~25,000op/s sustained in the ceph logs. Not bad IMHO. Given the feedback I got from my Sanity Check mail, I'm even more interested in the actual setup you're using now. Given your workplace, I expect to be impressed. ^o^ Cheers, Dan CERN IT/DSS [snip] Regards, Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] cephx and auth for rbd image
Hi all, I've tested authentication on client side for pools, no problem so far. I'm testing granularity to the rbd image, I've seen in the doc that we can limit to object prefix, so possibly to rbd image : http://ceph.com/docs/master/man/8/ceph-authtool/#osd-capabilities I've got the following key : client.test01 key: ... caps: [mon] allow r caps: [osd] allow * object_prefix rbd_data.108374b0dc51 The object_prefix is from the rbd info image command : block_name_prefix: rbd_data.108374b0dc51 And my client, I've got the following error using this key : rbd --id test01 --keyfile test01 map pool/image rbd: add failed: (34) Numerical result out of range However I've got no error when I use the caps [osd] allow rwx pool. I would say it's my object_prefix declaration that is wrong. I'm puzzled, is there anyone who could implement this granularity? Regards, Laurent Durnez ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-deploy issues with initial mons that aren't up
This makes sense. So if other mons come up that are *not* defined as initial mons, then they will not be in service until the initial mon is up and ready? At which point they can form their quorum and operate? -Original Message- From: Gregory Farnum [mailto:g...@inktank.com] Sent: Thursday, December 19, 2013 10:19 PM To: Don Talton (dotalton) Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] ceph-deploy issues with initial mons that aren't up mon initial members is a race prevention mechanism whose purpose is to prevent your monitors from forming separate quorums when they're brought up by automated software provisioning systems (by not allowing monitors to form a quorum unless everybody in the list is a member). If you want to add other monitors at a later time you can do so by specifying them elsewhere (including in mon hosts or whatever, so other daemons will attempt to contact them.) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Thu, Dec 19, 2013 at 9:13 PM, Don Talton (dotalton) dotal...@cisco.com wrote: I just realized my email is not clear. If the first mon is up and the additional initials are not, then the process fails. -Original Message- From: ceph-users-boun...@lists.ceph.com [mailto:ceph-users- boun...@lists.ceph.com] On Behalf Of Don Talton (dotalton) Sent: Thursday, December 19, 2013 2:44 PM To: ceph-users@lists.ceph.com Subject: [ceph-users] ceph-deploy issues with initial mons that aren't up Hi all, I've been working in some ceph-deploy automation and think I've stumbled on an interesting behavior. I create a new cluster, and specify 3 machines. If all 3 are not and unable to be ssh'd into with the account I created for ceph- deploy, then the mon create process will fail and the cluster is not properly setup with keys, etc. This seems odd to me, since I may want to specify initial mons that may not yet be up (say they are waiting for cobbler to finish loading them for example), but I want them as part of the initial cluster. Donald Talton Cloud Systems Development Cisco Systems ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-deploy issues with initial mons that aren't up
Yeah. This is less of a problem when you're listing them all explicitly ahead of time (we could just make them wait for any majority), but some systems don't want to specify even the monitor count that way, so we give the admins mon initial members as a big hammer. -Greg On Fri, Dec 20, 2013 at 8:17 AM, Don Talton (dotalton) dotal...@cisco.com wrote: This makes sense. So if other mons come up that are *not* defined as initial mons, then they will not be in service until the initial mon is up and ready? At which point they can form their quorum and operate? -Original Message- From: Gregory Farnum [mailto:g...@inktank.com] Sent: Thursday, December 19, 2013 10:19 PM To: Don Talton (dotalton) Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] ceph-deploy issues with initial mons that aren't up mon initial members is a race prevention mechanism whose purpose is to prevent your monitors from forming separate quorums when they're brought up by automated software provisioning systems (by not allowing monitors to form a quorum unless everybody in the list is a member). If you want to add other monitors at a later time you can do so by specifying them elsewhere (including in mon hosts or whatever, so other daemons will attempt to contact them.) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Thu, Dec 19, 2013 at 9:13 PM, Don Talton (dotalton) dotal...@cisco.com wrote: I just realized my email is not clear. If the first mon is up and the additional initials are not, then the process fails. -Original Message- From: ceph-users-boun...@lists.ceph.com [mailto:ceph-users- boun...@lists.ceph.com] On Behalf Of Don Talton (dotalton) Sent: Thursday, December 19, 2013 2:44 PM To: ceph-users@lists.ceph.com Subject: [ceph-users] ceph-deploy issues with initial mons that aren't up Hi all, I've been working in some ceph-deploy automation and think I've stumbled on an interesting behavior. I create a new cluster, and specify 3 machines. If all 3 are not and unable to be ssh'd into with the account I created for ceph- deploy, then the mon create process will fail and the cluster is not properly setup with keys, etc. This seems odd to me, since I may want to specify initial mons that may not yet be up (say they are waiting for cobbler to finish loading them for example), but I want them as part of the initial cluster. Donald Talton Cloud Systems Development Cisco Systems ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-deploy issues with initial mons that aren't up
I guess I should add, what if I add OSDs to a mon in this scenario? Do they get up and in and will the crush map from the non initial mons get merged with the initial when it's online? -Original Message- From: ceph-users-boun...@lists.ceph.com [mailto:ceph-users- boun...@lists.ceph.com] On Behalf Of Don Talton (dotalton) Sent: Friday, December 20, 2013 9:17 AM To: Gregory Farnum Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] ceph-deploy issues with initial mons that aren't up This makes sense. So if other mons come up that are *not* defined as initial mons, then they will not be in service until the initial mon is up and ready? At which point they can form their quorum and operate? -Original Message- From: Gregory Farnum [mailto:g...@inktank.com] Sent: Thursday, December 19, 2013 10:19 PM To: Don Talton (dotalton) Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] ceph-deploy issues with initial mons that aren't up mon initial members is a race prevention mechanism whose purpose is to prevent your monitors from forming separate quorums when they're brought up by automated software provisioning systems (by not allowing monitors to form a quorum unless everybody in the list is a member). If you want to add other monitors at a later time you can do so by specifying them elsewhere (including in mon hosts or whatever, so other daemons will attempt to contact them.) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Thu, Dec 19, 2013 at 9:13 PM, Don Talton (dotalton) dotal...@cisco.com wrote: I just realized my email is not clear. If the first mon is up and the additional initials are not, then the process fails. -Original Message- From: ceph-users-boun...@lists.ceph.com [mailto:ceph-users- boun...@lists.ceph.com] On Behalf Of Don Talton (dotalton) Sent: Thursday, December 19, 2013 2:44 PM To: ceph-users@lists.ceph.com Subject: [ceph-users] ceph-deploy issues with initial mons that aren't up Hi all, I've been working in some ceph-deploy automation and think I've stumbled on an interesting behavior. I create a new cluster, and specify 3 machines. If all 3 are not and unable to be ssh'd into with the account I created for ceph- deploy, then the mon create process will fail and the cluster is not properly setup with keys, etc. This seems odd to me, since I may want to specify initial mons that may not yet be up (say they are waiting for cobbler to finish loading them for example), but I want them as part of the initial cluster. Donald Talton Cloud Systems Development Cisco Systems ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rebooting nodes in a ceph cluster
David Clarke writes: Not directly related to Ceph, but you may want to investigate kexec[0] ('kexec-tools' package in Debian derived distributions) in order to get your machines rebooting quicker. It essentially re-loads the kernel as the last step of the shutdown procedure, skipping over the lengthy BIOS/UEFI/controller firmware etc boot stages. [0]: http://en.wikipedia.org/wiki/Kexec I'd like to second that recommendation - I only discovered this recently, and on systems with long BIOS initialization, this cuts down the time to reboot *dramatically*, like from 5 to 1 minute. -- Simon. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Storing VM Images on CEPH with RBD-QEMU driver
fio --size=100m --ioengine=libaio --invalidate=1 --direct=1 --numjobs=10 --rw=read --name=fiojob --blocksize_range=4K-512k --iodepth=16 Since size=100m so reads would be entirely cached and, if hypervisor is write-back, potentially many writes would never make it to the cluster as well? Sorry if I've misunderstood :) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph network topology with redundant switches
The area I'm currently investigating is how to configure the networking. To avoid a SPOF I'd like to have redundant switches for both the public network and the internal network, most likely running at 10Gb. I'm considering splitting the nodes in to two separate racks and connecting each half to its own switch, and then trunk the switches together to allow the two halves of the cluster to see each other. The idea being that if a single switch fails I'd only lose half of the cluster. This is fine if you are using a replication factor of 2, you would need 2/3 of the cluster surviving if using a replication factor 3 with osd pool default min size set to 2. My question is about configuring the public network. If it's all one subnet then the clients consuming the Ceph resources can't have both links active, so they'd be configured in an active/standby role. But this results in quite heavy usage of the trunk between the two switches when a client accesses nodes on the other switch than the one they're actively connected to. The linux bonding driver supports several strategies for teaming network adapters on L2 networks. So, can I configure multiple public networks? I think so, based on the documentation, but I'm not completely sure. Can I have one half of the cluster on one subnet, and the other half on another? And then the client machine can have interfaces in different subnets and do the right thing with both interfaces to talk to all the nodes. This seems like a fairly simple solution that avoids a SPOF in Ceph or the network layer. You can have multiple networks for both the public and cluster networks, the only restriction is that all subnets for a given type be within the same supernet. For example 10.0.0.0/16 - Public supernet (configured in ceph.conf) 10.0.1.0/24 - Public rack 1 10.0.2.0/24 - Public rack 2 10.1.0.0/16 - Cluster supernet (configured in ceph.conf) 10.1.1.0/24 - Cluster rack 1 10.1.2.0/24 - Cluster rack 2 Or maybe I'm missing an alternative that would be better? I'm aiming for something that keeps things as simple as possible while meeting the redundancy requirements. As an aside, there's a similar issue on the cluster network side with heavy traffic on the trunk between the two cluster switches. But I can't see that's avoidable, and presumably it's something people just have to deal with in larger Ceph installations? A proper CRUSH configuration is going to place a replica on a node in each rack, this means every write is going to cross the trunk. Other traffic that you will see on the trunk: * OSDs gossiping with one another * OSD/Monitor traffic in the case where an OSD is connected to a monitor connected in the adjacent rack (map updates, heartbeats). * OSD/Client traffic where the OSD and client are in adjacent racks If you use all 4 40GbE uplinks (common on 10GbE ToR) then your cluster level bandwidth is oversubscribed 4:1. To lower oversubscription you are going to have to steal some of the other 48 ports, 12 for 2:1 and 24 for a non-blocking fabric. Given number of nodes you have/plan to have you will be utilizing 6-12 links per switch, leaving you with 12-18 links for clients on a non-blocking fabric, 24-30 for 2:1 and 36-48 for 4:1. -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph network topology with redundant switches
Hi Wido, Thanks for the reply. On Fri, Dec 20, 2013 at 08:14:13AM +0100, Wido den Hollander wrote: On 12/18/2013 09:39 PM, Tim Bishop wrote: I'm investigating and planning a new Ceph cluster starting with 6 nodes with currently planned growth to 12 nodes over a few years. Each node will probably contain 4 OSDs, maybe 6. The area I'm currently investigating is how to configure the networking. To avoid a SPOF I'd like to have redundant switches for both the public network and the internal network, most likely running at 10Gb. I'm considering splitting the nodes in to two separate racks and connecting each half to its own switch, and then trunk the switches together to allow the two halves of the cluster to see each other. The idea being that if a single switch fails I'd only lose half of the cluster. Why not three switches in total and use VLANs on the switches to separate public/cluster traffic? This way you can configure the CRUSH map to have one replica go to each switch so that when you loose a switch you still have two replicas available. Saves you a lot of switches and makes the network simpler. I was planning to use VLANs to separate the public and cluster traffic on the same switches. Two switches costs less than three switches :-) I think on a slightly larger scale cluster it might make more sense to go up to three (or even more) switches, but I'm not sure the extra cost is worth it at this level. I was planning two switches, using VLANs to separate the public and cluster traffic, and connecting half of the cluster to each switch. (I'm not touching on the required third MON in a separate location and the CRUSH rules to make sure data is correctly replicated - I'm happy with the setup there) To allow consumers of Ceph to see the full cluster they'd be directly connected to both switches. I could have another layer of switches for them and interlinks between them, but I'm not sure it's worth it on this sort of scale. My question is about configuring the public network. If it's all one subnet then the clients consuming the Ceph resources can't have both links active, so they'd be configured in an active/standby role. But this results in quite heavy usage of the trunk between the two switches when a client accesses nodes on the other switch than the one they're actively connected to. Why can't the clients have both links active? You could use LACP? Some switches support mlag to span LACP trunks over two switches. Or use some intelligent bonding mode in the Linux kernel. I've only ever used LACP to the same switch, and I hadn't realised there were options for spanning LACP links across multiple switches. Thanks for the information there. So, can I configure multiple public networks? I think so, based on the documentation, but I'm not completely sure. Can I have one half of the cluster on one subnet, and the other half on another? And then the client machine can have interfaces in different subnets and do the right thing with both interfaces to talk to all the nodes. This seems like a fairly simple solution that avoids a SPOF in Ceph or the network layer. There is no restriction on the IPs of the OSDs. All they need is a Layer 3 route to the WHOLE cluster and monitors. Say doesn't have to be in a Layer 2 network, everything can be simply Layer 3. You just have to make sure all the nodes can reach each other. Thanks, that makes sense and makes planning simpler. I suppose it's logical really... in a HUGE cluster you'd probably have a whole manner of networks spread around the datacenter. Or maybe I'm missing an alternative that would be better? I'm aiming for something that keeps things as simple as possible while meeting the redundancy requirements. client | | core switch /| \ / | \ / | \ / |\ /| \ switch1 switch2 switch3 || | OSD OSD OSD You could build something like that. That would be fairly simple. Isn't the core switch in that diagram a SPOF? Or is it presumed to already be a redundant setup? Keep in mind that you can always loose a switch and still keep I/O going. Wido Thanks for your help. You answered my main point about IP addressing on the public side, and gave me some other stuff to think about. Tim. As an aside, there's a similar issue on the cluster network side with heavy traffic on the trunk between the two cluster switches. But I can't see that's avoidable, and presumably it's something people just have to deal with in larger Ceph installations? Finally, this is all theoretical planning to try and avoid designing in bottlenecks at the outset. I don't have any concrete ideas of loading so in practice none of it
Re: [ceph-users] Performance questions (how original, I know)
Le 20/12/2013 03:51, Christian Balzer a écrit : Hello Mark, On Thu, 19 Dec 2013 17:18:01 -0600 Mark Nelson wrote: On 12/16/2013 02:42 AM, Christian Balzer wrote: Hello, Hi Christian! new to Ceph, not new to replicated storage. Simple test cluster with 2 identical nodes running Debian Jessie, thus ceph 0.48. And yes, I very much prefer a distro supported package. I know you'd like to use the distro package, but 0.48 is positively ancient at this point. There's been a *lot* of fixes/changes since then. If it makes you feel better, our current professionally supported release is based on dumpling. Oh well, I assume 0.48 was picked due to the long term support title (and thus one would hope it received it steady stream of backported fixes at least ^o^). There is 0.72 is unstable, so for testing I will just push that test cluster to sid and see what happens. As well as poke the Debian maintainer for a wheezy backport if possible, if not I'll use the source package to roll my own binary packages. In this case, why don't you want to use ceph repository, which has packages for Debian Wheezy ? Repository : http://ceph.com/debian/ Documentation : http://ceph.com/docs/master/start/quick-start-preflight/#advanced-package-tool-apt [...] ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rebooting nodes in a ceph cluster
On 12/19/13, 7:51 PM, Sage Weil wrote: If it takes 15 minutes for one of my servers to reboot is there a risk that some sort of needless automatic processing will begin? By default, we start rebalancing data after 5 minutes. You can adjust this (to, say, 15 minutes) with mon osd down out interval = 900 in ceph.conf. Will Ceph detect if the OSDs come back while it is re-balancing and stop? -- Derek T. Yarnell University of Maryland Institute for Advanced Computer Studies ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Storing VM Images on CEPH with RBD-QEMU driver
On Fri, Dec 20, 2013 at 6:19 PM, James Pearce ja...@peacon.co.uk wrote: fio --size=100m --ioengine=libaio --invalidate=1 --direct=1 --numjobs=10 --rw=read --name=fiojob --blocksize_range=4K-512k --iodepth=16 Since size=100m so reads would be entirely cached --invalidate=1 drops the cache, no? Our results of that particular fio test are consistently just under 1Gb/s on varied VMs running on varied HVs. BTW, look what happens when you don't drop the cache: # fio --size=100m --ioengine=libaio --invalidate=0 --direct=0 --numjobs=10 --rw=read --name=fiojob --blocksize_range=4K-512k | grep READ READ: io=1000.0MB, aggrb=4065.5MB/s, minb=416260KB/s, maxb=572067KB/s, mint=179msec, maxt=246msec and, if hypervisor is write-back, potentially many writes would never make it to the cluster as well? Maybe you're right, but only if fio in randwrite mode overwrites the same address many times (does it??), and the rbd cache discards overwritten writes (does it??). By observation, I can say for certain that when we have those 10 VMs running these benchmarks in a while 1 loop, our cluster becomes quite busy. Cheers, Dan Sorry if I've misunderstood :) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Best way to replace a failed drive?
Hello Guys, I wonder what's the best way to replace a failed OSD instead of remove it from CRUSH and add a new one in. As I have OSD# assigned in the ceph.conf, add a new OSD might need to revise the config file and reload all the ceph instances. BTW, any suggestions for my ceph.conf? I kind of feel that to assign IP address for each OSD is not a smart way. Please advise. Thank you. [global] fsid = 638d7b4a-e5f1-4cfd-9c83-25177d5a8d3f mon_initial_members = mon01,ceph01,ceph02 mon_host = 10.123.11.91,10.123.11.111,10.123.11.112 auth_supported = cephx filestore_xattr_use_omap = true max_open_files = 131072 public_network = 10.123.11.0/24 cluster_network = 10.234.11.0/24 [osd.0] public_addr = 10.123.11.111 cluster_addr = 10.234.11.111 [osd.1] public_addr = 10.123.11.111 cluster_addr = 10.234.11.111 [osd.2] public_addr = 10.123.11.111 cluster_addr = 10.234.11.111 [osd.3] public_addr = 10.123.11.111 cluster_addr = 10.234.11.111 [osd.4] public_addr = 10.123.11.111 cluster_addr = 10.234.11.111 [osd.5] public_addr = 10.123.11.111 cluster_addr = 10.234.11.111 [osd.6] public_addr = 10.123.11.112 cluster_addr = 10.234.11.112 [osd.7] public_addr = 10.123.11.112 cluster_addr = 10.234.11.112 [osd.8] public_addr = 10.123.11.112 cluster_addr = 10.234.11.112 [osd.9] public_addr = 10.123.11.112 cluster_addr = 10.234.11.112 [osd.10] public_addr = 10.123.11.112 cluster_addr = 10.234.11.112 [osd.11] public_addr = 10.123.11.112 cluster_addr = 10.234.11.112 [osd.12] public_addr = 10.123.11.113 cluster_addr = 10.234.11.113 [osd.13] public_addr = 10.123.11.113 cluster_addr = 10.234.11.113 [osd.14] public_addr = 10.123.11.113 cluster_addr = 10.234.11.113 [osd.15] public_addr = 10.123.11.113 cluster_addr = 10.234.11.113 [osd.16] public_addr = 10.123.11.113 cluster_addr = 10.234.11.113 [osd.17] public_addr = 10.123.11.113 cluster_addr = 10.234.11.113 [osd.18] public_addr = 10.123.11.114 cluster_addr = 10.234.11.114 [osd.19] public_addr = 10.123.11.114 cluster_addr = 10.234.11.114 [osd.20] public_addr = 10.123.11.114 cluster_addr = 10.234.11.114 [osd.21] public_addr = 10.123.11.114 cluster_addr = 10.234.11.114 [osd.22] public_addr = 10.123.11.114 cluster_addr = 10.234.11.114 [osd.23] public_addr = 10.123.11.114 cluster_addr = 10.234.11.114 -- Howie C. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Performance questions (how original, I know)
Hello Gilles, On Fri, 20 Dec 2013 21:04:45 +0100 Gilles Mocellin wrote: Le 20/12/2013 03:51, Christian Balzer a écrit : Hello Mark, On Thu, 19 Dec 2013 17:18:01 -0600 Mark Nelson wrote: On 12/16/2013 02:42 AM, Christian Balzer wrote: Hello, Hi Christian! new to Ceph, not new to replicated storage. Simple test cluster with 2 identical nodes running Debian Jessie, thus ceph 0.48. And yes, I very much prefer a distro supported package. I know you'd like to use the distro package, but 0.48 is positively ancient at this point. There's been a *lot* of fixes/changes since then. If it makes you feel better, our current professionally supported release is based on dumpling. Oh well, I assume 0.48 was picked due to the long term support title (and thus one would hope it received it steady stream of backported fixes at least ^o^). There is 0.72 is unstable, so for testing I will just push that test cluster to sid and see what happens. As well as poke the Debian maintainer for a wheezy backport if possible, if not I'll use the source package to roll my own binary packages. In this case, why don't you want to use ceph repository, which has packages for Debian Wheezy ? Ahahaha, now that is another bit of welcome information. I of course searched for this, but the only search result (top one as well) for ceph debian packages that resides on the ceph.com site is the broken (looping back to itself) link at: http://ceph.com/uncategorized/debian-packages/ Repository : http://ceph.com/debian/ Documentation : http://ceph.com/docs/master/start/quick-start-preflight/#advanced-package-tool-apt Thanks a lot, Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph RAM Requirement?
Hello, We have boxes with 24 Drives, 2TB each and want to run one OSD per drive. What should be the ideal Memory requirement of the system, keeping in mind that OSD Rebalancing and failure/replication of say 10-15TB data -Hemant ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] OSD-hierachy and crush
Hi, yesterday I expand our 3-Node ceph-cluster with an fourth node (additional 13 OSDs - all OSDs have the same size (4TB)). I use the same command like before to add OSDs and change the weight: ceph osd crush set 44 0.2 pool=default rack=unknownrack host=ceph-04 But ceph osd tree show all OSDs not below unknownrack and the weighting seems to be different (because with an weight of 0.8 the OSD are almost full - switched back to 0.6) root@ceph-04:~# ceph osd tree # idweight type name up/down reweight -1 46.8root default -3 39 rack unknownrack -2 13 host ceph-01 0 1 osd.0 up 1 1 1 osd.1 up 1 ... 27 1 osd.27 up 1 28 1 osd.28 up 1 -4 13 host ceph-02 10 1 osd.10 up 1 11 1 osd.11 up 1 ... 32 1 osd.32 up 1 33 1 osd.33 up 1 -5 13 host ceph-03 16 1 osd.16 up 1 18 1 osd.18 up 1 ... 37 1 osd.37 up 1 38 1 osd.38 up 1 -6 7.8 host ceph-04 39 0.6 osd.39 up 1 40 0.6 osd.40 up 1 ... 50 0.6 osd.50 up 1 51 0.6 osd.51 up 1 How can I change ceph-04 to be part of rack unknownrack? If I change that, would the content of the OSDs from ceph-04 roughly the same, or move the whole content again? Thanks for feedback! regards Udo smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] radosgw public url
Hi All, Does Radosgw support a Public URL For static content? Being that I wish to share a File publicly but not give out username/passwords etc. I noticed in the http://ceph.com/docs/master/radosgw/swift/ it says Static Websites isn't supported.. which I assume is talking about this feature, I'm just not 100% sure. Cheers, Quenten ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com