Re: [ceph-users] Ceph with VMWare / XenServer
On Mon, May 12, 2014 at 03:45:43PM +0200, Uwe Grohnwaldt wrote: Hi, yes, we use it in production. I can stop/kill the tgt on one server and XenServer goes to the second one. We enabled multipathing in xenserver. In our setup we haven't multiple ip-ranges so we scan/login the second target on xenserverstartup with iscsiadm in rc.local. Thats based on history - we used Dell Equallogic before ceph came in and there was no need to use multipathing (only LACP-channels). No we enabled multipathing and use tgt, but without diffent ip-ranges. I assume you connected the machines to the same switch ? As normal LACP don't work with multiple switches. Is that correct ? It wasn't that I needed different ip-ranges in my setup, it just makes it simpler/predictable. Mit freundlichen Grüßen / Best Regards, -- Consultant Dipl.-Inf. Uwe Grohnwaldt Gutleutstr. 351 60327 Frankfurt a. M. eMail: u...@grohnwaldt.eu Telefon: +49-69-34878906 Mobil: +49-172-3209285 Fax: +49-69-348789069 - Original Message - From: Andrei Mikhailovsky and...@arhont.com To: Uwe Grohnwaldt u...@grohnwaldt.eu Cc: ceph-users@lists.ceph.com Sent: Montag, 12. Mai 2014 14:48:58 Subject: Re: [ceph-users] Ceph with VMWare / XenServer Uwe, thanks for your quick reply. Do you run the Xenserver setup on production env and have you tried to test some failover scenarios to see if the xenserver guest vms are working during the failover of storage servers? Also, how did you set up the xenserver iscsi? Have you used the multipath option to set up the LUNs? Cheers - Original Message - From: Uwe Grohnwaldt u...@grohnwaldt.eu To: ceph-users@lists.ceph.com Sent: Monday, 12 May, 2014 12:57:48 PM Subject: Re: [ceph-users] Ceph with VMWare / XenServer Hi, at the moment we are using tgt with RBD backend compiled from source on Ubuntu 12.04 and 14.04 LTS. We have two machines within two ip-ranges (e.g. 192.168.1.0/24 and 192.168.2.0/24). One machine in 192.168.1.0/24 and one machine in 192.168.2.0/24. The config for tgt is the same on both machines, they export the same rbd. This works well for XenServer. For VMWare you have to disable VAAI to use it with tgt (http://kb.vmware.com/selfservice/microsites/search.do?language=en_UScmd=displayKCexternalId=1033665) If you don't disable it, ESXi becomes very slow and unresponsive. I think the problem is the iSCSI Write Same Support but I haven't tried which of the settings of VAAI is responsible for this behavior. Mit freundlichen Grüßen / Best Regards, -- Consultant Dipl.-Inf. Uwe Grohnwaldt Gutleutstr. 351 60327 Frankfurt a. M. eMail: u...@grohnwaldt.eu Telefon: +49-69-34878906 Mobil: +49-172-3209285 Fax: +49-69-348789069 - Original Message - From: Andrei Mikhailovsky and...@arhont.com To: ceph-users@lists.ceph.com Sent: Montag, 12. Mai 2014 12:00:48 Subject: [ceph-users] Ceph with VMWare / XenServer Hello guys, I am currently running a ceph cluster for running vms with qemu + rbd. It works pretty well and provides a good degree of failover. I am able to run maintenance tasks on the ceph nodes without interrupting vms IO. I would like to do the same with VMWare / XenServer hypervisors, but I am not really sure how to achieve this. Initially I thought of using iscsi multipathing, however, as it turns out, multipathing is more for load balancing and nic/switch failure. It does not allow me to perform maintenance on the iscsi target without interrupting service to vms. Has anyone done either a PoC or better a production environment where they've used ceph as a backend storage with vmware / xenserver? The important element for me is to have the ability of performing maintenance tasks and resilience to failovers without interrupting IO to vms. Are there any recommendations or howtos on how this could be achieved? Many thanks Andrei ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph with VMWare / XenServer
On Mon, May 12, 2014 at 07:01:46PM +0200, Leen Besselink wrote: On Mon, May 12, 2014 at 03:45:43PM +0200, Uwe Grohnwaldt wrote: Hi, yes, we use it in production. I can stop/kill the tgt on one server and XenServer goes to the second one. We enabled multipathing in xenserver. In our setup we haven't multiple ip-ranges so we scan/login the second target on xenserverstartup with iscsiadm in rc.local. Thats based on history - we used Dell Equallogic before ceph came in and there was no need to use multipathing (only LACP-channels). No we enabled multipathing and use tgt, but without diffent ip-ranges. I assume you connected the machines to the same switch ? As normal LACP don't work with multiple switches. Is that correct ? Or maybe you used a stack or you have Cisco switches with vPC ? It wasn't that I needed different ip-ranges in my setup, it just makes it simpler/predictable. Mit freundlichen Grüßen / Best Regards, -- Consultant Dipl.-Inf. Uwe Grohnwaldt Gutleutstr. 351 60327 Frankfurt a. M. eMail: u...@grohnwaldt.eu Telefon: +49-69-34878906 Mobil: +49-172-3209285 Fax: +49-69-348789069 - Original Message - From: Andrei Mikhailovsky and...@arhont.com To: Uwe Grohnwaldt u...@grohnwaldt.eu Cc: ceph-users@lists.ceph.com Sent: Montag, 12. Mai 2014 14:48:58 Subject: Re: [ceph-users] Ceph with VMWare / XenServer Uwe, thanks for your quick reply. Do you run the Xenserver setup on production env and have you tried to test some failover scenarios to see if the xenserver guest vms are working during the failover of storage servers? Also, how did you set up the xenserver iscsi? Have you used the multipath option to set up the LUNs? Cheers - Original Message - From: Uwe Grohnwaldt u...@grohnwaldt.eu To: ceph-users@lists.ceph.com Sent: Monday, 12 May, 2014 12:57:48 PM Subject: Re: [ceph-users] Ceph with VMWare / XenServer Hi, at the moment we are using tgt with RBD backend compiled from source on Ubuntu 12.04 and 14.04 LTS. We have two machines within two ip-ranges (e.g. 192.168.1.0/24 and 192.168.2.0/24). One machine in 192.168.1.0/24 and one machine in 192.168.2.0/24. The config for tgt is the same on both machines, they export the same rbd. This works well for XenServer. For VMWare you have to disable VAAI to use it with tgt (http://kb.vmware.com/selfservice/microsites/search.do?language=en_UScmd=displayKCexternalId=1033665) If you don't disable it, ESXi becomes very slow and unresponsive. I think the problem is the iSCSI Write Same Support but I haven't tried which of the settings of VAAI is responsible for this behavior. Mit freundlichen Grüßen / Best Regards, -- Consultant Dipl.-Inf. Uwe Grohnwaldt Gutleutstr. 351 60327 Frankfurt a. M. eMail: u...@grohnwaldt.eu Telefon: +49-69-34878906 Mobil: +49-172-3209285 Fax: +49-69-348789069 - Original Message - From: Andrei Mikhailovsky and...@arhont.com To: ceph-users@lists.ceph.com Sent: Montag, 12. Mai 2014 12:00:48 Subject: [ceph-users] Ceph with VMWare / XenServer Hello guys, I am currently running a ceph cluster for running vms with qemu + rbd. It works pretty well and provides a good degree of failover. I am able to run maintenance tasks on the ceph nodes without interrupting vms IO. I would like to do the same with VMWare / XenServer hypervisors, but I am not really sure how to achieve this. Initially I thought of using iscsi multipathing, however, as it turns out, multipathing is more for load balancing and nic/switch failure. It does not allow me to perform maintenance on the iscsi target without interrupting service to vms. Has anyone done either a PoC or better a production environment where they've used ceph as a backend storage with vmware / xenserver? The important element for me is to have the ability of performing maintenance tasks and resilience to failovers without interrupting IO to vms. Are there any recommendations or howtos on how this could be achieved? Many thanks Andrei ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph
Re: [ceph-users] NFS over CEPH - best practice
On Mon, May 12, 2014 at 10:52:33AM +0100, Andrei Mikhailovsky wrote: Leen, thanks for explaining things. I does make sense now. Unfortunately, it does look like this technology would not fulfill my requirements as I do need to have an ability to perform maintenance without shutting down vms. Sorry for being cautious. I've seen certain iSCSI-initiators act that way. I do not know if that is representative for other iSCSI-initiators. So I don't know if that applies to VMWare. During failover reads/writes would be stalled of course. When properly configured, failover of the target could be done in seconds. I will open another topic to discuss possible solutions. Thanks for all your help Andrei - Original Message - From: Leen Besselink l...@consolejunkie.net To: ceph-users@lists.ceph.com Cc: Andrei Mikhailovsky and...@arhont.com Sent: Sunday, 11 May, 2014 11:41:08 PM Subject: Re: [ceph-users] NFS over CEPH - best practice On Sun, May 11, 2014 at 09:24:30PM +0100, Andrei Mikhailovsky wrote: Sorry if these questions will sound stupid, but I was not able to find an answer by googling. As the Astralians say: no worries, mate. It's fine. 1. Does iSCSI protocol support having multiple target servers to serve the same disk/block device? No, I don't think so. What does work is active/standby failover. I suggest to have some kind of clustering, because as far as I can see, you never want to have 2 target servers active if they don't share state (as far as I know there is no Linux iSCSI-target server which can share state between 2 targets). When there is a failure there is time to have all targets offline for a brief moment, before the second target comes online. The initiators should be able to handle short interruptions. In case of ceph, the same rbd disk image. I was hoping to have multiple servers to mount the same rbd disk and serve it as an iscsi LUN. This LUN would be used as a vm image storage on vmware / xenserver. You'd have one server which handles a LUN, with it goes down, an other should take over the target IP-address and handle requests for that LUN. 2.Does iscsi multipathing provide failover/HA capability only on the initiator side? The docs that i came across all mention multipathing on the client side, like using two different nics. I did not find anything about having multiple nics on the initiator connecting to multiple iscsi target servers. Multipathing for iSCSI, as I see it, only does one thing: it can be used to create multiple network paths between the initiator and the target. They can be used for resiliance (read: failover) or for loadbalancing when you need more bandwidth. The way I would do it is to have 2 switches and connect each initiator and each target to both switches. Also you would have 2 IP-subnets. So both the target and initiator would have 2 IP-addresses, one from each subnet. So for example: the target would have: 10.0.1.1 and 10.0.2.1 and the initiator: 10.0.1.11 and 10.0.2.11 Then you run the IP-traffic for 10.0.1.x on switch 1 and the 10.0.2.x traffic on switch 2. Thus, you have created a resilient set up: The target has multiple connections to the network, the initiator has multiple connections to the network and you can also handle a switch failover. I was hoping to have resilient solution on the storage side so that I can perform upgrades and maintenance without needing to shutdown vms running on vmware/xenserver. Is this possible with iscsi? The failover set up is mostly to handle failures, not really great for maintenance because it does give a short interruption in service. Like 30 seconds or so of no writing to the LUN. That might not be a problem for you, I don't know, but it is at least something to be aware of. And also something you should test when you've build the setup. Cheers Hope that helps. Andrei - Original Message - From: Leen Besselink l...@consolejunkie.net To: ceph-users@lists.ceph.com Sent: Saturday, 10 May, 2014 8:31:02 AM Subject: Re: [ceph-users] NFS over CEPH - best practice On Fri, May 09, 2014 at 12:37:57PM +0100, Andrei Mikhailovsky wrote: Ideally I would like to have a setup with 2+ iscsi servers, so that I can perform maintenance if necessary without shutting down the vms running on the servers. I guess multipathing is what I need. Also I will need to have more than one xenserver/vmware host servers, so the iscsi LUNs will be mounted on several servers. So you have multiple machines talking to the same LUN at the same time ? You'll have to co-ordinate how changes are written to the backing store, normally you'd have the virtualization servers use some kind of protocol. When it's SCSI there are the older Reserve/Release commands
Re: [ceph-users] NFS over CEPH - best practice
On Mon, May 12, 2014 at 12:08:24PM -0500, Dimitri Maziuk wrote: PS. (now that I looked) see e.g. http://blogs.mindspew-age.com/2012/04/05/adventures-in-high-availability-ha-iscsi-with-drbd-iscsi-and-pacemaker/ Dima Didn't you say you wanted multiple servers to write to the same LUN ? I think this set up won't work. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] NFS over CEPH - best practice
On Sun, May 11, 2014 at 09:24:30PM +0100, Andrei Mikhailovsky wrote: Sorry if these questions will sound stupid, but I was not able to find an answer by googling. As the Astralians say: no worries, mate. It's fine. 1. Does iSCSI protocol support having multiple target servers to serve the same disk/block device? No, I don't think so. What does work is active/standby failover. I suggest to have some kind of clustering, because as far as I can see, you never want to have 2 target servers active if they don't share state (as far as I know there is no Linux iSCSI-target server which can share state between 2 targets). When there is a failure there is time to have all targets offline for a brief moment, before the second target comes online. The initiators should be able to handle short interruptions. In case of ceph, the same rbd disk image. I was hoping to have multiple servers to mount the same rbd disk and serve it as an iscsi LUN. This LUN would be used as a vm image storage on vmware / xenserver. You'd have one server which handles a LUN, with it goes down, an other should take over the target IP-address and handle requests for that LUN. 2.Does iscsi multipathing provide failover/HA capability only on the initiator side? The docs that i came across all mention multipathing on the client side, like using two different nics. I did not find anything about having multiple nics on the initiator connecting to multiple iscsi target servers. Multipathing for iSCSI, as I see it, only does one thing: it can be used to create multiple network paths between the initiator and the target. They can be used for resiliance (read: failover) or for loadbalancing when you need more bandwidth. The way I would do it is to have 2 switches and connect each initiator and each target to both switches. Also you would have 2 IP-subnets. So both the target and initiator would have 2 IP-addresses, one from each subnet. So for example: the target would have: 10.0.1.1 and 10.0.2.1 and the initiator: 10.0.1.11 and 10.0.2.11 Then you run the IP-traffic for 10.0.1.x on switch 1 and the 10.0.2.x traffic on switch 2. Thus, you have created a resilient set up: The target has multiple connections to the network, the initiator has multiple connections to the network and you can also handle a switch failover. I was hoping to have resilient solution on the storage side so that I can perform upgrades and maintenance without needing to shutdown vms running on vmware/xenserver. Is this possible with iscsi? The failover set up is mostly to handle failures, not really great for maintenance because it does give a short interruption in service. Like 30 seconds or so of no writing to the LUN. That might not be a problem for you, I don't know, but it is at least something to be aware of. And also something you should test when you've build the setup. Cheers Hope that helps. Andrei - Original Message - From: Leen Besselink l...@consolejunkie.net To: ceph-users@lists.ceph.com Sent: Saturday, 10 May, 2014 8:31:02 AM Subject: Re: [ceph-users] NFS over CEPH - best practice On Fri, May 09, 2014 at 12:37:57PM +0100, Andrei Mikhailovsky wrote: Ideally I would like to have a setup with 2+ iscsi servers, so that I can perform maintenance if necessary without shutting down the vms running on the servers. I guess multipathing is what I need. Also I will need to have more than one xenserver/vmware host servers, so the iscsi LUNs will be mounted on several servers. So you have multiple machines talking to the same LUN at the same time ? You'll have to co-ordinate how changes are written to the backing store, normally you'd have the virtualization servers use some kind of protocol. When it's SCSI there are the older Reserve/Release commands and the newer SCSI-3 Persistent Reservation commands. (i)SCSI allows multiple changes to be in-flight, without coordination things will go wrong. Below it was mentioned that you can disable the cache for rbd, if you have no coordination protocol you'll need to do the same on the iSCSI-side. I believe when you do that it will be slower, but it might work. Would the suggested setup not work for my requirements? It depends on VMWare if they allow such a setup. Then there is an other thing. How do the VMWare machines coordinate which VM they should be running ? I don't know VMWare but usually if you have some kind of clustering setup you'll need to have a 'quorum'. A lot of times the quorum is handled by a quorum disk with the SCSI coordiation protocols mentioned above. An other way to have a quorum is to have a majority voting system with an un-even number of machines talking over the network. This is what Ceph monitor nodes do. As an example of a clustering system that allows it to be used without a quorum disk with only 2 machines
Re: [ceph-users] NFS over CEPH - best practice
On Fri, May 09, 2014 at 12:37:57PM +0100, Andrei Mikhailovsky wrote: Ideally I would like to have a setup with 2+ iscsi servers, so that I can perform maintenance if necessary without shutting down the vms running on the servers. I guess multipathing is what I need. Also I will need to have more than one xenserver/vmware host servers, so the iscsi LUNs will be mounted on several servers. So you have multiple machines talking to the same LUN at the same time ? You'll have to co-ordinate how changes are written to the backing store, normally you'd have the virtualization servers use some kind of protocol. When it's SCSI there are the older Reserve/Release commands and the newer SCSI-3 Persistent Reservation commands. (i)SCSI allows multiple changes to be in-flight, without coordination things will go wrong. Below it was mentioned that you can disable the cache for rbd, if you have no coordination protocol you'll need to do the same on the iSCSI-side. I believe when you do that it will be slower, but it might work. Would the suggested setup not work for my requirements? It depends on VMWare if they allow such a setup. Then there is an other thing. How do the VMWare machines coordinate which VM they should be running ? I don't know VMWare but usually if you have some kind of clustering setup you'll need to have a 'quorum'. A lot of times the quorum is handled by a quorum disk with the SCSI coordiation protocols mentioned above. An other way to have a quorum is to have a majority voting system with an un-even number of machines talking over the network. This is what Ceph monitor nodes do. As an example of a clustering system that allows it to be used without a quorum disk with only 2 machines talking over the network is Linux Pacemaker. When something bad happends, one machine will just turn off the power of the other machine to prevent things going wrong (this is called STONITH). Andrei - Original Message - From: Leen Besselink l...@consolejunkie.net To: ceph-users@lists.ceph.com Sent: Thursday, 8 May, 2014 9:35:21 PM Subject: Re: [ceph-users] NFS over CEPH - best practice On Thu, May 08, 2014 at 01:24:17AM +0200, Gilles Mocellin wrote: Le 07/05/2014 15:23, Vlad Gorbunov a écrit : It's easy to install tgtd with ceph support. ubuntu 12.04 for example: Connect ceph-extras repo: echo deb http://ceph.com/packages/ceph-extras/debian $(lsb_release -sc) main | sudo tee /etc/apt/sources.list.d/ceph-extras.list Install tgtd with rbd support: apt-get update apt-get install tgt It's important to disable the rbd cache on tgtd host. Set in /etc/ceph/ceph.conf: [client] rbd_cache = false [...] Hello, Hi, Without cache on the tgtd side, it should be possible to have failover and load balancing (active/avtive) multipathing. Have you tested multipath load balancing in this scenario ? If it's reliable, it opens a new way for me to do HA storage with iSCSI ! I have a question, what is your use case ? Do you need SCSI-3 persistent reservations so multiple machines can use the same LUN at the same time ? Because in that case I think tgtd won't help you. Have a good day, Leen. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] NFS over CEPH - best practice
On Thu, May 08, 2014 at 01:24:17AM +0200, Gilles Mocellin wrote: Le 07/05/2014 15:23, Vlad Gorbunov a écrit : It's easy to install tgtd with ceph support. ubuntu 12.04 for example: Connect ceph-extras repo: echo deb http://ceph.com/packages/ceph-extras/debian $(lsb_release -sc) main | sudo tee /etc/apt/sources.list.d/ceph-extras.list Install tgtd with rbd support: apt-get update apt-get install tgt It's important to disable the rbd cache on tgtd host. Set in /etc/ceph/ceph.conf: [client] rbd_cache = false [...] Hello, Hi, Without cache on the tgtd side, it should be possible to have failover and load balancing (active/avtive) multipathing. Have you tested multipath load balancing in this scenario ? If it's reliable, it opens a new way for me to do HA storage with iSCSI ! I have a question, what is your use case ? Do you need SCSI-3 persistent reservations so multiple machines can use the same LUN at the same time ? Because in that case I think tgtd won't help you. Have a good day, Leen. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph uses too much disk space!!
On Sun, Oct 06, 2013 at 10:00:48AM +0300, Linux Chips wrote: maybe its woth mentioning that my OSDs are formatted as btrfs. i don't think that btrfs have 13% overhead. or dose it? I would suggest you look at btrfs df, not df (never use df with btrfs) and btrfs volume list to see what btrfs is doing. If I'm not mistaken Ceph with btrfs uses snapshots as a way to do transactions instead of using a journal. Who knows, maybe something failed and they did't get cleaned up or something like that, I've never had a look at how it is handled so I don't know what it looks like normally. But post some information on the list if you see something unusual, someone probably knows. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] SSD recommendations for OSD journals
On Mon, Jul 22, 2013 at 08:45:07AM +1100, Mikaël Cluseau wrote: On 22/07/2013 08:03, Charles 'Boyo wrote: Counting on the kernel's cache, it appears I will be best served purchasing write-optimized SSDs? Can you share any information on the SSD you are using, is it PCIe connected? We are on a standard SAS bus so any SSD going to 500MB/s and being stable on the long run (we use 60G Intel 520), you do not need a lot of space for the journal (5G per drive is far enough on commodity hardware). Another question, since the intention of this storage cluster is relatively cheap storage on commodity hardware, what's the balance between cheap SSDs and reliability since journal failure might result in data loss or will such an event just 'down' the affected OSDs? When you do a write to Ceph, one OSD (I believe this is the master for a certain part of the data, an object) receives the write and distributed the copies to other OSD (as much as is configured, like: min size=2 size=3) when writes are done on all those OSDs it will confirm the write to the client. So if one OSD failes, other OSDs will have that data. The master will have to make sure an other copy is created somewhere else. So I don't see a reason for data loss if you lose one journal. There will be a lot of copying of data though and slow things down. A journal failure will fail your OSDs (from what I've understood, you'll have to rebuild them). But SSDs are very deterministic, so monitor them : # smartctl -A /dev/sdd [..] ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE [..] 232 Available_Reservd_Space 0x0033 100 100 010Pre-fail Always - 0 233 Media_Wearout_Indicator 0x0032 093 093 000Old_age Always - 0 And don't put too many OSDs on one SSD (I set a rule to not go over 4 for 1). When the SSD is large enough and yournals don't take up all the space, you can also leave part of the SSD unpartitioned. This will allow the SSD the fail much later. On a similar note, I am using XFS on the OSDs which also journals, does this affect performance in any way? You want this journal for consistency ;) I don't know exactly the impact, but since we use spinning drives, the most important factor is that ceph, with a journal on SSD, does a lot of sequential writes, avoiding most seeks. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to change the journal size at run time?
On Fri, Jun 21, 2013 at 12:11:23PM +0800, Da Chun wrote: Hi List, The default journal size is 1G, which I think is too small for my Gb network. I want to extend all the journal partitions to 2 or 4G. How can I do that? The osds were all created by commands like ceph-deploy osd create ceph-node0:/dev/sdb. The journal partition is on the same disk together with the corresponding data partition. I notice there is an attribute osd journal size which value is 1024. I guess this is why the command ceph-deploy osd create set the journal partition size as 1G. I want to do this job using steps as below: 1. Change the osd journal size in the ceph.conf to 4G 2. Remove the osd 3. Readd the osd 4. Repeat 2 and 3 steps for all the osds. This needs lots of manual work and is time consuming. Are there better ways to do that? Thanks! Have a look at these commands: http://ceph.com/docs/master/man/8/ceph-osd/#cmdoption-ceph-osd--flush-journal http://ceph.com/docs/master/man/8/ceph-osd/#cmdoption-ceph-osd--mkjournal And this setting: http://ceph.com/docs/master/rados/configuration/osd-config-ref/#index-2 If I'm not mistaken that is a per-machine global or per-osd setting in /etc/ceph/ceph.conf ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to change the journal size at run time?
On Fri, Jun 21, 2013 at 10:39:05AM +0200, Leen Besselink wrote: On Fri, Jun 21, 2013 at 12:11:23PM +0800, Da Chun wrote: Hi List, The default journal size is 1G, which I think is too small for my Gb network. I want to extend all the journal partitions to 2 or 4G. How can I do that? The osds were all created by commands like ceph-deploy osd create ceph-node0:/dev/sdb. The journal partition is on the same disk together with the corresponding data partition. I notice there is an attribute osd journal size which value is 1024. I guess this is why the command ceph-deploy osd create set the journal partition size as 1G. I want to do this job using steps as below: 1. Change the osd journal size in the ceph.conf to 4G 2. Remove the osd 3. Readd the osd 4. Repeat 2 and 3 steps for all the osds. This needs lots of manual work and is time consuming. Are there better ways to do that? Thanks! Have a look at these commands: http://ceph.com/docs/master/man/8/ceph-osd/#cmdoption-ceph-osd--flush-journal http://ceph.com/docs/master/man/8/ceph-osd/#cmdoption-ceph-osd--mkjournal Actually, I'm slightly mistaken. I don't think you need the mkjournal. If you stop the osd, flush the journal, change the setting, remove the journal and start the osd. I think it would create a new journal automatically. I hope you have a test-environment or maybe someone with more knowledge of these things can confirm or deny what I mentioned. And this setting: http://ceph.com/docs/master/rados/configuration/osd-config-ref/#index-2 If I'm not mistaken that is a per-machine global or per-osd setting in /etc/ceph/ceph.conf ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph iscsi questions
On Tue, Jun 18, 2013 at 09:52:53AM +0200, Kurt Bauer wrote: Hi, Da Chun schrieb: Hi List, I want to deploy a ceph cluster with latest cuttlefish, and export it with iscsi interface to my applications. Some questions here: 1. Which Linux distro and release would you recommend? I used Ubuntu 13.04 for testing purpose before. For the ceph-cluster or the iSCSI-GW? We use Ubuntu 12.04 LTS for the cluster and the iSCSI-GW, but tested Debian wheezy as iSCSI-GW too. Both work flawless. 2. Which iscsi target is better? LIO, SCST, or others? Have you read http://ceph.com/dev-notes/adding-support-for-rbd-to-stgt/ ? That's what we do and it works without problems so far. 3. The system for the iscsi target will be a single point of failure. How to eliminate it and make good use of ceph's nature of distribution? That's a question we asked aourselves too. In theory one can set up 2 iSCSI-GW and use multipath but what does that do to the cluster? Will smth. break if 2 iSCSI targets use the same rbd image in the cluster? Even if I use failover-mode only? Has someone already tried this and is willing to share their knowledge? Let's see. You mentioned HA and multipath. You don't really need multipath for a HA iSCSI-target. Multipath allows you to use multiple paths, multiple connections/networks/switches, but you don't want to connect an iSCSI-initiator to multiple iSCSI-targets (for the same LUN). That is asking for trouble. So multi-path just gives you extra paths. When you have multiple iSCSI-targets, you use failover. Most iSCSI-initiators can deal with at least up to 30 seconds of no responses from the iSCSI-target. No response, means no response. An error response is the wrong response of course. So when using failover, a virtual IP-address is probably what you want. Probably combined with something like Pacemaker to make sure multiple machines do not claim to have the same IP-address. You'll need even more if you have multiple iSCSI-initiators that want to connect to the same rbd, like some Windows or VMWare cluster. And I guess Linux clustering filesystem like with OCFS2 probably need it too. It's called SPC-3 Persistent Reservation. As I understand Persistent Reservation, the iSCSI-target just needs to keep state for the connected initiators. On failover it isn't a problem if there is no state. So there is no state that needs to be replicated between multiple gateways. As long as all initiators are connected to the same target. When different initiators are connected to different targets, your data will get corrupted on write. Now implementations: - stgt does have some support for SPC-3, but not enough. - LIO supports SPC-3 Persist, it is the one in the current Linux kernels. - SCST seemed to much of a pain to set up to even try, but I might be wrong. - IET: iSCSI Enterprise Target, seems to support SPC-3 Persist, it's a DKMS package on Ubuntu - I later found out there is an implementation: http://www.peach.ne.jp/archives/istgt/ It too supports SPC-3 Persist. It is from the FreeBSD-camp and a package is available for Debian and Ubuntu with Linux-kernel and not just kFreeBSD. But I haven't tried it. So I haven't tried them all yet. I have used LIO. An other small tip: if you don't understand iSCSI, you'll might end up configure it the wrong way at first and it will be slow. You might need to spend time to figure out how to tune it. Now you know what I know. Best regards, Kurt Thanks! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph iscsi questions
On Tue, Jun 18, 2013 at 11:13:15AM +0200, Leen Besselink wrote: On Tue, Jun 18, 2013 at 09:52:53AM +0200, Kurt Bauer wrote: Hi, Da Chun schrieb: Hi List, I want to deploy a ceph cluster with latest cuttlefish, and export it with iscsi interface to my applications. Some questions here: 1. Which Linux distro and release would you recommend? I used Ubuntu 13.04 for testing purpose before. For the ceph-cluster or the iSCSI-GW? We use Ubuntu 12.04 LTS for the cluster and the iSCSI-GW, but tested Debian wheezy as iSCSI-GW too. Both work flawless. 2. Which iscsi target is better? LIO, SCST, or others? Have you read http://ceph.com/dev-notes/adding-support-for-rbd-to-stgt/ ? That's what we do and it works without problems so far. 3. The system for the iscsi target will be a single point of failure. How to eliminate it and make good use of ceph's nature of distribution? That's a question we asked aourselves too. In theory one can set up 2 iSCSI-GW and use multipath but what does that do to the cluster? Will smth. break if 2 iSCSI targets use the same rbd image in the cluster? Even if I use failover-mode only? Has someone already tried this and is willing to share their knowledge? Let's see. You mentioned HA and multipath. You don't really need multipath for a HA iSCSI-target. Multipath allows you to use multiple paths, multiple connections/networks/switches, but you don't want to connect an iSCSI-initiator to multiple iSCSI-targets (for the same LUN). That is asking for trouble. Probably I should add why you might want to use multipath because it does add resiliance and also performance if one connection on the target is not enough. I have a feeling when using multipath it is easiest to use multiple subnets. So multi-path just gives you extra paths. When you have multiple iSCSI-targets, you use failover. Most iSCSI-initiators can deal with at least up to 30 seconds of no responses from the iSCSI-target. No response, means no response. An error response is the wrong response of course. So when using failover, a virtual IP-address is probably what you want. Probably combined with something like Pacemaker to make sure multiple machines do not claim to have the same IP-address. You'll need even more if you have multiple iSCSI-initiators that want to connect to the same rbd, like some Windows or VMWare cluster. And I guess Linux clustering filesystem like with OCFS2 probably need it too. It's called SPC-3 Persistent Reservation. As I understand Persistent Reservation, the iSCSI-target just needs to keep state for the connected initiators. On failover it isn't a problem if there is no state. So there is no state that needs to be replicated between multiple gateways. As long as all initiators are connected to the same target. When different initiators are connected to different targets, your data will get corrupted on write. Now implementations: - stgt does have some support for SPC-3, but not enough. - LIO supports SPC-3 Persist, it is the one in the current Linux kernels. - SCST seemed to much of a pain to set up to even try, but I might be wrong. - IET: iSCSI Enterprise Target, seems to support SPC-3 Persist, it's a DKMS package on Ubuntu - I later found out there is an implementation: http://www.peach.ne.jp/archives/istgt/ It too supports SPC-3 Persist. It is from the FreeBSD-camp and a package is available for Debian and Ubuntu with Linux-kernel and not just kFreeBSD. But I haven't tried it. So I haven't tried them all yet. I have used LIO. An other small tip: if you don't understand iSCSI, you'll might end up configure it the wrong way at first and it will be slow. You might need to spend time to figure out how to tune it. Now you know what I know. Best regards, Kurt Thanks! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Another osd is filled too full and taken off after manually taking one osd out
On Tue, Jun 18, 2013 at 08:13:39PM +0800, Da Chun wrote: Hi List,My ceph cluster has two osds on each node. One has 15g capacity, and the other 10g. It's interesting that, after I took the 15g osd out of the cluster, the cluster started to rebalance, and finally the 10g osd on the same node was finally full and taken off, and failed to start again with the following error in the osd log file: 2013-06-18 19:51:20.799756 7f6805ee07c0 -1 filestore(/var/lib/ceph/osd/ceph-1) Extended attributes don't appear to work. Got error (28) No space left on device. If you are using ext3 or ext4, be sure to mount the underlying file system with the 'user_xattr' option. 2013-06-18 19:51:20.800258 7f6805ee07c0 -1 ^[[0;31m ** ERROR: error converting store /var/lib/ceph/osd/ceph-1: (95) Operation not supported^[[0m I guess the 10g osd was chosen by the cluster to be the container for the extra objects. My questions here: 1. How are the extra objects spread in the cluster after an osd is taken out? Only spread to one of the osds? 2. Is there no mechanism to prevent the osds from being filled too full and taken off? As far I understand it. Each OSD has the same weight by default, you can give them a different weight to force it to be used less. The reason to do so could be because it has less space or because it is slower. Thanks for your time! This is the ceph log: 2013-06-18 19:26:41.567607 mon.0 172.18.46.34:6789/0 1599 : [INF] pgmap v14182: 456 pgs: 453 active+clean, 3 active+remapped+backfilling; 16874 MB data, 40220 MB used, 36513 MB / 76733 MB avail; 379/9761 degraded (3.883%); recovering 19 o/s, 77608KB/s 2013-06-18 19:26:42.649139 mon.0 172.18.46.34:6789/0 1600 : [INF] pgmap v14183: 456 pgs: 454 active+clean, 2 active+remapped+backfilling; 16874 MB data, 40222 MB used, 36511 MB / 76733 MB avail; 309/9745 degraded (3.171%); recovering 41 o/s, 162MB/s 2013-06-18 19:26:46.566721 mon.0 172.18.46.34:6789/0 1601 : [INF] pgmap v14184: 456 pgs: 454 active+clean, 2 active+remapped+backfilling; 16874 MB data, 40222 MB used, 36511 MB / 76733 MB avail; 250/9745 degraded (2.565%); recovering 25 o/s, 101450KB/s 2013-06-18 19:26:39.858833 osd.1 172.18.46.35:6801/10730 88 : [WRN] OSD near full (91%) 2013-06-18 19:26:48.548076 mon.0 172.18.46.34:6789/0 1602 : [INF] pgmap v14185: 456 pgs: 454 active+clean, 2 active+remapped+backfilling; 16874 MB data, 40222 MB used, 36511 MB / 76733 MB avail; 200/9745 degraded (2.052%); recovering 18 o/s, 72359KB/s 2013-06-18 19:26:51.898811 mon.0 172.18.46.34:6789/0 1603 : [INF] pgmap v14186: 456 pgs: 454 active+clean, 2 active+remapped+backfilling; 16874 MB data, 40222 MB used, 36511 MB / 76733 MB avail; 155/9745 degraded (1.591%); recovering 17 o/s, 71823KB/s 2013-06-18 19:26:53.947739 mon.0 172.18.46.34:6789/0 1604 : [INF] pgmap v14187: 456 pgs: 454 active+clean, 2 active+remapped+backfilling; 16874 MB data, 40222 MB used, 36511 MB / 76733 MB avail; 113/9745 degraded (1.160%); recovering 16 o/s, 65041KB/s 2013-06-18 19:26:57.293713 mon.0 172.18.46.34:6789/0 1605 : [INF] pgmap v14188: 456 pgs: 454 active+clean, 2 active+remapped+backfilling; 16874 MB data, 40222 MB used, 36511 MB / 76733 MB avail; 103/9745 degraded (1.057%); recovering 9 o/s, 37353KB/s 2013-06-18 19:27:03.861124 mon.0 172.18.46.34:6789/0 1606 : [INF] pgmap v14189: 456 pgs: 454 active+clean, 2 active+remapped+backfilling; 16874 MB data, 35598 MB used, 41134 MB / 76733 MB avail; 103/9745 degraded (1.057%); recovering 1 o/s, 3532KB/s 2013-06-18 19:27:13.732263 mon.0 172.18.46.34:6789/0 1607 : [DBG] osd.1 172.18.46.35:6801/10730 reported failed by osd.0 172.18.46.34:6804/1506 2013-06-18 19:27:15.949395 mon.0 172.18.46.34:6789/0 1608 : [DBG] osd.1 172.18.46.35:6801/10730 reported failed by osd.3 172.18.46.34:6807/11743 2013-06-18 19:27:17.239206 mon.0 172.18.46.34:6789/0 1609 : [DBG] osd.1 172.18.46.35:6801/10730 reported failed by osd.5 172.18.46.36:6806/7436 2013-06-18 19:27:17.239404 mon.0 172.18.46.34:6789/0 1610 : [INF] osd.1 172.18.46.35:6801/10730 failed (3 reports from 3 peers after 2013-06-18 19:27:38.239157 = grace 20.00) 2013-06-18 19:27:17.306958 mon.0 172.18.46.34:6789/0 1611 : [INF] osdmap e647: 6 osds: 5 up, 5 in 2013-06-18 19:27:17.387311 mon.0 172.18.46.34:6789/0 1612 : [INF] pgmap v14190: 456 pgs: 335 active+clean, 119 stale+active+clean, 2 active+remapped+backfilling; 16874 MB data, 35598 MB used, 41134 MB / 76733 MB avail; 103/9745 degraded (1.057%) 2013-06-18 19:27:18.308209 mon.0 172.18.46.34:6789/0 1613 : [INF] osdmap e648: 6 osds: 5 up, 5 in 2013-06-18 19:27:18.316487 mon.0 172.18.46.34:6789/0 1614 : [INF] pgmap v14191: 456 pgs: 335 active+clean, 119 stale+active+clean, 2 active+remapped+backfilling; 16874 MB data, 35598 MB used, 41134 MB / 76733 MB avail; 103/9745 degraded (1.057%) 2013-06-18 19:27:22.676915 mon.0 172.18.46.34:6789/0 1615 : [INF] pgmap
Re: [ceph-users] ceph iscsi questions
On Tue, Jun 18, 2013 at 02:38:19PM +0200, Kurt Bauer wrote: Da Chun schrieb: Thanks for sharing! Kurt. Yes. I have read the article you mentioned. But I also read another one: http://www.hastexo.com/resources/hints-and-kinks/turning-ceph-rbd-images-san-storage-devices. It uses LIO, which is the current standard Linux kernel SCSI target. That has a major disadvantage, which is, that you have to use the kernel rbd module, which is not feature equivalent to ceph userland code, at least in kernel-versions which are shipped with recent distributions. Yes, that is why I like that some Ceph developers added rbd to tgt. It's just a lot easier to do upgrades. I don't expect it to be much slower either. I believe I read somewhere that LIO in the kernel has it's limits. Specifically the number of threads...? But I could be wrong. The disadvantage of tgt is the clustering support does not work (yet ?). There is another doc in the ceph site: http://ceph.com/w/index.php?title=ISCSIredirect=no http://ceph.com/w/index.php?title=ISCSIredirect=no Quite outdated I think, last update nearly 3 years ago, I don't understand what the box in the middle should depict. I don't quite understand how the multi path works here. Are the two ISCSI targets on the same system or two different ones? Has anybody tried this already? Leen has illustrated that quite well. -- Original -- *From: * Kurt Bauerkurt.ba...@univie.ac.at; *Date: * Tue, Jun 18, 2013 03:52 PM *To: * Da Chunng...@qq.com; *Cc: * ceph-usersceph-users@lists.ceph.com; *Subject: * Re: [ceph-users] ceph iscsi questions Hi, Da Chun schrieb: Hi List, I want to deploy a ceph cluster with latest cuttlefish, and export it with iscsi interface to my applications. Some questions here: 1. Which Linux distro and release would you recommend? I used Ubuntu 13.04 for testing purpose before. For the ceph-cluster or the iSCSI-GW? We use Ubuntu 12.04 LTS for the cluster and the iSCSI-GW, but tested Debian wheezy as iSCSI-GW too. Both work flawless. 2. Which iscsi target is better? LIO, SCST, or others? Have you read http://ceph.com/dev-notes/adding-support-for-rbd-to-stgt/ ? That's what we do and it works without problems so far. 3. The system for the iscsi target will be a single point of failure. How to eliminate it and make good use of ceph's nature of distribution? That's a question we asked aourselves too. In theory one can set up 2 iSCSI-GW and use multipath but what does that do to the cluster? Will smth. break if 2 iSCSI targets use the same rbd image in the cluster? Even if I use failover-mode only? Has someone already tried this and is willing to share their knowledge? Best regards, Kurt Thanks! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Single Cluster / Reduced Failure Domains
On Tue, Jun 18, 2013 at 09:02:12AM -0700, Gregory Farnum wrote: On Tuesday, June 18, 2013, harri wrote: Hi, ** ** ** I wondered what best practice is recommended to reducing failure domains for a virtual server platform. If I wanted to run multiple virtual server clusters then would it be feasible to serve storage from 1 x large Ceph cluster? I'm a bit confused by your question here. Normally you want as many defined failure domains as possible to best tolerate those failures without data loss. I am concerned that, in the unlikely event the Ceph whole cluster fails, then *ALL *my VM's would be offline. Well, yes? Is there anyway to ring-fence failure domains within a logical Ceph cluster or would you instead look to build multiple Ceph clusters (but then that defeats the object of the technology doesn't it?)? You can separate your OSDs into different CRUSH buckets and thn assign different pools to draw from those buckets if you're trying to split up your storage somehow. But I'm still a little confused about what you're after. :) -Greg I think I know what he means, because this is what I've been thinking: The software (of the monitors) is/are the single point of failure. For example when you do an upgrade of Ceph and your monitors fail because of the upgrade. You will have down time. Obviously, it isn't every day I upgrade the software of our SAN either. But one of the reasons people seem to be moving to software more than 'hardware' is because of flexibility. So they want to be able to update it. I've had Ceph test installations fail an upgrade, I've had a 3 monitor setup lose 1 monitor and follow the wrong procedure to get it back up and running. I've seen others on the mailinglist asking for help after upgrade problems. This is exactly why RBD incremental backup makes me happy, because it should be easier to keep up to date copies/snapshots on multiple Ceph installations. -- Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Hardware recommendation / calculation for large cluster
On Sun, May 12, 2013 at 03:14:15PM +0200, Tim Mohlmann wrote: Hi, On Saturday 11 May 2013 16:04:27 Leen Besselink wrote: Someone is going to correct me if I'm wrong, but I think you misread something. The Mon-daemon doesn't need that much RAM: The 'RAM: 1 GB per daemon' is per Mon-daemon, not per OSD-daemon. Gosh, I feel embarresed. This ectually was my main concern / bottleneck. Thanks for pointing this out. Seems Ceph really rocks in deploying affordable data clusters. I did see you mentioned you wanted to have, many disks in the same machine. Not just machines with let's say 12 disks for example. Did you know you need the CPU-power of a 1Ghz Xeon core per OSD for the times when recovery is happening ? Regards, Tim On Sat, May 11, 2013 at 03:42:59PM +0200, Tim Mohlmann wrote: Hi, First of all I am new to ceph and this mailing list. At this moment I am looking into the possibilities to get involved in the storage business. I am trying to get an estimate about costs and after that I will start to determine how to get sufficient income. First I will describe my case, at the bottom you will find my questions. GENERAL LAYOUT: Part of this cost calculation is of course hardware. For the larger part I've already figured it out. In my plans I will be leasing a full rack (46U). Depending on the domestic needs I will be using 36 or 40U for ODS storage servers. (I will assume 36U from here on, to keep a solid value for calculation and have enough spare space for extra devices). Each OSD server uses 4U and can take 36x3.5 drives. So in 36U I can put 36/4=9 OSD servers, containing 9*36=324 HDDs. HARD DISK DRIVES I have been looking for WD digital RE and RED series. RE is more expensive per GB, but has a larger MTBF and offers a 4TB model. RED is (real) cheap per GB, but only goes as far a 3TB. At my current calculations it does not matter much if I would put expensive WD RE 4TB disks or cheaper WD RED 3TB, the price per GB over the complete cluster expense and 3 years of running costs (including AFR) is almost the same. So basically, if I could reduce the costs of all the other components used in the cluster, I would go for the 3TB disk and if the costs will be higher then my first calculation, I would use the 4TB disk. Let's assume 4TB from now on. So, 4*324=1296TB. So lets go Peta-byte ;). NETWORK I will use a redundant 2x10Gbe network connection for each node. Two independent 10Gbe switches will be used and I will use bonding between the interfaces on each node. (Thanks some guy in the #Ceph irc for pointing this option out). I will use VLAN's to split front-side, backside and Internet networks. OSD SERVER SuperMicro based, 36 HDD hotswap. Dual socket mainboard. 16x DIMM sockets. It is advertised they can take up to 512GB of RAM. I will install 2 x Intel Xeon E5620 2.40ghz processor, having 4 cores and 8 threads each. For the RAM I am in doubt (see below). I am looking into running 1 OSD per disk. MON AND MDS SERVERS Now comes the big question. What specs are required? It first I had the plan to use 4 SuperMicro superservers, with a 4 socket mainboards that contain up to the new 16core AMD processors and up to 1TB of RAM. I want all 4 of the servers to run a MON service, MDS service and costumer / public services. Probably I would use VM's (kvm) to separate them. I will compile my own kernel to enable Kernel Samepage Merge, Hugepage support and memory compaction to make RAM use more efficient. The requirements for my public services will be added up, once I know what I need for MON and MDS. RAM FOR ALL SERVERS So what would you estimate to be the ram usage? http://ceph.com/docs/master/install/hardware-recommendations/#minimum- hardware-recommendations. Sounds OK for the OSD part. 500 MB per daemon, would put the minimum RAM requirement for my OSD server to 18GB. 32GB should be more then enough. Although I would like to see if it is possible to use btrfs compression? In that case I'd need more RAM in there. What I really want to know: how many RAM do I need for MON and MDS servers? 1GB per daemon sounds pretty steep. As everybody knows, RAM is expensive! In my case I would need at least 324 GB of ram for each of them. Initially I was planning to use 4 servers and each of them running both. Joining those in a single system, with the other duties the system has to perform I would need the full 1TB of RAM. I would need to use 32GB modules witch are really expensive per GB and difficult to find. (not may server hardware vendors in the Netherlands have them). QUESTIONS Question 1: Is it really the amount for OSD's that counts for MON and MDS RAM usage
Re: [ceph-users] Hardware recommendation / calculation for large cluster
On Sun, May 12, 2013 at 10:22:10PM +0200, Tim Mohlmann wrote: Hi, On Sunday 12 May 2013 18:05:16 Leen Besselink wrote: I did see you mentioned you wanted to have, many disks in the same machine. Not just machines with let's say 12 disks for example. Did you know you need the CPU-power of a 1Ghz Xeon core per OSD for the times when recovery is happening ? Nope, did not know it. The current intent is to install 2x 2.4 Ghz xeon CPU, handeling 8 threads each. So, 2*8*2.4=38.4 for max OSD's. It should be fine. If I would go for the 72 disk option, I have to consider doubling that power. The current max I can select from the dealer I am looking at, for the socket housed in the supermicro 72x 3.5 version are 2x a Xeon x5680. Utilizing 12 threads each, at 3.33Ghz. So, 2*12*3.33=79.79 for max OSD's. Also this should be fine. What will happen if the CPU is maxed out anyway? Slowing things or crashing things? In my opinion it is not a bad thing if a system is maxed out in such a massive migration, which should not occur on a daily base. Sure, a disk that fails every two weeks, no prob. What are we talking about? 0.3% of the complete storage cluster. Even 0.15% if I would take the 72x3.5 servers. Even if one disk/OSD fails, it would need to recheck where each placement groups should be stored and move stuff around if needed. If during this action your CPUs are maxed out, you might start to lose connections between OSDs and the process will need to start over. At least that is how I understand it, I've done a few test installations, but not yet deployed it in production. The Inktank people said in the presentations I've seen (and looking at the picture in the video from DreamHost I have a feeling that is what they've deployed): 12 HDD == 12 OSD per machine is ideal, maybe with 2 or 3 SSD for journaling if you want more performance. If a complete server stops working, that is something else. But as I said in a different split of this thread: if that happens I have got different things to worry about, than a slow migration of data. As long as there is no data lost, I don't really care it takes a bit longer. Thanks for the advise. Tim ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Hardware recommendation / calculation for large cluster
Hi, Someone is going to correct me if I'm wrong, but I think you misread something. The Mon-daemon doesn't need that much RAM: The 'RAM: 1 GB per daemon' is per Mon-daemon, not per OSD-daemon. The same for disk-space. You should read this page again: http://ceph.com/docs/master/install/hardware-recommendations/ Some of the other questions are answered there as well. Like how much memory does a OSD-daemon need and why/when. On Sat, May 11, 2013 at 03:42:59PM +0200, Tim Mohlmann wrote: Hi, First of all I am new to ceph and this mailing list. At this moment I am looking into the possibilities to get involved in the storage business. I am trying to get an estimate about costs and after that I will start to determine how to get sufficient income. First I will describe my case, at the bottom you will find my questions. GENERAL LAYOUT: Part of this cost calculation is of course hardware. For the larger part I've already figured it out. In my plans I will be leasing a full rack (46U). Depending on the domestic needs I will be using 36 or 40U for ODS storage servers. (I will assume 36U from here on, to keep a solid value for calculation and have enough spare space for extra devices). Each OSD server uses 4U and can take 36x3.5 drives. So in 36U I can put 36/4=9 OSD servers, containing 9*36=324 HDDs. HARD DISK DRIVES I have been looking for WD digital RE and RED series. RE is more expensive per GB, but has a larger MTBF and offers a 4TB model. RED is (real) cheap per GB, but only goes as far a 3TB. At my current calculations it does not matter much if I would put expensive WD RE 4TB disks or cheaper WD RED 3TB, the price per GB over the complete cluster expense and 3 years of running costs (including AFR) is almost the same. So basically, if I could reduce the costs of all the other components used in the cluster, I would go for the 3TB disk and if the costs will be higher then my first calculation, I would use the 4TB disk. Let's assume 4TB from now on. So, 4*324=1296TB. So lets go Peta-byte ;). NETWORK I will use a redundant 2x10Gbe network connection for each node. Two independent 10Gbe switches will be used and I will use bonding between the interfaces on each node. (Thanks some guy in the #Ceph irc for pointing this option out). I will use VLAN's to split front-side, backside and Internet networks. OSD SERVER SuperMicro based, 36 HDD hotswap. Dual socket mainboard. 16x DIMM sockets. It is advertised they can take up to 512GB of RAM. I will install 2 x Intel Xeon E5620 2.40ghz processor, having 4 cores and 8 threads each. For the RAM I am in doubt (see below). I am looking into running 1 OSD per disk. MON AND MDS SERVERS Now comes the big question. What specs are required? It first I had the plan to use 4 SuperMicro superservers, with a 4 socket mainboards that contain up to the new 16core AMD processors and up to 1TB of RAM. I want all 4 of the servers to run a MON service, MDS service and costumer / public services. Probably I would use VM's (kvm) to separate them. I will compile my own kernel to enable Kernel Samepage Merge, Hugepage support and memory compaction to make RAM use more efficient. The requirements for my public services will be added up, once I know what I need for MON and MDS. RAM FOR ALL SERVERS So what would you estimate to be the ram usage? http://ceph.com/docs/master/install/hardware-recommendations/#minimum- hardware-recommendations. Sounds OK for the OSD part. 500 MB per daemon, would put the minimum RAM requirement for my OSD server to 18GB. 32GB should be more then enough. Although I would like to see if it is possible to use btrfs compression? In that case I'd need more RAM in there. What I really want to know: how many RAM do I need for MON and MDS servers? 1GB per daemon sounds pretty steep. As everybody knows, RAM is expensive! In my case I would need at least 324 GB of ram for each of them. Initially I was planning to use 4 servers and each of them running both. Joining those in a single system, with the other duties the system has to perform I would need the full 1TB of RAM. I would need to use 32GB modules witch are really expensive per GB and difficult to find. (not may server hardware vendors in the Netherlands have them). QUESTIONS Question 1: Is it really the amount for OSD's that counts for MON and MDS RAM usage, or the size of the object store? Question 2: can I do it with less RAM? Any statistics, or better: a calculation? I can imagine memory pages becoming redundant if the cluster grows, so less memory required per OSD. Question 3: If it is the amount of OSDs that counts, would it be beneficial to combine disks in a RAID 0 (lvm or btrfs) array? Question 4: Is it safe / possible to store MON files inside of the cluster itself? The 10GB per daemon requirement would
Re: [ceph-users] Using Ceph as Storage for VMware
On Thu, May 09, 2013 at 11:51:32PM +0100, Neil Levine wrote: Jared, As Weiguo says you will need to use a gateway to present a Ceph block device (RBD) in a format VMware understands. We've contributed the relevant code to the TGT iSCSI target (see blog: http://ceph.com/dev-notes/adding-support-for-rbd-to-stgt/) and though we haven't done a massive amount of testing on it, I'd love to get some feedback on it. We will be putting more effort into it this cycle (including producing a package). We also have a legacy virtualization setup we are thinking of using with Ceph and iSCSI. We however also ended up at LIO, because LIO supports the iSCSI extensions which are needed for clustering. stgt doesn't yet support all the needed extensions as far as I can see. There seems to be exactly one person sporadically working on improving stgt in this area. If you have a VMware account rep, be sure to ask him to file support for Ceph as a customer request with the product teams while we continue knock on VMware's door :-) Neil On Thu, May 9, 2013 at 11:30 PM, w sun ws...@hotmail.com wrote: RBD is not supported by VMware/vSphere. You will need to build a NFS/iSCSI/FC GW to support VMware. Here is a post someone has been trying and you may have to contact them directly for status, http://ceph.com/community/ceph-over-fibre-for-vmware/ --weiguo To: ceph-users@lists.ceph.com From: jaredda...@shelterinsurance.com Date: Thu, 9 May 2013 17:25:02 -0500 Subject: [ceph-users] Using Ceph as Storage for VMware I am investigating using Ceph as a storage target for virtual servers in VMware. We have 3 servers packed with hard drives ready for the proof of concept. I am looking for some direction. Is this a valid use for Ceph? If so, has anybody accomplished this? Are there any documents on how to set this up? Should I use RDB, NFS, etc? Any help, would be greatly appreciated. Thank You, JD ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Using Ceph as Storage for VMware
On Fri, May 10, 2013 at 12:12:45AM +0100, Neil Levine wrote: Leen, Do you mean you get LIO working with RBD directly? Or are you just re-exporting a kernel mounted volume? Yes, re-exporting a kernel mounted volume on seperate gateway machines. Neil On Thu, May 9, 2013 at 11:58 PM, Leen Besselink l...@consolejunkie.net wrote: On Thu, May 09, 2013 at 11:51:32PM +0100, Neil Levine wrote: Jared, As Weiguo says you will need to use a gateway to present a Ceph block device (RBD) in a format VMware understands. We've contributed the relevant code to the TGT iSCSI target (see blog: http://ceph.com/dev-notes/adding-support-for-rbd-to-stgt/) and though we haven't done a massive amount of testing on it, I'd love to get some feedback on it. We will be putting more effort into it this cycle (including producing a package). We also have a legacy virtualization setup we are thinking of using with Ceph and iSCSI. We however also ended up at LIO, because LIO supports the iSCSI extensions which are needed for clustering. stgt doesn't yet support all the needed extensions as far as I can see. There seems to be exactly one person sporadically working on improving stgt in this area. If you have a VMware account rep, be sure to ask him to file support for Ceph as a customer request with the product teams while we continue knock on VMware's door :-) Neil On Thu, May 9, 2013 at 11:30 PM, w sun ws...@hotmail.com wrote: RBD is not supported by VMware/vSphere. You will need to build a NFS/iSCSI/FC GW to support VMware. Here is a post someone has been trying and you may have to contact them directly for status, http://ceph.com/community/ceph-over-fibre-for-vmware/ --weiguo To: ceph-users@lists.ceph.com From: jaredda...@shelterinsurance.com Date: Thu, 9 May 2013 17:25:02 -0500 Subject: [ceph-users] Using Ceph as Storage for VMware I am investigating using Ceph as a storage target for virtual servers in VMware. We have 3 servers packed with hard drives ready for the proof of concept. I am looking for some direction. Is this a valid use for Ceph? If so, has anybody accomplished this? Are there any documents on how to set this up? Should I use RDB, NFS, etc? Any help, would be greatly appreciated. Thank You, JD ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com