from:"Leen Besselink"

Re: [ceph-users] Ceph with VMWare / XenServer

2014-05-12 Thread Leen Besselink

On Mon, May 12, 2014 at 03:45:43PM +0200, Uwe Grohnwaldt wrote:
Hi,

yes, we use it in production. I can stop/kill the tgt on one server and
XenServer goes to the second one. We enabled multipathing in xenserver. In
our setup we haven't multiple ip-ranges so we scan/login the second target on
xenserverstartup with iscsiadm in rc.local.

Thats based on history - we used Dell Equallogic before ceph came in and
there was no need to use multipathing (only LACP-channels). No we enabled
multipathing and use tgt, but without diffent ip-ranges.

I assume you connected the machines to the same switch ? As normal LACP don't
work with multiple switches.

Is that correct ?

It wasn't that I needed different ip-ranges in my setup, it just makes it
simpler/predictable.

Mit freundlichen Grüßen / Best Regards,
--
Consultant
Dipl.-Inf. Uwe Grohnwaldt
Gutleutstr. 351
60327 Frankfurt a. M.

eMail: u...@grohnwaldt.eu
Telefon: +49-69-34878906
Mobil: +49-172-3209285
Fax: +49-69-348789069

- Original Message -
From: Andrei Mikhailovsky and...@arhont.com
To: Uwe Grohnwaldt u...@grohnwaldt.eu
Cc: ceph-users@lists.ceph.com
Sent: Montag, 12. Mai 2014 14:48:58
Subject: Re: [ceph-users] Ceph with VMWare / XenServer

Uwe, thanks for your quick reply.

Do you run the Xenserver setup on production env and have you tried
to test some failover scenarios to see if the xenserver guest vms
are working during the failover of storage servers?

Also, how did you set up the xenserver iscsi? Have you used the
multipath option to set up the LUNs?

Cheers

- Original Message -

From: Uwe Grohnwaldt u...@grohnwaldt.eu
To: ceph-users@lists.ceph.com
Sent: Monday, 12 May, 2014 12:57:48 PM
Subject: Re: [ceph-users] Ceph with VMWare / XenServer

Hi,

at the moment we are using tgt with RBD backend compiled from source
on Ubuntu 12.04 and 14.04 LTS. We have two machines within two
ip-ranges (e.g. 192.168.1.0/24 and 192.168.2.0/24). One machine in
192.168.1.0/24 and one machine in 192.168.2.0/24. The config for tgt
is the same on both machines, they export the same rbd. This works
well for XenServer.

For VMWare you have to disable VAAI to use it with tgt
(http://kb.vmware.com/selfservice/microsites/search.do?language=en_UScmd=displayKCexternalId=1033665)
If you don't disable it, ESXi becomes very slow and unresponsive.

I think the problem is the iSCSI Write Same Support but I haven't
tried which of the settings of VAAI is responsible for this
behavior.

Mit freundlichen Grüßen / Best Regards,
--
Consultant
Dipl.-Inf. Uwe Grohnwaldt
Gutleutstr. 351
60327 Frankfurt a. M.

eMail: u...@grohnwaldt.eu
Telefon: +49-69-34878906
Mobil: +49-172-3209285
Fax: +49-69-348789069

- Original Message -
From: Andrei Mikhailovsky and...@arhont.com
To: ceph-users@lists.ceph.com
Sent: Montag, 12. Mai 2014 12:00:48
Subject: [ceph-users] Ceph with VMWare / XenServer

Hello guys,

I am currently running a ceph cluster for running vms with qemu +
rbd. It works pretty well and provides a good degree of failover. I
am able to run maintenance tasks on the ceph nodes without
interrupting vms IO.

I would like to do the same with VMWare / XenServer hypervisors,
but
I am not really sure how to achieve this. Initially I thought of
using iscsi multipathing, however, as it turns out, multipathing is
more for load balancing and nic/switch failure. It does not allow
me
to perform maintenance on the iscsi target without interrupting
service to vms.

Has anyone done either a PoC or better a production environment
where
they've used ceph as a backend storage with vmware / xenserver? The
important element for me is to have the ability of performing
maintenance tasks and resilience to failovers without interrupting
IO to vms. Are there any recommendations or howtos on how this
could
be achieved?

Many thanks

Andrei

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph with VMWare / XenServer

2014-05-12 Thread Leen Besselink

On Mon, May 12, 2014 at 07:01:46PM +0200, Leen Besselink wrote:
 On Mon, May 12, 2014 at 03:45:43PM +0200, Uwe Grohnwaldt wrote:
  Hi,
  
  yes, we use it in production. I can stop/kill the tgt on one server and 
  XenServer goes to the second one. We enabled multipathing in xenserver. In 
  our setup we haven't multiple ip-ranges so we scan/login the second target 
  on xenserverstartup with iscsiadm in rc.local.
  
  Thats based on history - we used Dell Equallogic before ceph came in and 
  there was no need to use multipathing (only LACP-channels). No we enabled 
  multipathing and use tgt, but without diffent ip-ranges.
  
 
 I assume you connected the machines to the same switch ? As normal LACP don't 
 work with multiple switches.
 
 Is that correct ?
 

Or maybe you used a stack or you have Cisco switches with vPC ?

 It wasn't that I needed different ip-ranges in my setup, it just makes it 
 simpler/predictable.
 
  Mit freundlichen Grüßen / Best Regards,
  --
  Consultant
  Dipl.-Inf. Uwe Grohnwaldt
  Gutleutstr. 351
  60327 Frankfurt a. M.
  
  eMail: u...@grohnwaldt.eu
  Telefon: +49-69-34878906
  Mobil: +49-172-3209285
  Fax: +49-69-348789069
  
  - Original Message -
   From: Andrei Mikhailovsky and...@arhont.com
   To: Uwe Grohnwaldt u...@grohnwaldt.eu
   Cc: ceph-users@lists.ceph.com
   Sent: Montag, 12. Mai 2014 14:48:58
   Subject: Re: [ceph-users] Ceph with VMWare / XenServer
   
   
   Uwe, thanks for your quick reply.
   
   Do you run the Xenserver setup on production env and have you tried
   to test some failover scenarios to see if the xenserver guest vms
   are working during the failover of storage servers?
   
   Also, how did you set up the xenserver iscsi? Have you used the
   multipath option to set up the LUNs?
   
   Cheers
   
   
   
   
   - Original Message -
   
   From: Uwe Grohnwaldt u...@grohnwaldt.eu
   To: ceph-users@lists.ceph.com
   Sent: Monday, 12 May, 2014 12:57:48 PM
   Subject: Re: [ceph-users] Ceph with VMWare / XenServer
   
   Hi,
   
   at the moment we are using tgt with RBD backend compiled from source
   on Ubuntu 12.04 and 14.04 LTS. We have two machines within two
   ip-ranges (e.g. 192.168.1.0/24 and 192.168.2.0/24). One machine in
   192.168.1.0/24 and one machine in 192.168.2.0/24. The config for tgt
   is the same on both machines, they export the same rbd. This works
   well for XenServer.
   
   For VMWare you have to disable VAAI to use it with tgt
   (http://kb.vmware.com/selfservice/microsites/search.do?language=en_UScmd=displayKCexternalId=1033665)
   If you don't disable it, ESXi becomes very slow and unresponsive.
   
   I think the problem is the iSCSI Write Same Support but I haven't
   tried which of the settings of VAAI is responsible for this
   behavior.
   
   Mit freundlichen Grüßen / Best Regards,
   --
   Consultant
   Dipl.-Inf. Uwe Grohnwaldt
   Gutleutstr. 351
   60327 Frankfurt a. M.
   
   eMail: u...@grohnwaldt.eu
   Telefon: +49-69-34878906
   Mobil: +49-172-3209285
   Fax: +49-69-348789069
   
   - Original Message -
From: Andrei Mikhailovsky and...@arhont.com
To: ceph-users@lists.ceph.com
Sent: Montag, 12. Mai 2014 12:00:48
Subject: [ceph-users] Ceph with VMWare / XenServer



Hello guys,

I am currently running a ceph cluster for running vms with qemu +
rbd. It works pretty well and provides a good degree of failover. I
am able to run maintenance tasks on the ceph nodes without
interrupting vms IO.

I would like to do the same with VMWare / XenServer hypervisors,
but
I am not really sure how to achieve this. Initially I thought of
using iscsi multipathing, however, as it turns out, multipathing is
more for load balancing and nic/switch failure. It does not allow
me
to perform maintenance on the iscsi target without interrupting
service to vms.

Has anyone done either a PoC or better a production environment
where
they've used ceph as a backend storage with vmware / xenserver? The
important element for me is to have the ability of performing
maintenance tasks and resilience to failovers without interrupting
IO to vms. Are there any recommendations or howtos on how this
could
be achieved?

Many thanks

Andrei


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

   ___
   ceph-users mailing list
   ceph-users@lists.ceph.com
   http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
   
   
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph

Re: [ceph-users] NFS over CEPH - best practice

2014-05-12 Thread Leen Besselink

On Mon, May 12, 2014 at 10:52:33AM +0100, Andrei Mikhailovsky wrote:
 Leen, 
 
 thanks for explaining things. I does make sense now. 
 
 Unfortunately, it does look like this technology would not fulfill my 
 requirements as I do need to have an ability to perform maintenance without 
 shutting down vms. 
 

Sorry for being cautious. I've seen certain iSCSI-initiators act that way.

I do not know if that is representative for other iSCSI-initiators.

So I don't know if that applies to VMWare.

During failover reads/writes would be stalled of course.

When properly configured, failover of the target could be done in seconds.

 I will open another topic to discuss possible solutions. 
 
 Thanks for all your help 
 
 Andrei 
 - Original Message -
 
 From: Leen Besselink l...@consolejunkie.net 
 To: ceph-users@lists.ceph.com 
 Cc: Andrei Mikhailovsky and...@arhont.com 
 Sent: Sunday, 11 May, 2014 11:41:08 PM 
 Subject: Re: [ceph-users] NFS over CEPH - best practice 
 
 On Sun, May 11, 2014 at 09:24:30PM +0100, Andrei Mikhailovsky wrote: 
  Sorry if these questions will sound stupid, but I was not able to find an 
  answer by googling. 
  
 
 As the Astralians say: no worries, mate. 
 
 It's fine. 
 
  1. Does iSCSI protocol support having multiple target servers to serve the 
  same disk/block device? 
  
 
 No, I don't think so. What does work is active/standby failover. 
 
 I suggest to have some kind of clustering, because as far as I can see, you 
 never want to have 2 target servers active if they don't share state 
 (as far as I know there is no Linux iSCSI-target server which can share state 
 between 2 targets). 
 
 When there is a failure there is time to have all targets offline for a brief 
 moment, before the second target comes online. The initiators should be able 
 to handle short interruptions. 
 
  In case of ceph, the same rbd disk image. I was hoping to have multiple 
  servers to mount the same rbd disk and serve it as an iscsi LUN. This LUN 
  would be used as a vm image storage on vmware / xenserver. 
  
 
 You'd have one server which handles a LUN, with it goes down, an other should 
 take over the target IP-address and handle requests for that LUN. 
 
  2.Does iscsi multipathing provide failover/HA capability only on the 
  initiator side? The docs that i came across all mention multipathing on the 
  client side, like using two different nics. I did not find anything about 
  having multiple nics on the initiator connecting to multiple iscsi target 
  servers. 
  
 
 Multipathing for iSCSI, as I see it, only does one thing: it can be used to 
 create multiple network paths between the initiator and the target. They can 
 be used for resiliance (read: failover) or for loadbalancing when you need 
 more bandwidth. 
 
 The way I would do it is to have 2 switches and connect each initiator and 
 each target to both switches. Also you would have 2 IP-subnets. 
 
 So both the target and initiator would have 2 IP-addresses, one from each 
 subnet. 
 
 So for example: the target would have: 10.0.1.1 and 10.0.2.1 and the 
 initiator: 10.0.1.11 and 10.0.2.11 
 
 Then you run the IP-traffic for 10.0.1.x on switch 1 and the 10.0.2.x traffic 
 on switch 2. 
 
 Thus, you have created a resilient set up: The target has multiple 
 connections to the network, the initiator has multiple connections to the 
 network and you can also handle a switch failover. 
 
  I was hoping to have resilient solution on the storage side so that I can 
  perform upgrades and maintenance without needing to shutdown vms running on 
  vmware/xenserver. Is this possible with iscsi? 
  
 
 The failover set up is mostly to handle failures, not really great for 
 maintenance because it does give a short interruption in service. Like 30 
 seconds or so of no writing to the LUN. 
 
 That might not be a problem for you, I don't know, but it is at least 
 something to be aware of. And also something you should test when you've 
 build the setup. 
 
  Cheers 
  
 
 Hope that helps. 
 
  Andrei 
  - Original Message - 
  
  From: Leen Besselink l...@consolejunkie.net 
  To: ceph-users@lists.ceph.com 
  Sent: Saturday, 10 May, 2014 8:31:02 AM 
  Subject: Re: [ceph-users] NFS over CEPH - best practice 
  
  On Fri, May 09, 2014 at 12:37:57PM +0100, Andrei Mikhailovsky wrote: 
   Ideally I would like to have a setup with 2+ iscsi servers, so that I can 
   perform maintenance if necessary without shutting down the vms running on 
   the servers. I guess multipathing is what I need. 
   
   Also I will need to have more than one xenserver/vmware host servers, so 
   the iscsi LUNs will be mounted on several servers. 
   
  
  So you have multiple machines talking to the same LUN at the same time ? 
  
  You'll have to co-ordinate how changes are written to the backing store, 
  normally you'd have the virtualization servers use some kind of protocol. 
  
  When it's SCSI there are the older Reserve/Release commands

Re: [ceph-users] NFS over CEPH - best practice

2014-05-12 Thread Leen Besselink

On Mon, May 12, 2014 at 12:08:24PM -0500, Dimitri Maziuk wrote:
 PS. (now that I looked) see e.g.
 http://blogs.mindspew-age.com/2012/04/05/adventures-in-high-availability-ha-iscsi-with-drbd-iscsi-and-pacemaker/
 
 
 Dima

Didn't you say you wanted multiple servers to write to the same LUN ?

I think this set up won't work.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] NFS over CEPH - best practice

2014-05-11 Thread Leen Besselink

On Sun, May 11, 2014 at 09:24:30PM +0100, Andrei Mikhailovsky wrote:
 Sorry if these questions will sound stupid, but I was not able to find an 
 answer by googling. 
 

As the Astralians say: no worries, mate.

It's fine.

 1. Does iSCSI protocol support having multiple target servers to serve the 
 same disk/block device? 
 

No, I don't think so. What does work is active/standby failover.

I suggest to have some kind of clustering, because as far as I can see, you 
never want to have 2 target servers active if they don't share state
(as far as I know there is no Linux iSCSI-target server which can share state 
between 2 targets).

When there is a failure there is time to have all targets offline for a brief 
moment, before the second target comes online. The initiators should be able to 
handle short interruptions.

 In case of ceph, the same rbd disk image. I was hoping to have multiple 
 servers to mount the same rbd disk and serve it as an iscsi LUN. This LUN 
 would be used as a vm image storage on vmware / xenserver. 
 

You'd have one server which handles a LUN, with it goes down, an other should 
take over the target IP-address and handle requests for that LUN.

 2.Does iscsi multipathing provide failover/HA capability only on the 
 initiator side? The docs that i came across all mention multipathing on the 
 client side, like using two different nics. I did not find anything about 
 having multiple nics on the initiator connecting to multiple iscsi target 
 servers. 
 

Multipathing for iSCSI, as I see it, only does one thing: it can be used to 
create multiple network paths between the initiator and the target. They can be 
used for resiliance (read: failover) or for loadbalancing when you need more 
bandwidth.

The way I would do it is to have 2 switches and connect each initiator and each 
target to both switches. Also you would have 2 IP-subnets.

So both the target and initiator would have 2 IP-addresses, one from each 
subnet.

So for example: the target would have: 10.0.1.1 and 10.0.2.1 and the initiator: 
10.0.1.11 and 10.0.2.11

Then you run the IP-traffic for 10.0.1.x on switch 1 and the 10.0.2.x traffic 
on switch 2.

Thus, you have created a resilient set up: The target has multiple connections 
to the network, the initiator has multiple connections to the network and you 
can also handle a switch failover.

 I was hoping to have resilient solution on the storage side so that I can 
 perform upgrades and maintenance without needing to shutdown vms running on 
 vmware/xenserver. Is this possible with iscsi? 
 

The failover set up is mostly to handle failures, not really great for 
maintenance because it does give a short interruption in service. Like 30 
seconds or so of no writing to the LUN.

That might not be a problem for you, I don't know, but it is at least something 
to be aware of. And also something you should test when you've build the setup.

 Cheers 
 

Hope that helps.

 Andrei 
 - Original Message -
 
 From: Leen Besselink l...@consolejunkie.net 
 To: ceph-users@lists.ceph.com 
 Sent: Saturday, 10 May, 2014 8:31:02 AM 
 Subject: Re: [ceph-users] NFS over CEPH - best practice 
 
 On Fri, May 09, 2014 at 12:37:57PM +0100, Andrei Mikhailovsky wrote: 
  Ideally I would like to have a setup with 2+ iscsi servers, so that I can 
  perform maintenance if necessary without shutting down the vms running on 
  the servers. I guess multipathing is what I need. 
  
  Also I will need to have more than one xenserver/vmware host servers, so 
  the iscsi LUNs will be mounted on several servers. 
  
 
 So you have multiple machines talking to the same LUN at the same time ? 
 
 You'll have to co-ordinate how changes are written to the backing store, 
 normally you'd have the virtualization servers use some kind of protocol. 
 
 When it's SCSI there are the older Reserve/Release commands and the newer 
 SCSI-3 Persistent Reservation commands. 
 
 (i)SCSI allows multiple changes to be in-flight, without coordination things 
 will go wrong. 
 
 Below it was mentioned that you can disable the cache for rbd, if you have no 
 coordination protocol you'll need to do the same on the iSCSI-side. 
 
 I believe when you do that it will be slower, but it might work. 
 
  Would the suggested setup not work for my requirements? 
  
 
 It depends on VMWare if they allow such a setup. 
 
 Then there is an other thing. How do the VMWare machines coordinate which VM 
 they should be running ? 
 
 I don't know VMWare but usually if you have some kind of clustering setup 
 you'll need to have a 'quorum'. 
 
 A lot of times the quorum is handled by a quorum disk with the SCSI 
 coordiation protocols mentioned above. 
 
 An other way to have a quorum is to have a majority voting system with an 
 un-even number of machines talking over the network. This is what Ceph 
 monitor nodes do. 
 
 As an example of a clustering system that allows it to be used without a 
 quorum disk with only 2 machines

Re: [ceph-users] NFS over CEPH - best practice

2014-05-10 Thread Leen Besselink

On Fri, May 09, 2014 at 12:37:57PM +0100, Andrei Mikhailovsky wrote:
 Ideally I would like to have a setup with 2+ iscsi servers, so that I can 
 perform maintenance if necessary without shutting down the vms running on the 
 servers. I guess multipathing is what I need. 
 
 Also I will need to have more than one xenserver/vmware host servers, so the 
 iscsi LUNs will be mounted on several servers. 
 

So you have multiple machines talking to the same LUN at the same time ?

You'll have to co-ordinate how changes are written to the backing store, 
normally you'd have the virtualization servers use some kind of protocol.

When it's SCSI there are the older Reserve/Release commands and the newer 
SCSI-3 Persistent Reservation commands.

(i)SCSI allows multiple changes to be in-flight, without coordination things 
will go wrong.

Below it was mentioned that you can disable the cache for rbd, if you have no 
coordination protocol you'll need to do the same on the iSCSI-side.

I believe when you do that it will be slower, but it might work.

 Would the suggested setup not work for my requirements? 
 

It depends on VMWare if they allow such a setup.

Then there is an other thing. How do the VMWare machines coordinate which VM 
they should be running ?

I don't know VMWare but usually if you have some kind of clustering setup 
you'll need to have a 'quorum'.

A lot of times the quorum is handled by a quorum disk with the SCSI coordiation 
protocols mentioned above.

An other way to have a quorum is to have a majority voting system with an 
un-even number of machines talking over the network. This is what Ceph monitor 
nodes do.

As an example of a clustering system that allows it to be used without a quorum 
disk with only 2 machines talking over the network is Linux Pacemaker. When 
something bad happends, one machine will just turn off the power of the other 
machine to prevent things going wrong (this is called STONITH).

 Andrei 
 - Original Message -
 
 From: Leen Besselink l...@consolejunkie.net 
 To: ceph-users@lists.ceph.com 
 Sent: Thursday, 8 May, 2014 9:35:21 PM 
 Subject: Re: [ceph-users] NFS over CEPH - best practice 
 
 On Thu, May 08, 2014 at 01:24:17AM +0200, Gilles Mocellin wrote: 
  Le 07/05/2014 15:23, Vlad Gorbunov a écrit : 
  It's easy to install tgtd with ceph support. ubuntu 12.04 for example: 
   
  Connect ceph-extras repo: 
  echo deb http://ceph.com/packages/ceph-extras/debian $(lsb_release 
  -sc) main | sudo tee /etc/apt/sources.list.d/ceph-extras.list 
   
  Install tgtd with rbd support: 
  apt-get update 
  apt-get install tgt 
   
  It's important to disable the rbd cache on tgtd host. Set in 
  /etc/ceph/ceph.conf: 
  [client] 
  rbd_cache = false 
  [...] 
  
  Hello, 
  
 
 Hi, 
 
  Without cache on the tgtd side, it should be possible to have 
  failover and load balancing (active/avtive) multipathing. 
  Have you tested multipath load balancing in this scenario ? 
  
  If it's reliable, it opens a new way for me to do HA storage with iSCSI ! 
  
 
 I have a question, what is your use case ? 
 
 Do you need SCSI-3 persistent reservations so multiple machines can use the 
 same LUN at the same time ? 
 
 Because in that case I think tgtd won't help you. 
 
 Have a good day, 
 Leen. 
 ___ 
 ceph-users mailing list 
 ceph-users@lists.ceph.com 
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] NFS over CEPH - best practice

2014-05-08 Thread Leen Besselink

On Thu, May 08, 2014 at 01:24:17AM +0200, Gilles Mocellin wrote:
 Le 07/05/2014 15:23, Vlad Gorbunov a écrit :
 It's easy to install tgtd with ceph support. ubuntu 12.04 for example:
 
 Connect ceph-extras repo:
 echo deb http://ceph.com/packages/ceph-extras/debian $(lsb_release
 -sc) main | sudo tee /etc/apt/sources.list.d/ceph-extras.list
 
 Install tgtd with rbd support:
 apt-get update
 apt-get install tgt
 
 It's important to disable the rbd cache on tgtd host. Set in
 /etc/ceph/ceph.conf:
 [client]
 rbd_cache = false
 [...]
 
 Hello,
 

Hi,

 Without cache on the tgtd side, it should be possible to have
 failover and load balancing (active/avtive) multipathing.
 Have you tested multipath load balancing in this scenario ?
 
 If it's reliable, it opens a new way for me to do HA storage with iSCSI !
 

I have a question, what is your use case ?

Do you need SCSI-3 persistent reservations so multiple machines can use the 
same LUN at the same time ?

Because in that case I think tgtd won't help you.

Have a good day,
Leen.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph uses too much disk space!!

2013-10-06 Thread Leen Besselink

On Sun, Oct 06, 2013 at 10:00:48AM +0300, Linux Chips wrote:
 maybe its woth mentioning that my OSDs are formatted as btrfs. i
 don't think that btrfs have 13% overhead. or dose it?
 

I would suggest you look at btrfs df, not df (never use df with btrfs)
and btrfs volume list to see what btrfs is doing.

If I'm not mistaken Ceph with btrfs uses snapshots as a way to do
transactions instead of using a journal.

Who knows, maybe something failed and they did't get cleaned up or
something like that, I've never had a look at how it is handled so
I don't know what it looks like normally. But post some information
on the list if you see something unusual, someone probably knows.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSD recommendations for OSD journals

2013-07-21 Thread Leen Besselink

On Mon, Jul 22, 2013 at 08:45:07AM +1100, Mikaël Cluseau wrote:
 On 22/07/2013 08:03, Charles 'Boyo wrote:
 Counting on the kernel's cache, it appears I will be best served
 purchasing write-optimized SSDs?
 Can you share any information on the SSD you are using, is it PCIe
 connected?
 
 We are on a standard SAS bus so any SSD going to 500MB/s and being
 stable on the long run (we use 60G Intel 520), you do not need a lot
 of space for the journal (5G per drive is far enough on commodity
 hardware).
 
 Another question, since the intention of this storage cluster is
 relatively cheap storage on commodity hardware, what's the balance
 between cheap SSDs and reliability since journal failure might
 result in data loss or will such an event just 'down' the affected
 OSDs?
 

When you do a write to Ceph, one OSD (I believe this is the master for a
certain part of the data, an object) receives the write and distributed
the copies to other OSD (as much as is configured, like: min size=2 size=3)
when writes are done on all those OSDs it will confirm the write to the
client. So if one OSD failes, other OSDs will have that data. The master
will have to make sure an other copy is created somewhere else.

So I don't see a reason for data loss if you lose one journal. There
will be a lot of copying of data though and slow things down.

 A journal failure will fail your OSDs (from what I've understood,
 you'll have to rebuild them). But SSDs are very deterministic, so
 monitor them :
 
 # smartctl -A /dev/sdd
 [..]
 ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE
 UPDATED  WHEN_FAILED RAW_VALUE
 [..]
 232 Available_Reservd_Space 0x0033   100   100   010Pre-fail
 Always   -   0
 233 Media_Wearout_Indicator 0x0032   093   093   000Old_age
 Always   -   0
 
 And don't put too many OSDs on one SSD (I set a rule to not go over
 4 for 1).
 

When the SSD is large enough and yournals don't take up all the space,
you can also leave part of the SSD unpartitioned. This will allow the SSD
the fail much later.

 On a similar note, I am using XFS on the OSDs which also journals,
 does this affect performance in any way?
 
 You want this journal for consistency ;) I don't know exactly the
 impact, but since we use spinning drives, the most important factor
 is that ceph, with a journal on SSD, does a lot of sequential
 writes, avoiding most seeks.
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to change the journal size at run time?

2013-06-21 Thread Leen Besselink

On Fri, Jun 21, 2013 at 12:11:23PM +0800, Da Chun wrote:
 Hi List,
 The default journal size is 1G, which I think is too small for my Gb network. 
 I want to extend all the journal partitions to 2 or 4G. How can I do that? 
 The osds were all created by commands like ceph-deploy osd create 
 ceph-node0:/dev/sdb. The journal partition is on the same disk together with 
 the corresponding data partition.
 I notice there is an attribute osd journal size which value is 1024. I 
 guess this is why the command ceph-deploy osd create set the journal 
 partition size as 1G.
 
 
 I want to do this job using steps as below:
 1. Change the osd journal size in the ceph.conf to 4G
 2. Remove the osd
 3. Readd the osd
 4. Repeat 2 and 3 steps for all the osds.
 
 
 This needs lots of manual work and is time consuming. Are there better ways 
 to do that? Thanks!

Have a look at these commands:

http://ceph.com/docs/master/man/8/ceph-osd/#cmdoption-ceph-osd--flush-journal
http://ceph.com/docs/master/man/8/ceph-osd/#cmdoption-ceph-osd--mkjournal

And this setting:
http://ceph.com/docs/master/rados/configuration/osd-config-ref/#index-2

If I'm not mistaken that is a per-machine global or per-osd setting in 
/etc/ceph/ceph.conf

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to change the journal size at run time?

2013-06-21 Thread Leen Besselink

On Fri, Jun 21, 2013 at 10:39:05AM +0200, Leen Besselink wrote:
 On Fri, Jun 21, 2013 at 12:11:23PM +0800, Da Chun wrote:
  Hi List,
  The default journal size is 1G, which I think is too small for my Gb 
  network. I want to extend all the journal partitions to 2 or 4G. How can I 
  do that? The osds were all created by commands like ceph-deploy osd create 
  ceph-node0:/dev/sdb. The journal partition is on the same disk together 
  with the corresponding data partition.
  I notice there is an attribute osd journal size which value is 1024. I 
  guess this is why the command ceph-deploy osd create set the journal 
  partition size as 1G.
  
  
  I want to do this job using steps as below:
  1. Change the osd journal size in the ceph.conf to 4G
  2. Remove the osd
  3. Readd the osd
  4. Repeat 2 and 3 steps for all the osds.
  
  
  This needs lots of manual work and is time consuming. Are there better ways 
  to do that? Thanks!
 
 Have a look at these commands:
 
 http://ceph.com/docs/master/man/8/ceph-osd/#cmdoption-ceph-osd--flush-journal
 http://ceph.com/docs/master/man/8/ceph-osd/#cmdoption-ceph-osd--mkjournal
 

Actually, I'm slightly mistaken. I don't think you need the mkjournal.

If you stop the osd, flush the journal, change the setting, remove the journal 
and start the osd.

I think it would create a new journal automatically.

I hope you have a test-environment or maybe someone with more knowledge of 
these things can confirm
or deny what I mentioned.

 And this setting:
 http://ceph.com/docs/master/rados/configuration/osd-config-ref/#index-2
 
 If I'm not mistaken that is a per-machine global or per-osd setting in 
 /etc/ceph/ceph.conf
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph iscsi questions

2013-06-18 Thread Leen Besselink

On Tue, Jun 18, 2013 at 09:52:53AM +0200, Kurt Bauer wrote:
 Hi,
 
 
 Da Chun schrieb:
  Hi List,
 
  I want to deploy a ceph cluster with latest cuttlefish, and export it
  with iscsi interface to my applications.
  Some questions here:
  1. Which Linux distro and release would you recommend? I used Ubuntu
  13.04 for testing purpose before.
 For the ceph-cluster or the iSCSI-GW? We use Ubuntu 12.04 LTS for the
 cluster and the iSCSI-GW, but tested Debian wheezy as iSCSI-GW too. Both
 work flawless.
  2. Which iscsi target is better? LIO, SCST, or others?
 Have you read http://ceph.com/dev-notes/adding-support-for-rbd-to-stgt/
 ? That's what we do and it works without problems so far.
 
  3. The system for the iscsi target will be a single point of failure.
  How to eliminate it and make good use of ceph's nature of distribution?
 That's a question we asked aourselves too. In theory one can set up 2
 iSCSI-GW and use multipath but what does that do to the cluster? Will
 smth. break if 2 iSCSI targets use the same rbd image in the cluster?
 Even if I use failover-mode only?
 
 Has someone already tried this and is willing to share their knowledge?
 

Let's see.

You mentioned HA and multipath.

You don't really need multipath for a HA iSCSI-target.

Multipath allows you to use multiple paths, multiple 
connections/networks/switches,
but you don't want to connect an iSCSI-initiator to multiple iSCSI-targets (for 
the
same LUN). That is asking for trouble.

So multi-path just gives you extra paths.

When you have multiple iSCSI-targets, you use failover.

Most iSCSI-initiators can deal with at least up to 30 seconds of no responses 
from
the iSCSI-target. No response, means no response. An error response is the wrong
response of course.

So when using failover, a virtual IP-address is probably what you want.

Probably combined with something like Pacemaker to make sure multiple machines
do not claim to have the same IP-address.

You'll need even more if you have multiple iSCSI-initiators that want to connect
to the same rbd, like some Windows or VMWare cluster. And I guess Linux 
clustering
filesystem like with OCFS2 probably need it too.

It's called SPC-3 Persistent Reservation.

As I understand Persistent Reservation, the iSCSI-target just needs to keep 
state
for the connected initiators. On failover it isn't a problem if there is no 
state.
So there is no state that needs to be replicated between multiple gateways. As 
long
as all initiators are connected to the same target. When different initiators 
are
connected to different targets, your data will get corrupted on write.

Now implementations:

- stgt does have some support for SPC-3, but not enough.
- LIO supports SPC-3 Persist, it is the one in the current Linux kernels.
- SCST seemed to much of a pain to set up to even try, but I might be wrong.
- IET: iSCSI Enterprise Target, seems to support SPC-3 Persist, it's a DKMS 
package on Ubuntu
- I later found out there is an implementation: 
http://www.peach.ne.jp/archives/istgt/ It too supports SPC-3 Persist. It is 
from the FreeBSD-camp and a package is available for Debian and Ubuntu with 
Linux-kernel and not just kFreeBSD. But I haven't tried it.

So I haven't tried them all yet. I have used LIO.

An other small tip: if you don't understand iSCSI, you'll might end up 
configure it the wrong
way at first and it will be slow. You might need to spend time to figure out 
how to tune it.

Now you know what I know.

 Best regards,
 Kurt
 
 
  Thanks!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph iscsi questions

2013-06-18 Thread Leen Besselink

On Tue, Jun 18, 2013 at 11:13:15AM +0200, Leen Besselink wrote:
 On Tue, Jun 18, 2013 at 09:52:53AM +0200, Kurt Bauer wrote:
  Hi,
  
  
  Da Chun schrieb:
   Hi List,
  
   I want to deploy a ceph cluster with latest cuttlefish, and export it
   with iscsi interface to my applications.
   Some questions here:
   1. Which Linux distro and release would you recommend? I used Ubuntu
   13.04 for testing purpose before.
  For the ceph-cluster or the iSCSI-GW? We use Ubuntu 12.04 LTS for the
  cluster and the iSCSI-GW, but tested Debian wheezy as iSCSI-GW too. Both
  work flawless.
   2. Which iscsi target is better? LIO, SCST, or others?
  Have you read http://ceph.com/dev-notes/adding-support-for-rbd-to-stgt/
  ? That's what we do and it works without problems so far.
  
   3. The system for the iscsi target will be a single point of failure.
   How to eliminate it and make good use of ceph's nature of distribution?
  That's a question we asked aourselves too. In theory one can set up 2
  iSCSI-GW and use multipath but what does that do to the cluster? Will
  smth. break if 2 iSCSI targets use the same rbd image in the cluster?
  Even if I use failover-mode only?
  
  Has someone already tried this and is willing to share their knowledge?
  
 
 Let's see.
 
 You mentioned HA and multipath.
 
 You don't really need multipath for a HA iSCSI-target.
 
 Multipath allows you to use multiple paths, multiple 
 connections/networks/switches,
 but you don't want to connect an iSCSI-initiator to multiple iSCSI-targets 
 (for the
 same LUN). That is asking for trouble.
 

Probably I should add why you might want to use multipath because it does add 
resiliance
and also performance if one connection on the target is not enough.

I have a feeling when using multipath it is easiest to use multiple subnets.

 So multi-path just gives you extra paths.
 
 When you have multiple iSCSI-targets, you use failover.
 
 Most iSCSI-initiators can deal with at least up to 30 seconds of no responses 
 from
 the iSCSI-target. No response, means no response. An error response is the 
 wrong
 response of course.
 
 So when using failover, a virtual IP-address is probably what you want.
 
 Probably combined with something like Pacemaker to make sure multiple machines
 do not claim to have the same IP-address.
 
 You'll need even more if you have multiple iSCSI-initiators that want to 
 connect
 to the same rbd, like some Windows or VMWare cluster. And I guess Linux 
 clustering
 filesystem like with OCFS2 probably need it too.
 
 It's called SPC-3 Persistent Reservation.
 
 As I understand Persistent Reservation, the iSCSI-target just needs to keep 
 state
 for the connected initiators. On failover it isn't a problem if there is no 
 state.
 So there is no state that needs to be replicated between multiple gateways. 
 As long
 as all initiators are connected to the same target. When different initiators 
 are
 connected to different targets, your data will get corrupted on write.
 
 Now implementations:
 
 - stgt does have some support for SPC-3, but not enough.
 - LIO supports SPC-3 Persist, it is the one in the current Linux kernels.
 - SCST seemed to much of a pain to set up to even try, but I might be wrong.
 - IET: iSCSI Enterprise Target, seems to support SPC-3 Persist, it's a DKMS 
 package on Ubuntu
 - I later found out there is an implementation: 
 http://www.peach.ne.jp/archives/istgt/ It too supports SPC-3 Persist. It is 
 from the FreeBSD-camp and a package is available for Debian and Ubuntu with 
 Linux-kernel and not just kFreeBSD. But I haven't tried it.
 
 So I haven't tried them all yet. I have used LIO.
 
 An other small tip: if you don't understand iSCSI, you'll might end up 
 configure it the wrong
 way at first and it will be slow. You might need to spend time to figure out 
 how to tune it.
 
 Now you know what I know.
 
  Best regards,
  Kurt
  
  
   Thanks!
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Another osd is filled too full and taken off after manually taking one osd out

2013-06-18 Thread Leen Besselink

On Tue, Jun 18, 2013 at 08:13:39PM +0800, Da Chun wrote:
 Hi List,My ceph cluster has two osds on each node. One has 15g capacity, and 
 the other 10g.
 It's interesting that, after I took the 15g osd out of the cluster, the 
 cluster started to rebalance, and finally the 10g osd on the same node was 
 finally full and taken off, and failed to start again with the following 
 error in the osd log file:
 2013-06-18 19:51:20.799756 7f6805ee07c0 -1 
 filestore(/var/lib/ceph/osd/ceph-1) Extended attributes don't appear to work. 
 Got error (28) No space left on device. If you are using ext3 or ext4, be 
 sure to mount the underlying file system with the 'user_xattr' option.
 2013-06-18 19:51:20.800258 7f6805ee07c0 -1 ^[[0;31m ** ERROR: error 
 converting store /var/lib/ceph/osd/ceph-1: (95) Operation not supported^[[0m
 
 
 
 I guess the 10g osd was chosen by the cluster to be the container for the 
 extra objects.
 My questions here:
 1. How are the extra objects spread in the cluster after an osd is taken out? 
 Only spread to one of the osds?
 2. Is there no mechanism to prevent the osds from being filled too full and 
 taken off?
 

As far I understand it.

Each OSD has the same weight by default, you can give them a different weight 
to force it to be used less.

The reason to do so could be because it has less space or because it is slower.

 
 Thanks for your time!
 
 
 This is the ceph log:
 2013-06-18 19:26:41.567607 mon.0 172.18.46.34:6789/0 1599 : [INF] pgmap 
 v14182: 456 pgs: 453 active+clean, 3 active+remapped+backfilling; 16874 MB 
 data, 40220 MB used, 36513 MB / 76733 MB avail; 379/9761 degraded (3.883%);  
 recovering 19 o/s, 77608KB/s
 2013-06-18 19:26:42.649139 mon.0 172.18.46.34:6789/0 1600 : [INF] pgmap 
 v14183: 456 pgs: 454 active+clean, 2 active+remapped+backfilling; 16874 MB 
 data, 40222 MB used, 36511 MB / 76733 MB avail; 309/9745 degraded (3.171%);  
 recovering 41 o/s, 162MB/s
 2013-06-18 19:26:46.566721 mon.0 172.18.46.34:6789/0 1601 : [INF] pgmap 
 v14184: 456 pgs: 454 active+clean, 2 active+remapped+backfilling; 16874 MB 
 data, 40222 MB used, 36511 MB / 76733 MB avail; 250/9745 degraded (2.565%);  
 recovering 25 o/s, 101450KB/s
 2013-06-18 19:26:39.858833 osd.1 172.18.46.35:6801/10730 88 : [WRN] OSD near 
 full (91%)
 2013-06-18 19:26:48.548076 mon.0 172.18.46.34:6789/0 1602 : [INF] pgmap 
 v14185: 456 pgs: 454 active+clean, 2 active+remapped+backfilling; 16874 MB 
 data, 40222 MB used, 36511 MB / 76733 MB avail; 200/9745 degraded (2.052%);  
 recovering 18 o/s, 72359KB/s
 2013-06-18 19:26:51.898811 mon.0 172.18.46.34:6789/0 1603 : [INF] pgmap 
 v14186: 456 pgs: 454 active+clean, 2 active+remapped+backfilling; 16874 MB 
 data, 40222 MB used, 36511 MB / 76733 MB avail; 155/9745 degraded (1.591%);  
 recovering 17 o/s, 71823KB/s
 2013-06-18 19:26:53.947739 mon.0 172.18.46.34:6789/0 1604 : [INF] pgmap 
 v14187: 456 pgs: 454 active+clean, 2 active+remapped+backfilling; 16874 MB 
 data, 40222 MB used, 36511 MB / 76733 MB avail; 113/9745 degraded (1.160%);  
 recovering 16 o/s, 65041KB/s
 2013-06-18 19:26:57.293713 mon.0 172.18.46.34:6789/0 1605 : [INF] pgmap 
 v14188: 456 pgs: 454 active+clean, 2 active+remapped+backfilling; 16874 MB 
 data, 40222 MB used, 36511 MB / 76733 MB avail; 103/9745 degraded (1.057%);  
 recovering 9 o/s, 37353KB/s
 2013-06-18 19:27:03.861124 mon.0 172.18.46.34:6789/0 1606 : [INF] pgmap 
 v14189: 456 pgs: 454 active+clean, 2 active+remapped+backfilling; 16874 MB 
 data, 35598 MB used, 41134 MB / 76733 MB avail; 103/9745 degraded (1.057%);  
 recovering 1 o/s, 3532KB/s
 2013-06-18 19:27:13.732263 mon.0 172.18.46.34:6789/0 1607 : [DBG] osd.1 
 172.18.46.35:6801/10730 reported failed by osd.0 172.18.46.34:6804/1506
 2013-06-18 19:27:15.949395 mon.0 172.18.46.34:6789/0 1608 : [DBG] osd.1 
 172.18.46.35:6801/10730 reported failed by osd.3 172.18.46.34:6807/11743
 2013-06-18 19:27:17.239206 mon.0 172.18.46.34:6789/0 1609 : [DBG] osd.1 
 172.18.46.35:6801/10730 reported failed by osd.5 172.18.46.36:6806/7436
 2013-06-18 19:27:17.239404 mon.0 172.18.46.34:6789/0 1610 : [INF] osd.1 
 172.18.46.35:6801/10730 failed (3 reports from 3 peers after 2013-06-18 
 19:27:38.239157 = grace 20.00)
 2013-06-18 19:27:17.306958 mon.0 172.18.46.34:6789/0 1611 : [INF] osdmap 
 e647: 6 osds: 5 up, 5 in
 2013-06-18 19:27:17.387311 mon.0 172.18.46.34:6789/0 1612 : [INF] pgmap 
 v14190: 456 pgs: 335 active+clean, 119 stale+active+clean, 2 
 active+remapped+backfilling; 16874 MB data, 35598 MB used, 41134 MB / 76733 
 MB avail; 103/9745 degraded (1.057%)
 2013-06-18 19:27:18.308209 mon.0 172.18.46.34:6789/0 1613 : [INF] osdmap 
 e648: 6 osds: 5 up, 5 in
 2013-06-18 19:27:18.316487 mon.0 172.18.46.34:6789/0 1614 : [INF] pgmap 
 v14191: 456 pgs: 335 active+clean, 119 stale+active+clean, 2 
 active+remapped+backfilling; 16874 MB data, 35598 MB used, 41134 MB / 76733 
 MB avail; 103/9745 degraded (1.057%)
 2013-06-18 19:27:22.676915 mon.0 172.18.46.34:6789/0 1615 : [INF] pgmap

Re: [ceph-users] ceph iscsi questions

2013-06-18 Thread Leen Besselink

On Tue, Jun 18, 2013 at 02:38:19PM +0200, Kurt Bauer wrote:

Da Chun schrieb:

Thanks for sharing! Kurt.

Yes. I have read the article you mentioned. But I also read another
one:
http://www.hastexo.com/resources/hints-and-kinks/turning-ceph-rbd-images-san-storage-devices.
It uses LIO, which is the current standard Linux kernel SCSI target.

That has a major disadvantage, which is, that you have to use the kernel
rbd module, which is not feature equivalent to ceph userland code, at
least in kernel-versions which are shipped with recent distributions.

Yes, that is why I like that some Ceph developers added rbd to tgt.

It's just a lot easier to do upgrades.

I don't expect it to be much slower either. I believe I read somewhere that LIO
in the kernel has
it's limits. Specifically the number of threads...? But I could be wrong.

The disadvantage of tgt is the clustering support does not work (yet ?).

There is another doc in the ceph
site: http://ceph.com/w/index.php?title=ISCSIredirect=no
http://ceph.com/w/index.php?title=ISCSIredirect=no
Quite outdated I think, last update nearly 3 years ago, I don't
understand what the box in the middle should depict.

I don't quite understand how the multi path works here. Are the two
ISCSI targets on the same system or two different ones?
Has anybody tried this already?

Leen has illustrated that quite well.

-- Original --
*From: * Kurt Bauerkurt.ba...@univie.ac.at;
*Date: * Tue, Jun 18, 2013 03:52 PM
*To: * Da Chunng...@qq.com;
*Cc: * ceph-usersceph-users@lists.ceph.com;
*Subject: * Re: [ceph-users] ceph iscsi questions

Hi,

Da Chun schrieb:
Hi List,

I want to deploy a ceph cluster with latest cuttlefish, and export it
with iscsi interface to my applications.
Some questions here:
1. Which Linux distro and release would you recommend? I used Ubuntu
13.04 for testing purpose before.
For the ceph-cluster or the iSCSI-GW? We use Ubuntu 12.04 LTS for
the cluster and the iSCSI-GW, but tested Debian wheezy as iSCSI-GW
too. Both work flawless.
2. Which iscsi target is better? LIO, SCST, or others?
Have you read
http://ceph.com/dev-notes/adding-support-for-rbd-to-stgt/ ? That's
what we do and it works without problems so far.

3. The system for the iscsi target will be a single point of failure.
How to eliminate it and make good use of ceph's nature of distribution?
That's a question we asked aourselves too. In theory one can set up 2
iSCSI-GW and use multipath but what does that do to the cluster? Will
smth. break if 2 iSCSI targets use the same rbd image in the cluster?
Even if I use failover-mode only?

Has someone already tried this and is willing to share their knowledge?

Best regards,
Kurt

Thanks!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Single Cluster / Reduced Failure Domains

2013-06-18 Thread Leen Besselink

On Tue, Jun 18, 2013 at 09:02:12AM -0700, Gregory Farnum wrote:
 On Tuesday, June 18, 2013, harri wrote:
 
   Hi, ** 
 
  ** **
 
  I wondered what best practice is recommended to reducing failure domains
  for a virtual server platform. If I wanted to run multiple virtual server
  clusters then would it be feasible to serve storage from 1 x large Ceph
  cluster?
 
 
 I'm a bit confused by your question here. Normally you want as many
 defined failure domains as possible to best tolerate those failures without
 data loss.
 
 
 
 
  I am concerned that, in the unlikely event the Ceph whole cluster fails,
  then *ALL *my VM's would be offline.
 
 
 Well, yes?
 
 
 
 
  Is there anyway to ring-fence failure domains within a logical Ceph
  cluster or would you instead look to build multiple Ceph clusters (but then
  that defeats the object of the technology doesn't it?)?
 
 
 You can separate your OSDs into different CRUSH buckets and thn assign
 different pools to draw from those buckets if you're trying to split up
 your storage somehow. But I'm still a little confused about what you're
 after. :)
 -Greg
 

I think I know what he means, because this is what I've been thinking:

The software (of the monitors) is/are the single point of failure.

For example when you do an upgrade of Ceph and your monitors fail because of 
the upgrade.

You will have down time.

Obviously, it isn't every day I upgrade the software of our SAN either.

But one of the reasons people seem to be moving to software more than 
'hardware' is because of
flexibility. So they want to be able to update it.

I've had Ceph test installations fail an upgrade, I've had a 3 monitor setup 
lose 1 monitor
and follow the wrong procedure to get it back up and running.

I've seen others on the mailinglist asking for help after upgrade problems.

This is exactly why RBD incremental backup makes me happy, because it should be 
easier to
keep up to date copies/snapshots on multiple Ceph installations.

 
 
 -- 
 Software Engineer #42 @ http://inktank.com | http://ceph.com

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Hardware recommendation / calculation for large cluster

2013-05-12 Thread Leen Besselink

On Sun, May 12, 2013 at 03:14:15PM +0200, Tim Mohlmann wrote:
 Hi,
 
 On Saturday 11 May 2013 16:04:27 Leen Besselink wrote:
  
  Someone is going to correct me if I'm wrong, but I think you misread
  something.
 
 
  The Mon-daemon doesn't need that much RAM:
  
  The 'RAM: 1 GB per daemon' is per Mon-daemon, not per OSD-daemon.
  
 Gosh, I feel embarresed. This ectually was my main concern / bottleneck. 
 Thanks for pointing this out. Seems Ceph really rocks in deploying affordable 
 data clusters.
 

I did see you mentioned you wanted to have, many disks in the same machine.

Not just machines with let's say 12 disks for example.

Did you know you need the CPU-power of a 1Ghz Xeon core per OSD for the times 
when
recovery is happening ?

 Regards, Tim
 
  On Sat, May 11, 2013 at 03:42:59PM +0200, Tim Mohlmann wrote:
   Hi,
   
   First of all I am new to ceph and this mailing list. At this moment I am
   looking into the possibilities to get involved in the storage business. I
   am trying to get an estimate about costs and after that I will start to
   determine how to get sufficient income.
   
   First I will describe my case, at the bottom you will find my questions.
   
   
   GENERAL LAYOUT:
   
   Part of this cost calculation is of course hardware. For the larger part
   I've already figured it out. In my plans I will be leasing a full rack
   (46U). Depending on the domestic needs I will be using 36 or 40U for ODS
   storage servers. (I will assume 36U from here on, to keep a solid value
   for calculation and have enough spare space for extra devices).
   
   Each OSD server uses 4U and can take 36x3.5 drives. So in 36U I can put
   36/4=9 OSD servers, containing 9*36=324 HDDs.
   
   
   HARD DISK DRIVES
   
   I have been looking for WD digital RE and RED series. RE is more expensive
   per GB, but has a larger MTBF and offers a 4TB model. RED is (real) cheap
   per GB, but only goes as far a 3TB.
   
   At my current calculations it does not matter much if I would put
   expensive WD RE 4TB disks or cheaper WD RED 3TB, the price per GB over
   the complete cluster expense and 3 years of running costs (including AFR)
   is almost the same.
   
   So basically, if I could reduce the costs of all the other components used
   in the cluster, I would go for the 3TB disk and if the costs will be
   higher then my first calculation, I would use the 4TB disk.
   
   Let's assume 4TB from now on. So, 4*324=1296TB. So lets go Peta-byte ;).
   
   
   NETWORK
   
   I will use a redundant 2x10Gbe network connection for each node. Two
   independent 10Gbe switches will be used and I will use bonding between the
   interfaces on each node. (Thanks some guy in the #Ceph irc for pointing
   this option out). I will use VLAN's to split front-side, backside and
   Internet networks.
   
   
   OSD SERVER
   
   SuperMicro based, 36 HDD hotswap. Dual socket mainboard. 16x DIMM sockets.
   It is advertised they can take up to 512GB of RAM. I will install 2 x
   Intel Xeon E5620 2.40ghz processor, having 4 cores and 8 threads each.
   For the RAM I am in doubt (see below). I am looking into running 1 OSD
   per disk.
   
   
   MON AND MDS SERVERS
   
   Now comes the big question. What specs are required? It first I had the
   plan to use 4 SuperMicro superservers, with a 4 socket mainboards that
   contain up to the new 16core AMD processors and up to 1TB of RAM.
   
   I want all 4 of the servers to run a MON service, MDS service and costumer
   / public services. Probably I would use VM's (kvm) to separate them. I
   will compile my own kernel to enable Kernel Samepage Merge, Hugepage
   support and memory compaction to make RAM use more efficient. The
   requirements for my public services will be added up, once I know what I
   need for MON and MDS.
   
   
   RAM FOR ALL SERVERS
   
   So what would you estimate to be the ram usage?
   http://ceph.com/docs/master/install/hardware-recommendations/#minimum-
   hardware-recommendations.
   
   Sounds OK for the OSD part. 500 MB per daemon, would put the minimum RAM
   requirement for my OSD server to 18GB. 32GB should be more then enough.
   Although I would like to see if it is possible to use btrfs compression?
   In
   that case I'd need more RAM in there.
   
   What I really want to know: how many RAM do I need for MON and MDS
   servers?
   1GB per daemon sounds pretty steep. As everybody knows, RAM is expensive!
   
   In my case I would need at least 324 GB of ram for each of them. Initially
   I was planning to use 4 servers and each of them running both. Joining
   those in a single system, with the other duties the system has to perform
   I would need the full 1TB of RAM. I would need to use 32GB modules witch
   are really expensive per GB and difficult to find. (not may server
   hardware vendors in the Netherlands have them).
   
   
   QUESTIONS
   
   Question 1: Is it really the amount for OSD's that counts for MON and MDS
   RAM usage

Re: [ceph-users] Hardware recommendation / calculation for large cluster

2013-05-12 Thread Leen Besselink

On Sun, May 12, 2013 at 10:22:10PM +0200, Tim Mohlmann wrote:
 Hi,
 
 On Sunday 12 May 2013 18:05:16 Leen Besselink wrote:
 
  
  I did see you mentioned you wanted to have, many disks in the same machine.
  
  Not just machines with let's say 12 disks for example.
  
  Did you know you need the CPU-power of a 1Ghz Xeon core per OSD for the
  times when recovery is happening ?
 Nope, did not know it.
 
 The current intent is to install 2x 2.4 Ghz xeon CPU, handeling 8 threads 
 each. So, 2*8*2.4=38.4 for max OSD's. It should be fine.
 
 If I would go for the 72 disk option, I have to consider doubling that power. 
 The current max I can select from the dealer I am looking at, for the socket 
 housed in the supermicro 72x 3.5 version are 2x a Xeon x5680. Utilizing 12 
 threads each, at 3.33Ghz. So, 2*12*3.33=79.79 for max OSD's. Also this should 
 be fine.
 
 What will happen if the CPU is maxed out anyway? Slowing things or crashing 
 things? In my opinion it is not a bad thing if a system is maxed out in such 
 a 
 massive migration, which should not occur on a daily base. Sure, a disk that 
 fails every two weeks, no prob. What are we talking about? 0.3% of the 
 complete storage cluster. Even 0.15% if I would take the 72x3.5 servers.
 

Even if one disk/OSD fails, it would need to recheck where each placement groups
should be stored and move stuff around if needed.

If during this action your CPUs are maxed out, you might start to lose 
connections
between OSDs and the process will need to start over.

At least that is how I understand it, I've done a few test installations, but
not yet deployed it in production.

The Inktank people said in the presentations I've seen (and looking at the 
picture
in the video from DreamHost I have a feeling that is what they've deployed):

12 HDD == 12 OSD per machine is ideal, maybe with 2 or 3 SSD for journaling if 
you
want more performance.

 If a complete server stops working, that is something else. But as I said in 
 a 
 different split of this thread: if that happens I have got different things 
 to 
 worry about, than a slow migration of data. As long as there is no data lost, 
 I don't really care it takes a bit longer.
 
 Thanks for the advise.
 
 Tim
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Hardware recommendation / calculation for large cluster

2013-05-11 Thread Leen Besselink

Hi,

Someone is going to correct me if I'm wrong, but I think you misread something.

The Mon-daemon doesn't need that much RAM:

The 'RAM: 1 GB per daemon' is per Mon-daemon, not per OSD-daemon.

The same for disk-space.

You should read this page again:

http://ceph.com/docs/master/install/hardware-recommendations/

Some of the other questions are answered there as well.

Like how much memory does a OSD-daemon need and why/when.



On Sat, May 11, 2013 at 03:42:59PM +0200, Tim Mohlmann wrote:
 Hi,
 
 First of all I am new to ceph and this mailing list. At this moment I am 
 looking into the possibilities to get involved in the storage business. I am 
 trying to get an estimate about costs and after that I will start to 
 determine 
 how to get sufficient income.
 
 First I will describe my case, at the bottom you will find my questions.
 
 
 GENERAL LAYOUT:
 
 Part of this cost calculation is of course hardware. For the larger part I've 
 already figured it out. In my plans I will be leasing a full rack (46U). 
 Depending on the domestic needs I will be using 36 or 40U for ODS storage 
 servers. (I will assume 36U from here on, to keep a solid value for 
 calculation and have enough spare space for extra devices).
 
 Each OSD server uses 4U and can take 36x3.5 drives. So in 36U I can put 
 36/4=9 OSD servers, containing 9*36=324 HDDs.
 
 
 HARD DISK DRIVES
 
 I have been looking for WD digital RE and RED series. RE is more expensive 
 per 
 GB, but has a larger MTBF and offers a 4TB model. RED is (real) cheap per GB, 
 but only goes as far a 3TB.
 
 At my current calculations it does not matter much if I would put expensive 
 WD 
 RE 4TB disks or cheaper WD RED 3TB, the price per GB over the complete 
 cluster 
 expense and 3 years of running costs (including AFR) is almost the same.
 
 So basically, if I could reduce the costs of all the other components used in 
 the cluster, I would go for the 3TB disk and if the costs will be higher then 
 my first calculation, I would use the 4TB disk.
 
 Let's assume 4TB from now on. So, 4*324=1296TB. So lets go Peta-byte ;).
 
 
 NETWORK
 
 I will use a redundant 2x10Gbe network connection for each node. Two 
 independent 10Gbe switches will be used and I will use bonding between the 
 interfaces on each node. (Thanks some guy in the #Ceph irc for pointing this 
 option out). I will use VLAN's to split front-side, backside and Internet 
 networks.
 
 
 OSD SERVER
 
 SuperMicro based, 36 HDD hotswap. Dual socket mainboard. 16x DIMM sockets. It 
 is advertised they can take up to 512GB of RAM. I will install 2 x Intel Xeon 
 E5620 2.40ghz processor, having 4 cores and 8 threads each. For the RAM I am 
 in doubt (see below). I am looking into running 1 OSD per disk.
 
 
 MON AND MDS SERVERS
 
 Now comes the big question. What specs are required? It first I had the plan 
 to 
 use 4 SuperMicro superservers, with a 4 socket mainboards that contain up to 
 the new 16core AMD processors and up to 1TB of RAM.
 
 I want all 4 of the servers to run a MON service, MDS service and costumer / 
 public services. Probably I would use VM's (kvm) to separate them. I will 
 compile my own kernel to enable Kernel Samepage Merge, Hugepage support and 
 memory compaction to make RAM use more efficient. The requirements for my 
 public 
 services will be added up, once I know what I need for MON and MDS.
 
 
 RAM FOR ALL SERVERS
 
 So what would you estimate to be the ram usage?
 http://ceph.com/docs/master/install/hardware-recommendations/#minimum-
 hardware-recommendations.
 
 Sounds OK for the OSD part. 500 MB per daemon, would put the minimum RAM 
 requirement for my OSD server to 18GB. 32GB should be more then enough. 
 Although I would like to see if it is possible to use btrfs compression? In 
 that case I'd need more RAM in there.
 
 What I really want to know: how many RAM do I need for MON and MDS servers? 
 1GB per daemon sounds pretty steep. As everybody knows, RAM is expensive!
 
 In my case I would need at least 324 GB of ram for each of them. Initially I 
 was planning to use 4 servers and each of them running both. Joining those in 
 a single system, with the other duties the system has to perform I would need 
 the full 1TB of RAM. I would need to use 32GB modules witch are really 
 expensive per GB and difficult to find. (not may server hardware vendors in 
 the 
 Netherlands have them).
 
 
 QUESTIONS
 
 Question 1: Is it really the amount for OSD's that counts for MON and MDS RAM 
 usage, or the size of the object store?
 
 Question 2: can I do it with less RAM? Any statistics, or better: a 
 calculation? I can imagine memory pages becoming redundant if the cluster 
 grows, so less memory required per OSD.
 
 Question 3: If it is the amount of OSDs that counts, would it be beneficial 
 to 
 combine disks in a RAID 0 (lvm or btrfs) array?
 
 Question 4: Is it safe / possible to store MON files inside of the cluster 
 itself? The 10GB per daemon requirement would

Re: [ceph-users] Using Ceph as Storage for VMware

2013-05-09 Thread Leen Besselink

On Thu, May 09, 2013 at 11:51:32PM +0100, Neil Levine wrote:
 Jared,
 
 As Weiguo says you will need to use a gateway to present a Ceph block
 device (RBD) in a format VMware understands. We've contributed the
 relevant code to the TGT iSCSI target (see blog:
 http://ceph.com/dev-notes/adding-support-for-rbd-to-stgt/) and though
 we haven't done a massive amount of testing on it, I'd love to get
 some feedback on it. We will be putting more effort into it this cycle
 (including producing a package).
 

We also have a legacy virtualization setup we are thinking of using with Ceph
and iSCSI. We however also ended up at LIO, because LIO supports the iSCSI
extensions which are needed for clustering.

stgt doesn't yet support all the needed extensions as far as I can see.

There seems to be exactly one person sporadically working on improving stgt
in this area.

 If you have a VMware account rep, be sure to ask him to file support
 for Ceph as a customer request with the product teams while we
 continue knock on VMware's door :-)
 
 Neil
 
 On Thu, May 9, 2013 at 11:30 PM, w sun ws...@hotmail.com wrote:
  RBD is not supported by VMware/vSphere. You will need to build a
  NFS/iSCSI/FC GW to support VMware. Here is a post someone has been trying
  and you may have to contact them directly for status,
 
  http://ceph.com/community/ceph-over-fibre-for-vmware/
 
  --weiguo
 
  
  To: ceph-users@lists.ceph.com
  From: jaredda...@shelterinsurance.com
  Date: Thu, 9 May 2013 17:25:02 -0500
  Subject: [ceph-users] Using Ceph as Storage for VMware
 
 
  I am investigating using Ceph as a storage target for virtual servers in
  VMware.  We have 3 servers packed with hard drives ready for the proof of
  concept.  I am looking for some direction.  Is this a valid use for Ceph?
  If so, has anybody accomplished this?  Are there any documents on how to set
  this up?  Should I use RDB, NFS, etc?  Any help, would be greatly
  appreciated.
 
 
  Thank You,
 
  JD
 
 
  ___ ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Using Ceph as Storage for VMware

2013-05-09 Thread Leen Besselink

On Fri, May 10, 2013 at 12:12:45AM +0100, Neil Levine wrote:
 Leen,
 
 Do you mean you get LIO working with RBD directly? Or are you just
 re-exporting a kernel mounted volume?
 

Yes, re-exporting a kernel mounted volume on seperate gateway machines.

 Neil
 
 On Thu, May 9, 2013 at 11:58 PM, Leen Besselink l...@consolejunkie.net 
 wrote:
  On Thu, May 09, 2013 at 11:51:32PM +0100, Neil Levine wrote:
  Jared,
 
  As Weiguo says you will need to use a gateway to present a Ceph block
  device (RBD) in a format VMware understands. We've contributed the
  relevant code to the TGT iSCSI target (see blog:
  http://ceph.com/dev-notes/adding-support-for-rbd-to-stgt/) and though
  we haven't done a massive amount of testing on it, I'd love to get
  some feedback on it. We will be putting more effort into it this cycle
  (including producing a package).
 
 
  We also have a legacy virtualization setup we are thinking of using with 
  Ceph
  and iSCSI. We however also ended up at LIO, because LIO supports the iSCSI
  extensions which are needed for clustering.
 
  stgt doesn't yet support all the needed extensions as far as I can see.
 
  There seems to be exactly one person sporadically working on improving stgt
  in this area.
 
  If you have a VMware account rep, be sure to ask him to file support
  for Ceph as a customer request with the product teams while we
  continue knock on VMware's door :-)
 
  Neil
 
  On Thu, May 9, 2013 at 11:30 PM, w sun ws...@hotmail.com wrote:
   RBD is not supported by VMware/vSphere. You will need to build a
   NFS/iSCSI/FC GW to support VMware. Here is a post someone has been trying
   and you may have to contact them directly for status,
  
   http://ceph.com/community/ceph-over-fibre-for-vmware/
  
   --weiguo
  
   
   To: ceph-users@lists.ceph.com
   From: jaredda...@shelterinsurance.com
   Date: Thu, 9 May 2013 17:25:02 -0500
   Subject: [ceph-users] Using Ceph as Storage for VMware
  
  
   I am investigating using Ceph as a storage target for virtual servers in
   VMware.  We have 3 servers packed with hard drives ready for the proof of
   concept.  I am looking for some direction.  Is this a valid use for Ceph?
   If so, has anybody accomplished this?  Are there any documents on how to 
   set
   this up?  Should I use RDB, NFS, etc?  Any help, would be greatly
   appreciated.
  
  
   Thank You,
  
   JD
  
  
   ___ ceph-users mailing list
   ceph-users@lists.ceph.com
   http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
  
   ___
   ceph-users mailing list
   ceph-users@lists.ceph.com
   http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
  
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph with VMWare / XenServer

Re: [ceph-users] Ceph with VMWare / XenServer

Re: [ceph-users] NFS over CEPH - best practice

Re: [ceph-users] NFS over CEPH - best practice

Re: [ceph-users] NFS over CEPH - best practice

Re: [ceph-users] NFS over CEPH - best practice

Re: [ceph-users] NFS over CEPH - best practice

Re: [ceph-users] ceph uses too much disk space!!

Re: [ceph-users] SSD recommendations for OSD journals

Re: [ceph-users] How to change the journal size at run time?

Re: [ceph-users] How to change the journal size at run time?

Re: [ceph-users] ceph iscsi questions

Re: [ceph-users] ceph iscsi questions

Re: [ceph-users] Another osd is filled too full and taken off after manually taking one osd out

Re: [ceph-users] ceph iscsi questions

Re: [ceph-users] Single Cluster / Reduced Failure Domains

Re: [ceph-users] Hardware recommendation / calculation for large cluster

Re: [ceph-users] Hardware recommendation / calculation for large cluster

Re: [ceph-users] Hardware recommendation / calculation for large cluster

Re: [ceph-users] Using Ceph as Storage for VMware

Re: [ceph-users] Using Ceph as Storage for VMware

21 matches

Site Navigation

Mail list logo

Footer information