Re: [ceph-users] The project of ceph client file system porting from Linux to AIX

2015-03-04 Thread McNamara, Bradley
I'd like to see a Solaris client.

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Dennis 
Chen
Sent: Wednesday, March 04, 2015 2:00 AM
To: ceph-devel; ceph-users; Sage Weil; Loic Dachary
Subject: [ceph-users] The project of ceph client file system porting from Linux 
to AIX

Hello,

The ceph cluster now can only be used by Linux system AFAICT, so I planed to 
port the ceph client file system from Linux to AIX as a tiered storage solution 
in that platform. Below is the source code repository I've done, which is still 
in progress. 3 important modules:

1. aixker: maintain a uniform kernel API beteween the Linux and AIX 2. net: as 
a data transfering layer between the client and cluster 3. fs: as an adaptor to 
make the AIX can recognize the Linux file system.

https://github.com/Dennis-Chen1977/aix-cephfs

Welcome any comments or anything...

--
Den
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Double-mounting of RBD

2014-12-17 Thread McNamara, Bradley
I have a somewhat interesting scenario.  I have an RBD of 17TB formatted using 
XFS.  I would like it accessible from two different hosts, one mapped/mounted 
read-only, and one mapped/mounted as read-write.  Both are shared using Samba 
4.x.  One Samba server gives read-only access to the world for the data.  The 
other gives read-write access to a very limited set of users who occasionally 
need to add data.

However, when testing this, when changes are made to the read-write Samba 
server the changes don't seem to be seen by the read-only Samba server.  Is 
there some file system caching going on that will eventually be flushed?

Am I living dangerously doing what I have set up?  I thought I would avoid 
most/all potential file system corruption by making sure there is only one 
read-write access method.  Thanks for any answers.

Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] work with share disk

2014-10-31 Thread McNamara, Bradley
CephFS, yes, but it's not considered production-ready.

You can also use an RBD volume and place OCFS2 on it and share it that way, too.

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
yang.bi...@zte.com.cn
Sent: Friday, October 31, 2014 12:22 AM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] work with share disk

Hi

Can mutiple ceph nodes work on one single share disk? Just like RedHat Global 
FS or Oralce ocfs2.







ZTE Information Security Notice: The information contained in this mail (and 
any attachment transmitted herewith) is privileged and confidential and is 
intended for the exclusive use of the addressee(s).  If you are not an intended 
recipient, any disclosure, reproduction, distribution or other dissemination or 
use of the information contained is strictly prohibited.  If you have received 
this mail in error, please delete it and notify us immediately.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgraded now MDS won't start

2014-09-11 Thread McNamara, Bradley
That portion of the log confused me, too.  However, I had run the same upgrade 
process on the MDS as all the other cluster components.  Firefly was actually 
installed on the MDS even though the log mentions 0.72.2.

At any rate, I ended up stopping the MDS and using 'newfs' on the metadata and 
data pools to eliminate the HEALTH_WARN issue.

-Original Message-
From: Gregory Farnum [mailto:g...@inktank.com] 
Sent: Thursday, September 11, 2014 2:09 PM
To: McNamara, Bradley
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Upgraded now MDS won't start

On Wed, Sep 10, 2014 at 4:24 PM, McNamara, Bradley 
 wrote:
> Hello,
>
> This is my first real issue since running Ceph for several months.  Here's 
> the situation:
>
> I've been running an Emperor cluster for several months.  All was good.  I 
> decided to upgrade since I'm running Ubuntu 13.10 and 0.72.2.  I decided to 
> first upgrade Ceph to 0.80.4, which was the last version in the apt 
> repository for 13.10.  I upgrade the MON's, then the OSD servers to 0.80.4; 
> all went as expected with no issues.  The last thing I did was upgrade the 
> MDS using the same process, but now the MDS won't start.  I've tried to 
> manually start the MDS with debugging on, and I have attached the file.  It 
> complains that it's looking for "mds.0.20  need osdmap epoch 3602, have 3601".
>
> Anyway, I'd don't really use CephFS or RGW, so I don't need the MDS, but I'd 
> like to have it.  Can someone tell me how to fix it, or delete it, so I can 
> start over when I do need it?  Right now my cluster is HEALTH_WARN because of 
> it.

Uh, the log is from an MDS running Emperor. That one looks like it's 
complaining because the mds data formats got updated for Firefly. ;) You'll 
need to run debugging from a Firefly mds to try and get something useful.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Upgraded now MDS won't start

2014-09-10 Thread McNamara, Bradley
Hello,

This is my first real issue since running Ceph for several months.  Here's the 
situation:

I've been running an Emperor cluster for several months.  All was good.  I 
decided to upgrade since I'm running Ubuntu 13.10 and 0.72.2.  I decided to 
first upgrade Ceph to 0.80.4, which was the last version in the apt repository 
for 13.10.  I upgrade the MON's, then the OSD servers to 0.80.4; all went as 
expected with no issues.  The last thing I did was upgrade the MDS using the 
same process, but now the MDS won't start.  I've tried to manually start the 
MDS with debugging on, and I have attached the file.  It complains that it's 
looking for "mds.0.20  need osdmap epoch 3602, have 3601".

Anyway, I'd don't really use CephFS or RGW, so I don't need the MDS, but I'd 
like to have it.  Can someone tell me how to fix it, or delete it, so I can 
start over when I do need it?  Right now my cluster is HEALTH_WARN because of 
it.

Thanks!

Brad2014-09-10 15:48:13.830787 7fae3c48e7c0  0 ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mds, pid 3166
2014-09-10 15:48:13.834336 7fae3c48e7c0 10 mds.-1.0 168	MDSCacheObject
2014-09-10 15:48:13.834349 7fae3c48e7c0 10 mds.-1.0 2168	CInode
2014-09-10 15:48:13.834355 7fae3c48e7c0 10 mds.-1.0 16	 elist<>::item   *7=112
2014-09-10 15:48:13.834359 7fae3c48e7c0 10 mds.-1.0 392	 inode_t 
2014-09-10 15:48:13.834361 7fae3c48e7c0 10 mds.-1.0 56	  nest_info_t 
2014-09-10 15:48:13.834364 7fae3c48e7c0 10 mds.-1.0 32	  frag_info_t 
2014-09-10 15:48:13.834370 7fae3c48e7c0 10 mds.-1.0 40	 SimpleLock   *5=200
2014-09-10 15:48:13.834373 7fae3c48e7c0 10 mds.-1.0 48	 ScatterLock  *3=144
2014-09-10 15:48:13.834377 7fae3c48e7c0 10 mds.-1.0 488	CDentry
2014-09-10 15:48:13.834379 7fae3c48e7c0 10 mds.-1.0 16	 elist<>::item
2014-09-10 15:48:13.834383 7fae3c48e7c0 10 mds.-1.0 40	 SimpleLock
2014-09-10 15:48:13.834385 7fae3c48e7c0 10 mds.-1.0 1024	CDir 
2014-09-10 15:48:13.834387 7fae3c48e7c0 10 mds.-1.0 16	 elist<>::item   *2=32
2014-09-10 15:48:13.834390 7fae3c48e7c0 10 mds.-1.0 192	 fnode_t 
2014-09-10 15:48:13.834392 7fae3c48e7c0 10 mds.-1.0 56	  nest_info_t *2
2014-09-10 15:48:13.834394 7fae3c48e7c0 10 mds.-1.0 32	  frag_info_t *2
2014-09-10 15:48:13.834399 7fae3c48e7c0 10 mds.-1.0 168	Capability 
2014-09-10 15:48:13.834402 7fae3c48e7c0 10 mds.-1.0 32	 xlist<>::item   *2=64
2014-09-10 15:48:13.835815 7fae3c486700 10 mds.-1.0 MDS::ms_get_authorizer type=mon
2014-09-10 15:48:13.836113 7fae37292700  5 mds.-1.0 ms_handle_connect on 156.74.237.50:6789/0
2014-09-10 15:48:13.839873 7fae3c48e7c0 10 mds.-1.0 beacon_send up:boot seq 1 (currently up:boot)
2014-09-10 15:48:13.840110 7fae3c48e7c0 10 mds.-1.0 create_logger
2014-09-10 15:48:13.867040 7fae37292700  5 mds.-1.0 handle_mds_map epoch 149 from mon.0
2014-09-10 15:48:13.867109 7fae37292700 10 mds.-1.0  my compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding}
2014-09-10 15:48:13.867122 7fae37292700 10 mds.-1.0  mdsmap compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding}
2014-09-10 15:48:13.867136 7fae37292700 10 mds.-1.-1 map says i am 156.74.237.56:6800/3166 mds.-1.-1 state down:dne
2014-09-10 15:48:13.867151 7fae37292700 10 mds.-1.-1 not in map yet
2014-09-10 15:48:14.164620 7fae37292700  5 mds.-1.-1 handle_mds_map epoch 150 from mon.0
2014-09-10 15:48:14.164706 7fae37292700 10 mds.-1.-1  my compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding}
2014-09-10 15:48:14.164716 7fae37292700 10 mds.-1.-1  mdsmap compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding}
2014-09-10 15:48:14.164727 7fae37292700 10 mds.-1.0 map says i am 156.74.237.56:6800/3166 mds.-1.0 state up:standby
2014-09-10 15:48:14.164739 7fae37292700 10 mds.-1.0  peer mds gid 5192121 removed from map
2014-09-10 15:48:14.164757 7fae37292700  1 mds.-1.0 handle_mds_map standby
2014-09-10 15:48:14.237027 7fae37292700  5 mds.-1.0 handle_mds_map epoch 151 from mon.0
2014-09-10 15:48:14.237060 7fae37292700 10 mds.-1.0  my compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding}
2014-09-10 15:48:14.237070 7fae37292700 10 mds.-1.0  mdsmap compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding}
2014-09-10 15:48:14.237079 7fae37292700 10 mds.0.20 map says i am 156.74.237.56:6800/3166 mds.0.20 state up:replay
2014-09-10 15:48:14.237091 7fae37292700  1 mds.0.20 handle_mds

Re: [ceph-users] osd pool default pg num problem

2014-05-23 Thread McNamara, Bradley
The other thing to note, too, is that it appears you're trying to decrease the 
PG/PGP_num parameters, which is not supported.  In order to decrease those 
settings, you'll need to delete and recreate the pools.  All new pools created 
will use the settings defined in the ceph.conf file.

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of John 
Spray
Sent: Friday, May 23, 2014 6:38 AM
To: Cao, Buddy
Cc: ceph-users@lists.ceph.com; ceph-u...@ceph.com
Subject: Re: [ceph-users] osd pool default pg num problem

Those settings are applied when creating new pools with "osd pool create", but 
not to the pools that are created automatically during cluster setup.

We've had the same question before
(http://comments.gmane.org/gmane.comp.file-systems.ceph.user/8150), so maybe 
it's worth opening a ticket to do something about it.

Cheers,
John

On Fri, May 23, 2014 at 2:01 PM, Cao, Buddy  wrote:
> In Firefly, I added below lines to [global] section in ceph.conf, 
> however, after creating the cluster, the default pool 
> “metadata/data/rbd”’s pg num is still over 900 but not 375.  Any suggestion?
>
>
>
>
>
> osd pool default pg num = 375
>
> osd pool default pgp num = 375
>
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CEPH placement groups and pool sizes

2014-05-12 Thread McNamara, Bradley
The formula was designed to be used on a per-pool basis.  Having said that, 
though, when looking at the number of PG's from a system-wide perspective, one 
does not want too many total PG's.  So, it's a balancing act, and it has been 
suggested that it's better to have slightly more PG's than you need, but not 
too many.

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Pieter 
Koorts
Sent: Monday, May 12, 2014 5:21 AM
To: ceph-us...@ceph.com
Subject: [ceph-users] CEPH placement groups and pool sizes

Hi,

Been doing some reading on the CEPH documentation and just wanted to clarify if 
anyone knows the (approximate) correct PG's for CEPH.

What I mean is lets say I have created one pool with 4096 placement groups.
Now instead of one pool I want two so if I were to create 2 pools instead would 
it be still 4096 placement groups per pool or would I divide it between the 
pools (e.g. 2048 pg per pool)

On a side note, per pool is that a recommended maximum of data before turning 
over to a new pool?

Regards

Pieter
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NFS over CEPH - best practice

2014-05-12 Thread McNamara, Bradley
The underlying file system on the RBD needs to be a clustered file system, like 
OCFS2, GFS2, etc., and a cluster between the two, or more, iSCSI target servers 
needs to be created to manage the clustered file system.

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Andrei 
Mikhailovsky
Sent: Sunday, May 11, 2014 1:25 PM
To: l...@consolejunkie.net
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] NFS over CEPH - best practice

Sorry if these questions will sound stupid, but I was not able to find an 
answer by googling.

1. Does iSCSI protocol support having multiple target servers to serve the same 
disk/block device?

In case of ceph, the same rbd disk image. I was hoping to have multiple servers 
to mount the same rbd disk and serve it as an iscsi LUN. This LUN would be used 
as a vm image storage on vmware / xenserver.

2.Does iscsi multipathing provide failover/HA capability only on the initiator 
side? The docs that i came across all mention multipathing on the client side, 
like using two different nics. I did not find anything about having multiple 
nics on the initiator connecting to multiple iscsi target servers.

I was hoping to have resilient solution on the storage side so that I can 
perform upgrades and maintenance without needing to shutdown vms running on 
vmware/xenserver. Is this possible with iscsi?

Cheers

Andrei

From: "Leen Besselink" mailto:l...@consolejunkie.net>>
To: ceph-users@lists.ceph.com
Sent: Saturday, 10 May, 2014 8:31:02 AM
Subject: Re: [ceph-users] NFS over CEPH - best practice

On Fri, May 09, 2014 at 12:37:57PM +0100, Andrei Mikhailovsky wrote:
> Ideally I would like to have a setup with 2+ iscsi servers, so that I can 
> perform maintenance if necessary without shutting down the vms running on the 
> servers. I guess multipathing is what I need.
>
> Also I will need to have more than one xenserver/vmware host servers, so the 
> iscsi LUNs will be mounted on several servers.
>

So you have multiple machines talking to the same LUN at the same time ?

You'll have to co-ordinate how changes are written to the backing store, 
normally you'd have the virtualization servers use some kind of protocol.

When it's SCSI there are the older Reserve/Release commands and the newer 
SCSI-3 Persistent Reservation commands.

(i)SCSI allows multiple changes to be in-flight, without coordination things 
will go wrong.

Below it was mentioned that you can disable the cache for rbd, if you have no 
coordination protocol you'll need to do the same on the iSCSI-side.

I believe when you do that it will be slower, but it might work.

> Would the suggested setup not work for my requirements?
>

It depends on VMWare if they allow such a setup.

Then there is an other thing. How do the VMWare machines coordinate which VM 
they should be running ?

I don't know VMWare but usually if you have some kind of clustering setup 
you'll need to have a 'quorum'.

A lot of times the quorum is handled by a quorum disk with the SCSI coordiation 
protocols mentioned above.

An other way to have a quorum is to have a majority voting system with an 
un-even number of machines talking over the network. This is what Ceph monitor 
nodes do.

As an example of a clustering system that allows it to be used without a quorum 
disk with only 2 machines talking over the network is Linux Pacemaker. When 
something bad happends, one machine will just turn off the power of the other 
machine to prevent things going wrong (this is called STONITH).

> Andrei
> - Original Message -
>
> From: "Leen Besselink" mailto:l...@consolejunkie.net>>
> To: ceph-users@lists.ceph.com
> Sent: Thursday, 8 May, 2014 9:35:21 PM
> Subject: Re: [ceph-users] NFS over CEPH - best practice
>
> On Thu, May 08, 2014 at 01:24:17AM +0200, Gilles Mocellin wrote:
> > Le 07/05/2014 15:23, Vlad Gorbunov a écrit :
> > >It's easy to install tgtd with ceph support. ubuntu 12.04 for example:
> > >
> > >Connect ceph-extras repo:
> > >echo deb http://ceph.com/packages/ceph-extras/debian $(lsb_release
> > >-sc) main | sudo tee /etc/apt/sources.list.d/ceph-extras.list
> > >
> > >Install tgtd with rbd support:
> > >apt-get update
> > >apt-get install tgt
> > >
> > >It's important to disable the rbd cache on tgtd host. Set in
> > >/etc/ceph/ceph.conf:
> > >[client]
> > >rbd_cache = false
> > [...]
> >
> > Hello,
> >
>
> Hi,
>
> > Without cache on the tgtd side, it should be possible to have
> > failover and load balancing (active/avtive) multipathing.
> > Have you tested multipath load balancing in this scenario ?
> >
> > If it's reliable, it opens a new way for me to do HA storage with iSCSI !
> >
>
> I have a question, what is your use case ?
>
> Do you need SCSI-3 persistent reservations so multiple machines can use the 
> same LUN at the same time ?
>
> Because in that case I think tgtd won't help you.
>
> Have a good day,

Re: [ceph-users] RBD Cloning

2014-04-24 Thread McNamara, Bradley
I believe any kernel greater than 3.9 supports format 2 RBD's.  I'm sure 
someone will correct me if this is a misstatement.

Brad

-Original Message-
From: ceph-users-boun...@lists.ceph.com 
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Dyweni - Ceph-Users
Sent: Thursday, April 24, 2014 10:08 AM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] RBD Cloning

Hi,

Per the docs, I see that cloning is only supported in format 2, and that the 
kernel rbd module does not support format 2.

Is there another way to be able to mount a format 2 rbd image on a physical 
host without using the kernel rbd module?

One idea I had (not tested) is to use rbd-fuse to expose the rbd images and 
then use kpartx/device-mapper to mount from there...

Thanks,
Dyweni

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cluster_network ignored

2014-04-24 Thread McNamara, Bradley
Do you have all of the cluster IP's defined in the host file on each OSD 
server?  As I understand it, the mon's do not use a cluster network, only the 
OSD servers.

-Original Message-
From: ceph-users-boun...@lists.ceph.com 
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Gandalf Corvotempesta
Sent: Thursday, April 24, 2014 8:54 AM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] cluster_network ignored

I'm trying to configure a small ceph cluster with both public and cluster 
networks.
This is my conf:

[global]
  public_network = 192.168.0/24
  cluster_network = 10.0.0.0/24
  auth cluster required = cephx
  auth service required = cephx
  auth client required = cephx
  fsid = 004baba0-74dc-4429-84ec-1e376fb7bcad
  osd pool default pg num = 8192
  osd pool default pgp num = 8192
  osd pool default size = 3

[mon]
  mon osd down out interval = 600
  mon osd mon down reporters = 7
  [mon.osd1]
host = osd1
mon addr = 192.168.0.1
  [mon.osd2]
host = osd2
mon addr = 192.168.0.2
  [mon.osd3]
host = osd3
mon addr = 192.168.0.3

[osd]
  osd mkfs type = xfs
  osd journal size = 16384
  osd mon heartbeat interval = 30
  filestore merge threshold = 40
  filestore split multiple = 8
  osd op threads = 8
  osd recovery max active = 5
  osd max backfills = 2
  osd recovery op priority = 2


on each node I have bond0 bound to 192.168.0.x and bond1 bound to 10.0.0.x When 
ceph is doing recovery, I can see replication through bond0 (public interface) 
and nothing via bond1 (cluster interface)

Should I configure anything else ?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Live database files on Ceph

2014-04-04 Thread McNamara, Bradley
Take a look at ProxmoxVE.  Has full support for Ceph, is supported, and uses 
KVM/QEMU.

-Original Message-
From: ceph-users-boun...@lists.ceph.com 
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Brian Candler
Sent: Friday, April 04, 2014 1:44 AM
To: Brian Beverage; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Live database files on Ceph

On 03/04/2014 23:43, Brian Beverage wrote:
> Here is some info on what I am trying to accomplish. My goal here is 
> to find the least expensive way to get into Virtualization and storage 
> without the cost of a SAN and Proprietary software
...
> I have been
> tasked with taking a new start up project and basically trying to 
> incrementally move us into a VM environment without the use of a SAN.
Given those objectives, I'd suggest you also have a look at ganeti.

This won't give you the hyperscale storage of ceph, nor the remote 
S3/block/filesystem access. What it does give you is a clustered VM manager 
which can configure per-VM DRBD disk replication between pairs of nodes. So you 
basically get compute nodes with local storage, but with the ability to 
live-migrate VMs to their nominated secondary node, with no SAN or shared 
filesystem required. It's what Google use to run their internal office 
infrastructure, and is very actively developed and supported.

Of course, you still need to test it with your workload. If you're tuning, have 
a look at drbd >= 8.4.3:
http://blogs.linbit.com/p/469/843-random-writes-faster/

Ganeti can also manage VMs using ceph rbd backend (I haven't tried that yet). 
The not-yet-released ganeti 2.12 has signficantly reworked this to use KVM's 
direct rbd protocol access, rather than going via the kernel rbd driver.
http://docs.ganeti.org/ganeti/master/html/design-ceph-ganeti-support.html

So even if you do go with ceph for the storage, I think you'll still find 
ganeti interesting as a way to manage the lifecycle of the VMs themselves.
http://www.slideshare.net/gpaterno1/comparing-iaas-vmware-vs-openstack-vs-googles-ganeti-28016375

Apologies if this is OT for this list.

Regards,

Brian.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Pool Count incrementing on each create even though I removed the pool each time

2014-03-18 Thread McNamara, Bradley
What you are seeing is expected behavior.  Pool numbers do not get reused; they 
increment up.  Pool names can be reused once they are deleted.  One note, 
though, if you delete and recreate the data pool, and want to use cephfs, 
you'll need to run 'ceph mds newfs   
--yes-i-really-mean-it' before mounting it.

Brad

-Original Message-
From: ceph-users-boun...@lists.ceph.com 
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of matt.lat...@hgst.com
Sent: Tuesday, March 18, 2014 11:53 AM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Pool Count incrementing on each create even though I 
removed the pool each time


I am a novice ceph user creating a simple 4 OSD default cluster (initially) and 
experimenting with RADOS BENCH to understand basic HDD (OSD) performance. Each 
interation of rados bench -p data I want the cluster OSDs in initial state  
i.e. 0 objects . I assumed the easiest way was to remove and re-create the data 
pool each time.

While this appears to work , when I run ceph -s it shows me the pool count is 
incrementing each time:

matt@redstar9:~$ sudo ceph -s
cluster c677f4c3-46a5-4ae1-b8aa-b070326c3b24
 health HEALTH_WARN clock skew detected on mon.redstar10, mon.redstar11
 monmap e1: 3 mons at
{redstar10=192.168.5.40:6789/0,redstar11=192.168.5.41:6789/0,redstar9=192.168.5.39:6789/0},
 election epoch 6, quorum 0,1,2 redstar10,redstar11,redstar9
 osdmap e52: 4 osds: 4 up, 4 in
  pgmap v5240: 136 pgs, 14 pools, 768 MB data, 194 objects
1697 MB used, 14875 GB / 14876 GB avail
 136 active+clean


even though lspools still only shows me the 3 default pools (metadata, rbd,
data)

Is this a bug, AND/OR, is there a better way to zero my cluster for these 
experiments?

Thanks,

Matt Latter

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PG Calculations

2014-03-13 Thread McNamara, Bradley
There was a very recent thread discussing PG calculations, and it made me doubt 
my cluster setup.  So, Inktank, please provide some clarification.

I followed the documentation, and interpreted that documentation to mean that 
PG and PGP calculation was based upon a per-pool calculation.  The recent 
discussion introduced a slightly different formula adding in the total number 
of pools:

# OSD * 100 / 3

vs.

# OSD's * 100 / (3 * # pools)

My current cluster has 24 OSD's, replica size of 3, and the standard three 
pools, RBD, DATA, and METADATA.  My current total PG's is 3072, which by the 
second formula is way too many.  So, do I have too many?  Does it need to be 
addressed, or can it wait until I add more OSD's, which will bring the ratio 
closer to ideal?  I'm currently using only RBD and CephFS, no RadosGW.

Thank you!

Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG Scaling

2014-03-12 Thread McNamara, Bradley
Most things will cause data movement...

If you are going to have different failure zones within your crush map, I would 
edit your crush map and define those failure zones/buckets, first.  This will 
cause data movement when you inject the new crush map into the cluster.  This 
will immediately cause data movement.

Once the data movement from the new crush map is done, then I would change the 
number of placement groups.  This will immediately cause data movement, too.

If you have a cluster network defined and in use, this shouldn't materially 
affect the running cluster.  Response times may be exaggerated, but the cluster 
will be completely functional.

Brad

From: Karol Kozubal [mailto:karol.kozu...@elits.com]
Sent: Wednesday, March 12, 2014 1:52 PM
To: McNamara, Bradley; ceph-users@lists.ceph.com
Subject: Re: PG Scaling

Thank you for your response.

The number of replicas is already set to 3. So if I simply increase the number 
of pg's they will also start to move or is that simply triggered with size 
alterations? I suppose since this will generate movement in the cluster network 
it is ideal to do this operation while the cluster isnt as busy.

Karol


From: , Bradley 
mailto:bradley.mcnam...@seattle.gov>>
Date: Wednesday, March 12, 2014 at 1:54 PM
To: Karol Kozubal mailto:karol.kozu...@elits.com>>, 
"ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>" 
mailto:ceph-users@lists.ceph.com>>
Subject: RE: PG Scaling

Round up your pg_num and pgp_num to the next power of 2, 2048.

Ceph will start moving data as soon as you implement the new 'size 3', so I 
would increase the pg_num and pgp_num, first, then increase the size.  It will 
start creating the new PG's immediately.  You can see all this going on using 
'ceph -w'.

Once the data is finished moving, you may need to  run 'ceph osd crush tunables 
optimal'.  This should take care of any unclean PG's that may be hanging around.

It is NOT possible to decrease the PG's.  One would need to  delete the pool 
and recreate it.

Brad

From: 
ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com> 
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Karol Kozubal
Sent: Wednesday, March 12, 2014 9:08 AM
To: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] PG Scaling

Correction: Sorry min_size is at 1 everywhere.


Thank you.

Karol Kozubal

From: Karol Kozubal mailto:karol.kozu...@elits.com>>
Date: Wednesday, March 12, 2014 at 12:06 PM
To: "ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>" 
mailto:ceph-users@lists.ceph.com>>
Subject: PG Scaling

Hi Everyone,

I am deploying an openstack deployment with Fuel 4.1 and have a 20 node ceph 
deployment of c6220's with 3 osd's and 1 journaling disk per node. When first 
deployed each storage pool is configured with the correct size and min_size 
attributes however fuel doesn't seem to apply the correct number of pg's to the 
pools based on the number of osd's that we actually have.

I make the adjustments using the following

(20 nodes * 3 OSDs)*100 / 3 replicas = 2000

ceph osd pool volumes set size 3
ceph osd pool volumes set min_size 3
ceph osd pool volumes set pg_num 2000
ceph osd pool volumes set pgp_num 2000

ceph osd pool images set size 3
ceph osd pool images set min_size 3
ceph osd pool images set pg_num 2000
ceph osd pool images set pgp_num 2000

ceph osd pool compute set size 3
ceph osd pool compute set min_size 3
ceph osd pool compute set pg_num 2000
ceph osd pool compute set pgp_num 2000

Here are the questions I am left with concerning these changes:

 1.  How long does it take for ceph to apply the changes and recalculate the 
pg's?
 2.  When is it safe to do this type of operation? before any data is written 
to the pools or is doing this while pools are used acceptable?
 3.  Is it possible to scale down the number of pg's ?
Thank you for your input.

Karol Kozubal
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG Scaling

2014-03-12 Thread McNamara, Bradley
Round up your pg_num and pgp_num to the next power of 2, 2048.

Ceph will start moving data as soon as you implement the new 'size 3', so I 
would increase the pg_num and pgp_num, first, then increase the size.  It will 
start creating the new PG's immediately.  You can see all this going on using 
'ceph -w'.

Once the data is finished moving, you may need to  run 'ceph osd crush tunables 
optimal'.  This should take care of any unclean PG's that may be hanging around.

It is NOT possible to decrease the PG's.  One would need to  delete the pool 
and recreate it.

Brad

From: ceph-users-boun...@lists.ceph.com 
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Karol Kozubal
Sent: Wednesday, March 12, 2014 9:08 AM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] PG Scaling

Correction: Sorry min_size is at 1 everywhere.


Thank you.

Karol Kozubal

From: Karol Kozubal mailto:karol.kozu...@elits.com>>
Date: Wednesday, March 12, 2014 at 12:06 PM
To: "ceph-users@lists.ceph.com" 
mailto:ceph-users@lists.ceph.com>>
Subject: PG Scaling

Hi Everyone,

I am deploying an openstack deployment with Fuel 4.1 and have a 20 node ceph 
deployment of c6220's with 3 osd's and 1 journaling disk per node. When first 
deployed each storage pool is configured with the correct size and min_size 
attributes however fuel doesn't seem to apply the correct number of pg's to the 
pools based on the number of osd's that we actually have.

I make the adjustments using the following

(20 nodes * 3 OSDs)*100 / 3 replicas = 2000

ceph osd pool volumes set size 3
ceph osd pool volumes set min_size 3
ceph osd pool volumes set pg_num 2000
ceph osd pool volumes set pgp_num 2000

ceph osd pool images set size 3
ceph osd pool images set min_size 3
ceph osd pool images set pg_num 2000
ceph osd pool images set pgp_num 2000

ceph osd pool compute set size 3
ceph osd pool compute set min_size 3
ceph osd pool compute set pg_num 2000
ceph osd pool compute set pgp_num 2000

Here are the questions I am left with concerning these changes:

 1.  How long does it take for ceph to apply the changes and recalculate the 
pg's?
 2.  When is it safe to do this type of operation? before any data is written 
to the pools or is doing this while pools are used acceptable?
 3.  Is it possible to scale down the number of pg's ?
Thank you for your input.

Karol Kozubal
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mon servers

2014-03-06 Thread McNamara, Bradley
I'm confused...

The bug tracker says this was resolved ten days ago.  Also, I actually used 
ceph-deploy on 2/12/2014 to add two monitors to my cluster, and it worked, and 
the documentation says it can be done.  However, I believe that I added the new 
mon's to the ceph.conf in the 'mon_initial_members' line before I added them.  
Maybe this is the reason it worked?

Maybe I'm misunderstanding the issue?

Brad

-Original Message-
From: ceph-users-boun...@lists.ceph.com 
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jonathan Gowar
Sent: Thursday, March 06, 2014 9:27 AM
To: Alfredo Deza
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] mon servers

On Thu, 2014-03-06 at 09:02 -0500, Alfredo Deza wrote:
> > From the admin node:-
> > http://pastebin.com/AYKgevyF
> 
> Ah you added a monitor with ceph-deploy but that is not something that 
> is supported (yet)
> 
> See: http://tracker.ceph.com/issues/6638
> 
> This should be released in the upcoming ceph-deploy version.
> 
> But what it means is that you kind of deployed monitors that have no 
> idea how to communicate with the ones that were deployed before.

Fortunately the cluster is not production, so I actually was able to laugh at 
this :)

What's the best way to resolve this then?

Regards,
Jon

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Crush Maps

2014-02-06 Thread McNamara, Bradley
I have a test cluster that is up and running.  It consists of three mons, and 
three OSD servers, with each OSD server having eight OSD's and two SSD's for 
journals.  I'd like to move from the flat crushmap to a crushmap with typical 
depth using most of the predefined types.  I have the current crushmap 
decompiled and have edited it to add the additional depth of failure zones.

Questions:


1)  Do the ID's of the bucket types need to be consecutive, or can I make 
them up as long as they are negative in value and unique?

2)  Is there any way that I can control the assignment of the bucket type 
ID's if I were to update the crushmap on a running system using the CLI?

3)  Is there any harm in adding bucket types that are not currently used, 
but assigning them a weight of 0, so they aren't used (a row defined, with 
racks, but the racks have no hosts defined)?

4)  Can I have a bucket type with no "item" lines in it, or does each 
bucket type need at least on item declaration to be valid?

Example:
# begin crush map

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11
device 12 osd.12
device 13 osd.13
device 14 osd.14
device 15 osd.15
device 16 osd.16
device 17 osd.17
device 18 osd.18
device 19 osd.19
device 20 osd.20
device 21 osd.21
device 22 osd.22
device 23 osd.23

# types
type 0 osd
type 1 host
type 2 rack
type 3 row
type 4 room
type 5 datacenter
type 6 root

# buckets
host spucosds01 {
id -2   # do not change unnecessarily
# weight 29.120
alg straw
hash 0  # rjenkins1
item osd.0 weight 3.640
item osd.1 weight 3.640
item osd.2 weight 3.640
item osd.3 weight 3.640
item osd.4 weight 3.640
item osd.5 weight 3.640
item osd.6 weight 3.640
item osd.7 weight 3.640
}
host spucosds02 {
id -3   # do not change unnecessarily
# weight 29.120
alg straw
hash 0  # rjenkins1
item osd.8 weight 3.640
item osd.9 weight 3.640
item osd.10 weight 3.640
item osd.11 weight 3.640
item osd.12 weight 3.640
item osd.13 weight 3.640
item osd.14 weight 3.640
item osd.15 weight 3.640
}
host spucosds03 {
id -4   # do not change unnecessarily
# weight 29.120
alg straw
hash 0  # rjenkins1
item osd.16 weight 3.640
item osd.17 weight 3.640
item osd.18 weight 3.640
item osd.19 weight 3.640
item osd.20 weight 3.640
item osd.21 weight 3.640
item osd.22 weight 3.640
item osd.23 weight 3.640
}
rack rack2-2 {
id -220
alg straw
hash 0
item spucosds01 weight 29.12
}
rack rack3-2 {
id -230
alg straw
hash 0
item spucosds02 weight 29.12
}
rack rack4-2 {
id -240
alg straw
hash 0
item spucosds03 weight 29.12
}
row row1 {
id -100
alg straw
hash 0
}
row row2 {
id -200
alg straw
hash 0
item rack2-2 weight 29.12
item rack3-2 weight 29.12
item rack4-2 weight 29.12
}
datacenter smt {
id -1000
alg straw
hash 0
item row1 weight 0.0
item row2 weight 87.36
}
root default {
id -1   # do not change unnecessarily
# weight 87.360
alg straw
hash 0  # rjenkins1
item smt weight 87.36
}

# rules
rule data {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
rule metadata {
ruleset 1
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
rule rbd {
ruleset 2
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

# end crush map

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Performance issues running vmfs on top of Ceph

2014-02-04 Thread McNamara, Bradley
Just for clarity since I didn't see it explained, but how are you accessing 
Ceph using ESXI?  Is it via iscsi or NFS?  Thanks.

Brad McNamara

-Original Message-
From: ceph-users-boun...@lists.ceph.com 
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Maciej Bonin
Sent: Tuesday, February 04, 2014 11:01 AM
To: Maciej Bonin; Mark Nelson; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Performance issues running vmfs on top of Ceph

Hello again,

Having said that we seem to have improved the performance by following 
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1033665
 after we figured out there might be a mismatch between reported and actually 
supported capabilities.
Thank you for your time Mark and Neil.

Regards,
Maciej Bonin
Systems Engineer | M247 Limited
M247.com  Connected with our Customers
Contact us today to discuss your hosting and connectivity requirements ISO 
27001 | ISO 9001 | Deloitte Technology Fast 50 | Deloitte Technology Fast 500 
EMEA | Sunday Times Tech Track 100
M247 Ltd, registered in England & Wales #4968341. 1 Ball Green, Cobra Court, 
Manchester, M32 0QT
 
ISO 27001 Data Protection Classification: A - Public
 


-Original Message-
From: ceph-users-boun...@lists.ceph.com 
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Maciej Bonin
Sent: 04 February 2014 18:21
To: Mark Nelson; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Performance issues running vmfs on top of Ceph

Hello Mark,

Thanks for getting back to me. We do have a couple of vms running that were 
migrated off xen that are fine, performance in rados bench is what one would 
expect (maxing the 4xgigabit bond).
The only other time I've noticed similar issues is when running mkfs.ext[3-4] 
on new images, which took ridiculously long on xen-pv and kvm and even longer 
under esxi. We have a vmfs image with configuration files for the guests and 
are trying to wget an iso into the shared config volume to install another vm 
via esxi we don't get very far (we checked the uplink etc and everything up to 
the way vmfs works on top of ceph seems ok). My thoughts are something in the 
way vmfs thin-provisions space is causing problems with ceph's own thin 
provisioning, my colleague is testing different block sizes, no luck however in 
getting any sort of improvement so far.
And yes it we are using iscsi via tgtd from the ceph-extras repo I believe (in 
response to a message I just noticed come in while I was typing)

Regards,
Maciej Bonin
Systems Engineer | M247 Limited
M247.com  Connected with our Customers
Contact us today to discuss your hosting and connectivity requirements ISO 
27001 | ISO 9001 | Deloitte Technology Fast 50 | Deloitte Technology Fast 500 
EMEA | Sunday Times Tech Track 100
M247 Ltd, registered in England & Wales #4968341. 1 Ball Green, Cobra Court, 
Manchester, M32 0QT
 
ISO 27001 Data Protection Classification: A - Public
 


-Original Message-
From: ceph-users-boun...@lists.ceph.com 
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mark Nelson
Sent: 04 February 2014 18:11
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Performance issues running vmfs on top of Ceph

On 02/04/2014 11:55 AM, Maciej Bonin wrote:
> Hello guys,
>
> We're testing running an esxi hv on top of a ceph backend and we're getting 
> abysmal performance when using vmfs, has anyone else tried this successful, 
> any advice ?
> Would be really thankful for any hints.

Hi!

I don't have a ton of experience with esxi, but if you can do some rados bench 
or smalliobenchfs tests, that might help give you an idea if the problem is 
Ceph (or lower), or more related to something higher up closer to exsi.  Can 
you describe a little more what you are seeing and what you expect?

Thanks,
Mark

>
> Regards,
> Maciej Bonin
> Systems Engineer | M247 Limited
> M247.com  Connected with our Customers Contact us today to discuss 
> your hosting and connectivity requirements ISO 27001 | ISO 9001 | 
> Deloitte Technology Fast 50 | Deloitte Technology Fast 500 EMEA | 
> Sunday Times Tech Track 100
> M247 Ltd, registered in England & Wales #4968341. 1 Ball Green, Cobra 
> Court, Manchester, M32 0QT
>
> ISO 27001 Data Protection Classification: A - Public
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-

Re: [ceph-users] PG's and Pools

2014-01-29 Thread McNamara, Bradley
Unless I misunderstand this, three OSD servers, each with eight OSD's, for a 
total of 24 OSD's.  The formula is(as I understand it):  Total PG's =  100 x 
24/3.  And, I see an error in my simple math!  I should have said 800.  Is that 
more what you were expecting?

Brad

-Original Message-
From: ceph-users-boun...@lists.ceph.com 
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Peter Matulis
Sent: Wednesday, January 29, 2014 8:11 AM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] PG's and Pools

On 01/28/2014 09:46 PM, McNamara, Bradley wrote:
> I finally have my first test cluster up and running.  No data on it, 
> yet.  The config is:  three mons, and three OSDS servers.  Each OSDS 
> server has eight 4TB SAS drives and two SSD journal drives.
> 
>  
> 
> The cluster is healthy, so I started playing with PG and PGP values.  
> By the provided calculation is determined that 600 PG/PGP was the 
> recommended value, so I set both to 600 (replication of 3).

How did get 600 PGs from 24 OSDs and a replication factor of 3?

/pm

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PG's and Pools

2014-01-28 Thread McNamara, Bradley
I finally have my first test cluster up and running.  No data on it, yet.  The 
config is:  three mons, and three OSDS servers.  Each OSDS server has eight 4TB 
SAS drives and two SSD journal drives.

The cluster is healthy, so I started playing with PG and PGP values.  By the 
provided calculation is determined that 600 PG/PGP was the recommended value, 
so I set both to 600 (replication of 3).  I decided that 600 might be too high, 
so I attempted to lower it to 512 (next power-of-2 number from 600).  I, then, 
realized that I couldn't lower it, only raise it from the present value.  So, I 
deleted the pools that I changed (rbd and data).  Then, I recreated the pools, 
but noticed that it created the two pools with pool numbers that were different 
from what it started with.

My two questions are:  is there any way to recreate the pools with the original 
pool number value that they had?  Does it really matter that data  pool is now 
3, and rbd is now 4?

Any other recommendations that the list can provide regarding PG's and PGP's?

Thanks!

Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OCFS2 or GFS2 for cluster filesystem?

2013-07-11 Thread McNamara, Bradley
Correct me if I'm wrong, I'm new to this, but I think the distinction between 
the two methods is that using 'qemu-img create -f rbd' creates an RBD for 
either a VM to boot from, or for mounting within a VM.  Whereas, the OP wants a 
single RBD, formatted with a cluster file system, to use as a place for 
multiple VM image files to reside.

I've often contemplated this same scenario, and would be quite interested in 
different ways people have implemented their VM infrastructure using RBD.  I 
guess one of the advantages of using 'qemu-img create -f rbd' is that a 
snapshot of a single RBD would capture just the changed RBD data for that VM, 
whereas a snapshot of a larger RBD with OCFS2 and multiple VM images on it, 
would capture changes of all the VM's, not just one.  It might provide more 
administrative agility to use the former.

Also, I guess another question would be, when a RBD is expanded, does the 
underlying VM that is created using 'qemu-img  create -f rbd' need to be 
rebooted to "see" the additional space.  My guess would be, yes.

Brad

-Original Message-
From: ceph-users-boun...@lists.ceph.com 
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Alex Bligh
Sent: Thursday, July 11, 2013 2:03 PM
To: Gilles Mocellin
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] OCFS2 or GFS2 for cluster filesystem?


On 11 Jul 2013, at 19:25, Gilles Mocellin wrote:

> Hello,
> 
> Yes, you missed that qemu can use directly RADOS volume.
> Look here :
> http://ceph.com/docs/master/rbd/qemu-rbd/
> 
> Create :
> qemu-img create -f rbd rbd:data/squeeze 10G
> 
> Use :
> 
> qemu -m 1024 -drive format=raw,file=rbd:data/squeeze

I don't think he did. As I read it he wants his VMs to all access the same 
filing system, and doesn't want to use cephfs.

OCFS2 on RBD I suppose is a reasonable choice for that.

-- 
Alex Bligh




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Might Be Spam -RE: Mounting a shared block device on multiple hosts

2013-05-29 Thread McNamara, Bradley
Jon;

For all intents and purposes, an RBD device is treated by the system/OS as a 
physical disk, albeit attached via the network where multiple servers store the 
data (ceph cluster).  Once the RBD device is created, one needs to format the 
device using any one of a number of file systems (ext4, btrfs, xfs, etc.).  If 
one wants/needs to share that RBD device, one will need to use a cluster-aware 
file system in place of ext4, xfs, etc., like gfs, ocfs2, etc.  This, however, 
adds a layer of complexity that you may not be looking for.

If you need to share the file system amongst several hosts, with concurrent 
reading/writing, then you’ll probably need to look into using CephFS if you 
want to continue with the notion of using a Ceph cluster.  CephFS, however, 
isn’t currently considered “production ready”, although many are using it in 
this manner.

Brad

From: Jon [mailto:three1...@gmail.com]
Sent: Wednesday, May 29, 2013 11:47 AM
To: McNamara, Bradley
Cc: ceph-users
Subject: Might Be Spam -RE: [ceph-users] Mounting a shared block device on 
multiple hosts

Hello Bradley,

Please excuse my ignorance, I am new to CEPH and what I thought was a good 
understanding of file systems has clearly been shown to be inadequate.

Maybe I'm asking the question wrong because I keep getting the same answer.

I guess I don't understand what a clustered filesystem is.  I thought CEPH was 
a clustered file system, and the Wikipedia article on clustered file systems 
doesn't offer any disambiguation as CEPH is listed as a clustered filesystem, 
albeit under the distributed heading.  There is a shared disk / storage area 
network section, but, with the exception of GFS, these look like SAN 
technologies, not just file systems.

I guess my question boils down to: "what format should I format my rbd so that 
I can read and write to it from multiple diverse hosts"?

Can I format my rbd as gfs2 and call it a day? Or do I need to use a technology 
like glusterFS, or mount the rbd on one host and export it as an NFS Mount? 
(Would really like to avoid NFS if at all possible, but if that's the solution, 
then that's the solution).

Thanks for your patience with me. I really feel like an idiot asking, but I 
really have no where else to turn.

Thanks,
Jon A
On May 29, 2013 12:01 PM, "McNamara, Bradley" 
mailto:bradley.mcnam...@seattle.gov>> wrote:
Instead of using ext4 for the file system, you need to use a clustered file 
system on the RBD device.

From: 
ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com> 
[mailto:ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com>]
 On Behalf Of Jon
Sent: Wednesday, May 29, 2013 7:55 AM
To: Igor Laskovy
Cc: ceph-users
Subject: Re: [ceph-users] Mounting a shared block device on multiple hosts


Hell Igor,

Thanks for getting back to me.

> > You can map it to multiple hosts, but before doing  dd if=/dev/zero 
> > of=/media/tmp/test you have created file system, right

Correct, I can't mount /dev/rbd/rbd/test-device without first creating a file 
system on the device.  Now, I am creating an ext4 filesystem with the -m0 flag 
so the metadata is contained within the filesystem. I did not create a 
partition on the device instead opting to format the whole device (this is how 
the rbd guide does it).

>> This file system MUST be distributed, thus multiple hosts can read and write 
>> files on it.

I'm not sure I understand.  Are you implying I need to use some OTHER network 
filesystem (e.g. glusterfs) -on top- of ceph? Or are you saying my ext4 file 
system should be distributed so multiple hosts should be able to read and write 
to/from the ext4 filesystem?

If it's the former, this seems counterintuitive, but if that's what nerds to 
happen I guess I'll make it so.  If it's the latter, then something is not 
right as my hosts are not all able to read and write to the ext4 filesystem.  
I'm not sure how else I can test / prove it other than writing a file from each 
of my hosts and that file not being accessible from all of my hosts, can you 
provide some further troubleshooting steps?  Could it be the filesystem type? 
Do I need to use xfs or btrfs if I want to map the block device to multiple 
hosts? Does CEPH not work as a client where it is running as a service?

Thanks for your help,
Jon A
On May 29, 2013 12:47 AM, "Igor Laskovy" 
mailto:igor.lask...@gmail.com>> wrote:
Hi Jon, I already mentioned multiple times here - RBD just a block device. You 
can map it to multiple hosts, but before doing  dd if=/dev/zero 
of=/media/tmp/test you have created file system, right? This file system MUST 
be distributed, thus multiple hosts can read and write files on it.

On Wed, May 29, 2013 at 4:24 AM, Jon 
mailto:three1...@gmail.com>> wrote:
Hello,

I would like to mount a single RBD on multiple hosts to be able

Re: [ceph-users] Mounting a shared block device on multiple hosts

2013-05-29 Thread McNamara, Bradley
Instead of using ext4 for the file system, you need to use a clustered file 
system on the RBD device.

From: ceph-users-boun...@lists.ceph.com 
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jon
Sent: Wednesday, May 29, 2013 7:55 AM
To: Igor Laskovy
Cc: ceph-users
Subject: Re: [ceph-users] Mounting a shared block device on multiple hosts


Hell Igor,

Thanks for getting back to me.

> > You can map it to multiple hosts, but before doing  dd if=/dev/zero 
> > of=/media/tmp/test you have created file system, right

Correct, I can't mount /dev/rbd/rbd/test-device without first creating a file 
system on the device.  Now, I am creating an ext4 filesystem with the -m0 flag 
so the metadata is contained within the filesystem. I did not create a 
partition on the device instead opting to format the whole device (this is how 
the rbd guide does it).

>> This file system MUST be distributed, thus multiple hosts can read and write 
>> files on it.

I'm not sure I understand.  Are you implying I need to use some OTHER network 
filesystem (e.g. glusterfs) -on top- of ceph? Or are you saying my ext4 file 
system should be distributed so multiple hosts should be able to read and write 
to/from the ext4 filesystem?

If it's the former, this seems counterintuitive, but if that's what nerds to 
happen I guess I'll make it so.  If it's the latter, then something is not 
right as my hosts are not all able to read and write to the ext4 filesystem.  
I'm not sure how else I can test / prove it other than writing a file from each 
of my hosts and that file not being accessible from all of my hosts, can you 
provide some further troubleshooting steps?  Could it be the filesystem type? 
Do I need to use xfs or btrfs if I want to map the block device to multiple 
hosts? Does CEPH not work as a client where it is running as a service?

Thanks for your help,
Jon A
On May 29, 2013 12:47 AM, "Igor Laskovy" 
mailto:igor.lask...@gmail.com>> wrote:
Hi Jon, I already mentioned multiple times here - RBD just a block device. You 
can map it to multiple hosts, but before doing  dd if=/dev/zero 
of=/media/tmp/test you have created file system, right? This file system MUST 
be distributed, thus multiple hosts can read and write files on it.

On Wed, May 29, 2013 at 4:24 AM, Jon 
mailto:three1...@gmail.com>> wrote:
Hello,

I would like to mount a single RBD on multiple hosts to be able to share the 
block device.
Is this possible?  I understand that it's not possible to share data between 
the different interfaces, e.g. CephFS and RBDs, but I don't see anywhere it's 
declared that sharing an RBD between hosts is or is not possible.

I have followed the instructions on the github page of ceph-deploy (I was 
following the 5 minute quick start http://ceph.com/docs/next/start/quick-start/ 
but when I got to the step with mkcephfs it erred out and pointed me to the 
github page), as I only have three servers I am running the osds and monitors 
on all of the hosts, I realize this isn't ideal but I'm hoping it will work for 
testing purposes.

This is what my cluster looks like:

>> root@red6:~# ceph -s
>>health HEALTH_OK
>>monmap e2: 3 mons at 
>> {kitt=192.168.0.35:6789/0,red6=192.168.0.40:6789/0,shepard=192.168.0.2:6789/0},
>>  election epoch 10, quorum 0,1,2 kitt,red6,shepard
>>osdmap e29: 5 osds: 5 up, 5 in
>> pgmap v1692: 192 pgs: 192 active+clean; 19935 MB data, 40441 MB used, 
>> 2581 GB / 2620 GB avail; 73B/s rd, 0op/s
>>mdsmap e1: 0/0/1 up

To test, what I have done is created a 20GB RBD mapped it and mounted it to 
/media/tmp on all the hosts in my cluster, so all of the hosts are also clients.

Then I use dd to create a 1MB file named test-$hostname

>> dd if=/dev/zero of=/media/tmp/test-`hostname` bs=1024 count=1024;

after the file is created, I wait for the writes to finish in `ceph -w`, then 
on each host when I list /media/tmp I see the results of 
/media/tmp/test-`hostname`, if I unmount then remount the RBD, I get mixed 
results.  Typically, I see the file that was created on the host that is at the 
front of the line in the quorum. e.g. the test I did while typing this e-mail 
"kitt" is listed first quorum 0,1,2 kitt,red6,shepard, this is the file I see 
created when I unmount then mount the rbd on shepard.

Where this is going is, I would like to use CEPH as my back end storage 
solution for my virtualization cluster.  The general idea is the hypervisors 
will all have a shared mountpoint that holds images and vms so vms can easily 
be migrated between hypervisors.  Actually, I was thinking I would create one 
mountpoint each for images and vms for performance reasons, am I likely to see 
performance gains using more smaller RBDs vs fewer larger RBDs?

Thanks for any feedback,
Jon A

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://li

Re: [ceph-users] Concurrent access to Ceph filesystems

2013-03-01 Thread McNamara, Bradley
I'm new, too, and I guess I just need a little clarification on Greg's 
statement.  The RBD filesystem is mounted to multiple VM servers, say, in a 
Proxmox cluster, and as long as any one VM image file on that filesystem is 
only being accessed from one node of the cluster, everything will work, and 
that's the way shared storage is intended to work within Ceph/RBD.  Correct?

I can understand things blowing up if the same VM image file is being accessed 
from multiple nodes in the cluster, and that's where a clustered filesystem 
comes into play.

I guess in my mind/world I was envisioning a group of VM servers using one 
large RBD volume, that is mounted to each VM server in the group, to store the 
VM images for all the VM's in the group of VM servers.  This way the VM's could 
migrate to any VM server in the group using the RBD volume.

No?

Brad

-Original Message-
From: ceph-users-boun...@lists.ceph.com 
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Gregory Farnum
Sent: Friday, March 01, 2013 2:13 PM
To: Karsten Becker
Cc: ceph-us...@ceph.com
Subject: Re: [ceph-users] Concurrent access to Ceph filesystems

On Fri, Mar 1, 2013 at 1:53 PM, Karsten Becker  
wrote:
> Hi,
>
> I'm new to Ceph. I currently find no answer in the official docs for 
> the following question.
>
> Can Ceph filesystems be used concurrently by clients, both when 
> accessing via RBD and CephFS? Concurrently means in terms of multiple 
> clients accessing an writing on the same Ceph volume (like it is 
> possible with OCFS2) and extremely, in the same file at the same time.
> Or is Ceph a "plain" distributed filesystem?

CephFS supports this very nicely, though it is of course not yet production 
ready for most users. RBD provides block device semantics - you can mount it 
from multiple hosts, but if you aren't using cluster-aware software on top of 
it you won't like the results (eg, you could run OCFS2 on top of RBD, but 
running ext4 on top of it will work precisely as well as doing so would with a 
regular hard drive that you somehow managed to plug into two systems at once).
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com