[ceph-users] radosgw public url

2013-12-20 Thread Quenten Grasso
Hi All,

Does Radosgw support a "Public URL" For static content?

Being that I wish to share a "File" publicly but not give out 
username/passwords etc.

I noticed in the http://ceph.com/docs/master/radosgw/swift/ it says Static 
Websites isn't supported.. which I assume is talking about this feature, I'm 
just not 100% sure.

Cheers,
Quenten
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD-hierachy and crush

2013-12-20 Thread Udo Lembke
Hi,
yesterday I expand our 3-Node ceph-cluster with an fourth node
(additional 13 OSDs - all OSDs have the same size (4TB)).

I use the same command like before to add OSDs and change the weight:
ceph osd crush set 44 0.2 pool=default rack=unknownrack host=ceph-04

But ceph osd tree show all OSDs not below unknownrack and the weighting
seems to be different (because with an weight of 0.8 the OSD are almost
full - switched back to 0.6)
root@ceph-04:~# ceph osd tree
# idweight  type name   up/down reweight
-1  46.8root default
-3  39  rack unknownrack
-2  13  host ceph-01
0   1   osd.0   up  1
1   1   osd.1   up  1
...
27  1   osd.27  up  1
28  1   osd.28  up  1
-4  13  host ceph-02
10  1   osd.10  up  1
11  1   osd.11  up  1
...
32  1   osd.32  up  1
33  1   osd.33  up  1
-5  13  host ceph-03
16  1   osd.16  up  1
18  1   osd.18  up  1
...
37  1   osd.37  up  1
38  1   osd.38  up  1
-6  7.8 host ceph-04
39  0.6 osd.39  up  1
40  0.6 osd.40  up  1
...
50  0.6 osd.50  up  1
51  0.6 osd.51  up  1

How can I change ceph-04 to be part of rack unknownrack?
If I change that, would the content of the OSDs from ceph-04 roughly the
same, or move the "whole" content again?

Thanks for feedback!

regards

Udo




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph RAM Requirement?

2013-12-20 Thread hemant burman
Hello,

We have boxes with 24 Drives, 2TB each and want to run one OSD per drive.
What should be the ideal Memory requirement of the system, keeping in mind
that OSD Rebalancing and failure/replication of say 10-15TB data

-Hemant
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Performance questions (how original, I know)

2013-12-20 Thread Christian Balzer

Hello Gilles,

On Fri, 20 Dec 2013 21:04:45 +0100 Gilles Mocellin wrote:

> Le 20/12/2013 03:51, Christian Balzer a écrit :
> > Hello Mark,
> >
> > On Thu, 19 Dec 2013 17:18:01 -0600 Mark Nelson wrote:
> >
> >> On 12/16/2013 02:42 AM, Christian Balzer wrote:
> >>> Hello,
> >> Hi Christian!
> >>
> >>> new to Ceph, not new to replicated storage.
> >>> Simple test cluster with 2 identical nodes running Debian Jessie,
> >>> thus ceph 0.48. And yes, I very much prefer a distro supported
> >>> package.
> >> I know you'd like to use the distro package, but 0.48 is positively
> >> ancient at this point.  There's been a *lot* of fixes/changes since
> >> then.  If it makes you feel better, our current professionally
> >> supported release is based on dumpling.
> >>
> > Oh well, I assume 0.48 was picked due to the "long term support" title
> > (and thus one would hope it received it steady stream of backported
> > fixes at least ^o^).
> > There is 0.72 is unstable, so for testing I will just push that test
> > cluster to sid and see what happens.
> > As well as poke the Debian maintainer for a wheezy backport if
> > possible, if not I'll use the source package to roll my own binary
> > packages.
> In this case, why don't you want to use ceph repository, which has 
> packages for Debian Wheezy ?
> 
Ahahaha, now that is another bit of welcome information.

I of course searched for this, but the only search result (top one as
well) for "ceph debian packages" that resides on the ceph.com site is the
broken (looping back to itself) link at:
http://ceph.com/uncategorized/debian-packages/

> Repository : http://ceph.com/debian/
> Documentation : 
> http://ceph.com/docs/master/start/quick-start-preflight/#advanced-package-tool-apt
> 

Thanks a lot,

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Failure probability with largish deployments

2013-12-20 Thread Kyle Bader
Using your data as inputs to in the Ceph reliability calculator [1]
results in the following:

Disk Modeling Parameters
size:   3TiB
FIT rate:826 (MTBF = 138.1 years)
NRE rate:1.0E-16
RAID parameters
replace:   6 hours
recovery rate:  500MiB/s (100 minutes)
NRE model:  fail
object size:4MiB

Column legends
1 storage unit/configuration being modeled
2 probability of object survival (per 1 years)
3 probability of loss due to site failures (per 1 years)
4 probability of loss due to drive failures (per 1 years)
5 probability of loss due to NREs during recovery (per 1 years)
6 probability of loss due to replication failure (per 1 years)
7 expected data loss per Petabyte (per 1 years)

storage   durabilityPL(site)  PL(copies)
PL(NRE) PL(rep)loss/PiB
----  --  --
--  --  --
RAID-6: 9+2  6-nines   0.000e+00   2.763e-10
0.11%   0.000e+00   9.317e+07


Disk Modeling Parameters
size:   3TiB
FIT rate:826 (MTBF = 138.1 years)
NRE rate:1.0E-16
RADOS parameters
auto mark-out: 10 minutes
recovery rate:50MiB/s (40 seconds/drive)
osd fullness:  75%
declustering:1100 PG/OSD
NRE model:  fail
object size:  4MB
stripe length:   1100

Column legends
1 storage unit/configuration being modeled
2 probability of object survival (per 1 years)
3 probability of loss due to site failures (per 1 years)
4 probability of loss due to drive failures (per 1 years)
5 probability of loss due to NREs during recovery (per 1 years)
6 probability of loss due to replication failure (per 1 years)
7 expected data loss per Petabyte (per 1 years)

storage   durabilityPL(site)  PL(copies)
PL(NRE) PL(rep)loss/PiB
----  --  --
--  --  --
RADOS: 3 cp 10-nines   0.000e+00   5.232e-08
0.000116%   0.000e+00   6.486e+03

[1] https://github.com/ceph/ceph-tools/tree/master/models/reliability

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Best way to replace a failed drive?

2013-12-20 Thread Howie C.
Hello Guys, 

I wonder what's the best way to replace a failed OSD instead of remove it from 
CRUSH and add a new one in. As I have OSD# assigned in the ceph.conf, add a new 
OSD might need to revise the config file and reload all the ceph instances.

BTW, any suggestions for my ceph.conf? I kind of feel that to assign IP address 
for each OSD is not a smart way. Please advise.
Thank you.

[global]
fsid = 638d7b4a-e5f1-4cfd-9c83-25177d5a8d3f
mon_initial_members = mon01,ceph01,ceph02
mon_host = 10.123.11.91,10.123.11.111,10.123.11.112
auth_supported = cephx
filestore_xattr_use_omap = true
max_open_files = 131072
public_network = 10.123.11.0/24
cluster_network = 10.234.11.0/24

[osd.0]
public_addr = 10.123.11.111
cluster_addr = 10.234.11.111

[osd.1]
public_addr = 10.123.11.111
cluster_addr = 10.234.11.111

[osd.2]
public_addr = 10.123.11.111
cluster_addr = 10.234.11.111

[osd.3]
public_addr = 10.123.11.111
cluster_addr = 10.234.11.111

[osd.4]
public_addr = 10.123.11.111
cluster_addr = 10.234.11.111

[osd.5]
public_addr = 10.123.11.111
cluster_addr = 10.234.11.111

[osd.6]
public_addr = 10.123.11.112
cluster_addr = 10.234.11.112

[osd.7]
public_addr = 10.123.11.112
cluster_addr = 10.234.11.112

[osd.8]
public_addr = 10.123.11.112
cluster_addr = 10.234.11.112

[osd.9]
public_addr = 10.123.11.112
cluster_addr = 10.234.11.112

[osd.10]
public_addr = 10.123.11.112
cluster_addr = 10.234.11.112

[osd.11]
public_addr = 10.123.11.112
cluster_addr = 10.234.11.112

[osd.12]
public_addr = 10.123.11.113
cluster_addr = 10.234.11.113

[osd.13]
public_addr = 10.123.11.113
cluster_addr = 10.234.11.113

[osd.14]
public_addr = 10.123.11.113
cluster_addr = 10.234.11.113

[osd.15]
public_addr = 10.123.11.113
cluster_addr = 10.234.11.113

[osd.16]
public_addr = 10.123.11.113
cluster_addr = 10.234.11.113

[osd.17]
public_addr = 10.123.11.113
cluster_addr = 10.234.11.113

[osd.18]
public_addr = 10.123.11.114
cluster_addr = 10.234.11.114

[osd.19]
public_addr = 10.123.11.114
cluster_addr = 10.234.11.114

[osd.20]
public_addr = 10.123.11.114
cluster_addr = 10.234.11.114

[osd.21]
public_addr = 10.123.11.114
cluster_addr = 10.234.11.114

[osd.22]
public_addr = 10.123.11.114
cluster_addr = 10.234.11.114

[osd.23]
public_addr = 10.123.11.114
cluster_addr = 10.234.11.114


-- 
Howie C.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Storing VM Images on CEPH with RBD-QEMU driver

2013-12-20 Thread Dan van der Ster
On Fri, Dec 20, 2013 at 6:19 PM, James Pearce  wrote:
>
> "fio --size=100m --ioengine=libaio --invalidate=1 --direct=1 --numjobs=10
> --rw=read --name=fiojob --blocksize_range=4K-512k --iodepth=16"
>
> Since size=100m so reads would be entirely cached

--invalidate=1 drops the cache, no? Our results of that particular fio
test are consistently just under 1Gb/s on varied VMs running on varied
HVs.

BTW, look what happens when you don't drop the cache:

# fio --size=100m --ioengine=libaio --invalidate=0 --direct=0
--numjobs=10 --rw=read --name=fiojob --blocksize_range=4K-512k | grep
READ
   READ: io=1000.0MB, aggrb=4065.5MB/s, minb=416260KB/s,
maxb=572067KB/s, mint=179msec, maxt=246msec

> and, if hypervisor is
> write-back, potentially many writes would never make it to the cluster as
> well?

Maybe you're right, but only if fio in randwrite mode overwrites the
same address many times (does it??), and the rbd cache discards
overwritten writes (does it??). By observation, I can say for certain
that when we have those 10 VMs running these benchmarks in a while 1
loop, our cluster becomes quite busy.
Cheers, Dan


>
> Sorry if I've misunderstood :)
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rebooting nodes in a ceph cluster

2013-12-20 Thread Sage Weil
On Fri, 20 Dec 2013, Derek Yarnell wrote:
> On 12/19/13, 7:51 PM, Sage Weil wrote:
> >> If it takes 15 minutes for one of my servers to reboot is there a risk
> >> that some sort of needless automatic processing will begin?
> > 
> > By default, we start rebalancing data after 5 minutes.  You can adjust 
> > this (to, say, 15 minutes) with
> > 
> >  mon osd down out interval = 900
> > 
> > in ceph.conf.
> > 
> 
> Will Ceph detect if the OSDs come back while it is re-balancing and stop?

Yep!

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rebooting nodes in a ceph cluster

2013-12-20 Thread Derek Yarnell
On 12/19/13, 7:51 PM, Sage Weil wrote:
>> If it takes 15 minutes for one of my servers to reboot is there a risk
>> that some sort of needless automatic processing will begin?
> 
> By default, we start rebalancing data after 5 minutes.  You can adjust 
> this (to, say, 15 minutes) with
> 
>  mon osd down out interval = 900
> 
> in ceph.conf.
> 

Will Ceph detect if the OSDs come back while it is re-balancing and stop?

-- 
Derek T. Yarnell
University of Maryland
Institute for Advanced Computer Studies
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Performance questions (how original, I know)

2013-12-20 Thread Gilles Mocellin

Le 20/12/2013 03:51, Christian Balzer a écrit :

Hello Mark,

On Thu, 19 Dec 2013 17:18:01 -0600 Mark Nelson wrote:


On 12/16/2013 02:42 AM, Christian Balzer wrote:

Hello,

Hi Christian!


new to Ceph, not new to replicated storage.
Simple test cluster with 2 identical nodes running Debian Jessie, thus
ceph 0.48. And yes, I very much prefer a distro supported package.

I know you'd like to use the distro package, but 0.48 is positively
ancient at this point.  There's been a *lot* of fixes/changes since
then.  If it makes you feel better, our current professionally supported
release is based on dumpling.


Oh well, I assume 0.48 was picked due to the "long term support" title
(and thus one would hope it received it steady stream of backported fixes
at least ^o^).
There is 0.72 is unstable, so for testing I will just push that test
cluster to sid and see what happens.
As well as poke the Debian maintainer for a wheezy backport if possible,
if not I'll use the source package to roll my own binary packages.
In this case, why don't you want to use ceph repository, which has 
packages for Debian Wheezy ?


Repository : http://ceph.com/debian/
Documentation : 
http://ceph.com/docs/master/start/quick-start-preflight/#advanced-package-tool-apt



[...]


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph network topology with redundant switches

2013-12-20 Thread Tim Bishop
Hi Wido,

Thanks for the reply.

On Fri, Dec 20, 2013 at 08:14:13AM +0100, Wido den Hollander wrote:
> On 12/18/2013 09:39 PM, Tim Bishop wrote:
> > I'm investigating and planning a new Ceph cluster starting with 6
> > nodes with currently planned growth to 12 nodes over a few years. Each
> > node will probably contain 4 OSDs, maybe 6.
> >
> > The area I'm currently investigating is how to configure the
> > networking. To avoid a SPOF I'd like to have redundant switches for
> > both the public network and the internal network, most likely running
> > at 10Gb. I'm considering splitting the nodes in to two separate racks
> > and connecting each half to its own switch, and then trunk the
> > switches together to allow the two halves of the cluster to see each
> > other. The idea being that if a single switch fails I'd only lose half
> > of the cluster.
> 
> Why not three switches in total and use VLANs on the switches to 
> separate public/cluster traffic?
> 
> This way you can configure the CRUSH map to have one replica go to each 
> "switch" so that when you loose a switch you still have two replicas 
> available.
> 
> Saves you a lot of switches and makes the network simpler.

I was planning to use VLANs to separate the public and cluster traffic
on the same switches.

Two switches costs less than three switches :-) I think on a slightly
larger scale cluster it might make more sense to go up to three (or even
more) switches, but I'm not sure the extra cost is worth it at this
level. I was planning two switches, using VLANs to separate the public
and cluster traffic, and connecting half of the cluster to each switch.

> > (I'm not touching on the required third MON in a separate location and
> > the CRUSH rules to make sure data is correctly replicated - I'm happy
> > with the setup there)
> >
> > To allow consumers of Ceph to see the full cluster they'd be directly
> > connected to both switches. I could have another layer of switches for
> > them and interlinks between them, but I'm not sure it's worth it on
> > this sort of scale.
> >
> > My question is about configuring the public network. If it's all one
> > subnet then the clients consuming the Ceph resources can't have both
> > links active, so they'd be configured in an active/standby role. But
> > this results in quite heavy usage of the trunk between the two
> > switches when a client accesses nodes on the other switch than the one
> > they're actively connected to.
> >
> 
> Why can't the clients have both links active? You could use LACP? Some 
> switches support mlag to span LACP trunks over two switches.
> 
> Or use some intelligent bonding mode in the Linux kernel.

I've only ever used LACP to the same switch, and I hadn't realised there
were options for spanning LACP links across multiple switches. Thanks
for the information there.

> > So, can I configure multiple public networks? I think so, based on the
> > documentation, but I'm not completely sure. Can I have one half of the
> > cluster on one subnet, and the other half on another? And then the
> > client machine can have interfaces in different subnets and "do the
> > right thing" with both interfaces to talk to all the nodes. This seems
> > like a fairly simple solution that avoids a SPOF in Ceph or the network
> > layer.
> 
> There is no restriction on the IPs of the OSDs. All they need is a Layer 
> 3 route to the WHOLE cluster and monitors.
> 
> Say doesn't have to be in a Layer 2 network, everything can be simply 
> Layer 3. You just have to make sure all the nodes can reach each other.

Thanks, that makes sense and makes planning simpler. I suppose it's
logical really... in a HUGE cluster you'd probably have a whole manner
of networks spread around the datacenter.

> > Or maybe I'm missing an alternative that would be better? I'm aiming
> > for something that keeps things as simple as possible while meeting
> > the redundancy requirements.
> >
>client
>  |
>  |
>core switch
> /| \
>/ |  \
>   /  |   \
>  /   |\
> /| \
> switch1  switch2 switch3
> ||  |
>OSD  OSD   OSD
> 
> 
> You could build something like that. That would be fairly simple.

Isn't the core switch in that diagram a SPOF? Or is it presumed to
already be a redundant setup?

> Keep in mind that you can always loose a switch and still keep I/O going.
> 
> Wido

Thanks for your help. You answered my main point about IP addressing on
the public side, and gave me some other stuff to think about.

Tim.

> > As an aside, there's a similar issue on the cluster network side with
> > heavy traffic on the trunk between the two cluster switches. But I
> > can't see that's avoidable, and presumably it's something people just
> > have to deal with in larger Ceph installations?
> >
> > Finally, this is all theoretical planning 

Re: [ceph-users] Ceph network topology with redundant switches

2013-12-20 Thread Kyle Bader
> The area I'm currently investigating is how to configure the
> networking. To avoid a SPOF I'd like to have redundant switches for
> both the public network and the internal network, most likely running
> at 10Gb. I'm considering splitting the nodes in to two separate racks
> and connecting each half to its own switch, and then trunk the
> switches together to allow the two halves of the cluster to see each
> other. The idea being that if a single switch fails I'd only lose half
> of the cluster.

This is fine if you are using a replication factor of 2, you would need 2/3 of
the cluster surviving if using a replication factor 3 with "osd pool default min
size" set to 2.

> My question is about configuring the public network. If it's all one
> subnet then the clients consuming the Ceph resources can't have both
> links active, so they'd be configured in an active/standby role. But
> this results in quite heavy usage of the trunk between the two
> switches when a client accesses nodes on the other switch than the one
> they're actively connected to.

The linux bonding driver supports several strategies for teaming network
adapters on L2 networks.

> So, can I configure multiple public networks? I think so, based on the
> documentation, but I'm not completely sure. Can I have one half of the
> cluster on one subnet, and the other half on another? And then the
> client machine can have interfaces in different subnets and "do the
> right thing" with both interfaces to talk to all the nodes. This seems
> like a fairly simple solution that avoids a SPOF in Ceph or the network
> layer.

You can have multiple networks for both the public and cluster networks,
the only restriction is that all subnets for a given type be within the same
supernet. For example

10.0.0.0/16 - Public supernet (configured in ceph.conf)
10.0.1.0/24 - Public rack 1
10.0.2.0/24 - Public rack 2
10.1.0.0/16 - Cluster supernet (configured in ceph.conf)
10.1.1.0/24 - Cluster rack 1
10.1.2.0/24 - Cluster rack 2

> Or maybe I'm missing an alternative that would be better? I'm aiming
> for something that keeps things as simple as possible while meeting
> the redundancy requirements.
>
> As an aside, there's a similar issue on the cluster network side with
> heavy traffic on the trunk between the two cluster switches. But I
> can't see that's avoidable, and presumably it's something people just
> have to deal with in larger Ceph installations?

A proper CRUSH configuration is going to place a replica on a node in
each rack, this means every write is going to cross the trunk. Other
traffic that you will see on the trunk:

* OSDs gossiping with one another
* OSD/Monitor traffic in the case where an OSD is connected to a
  monitor connected in the adjacent rack (map updates, heartbeats).
* OSD/Client traffic where the OSD and client are in adjacent racks

If you use all 4 40GbE uplinks (common on 10GbE ToR) then your
cluster level bandwidth is oversubscribed 4:1. To lower oversubscription
you are going to have to steal some of the other 48 ports, 12 for 2:1 and
24 for a non-blocking fabric. Given number of nodes you have/plan to
have you will be utilizing 6-12 links per switch, leaving you with 12-18
links for clients on a non-blocking fabric, 24-30 for 2:1 and 36-48 for 4:1.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Storing VM Images on CEPH with RBD-QEMU driver

2013-12-20 Thread James Pearce


"fio --size=100m --ioengine=libaio --invalidate=1 --direct=1 
--numjobs=10 --rw=read --name=fiojob --blocksize_range=4K-512k 
--iodepth=16"


Since size=100m so reads would be entirely cached and, if hypervisor is 
write-back, potentially many writes would never make it to the cluster 
as well?


Sorry if I've misunderstood :)

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rebooting nodes in a ceph cluster

2013-12-20 Thread Simon Leinen
David Clarke writes:
> Not directly related to Ceph, but you may want to investigate kexec[0]
> ('kexec-tools' package in Debian derived distributions) in order to
> get your machines rebooting quicker.  It essentially re-loads the
> kernel as the last step of the shutdown procedure, skipping over the
> lengthy BIOS/UEFI/controller firmware etc boot stages.

> [0]: http://en.wikipedia.org/wiki/Kexec

I'd like to second that recommendation - I only discovered this
recently, and on systems with long BIOS initialization, this cuts down
the time to reboot *dramatically*, like from >5 to <1 minute.
-- 
Simon.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-deploy issues with initial mons that aren't up

2013-12-20 Thread Don Talton (dotalton)
I guess I should add, what if I add OSDs to a mon in this scenario? Do they get 
up and in and will the crush map from the non initial mons get merged with the 
initial when it's online?

> -Original Message-
> From: ceph-users-boun...@lists.ceph.com [mailto:ceph-users-
> boun...@lists.ceph.com] On Behalf Of Don Talton (dotalton)
> Sent: Friday, December 20, 2013 9:17 AM
> To: Gregory Farnum
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] ceph-deploy issues with initial mons that aren't up
> 
> This makes sense. So if other mons come up that are *not* defined as initial
> mons, then they will not be in service until the initial mon is up and ready? 
> At
> which point they can form their quorum and operate?
> 
> 
> > -Original Message-
> > From: Gregory Farnum [mailto:g...@inktank.com]
> > Sent: Thursday, December 19, 2013 10:19 PM
> > To: Don Talton (dotalton)
> > Cc: ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] ceph-deploy issues with initial mons that
> > aren't up
> >
> > "mon initial members" is a race prevention mechanism whose purpose is
> > to prevent your monitors from forming separate quorums when they're
> > brought up by automated software provisioning systems (by not allowing
> > monitors to form a quorum unless everybody in the list is a member).
> > If you want to add other monitors at a later time you can do so by
> > specifying them elsewhere (including in mon hosts or whatever, so
> > other daemons will attempt to contact them.) -Greg Software Engineer
> > #42 @ http://inktank.com | http://ceph.com
> >
> >
> > On Thu, Dec 19, 2013 at 9:13 PM, Don Talton (dotalton)
> >  wrote:
> > > I just realized my email is not clear. If the first mon is up and
> > > the additional
> > initials are not, then the process fails.
> > >
> > >> -Original Message-
> > >> From: ceph-users-boun...@lists.ceph.com [mailto:ceph-users-
> > >> boun...@lists.ceph.com] On Behalf Of Don Talton (dotalton)
> > >> Sent: Thursday, December 19, 2013 2:44 PM
> > >> To: ceph-users@lists.ceph.com
> > >> Subject: [ceph-users] ceph-deploy issues with initial mons that
> > >> aren't up
> > >>
> > >> Hi all,
> > >>
> > >> I've been working in some ceph-deploy automation and think I've
> > >> stumbled on an interesting behavior. I create a new cluster, and
> > >> specify 3 machines. If all 3 are not and unable to be ssh'd into
> > >> with the account I created for ceph- deploy, then the mon create
> > >> process will fail and the cluster is not properly setup with keys, etc.
> > >>
> > >> This seems odd to me, since I may want to specify initial mons that
> > >> may not yet be up (say they are waiting for cobbler to finish
> > >> loading them for example), but I want them as part of the initial 
> > >> cluster.
> > >>
> > >>
> > >> Donald Talton
> > >> Cloud Systems Development
> > >> Cisco Systems
> > >>
> > >>
> > >>
> > >> ___
> > >> ceph-users mailing list
> > >> ceph-users@lists.ceph.com
> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-deploy issues with initial mons that aren't up

2013-12-20 Thread Gregory Farnum
Yeah. This is less of a problem when you're listing them all
explicitly ahead of time (we could just make them wait for any
majority), but some systems don't want to specify even the monitor
count that way, so we give the admins "mon initial members" as a big
hammer.
-Greg

On Fri, Dec 20, 2013 at 8:17 AM, Don Talton (dotalton)
 wrote:
> This makes sense. So if other mons come up that are *not* defined as initial 
> mons, then they will not be in service until the initial mon is up and ready? 
> At which point they can form their quorum and operate?
>
>
>> -Original Message-
>> From: Gregory Farnum [mailto:g...@inktank.com]
>> Sent: Thursday, December 19, 2013 10:19 PM
>> To: Don Talton (dotalton)
>> Cc: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] ceph-deploy issues with initial mons that aren't up
>>
>> "mon initial members" is a race prevention mechanism whose purpose is to
>> prevent your monitors from forming separate quorums when they're
>> brought up by automated software provisioning systems (by not allowing
>> monitors to form a quorum unless everybody in the list is a member).
>> If you want to add other monitors at a later time you can do so by specifying
>> them elsewhere (including in mon hosts or whatever, so other daemons will
>> attempt to contact them.) -Greg Software Engineer #42 @
>> http://inktank.com | http://ceph.com
>>
>>
>> On Thu, Dec 19, 2013 at 9:13 PM, Don Talton (dotalton)
>>  wrote:
>> > I just realized my email is not clear. If the first mon is up and the 
>> > additional
>> initials are not, then the process fails.
>> >
>> >> -Original Message-
>> >> From: ceph-users-boun...@lists.ceph.com [mailto:ceph-users-
>> >> boun...@lists.ceph.com] On Behalf Of Don Talton (dotalton)
>> >> Sent: Thursday, December 19, 2013 2:44 PM
>> >> To: ceph-users@lists.ceph.com
>> >> Subject: [ceph-users] ceph-deploy issues with initial mons that
>> >> aren't up
>> >>
>> >> Hi all,
>> >>
>> >> I've been working in some ceph-deploy automation and think I've
>> >> stumbled on an interesting behavior. I create a new cluster, and
>> >> specify 3 machines. If all 3 are not and unable to be ssh'd into with
>> >> the account I created for ceph- deploy, then the mon create process
>> >> will fail and the cluster is not properly setup with keys, etc.
>> >>
>> >> This seems odd to me, since I may want to specify initial mons that
>> >> may not yet be up (say they are waiting for cobbler to finish loading
>> >> them for example), but I want them as part of the initial cluster.
>> >>
>> >>
>> >> Donald Talton
>> >> Cloud Systems Development
>> >> Cisco Systems
>> >>
>> >>
>> >>
>> >> ___
>> >> ceph-users mailing list
>> >> ceph-users@lists.ceph.com
>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-deploy issues with initial mons that aren't up

2013-12-20 Thread Don Talton (dotalton)
This makes sense. So if other mons come up that are *not* defined as initial 
mons, then they will not be in service until the initial mon is up and ready? 
At which point they can form their quorum and operate?


> -Original Message-
> From: Gregory Farnum [mailto:g...@inktank.com]
> Sent: Thursday, December 19, 2013 10:19 PM
> To: Don Talton (dotalton)
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] ceph-deploy issues with initial mons that aren't up
> 
> "mon initial members" is a race prevention mechanism whose purpose is to
> prevent your monitors from forming separate quorums when they're
> brought up by automated software provisioning systems (by not allowing
> monitors to form a quorum unless everybody in the list is a member).
> If you want to add other monitors at a later time you can do so by specifying
> them elsewhere (including in mon hosts or whatever, so other daemons will
> attempt to contact them.) -Greg Software Engineer #42 @
> http://inktank.com | http://ceph.com
> 
> 
> On Thu, Dec 19, 2013 at 9:13 PM, Don Talton (dotalton)
>  wrote:
> > I just realized my email is not clear. If the first mon is up and the 
> > additional
> initials are not, then the process fails.
> >
> >> -Original Message-
> >> From: ceph-users-boun...@lists.ceph.com [mailto:ceph-users-
> >> boun...@lists.ceph.com] On Behalf Of Don Talton (dotalton)
> >> Sent: Thursday, December 19, 2013 2:44 PM
> >> To: ceph-users@lists.ceph.com
> >> Subject: [ceph-users] ceph-deploy issues with initial mons that
> >> aren't up
> >>
> >> Hi all,
> >>
> >> I've been working in some ceph-deploy automation and think I've
> >> stumbled on an interesting behavior. I create a new cluster, and
> >> specify 3 machines. If all 3 are not and unable to be ssh'd into with
> >> the account I created for ceph- deploy, then the mon create process
> >> will fail and the cluster is not properly setup with keys, etc.
> >>
> >> This seems odd to me, since I may want to specify initial mons that
> >> may not yet be up (say they are waiting for cobbler to finish loading
> >> them for example), but I want them as part of the initial cluster.
> >>
> >>
> >> Donald Talton
> >> Cloud Systems Development
> >> Cisco Systems
> >>
> >>
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephx and auth for rbd image

2013-12-20 Thread Laurent Durnez

Hi all,

I've tested authentication on client side for pools, no problem so far. 
I'm testing granularity to the rbd image, I've seen in the doc that we 
can limit to object prefix, so possibly to rbd image :

http://ceph.com/docs/master/man/8/ceph-authtool/#osd-capabilities

I've got the following key :
client.test01
key: ...
caps: [mon] allow r
caps: [osd] allow * object_prefix rbd_data.108374b0dc51

The object_prefix is from the rbd info  command : 
block_name_prefix: rbd_data.108374b0dc51

And my client, I've got the following error using this key  :
rbd --id test01 --keyfile test01 map /
rbd: add failed: (34) Numerical result out of range

However I've got no error when I use the caps [osd] allow rwx . I 
would say it's my object_prefix declaration that is wrong. I'm puzzled, 
is there anyone who could implement this granularity?


Regards,
Laurent Durnez
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Storing VM Images on CEPH with RBD-QEMU driver

2013-12-20 Thread Christian Balzer

Hello Dan,

On Fri, 20 Dec 2013 14:01:04 +0100 Dan van der Ster wrote:

> On Fri, Dec 20, 2013 at 9:44 AM, Christian Balzer  wrote:
> >
> > Hello,
> >
> > On Fri, 20 Dec 2013 09:20:48 +0100 Dan van der Ster wrote:
> >
> >> Hi,
> >> Our fio tests against qemu-kvm on RBD look quite promising, details
> >> here:
> >>
> >> https://docs.google.com/spreadsheet/ccc?key=0AoB4ekP8AM3RdGlDaHhoSV81MDhUS25EUVZxdmN6WHc&usp=drive_web#gid=0
> >>
> > That data is very interesting and welcome, however it would be a lot
> > more relevant if it included information about your setup (though it is
> > relatively easy to create a Ceph cluster than can saturate GbE ^.^) and
> > your configuration.
> >
> > For example I assume you're using the native QEMU RBD interface.
> > How did you configure caching, just turned it on and left it at the
> > default values?
> >
> 
> It's all RedHat 6.5, qemu-kvm-rhev-0.12.1.2-2.415.el6_5.3 on the HVs,
> ceph 0.67.4 on the servers. Caching is enabled with the usual
>   rbd cache = true
>   rbd cache writethrough until flush = true
> (otherwise defaults)
That's a good data point, I'll probably play with those defaults
eventually. One thinks that the same amount of cache as a consumer HD can
be improved upon, given memory prices and all. ^o^

> The hardware is 47 OSD servers with 24 OSDs each, single 10GbE NIC per
> server, no SSDs, write journal as a file on the OSD partition (which
> is a baaad idea for small write latency, so we are slowly reinstalling
> everything to put the journal on a separate partition)
> 
Ah yes, there is the impressive bit, 47 times 24 should easily give you
that amount of IOPs, even with the journal not optimized. 

Regards,

Christian

> Cheers, Dan
> 
> >> tl;dr: rbd with caching enabled is (1) at least 2x faster than the
> >> local instance storage, and (2) reaches the hypervisor's GbE network
> >> limit in ~all cases except very small random writes.
> >>
> >> BTW, currently we have ~10 VMs running those fio tests in a loop, and
> >> we're seeing ~25,000op/s sustained in the ceph logs. Not bad IMHO.
> > Given the feedback I got from my "Sanity Check" mail, I'm even more
> > interested in the actual setup you're using now.
> > Given your workplace, I expect to be impressed. ^o^
> >
> >> Cheers, Dan
> >> CERN IT/DSS
> >>
> > [snip]
> >
> > Regards,
> >
> > Christian
> > --
> > Christian BalzerNetwork/Systems Engineer
> > ch...@gol.com   Global OnLine Japan/Fusion Communications
> > http://www.gol.com/
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Storing VM Images on CEPH with RBD-QEMU driver

2013-12-20 Thread Dan van der Ster
On Fri, Dec 20, 2013 at 9:44 AM, Christian Balzer  wrote:
>
> Hello,
>
> On Fri, 20 Dec 2013 09:20:48 +0100 Dan van der Ster wrote:
>
>> Hi,
>> Our fio tests against qemu-kvm on RBD look quite promising, details here:
>>
>> https://docs.google.com/spreadsheet/ccc?key=0AoB4ekP8AM3RdGlDaHhoSV81MDhUS25EUVZxdmN6WHc&usp=drive_web#gid=0
>>
> That data is very interesting and welcome, however it would be a lot more
> relevant if it included information about your setup (though it is
> relatively easy to create a Ceph cluster than can saturate GbE ^.^) and
> your configuration.
>
> For example I assume you're using the native QEMU RBD interface.
> How did you configure caching, just turned it on and left it at the
> default values?
>

It's all RedHat 6.5, qemu-kvm-rhev-0.12.1.2-2.415.el6_5.3 on the HVs,
ceph 0.67.4 on the servers. Caching is enabled with the usual
  rbd cache = true
  rbd cache writethrough until flush = true
(otherwise defaults)
The hardware is 47 OSD servers with 24 OSDs each, single 10GbE NIC per
server, no SSDs, write journal as a file on the OSD partition (which
is a baaad idea for small write latency, so we are slowly reinstalling
everything to put the journal on a separate partition)

Cheers, Dan

>> tl;dr: rbd with caching enabled is (1) at least 2x faster than the
>> local instance storage, and (2) reaches the hypervisor's GbE network
>> limit in ~all cases except very small random writes.
>>
>> BTW, currently we have ~10 VMs running those fio tests in a loop, and
>> we're seeing ~25,000op/s sustained in the ceph logs. Not bad IMHO.
> Given the feedback I got from my "Sanity Check" mail, I'm even more
> interested in the actual setup you're using now.
> Given your workplace, I expect to be impressed. ^o^
>
>> Cheers, Dan
>> CERN IT/DSS
>>
> [snip]
>
> Regards,
>
> Christian
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Fusion Communications
> http://www.gol.com/
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Storing VM Images on CEPH with RBD-QEMU driver

2013-12-20 Thread Peder Jansson


- Original Message -
From: "Wido den Hollander" 
To: ceph-users@lists.ceph.com
Sent: Friday, December 20, 2013 8:04:09 AM
Subject: Re: [ceph-users] Storing VM Images on CEPH with RBD-QEMU driver

Hi,


> Hi,
>
> I'm testing CEPH with the RBD/QEMU driver through libvirt to store my VM
> images on. Installation and configuration all went very well with the
> ceph-deploy tool. I have set up authx authentication in libvirt and that
> works like a charm too.
>
> However, when coming to performance I have big issues getting expected
> results inside the hosted VM. I see high latency and bad write
> performance, down to 20MB/s in VM.
>

Have you tried running "rados bench" to see the throughput that is getting?

Yes i have tried it:

rados bench -p vm_system 50 write
...
 Total time run: 50.578626
Total writes made:  1363
Write size: 4194304
Bandwidth (MB/sec): 107.793 

Stddev Bandwidth:   19.8729
Max bandwidth (MB/sec): 136
Min bandwidth (MB/sec): 0
Average Latency:0.59249
Stddev Latency: 0.341871
Max latency:2.08384
Min latency:0.14101


> My setup:
> 3xDELL R410,
> 2xXeon X5650,
> 48 GB RAM,
> 2xSATA RAID1 for System,
> 2x250GB Samsung Evo SSD for OSD's (with XFS on each one)

So you are running the journal on the same system? With XFS that means 
that you will do three writes for one write coming in to the OSD.

We are running journal on all xfs disk, but our test shows there is only a 
problem when ran in qemu vms. I have tested to turn off journal on ext4 on the 
qemu image, with no effect.

>
> ceph version 0.72.1 (4d923861868f6a15dcb33fef7f50f674997322de)
> Linux server1 3.11.0-14-generic #21-Ubuntu SMP Tue Nov 12 17:04:55 UTC
> 2013 x86_64 x86_64 x86_64 GNU/Linux
> Ubuntu 13.10
>

Which Qemu version do you use? I suggest to use at least Qemu 1.5 and 
enable the RBD write cache.

We are running:
QEMU emulator version 1.5.0 (Debian 1.5.0+dfsg-3ubuntu5.1)

> In total:
> 6 OSD
> 1 MON
> 3 MDS

For RBD the MDS is not required.

>
> So, question is; is there anyone out there that have experience of
> running the RBD/QEMU driver in production, and getting any good
> performance inside the VM?
>
> I suspect the main performance issue to be caused by high latency, since
> it all feels quite high when running those tests below with bonnie++.
> (bonnie++ -s 4096 -r 2048 -u root -d X -m BenchClient)
>
> Inside VPS running on native image in RBD pool:
>
> -- Without any Cache
>
> Version  1.96   --Sequential Output-- --Sequential Input-
> --Random-
> Concurrency   1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
> --Seeks--
> MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
> /sec %CP
> BenchClient  4G   733  96 64919   8 20271   3  3013  97 30770   3
> 2887  82
> Latency 17425us1093ms 894ms   16789us   19390us
> 89203us
> Version  1.96   --Sequential Create-- Random
> Create
> BenchClient -Create-- --Read--- -Delete-- -Create-- --Read---
> -Delete--
> files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
> /sec %CP
>16 27951  52 + +++ + +++ 24921  45 + +++
> 22535  29
> Latency  1986us 826us1065us 216us  41us
> 611us
>
> --With Writeback Cache(QEMU)
> Version  1.96   --Sequential Output-- --Sequential Input-
> --Random-
> Concurrency   1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
> --Seeks--
> MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
> /sec %CP
> BenchClient  4G   872  96 67327   8 22424   3  2516  94 32013   3
> 2800  82
> Latency 16196us 657ms 843ms   37889us   19207us
> 85407us
> Version  1.96   --Sequential Create-- Random
> Create
> BenchClient -Create-- --Read--- -Delete-- -Create-- --Read---
> -Delete--
> files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
> /sec %CP
>16 27225  51 + +++ + +++ 27325  47 + +++
> 21645  28
> Latency  1986us 852us 874us 252us  34us
> 595us
>
> --With Writethrough Cache(QEMU)
> Version  1.96   --Sequential Output-- --Sequential Input-
> --Random-
> Concurrency   1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
> --Seeks--
> MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
> /sec %CP
> BenchClient  4G   833  95 27469   3  6520   1  2743  93 33003   3
> 1912  61
> Latency 17330us2388ms1165ms   48442us   19577us
> 91228us
> Version  1.96   --Sequential Create-- Random
> Create
> BenchClient -Create-- --Read--- -Delete-- -Create-- --Read---
> -Delete--
> files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
> /sec %CP
>16 16378  31 + +++ 18864  24 18024  33 + +++
> 14734  19
> Latency 

Re: [ceph-users] Need Java bindings for librados....

2013-12-20 Thread Wido den Hollander

On 12/20/2013 12:15 PM, upendrayadav.u wrote:

Hi,

I need *Java bindings* for librados.
And also i'm new to use "Java bindings". Could you please help me get a
best way to use *librados* with java program.

And what is the problem we will face, if we will use Java bindings.
Is there any alternatives...



Java bindings for librados are available at: 
https://github.com/ceph/rados-java


A Maven repository is available at: http://ceph.com/maven/

Examples can be found in the Unit Test code for the Java bindings.


*
*
*Thanks & Regards,*
*Upendra Yadav*


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Need Java bindings for librados....

2013-12-20 Thread upendrayadav.u
Hi,

I need Java bindings for librados.
And also i'm new to use "Java bindings". Could you please help me get a best 
way to use librados with java program.

And what is the problem we will face, if we will use Java bindings.
Is there any alternatives...




Thanks & Regards,
Upendra Yadav



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph at the University

2013-12-20 Thread Loic Dachary
Hi Ceph,

Just wanted to share Yann Dupont's talk about his experience in using Ceph at 
the University. He goes beyond telling his own story and it can probably be a 
source of inspiration for various use cases in the academic world.

   
http://video.renater.fr/jres/2013/index.php?play=jres2013_article_48_720p.mp4 

It was recorded this month during JRES 2013 https://2013.jres.org/

Yann also wrote a paper but I'm not sure if it's publicly available.

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Storing VM Images on CEPH with RBD-QEMU driver

2013-12-20 Thread Christian Balzer

Hello,

On Fri, 20 Dec 2013 09:20:48 +0100 Dan van der Ster wrote:

> Hi,
> Our fio tests against qemu-kvm on RBD look quite promising, details here:
> 
> https://docs.google.com/spreadsheet/ccc?key=0AoB4ekP8AM3RdGlDaHhoSV81MDhUS25EUVZxdmN6WHc&usp=drive_web#gid=0
> 
That data is very interesting and welcome, however it would be a lot more
relevant if it included information about your setup (though it is
relatively easy to create a Ceph cluster than can saturate GbE ^.^) and
your configuration. 

For example I assume you're using the native QEMU RBD interface.
How did you configure caching, just turned it on and left it at the
default values?

> tl;dr: rbd with caching enabled is (1) at least 2x faster than the
> local instance storage, and (2) reaches the hypervisor's GbE network
> limit in ~all cases except very small random writes.
> 
> BTW, currently we have ~10 VMs running those fio tests in a loop, and
> we're seeing ~25,000op/s sustained in the ceph logs. Not bad IMHO.
Given the feedback I got from my "Sanity Check" mail, I'm even more
interested in the actual setup you're using now.
Given your workplace, I expect to be impressed. ^o^

> Cheers, Dan
> CERN IT/DSS
> 
[snip]

Regards,

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Storing VM Images on CEPH with RBD-QEMU driver

2013-12-20 Thread Dan van der Ster
Hi,
Our fio tests against qemu-kvm on RBD look quite promising, details here:

https://docs.google.com/spreadsheet/ccc?key=0AoB4ekP8AM3RdGlDaHhoSV81MDhUS25EUVZxdmN6WHc&usp=drive_web#gid=0

tl;dr: rbd with caching enabled is (1) at least 2x faster than the
local instance storage, and (2) reaches the hypervisor's GbE network
limit in ~all cases except very small random writes.

BTW, currently we have ~10 VMs running those fio tests in a loop, and
we're seeing ~25,000op/s sustained in the ceph logs. Not bad IMHO.
Cheers, Dan
CERN IT/DSS

On Thu, Dec 19, 2013 at 4:00 PM, Peder Jansson  wrote:
> Hi,
>
> I'm testing CEPH with the RBD/QEMU driver through libvirt to store my VM
> images on. Installation and configuration all went very well with the
> ceph-deploy tool. I have set up authx authentication in libvirt and that
> works like a charm too.
>
> However, when coming to performance I have big issues getting expected
> results inside the hosted VM. I see high latency and bad write
> performance, down to 20MB/s in VM.
>
> My setup:
> 3xDELL R410,
> 2xXeon X5650,
> 48 GB RAM,
> 2xSATA RAID1 for System,
> 2x250GB Samsung Evo SSD for OSD's (with XFS on each one)
>
> ceph version 0.72.1 (4d923861868f6a15dcb33fef7f50f674997322de)
> Linux server1 3.11.0-14-generic #21-Ubuntu SMP Tue Nov 12 17:04:55 UTC
> 2013 x86_64 x86_64 x86_64 GNU/Linux
> Ubuntu 13.10
>
> In total:
> 6 OSD
> 1 MON
> 3 MDS
>
> So, question is; is there anyone out there that have experience of
> running the RBD/QEMU driver in production, and getting any good
> performance inside the VM?
>
> I suspect the main performance issue to be caused by high latency, since
> it all feels quite high when running those tests below with bonnie++.
> (bonnie++ -s 4096 -r 2048 -u root -d X -m BenchClient)
>
> Inside VPS running on native image in RBD pool:
>
> -- Without any Cache
>
> Version  1.96   --Sequential Output-- --Sequential Input-
> --Random-
> Concurrency   1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
> --Seeks--
> MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
> /sec %CP
> BenchClient  4G   733  96 64919   8 20271   3  3013  97 30770   3
> 2887  82
> Latency 17425us1093ms 894ms   16789us   19390us
> 89203us
> Version  1.96   --Sequential Create-- Random
> Create
> BenchClient -Create-- --Read--- -Delete-- -Create-- --Read---
> -Delete--
>files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
> /sec %CP
>   16 27951  52 + +++ + +++ 24921  45 + +++
> 22535  29
> Latency  1986us 826us1065us 216us  41us
> 611us
>
> --With Writeback Cache(QEMU)
> Version  1.96   --Sequential Output-- --Sequential Input-
> --Random-
> Concurrency   1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
> --Seeks--
> MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
> /sec %CP
> BenchClient  4G   872  96 67327   8 22424   3  2516  94 32013   3
> 2800  82
> Latency 16196us 657ms 843ms   37889us   19207us
> 85407us
> Version  1.96   --Sequential Create-- Random
> Create
> BenchClient -Create-- --Read--- -Delete-- -Create-- --Read---
> -Delete--
>files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
> /sec %CP
>   16 27225  51 + +++ + +++ 27325  47 + +++
> 21645  28
> Latency  1986us 852us 874us 252us  34us
> 595us
>
> --With Writethrough Cache(QEMU)
> Version  1.96   --Sequential Output-- --Sequential Input-
> --Random-
> Concurrency   1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
> --Seeks--
> MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
> /sec %CP
> BenchClient  4G   833  95 27469   3  6520   1  2743  93 33003   3
> 1912  61
> Latency 17330us2388ms1165ms   48442us   19577us
> 91228us
> Version  1.96   --Sequential Create-- Random
> Create
> BenchClient -Create-- --Read--- -Delete-- -Create-- --Read---
> -Delete--
>files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
> /sec %CP
>   16 16378  31 + +++ 18864  24 18024  33 + +++
> 14734  19
> Latency  2028us 761us1188us 271us  36us
> 567us
>
> ---With Writeback Cache (CEPH)
> Version  1.96   --Sequential Output-- --Sequential Input-
> --Random-
> Concurrency   1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
> --Seeks--
> MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
> /sec %CP
> BenchClient  4G   785  95 67573   8 19906   3  2777  96 32681   3
> 2764  80
> Latency 17410us 729ms 737ms   15103us   22802us
> 88876us
> Version  1.96   --Sequential Create-- Random
> Create
> BenchClient -Create-- --Read--- -Delete-- -Cr