from:"\"Kyle Bader\""

Re: [ceph-users] Ceph migration to AWS

2015-05-04 Thread Kyle Bader

> To those interested in a tricky problem,
>
> We have a Ceph cluster running at one of our data centers. One of our
> client's requirements is to have them hosted at AWS. My question is: How do
> we effectively migrate our data on our internal Ceph cluster to an AWS Ceph
> cluster?
>
> Ideas currently on the table:
>
> 1. Build OSDs at AWS and add them to our current Ceph cluster. Build quorum
> at AWS then sever the connection between AWS and our data center.

I would highly discourage this.

> 2. Build a Ceph cluster at AWS and send snapshots from our data center to
> our AWS cluster allowing us to "migrate" to AWS.

This sounds far more sensible. I'd look at the I2 (iops) or D2
(density) class instances, depending on use case.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] xfs/nobarrier

2014-12-27 Thread Kyle Bader

> do people consider a UPS + Shutdown procedures a suitable substitute?

I certainly wouldn't, I've seen utility power fail and the transfer
switch fail to transition to UPS strings. Had this happened to me with
nobarrier it would have been a very sad day.

-- 

Kyle Bader
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] private network - VLAN vs separate switch

2014-11-26 Thread Kyle Bader

> Thanks for all the help. Can the moving over from VLAN to separate
> switches be done on a live cluster? Or does there need to be a down
> time?

You can do it on a life cluster. The more cavalier approach would be
to quickly switch the link over one server at a time, which might
cause a short io stall. The more careful approach would be to 'ceph
osd set noup' mark all the osds on a node down, move the link, 'ceph
osd unset noup', and then wait for their peers to mark them back up
before proceeding to the next host.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] private network - VLAN vs separate switch

2014-11-25 Thread Kyle Bader

> For a large network (say 100 servers and 2500 disks), are there any
> strong advantages to using separate switch and physical network
> instead of VLAN?

Physical isolation will ensure that congestion on one does not affect
the other. On the flip side, asymmetric network failures tend to be
more difficult to troubleshoot eg. backend failure with functional
front end. That said, in a pinch you can switch to using the front end
network for both until you can repair the backend.

> Also, how difficult it would be to switch from a VLAN to using
> separate switches later?

Should be relatively straight forward. Simply configure the
VLAN/subnets on the new physical switches and move links over one by
one. Once all the links are moved over you can remove the VLAN and
subnets that are now on the new kit from the original hardware.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Dependency issues in fresh ceph/CentOS 7 install

2014-08-06 Thread Kyle Bader

> Can you paste me the whole output of the install? I am curious why/how you 
> are getting el7 and el6 packages.

priority=1 required in /etc/yum.repos.d/ceph.repo entries

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Is OSDs based on VFS?

2014-07-21 Thread Kyle Bader

> I wonder that OSDs use system calls of Virtual File System (i.e. open, read,
> write, etc) when they access disks.
>
> I mean ... Could I monitor I/O command requested by OSD to disks if I
> monitor VFS?

Ceph OSDs run on top of a traditional filesystem, so long as they
support xattrs - xfs by default. As such you can use kernel
instrumentation to view what is going on "under" the Ceph OSDs.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bypass Cache-Tiering for special reads (Backups)

2014-07-02 Thread Kyle Bader

> I was wondering, having a cache pool in front of an RBD pool is all fine
> and dandy, but imagine you want to pull backups of all your VMs (or one
> of them, or multiple...). Going to the cache for all those reads isn't
> only pointless, it'll also potentially fill up the cache and possibly
> evict actually frequently used data. Which got me thinking... wouldn't
> it be nifty if there was a special way of doing specific backup reads
> where you'd bypass the cache, ensuring the dirty cache contents get
> written to cold pool first? Or at least doing special reads where a
> cache-miss won't actually cache the requested data?
>
> AFAIK the backup routine for an RBD-backed KVM usually involves creating
> a snapshot of the RBD and putting that into a backup storage/tape, all
> done via librbd/API.
>
> Maybe something like that even already exists?

When used in the context of OpenStack Cinder, it does:

http://ceph.com/docs/next/rbd/rbd-openstack/#configuring-cinder-backup

You can have the backup pool use the default crush rules, assuming the
default isn't your hot pool. Another option might be to put backups on
an erasure coded pool, I'm not sure if that has been tested, but in
principle should work since objects composing a snapshot should be
immutable.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Journal SSD durability

2014-05-13 Thread Kyle Bader

> TL;DR: Power outages are more common than your colo facility will admit.

Seconded. I've seen power failures in at least 4 different facilities
and all of them had the usual gamut of batteries/generators/etc. Some
of those facilities I've seen problems multiple times in a single
year. Even a datacenter with five nines power availability is going to
see > 5m of downtime per year, and that would qualify for the highest
rating from the Uptime Institute (Tier IV)! I've lost power to Ceph
clusters on several occasions, in all cases the journals were on
spinning media.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Migrate whole clusters

2014-05-13 Thread Kyle Bader

> Anyway replacing set of monitors means downtime for every client, so
> I`m in doubt if 'no outage' word is still applicable there.

Taking the entire quorum down for migration would be bad. It's better
to add one in the new location, remove one at the old, ad infinitum.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Migrate whole clusters

2014-05-09 Thread Kyle Bader

> Let's assume a test cluster up and running with real data on it.
> Which is the best way to migrate everything to a production (and
> larger) cluster?
>
> I'm thinking to add production MONs to the test cluster, after that,
> add productions OSDs to the test cluster, waiting for a full rebalance
> and then starting to remove test OSDs and test mons.
>
> This should migrate everything with no outage.

It's possible and I've done it, this was around the argonaut/bobtail
timeframe on a pre-production cluster. If your cluster has a lot of
data then it may take a long time or be disruptive, make sure you've
tested that your recovery tunables are suitable for your hardware
configuration.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSDs: cache pool/tier versus node-local block cache

2014-04-17 Thread Kyle Bader

>> >> I think the timing should work that we'll be deploying with Firefly and
>> >> so
>> >> have Ceph cache pool tiering as an option, but I'm also evaluating
>> >> Bcache
>> >> versus Tier to act as node-local block cache device. Does anybody have
>> >> real
>> >> or anecdotal evidence about which approach has better performance?
>> > New idea that is dependent on failure behaviour of the cache tier...
>>
>> The problem with this type of configuration is it ties a VM to a
>> specific hypervisor, in theory it should be faster because you don't
>> have network latency from round trips to the cache tier, resulting in
>> higher iops. Large sequential workloads may achieve higher throughput
>> by parallelizing across many OSDs in a cache tier, whereas local flash
>> would be limited to single device throughput.
>
> Ah, I was ambiguous. When I said node-local I meant OSD-local. So I'm really
> looking at:
> 2-copy write-back object ssd cache-pool
> versus
> OSD write-back ssd block-cache
> versus
> 1-copy write-around object cache-pool & ssd journal

Ceph cache pools allow you to scale the size of the cache pool
independent of the underlying storage and avoids constraints about
disk:ssd ratios (for flashcache, bcache, etc). Local block caches
should have lower latency than a cache tier for a cache miss, due to
the extra hop(s) across the network. I would lean towards using Ceph's
cache tiers for the scaling independence.

> This is undoubtedly true for a write-back cache-tier. But in the scenario
> I'm suggesting, a write-around cache, that needn't be bad news - if a
> cache-tier OSD is lost the cache simply just got smaller and some cached
> objects were unceremoniously flushed. The next read on those objects should
> just miss and bring them into the now smaller cache.
>
> The thing I'm trying to avoid with the above is double read-caching of
> objects (so as to get more aggregate read cache). I assume the standard
> wisdom with write-back cache-tiering is that the backing data pool shouldn't
> bother with ssd journals?

Currently, all cache tiers need to be durable - regardless of cache
mode. As such, cache tiers should be erasure coded or N+1 replicated
(I'd recommend N+2 or 3x replica). Ceph could potentially do what you
described in the future, it just doesn't yet.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSDs: cache pool/tier versus node-local block cache

2014-04-16 Thread Kyle Bader

>> Obviously the ssds could be used as journal devices, but I'm not really
>> convinced whether this is worthwhile when all nodes have 1GB of hardware
>> writeback cache (writes to journal and data areas on the same spindle have
>> time to coalesce in the cache and minimise seek time hurt). Any advice on
>> this?

All writes need to be written to the journal before being written to
the data volume so it's going to impact your overall throughput and
cause seeking, a hardware cache will only help with the later (unless
you use btrfs).

>> I think the timing should work that we'll be deploying with Firefly and so
>> have Ceph cache pool tiering as an option, but I'm also evaluating Bcache
>> versus Tier to act as node-local block cache device. Does anybody have real
>> or anecdotal evidence about which approach has better performance?
> New idea that is dependent on failure behaviour of the cache tier...

The problem with this type of configuration is it ties a VM to a
specific hypervisor, in theory it should be faster because you don't
have network latency from round trips to the cache tier, resulting in
higher iops. Large sequential workloads may achieve higher throughput
by parallelizing across many OSDs in a cache tier, whereas local flash
would be limited to single device throughput.

> Carve the ssds 4-ways: each with 3 partitions for journals servicing the
> backing data pool and a fourth larger partition serving a write-around cache
> tier with only 1 object copy. Thus both reads and writes hit ssd but the ssd
> capacity is not halved by replication for availability.
>
> ...The crux is how the current implementation behaves in the face of cache
> tier OSD failures?

Cache tiers are durable by way of replication or erasure coding, OSDs
will remap degraded placement groups and backfill as appropriate. With
single replica cache pools loss of OSDs becomes a real concern, in the
case of RBD this means losing arbitrary chunk(s) of your block devices
- bad news. If you want host independence, durability and speed your
best bet is a replicated cache pool (2-3x).

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] question on harvesting freed space

2014-04-15 Thread Kyle Bader

> I'm assuming Ceph/RBD doesn't have any direct awareness of this since
> the file system doesn't traditionally have a "give back blocks"
> operation to the block device.  Is there anything special RBD does in
> this case that communicates the release of the Ceph storage back to the
> pool?

VMs running a 3.2+ kernel (iirc) can "give back blocks" by issuing TRIM.

http://wiki.qemu.org/Features/QED/Trim

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] 答复: 答复: why object can't be recovered when delete one replica

2014-03-24 Thread Kyle Bader

> I have run the repair command, and the warning info disappears in the
output of "ceph health detail", but the replicas isn't recovered in the
"current" directory.
> In all, the ceph cluster status can recover (the pg's status recover from
inconsistent to active and clean), but not the replica.

If you run a pg query does it still show the osd you removed the object
from in the acting set? It could be that the pg has a different member now
and the restored copy is on another osd.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Error initializing cluster client: Error

2014-03-22 Thread Kyle Bader

> I have two nodes with 8 OSDs on each. First node running 2 monitors on 
> different virtual machines (mon.1 and mon.2), second node runing mon.3
> After several reboots (I have tested power failure scenarios) "ceph -w" on 
> node 2 always fails with message:
>
> root@bes-mon3:~# ceph --verbose -w
> Error initializing cluster client: Error

The cluster is simply protecting itself from a split brain situation.
Say you have:

mon.1  mon.2  mon.3

If mon.1 fails, no big deal, you still have 2/3 so no problem.

Now instead, say mon.1 is separated from mon.2 and mon.3 because of a
network partition (trunk failure, whatever). If one monitor of the
three could elect itself as leader then you might have divergence
between your monitors. Self-elected mon.1 thinks it's the leader and
mon.{2,3} have elected a leader amongst themselves. The harsh reality
is you really need to have monitors on 3 distinct physical hosts to
protect against the failure of a physical host.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] why object can't be recovered when delete one replica

2014-03-22 Thread Kyle Bader

> I upload a file through swift API, then delete it in the “current” directory
> in the secondary OSD manually, why the object can’t be recovered?
>
> If I delete it in the primary OSD, the object is deleted directly in the
> pool .rgw.bucket and it can’t be recovered from the secondary OSD.
>
> Do anyone know this behavior?

This is because the placement group containing that object likely
needs to scrub (just a light scrub should do). The scrub will compare
the two replicas, notice the replica is missing from the secondary and
trigger recovery/backfill. Can you try scrubbing the placement group
containing the removed object and let us know if it triggers recovery?

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Mounting with dmcrypt still fails

2014-03-22 Thread Kyle Bader

> ceph-disk-prepare --fs-type xfs --dmcrypt --dmcrypt-key-dir 
> /etc/ceph/dmcrypt-keys --cluster ceph -- /dev/sdb
> ceph-disk: Error: Device /dev/sdb2 is in use by a device-mapper mapping 
> (dm-crypt?): dm-0

It sounds like device-mapper still thinks it's using the the volume,
you might be able to track it down with this:

for i in `ls -1 /sys/block/ | grep sd`; do echo $i: `ls
/sys/block/$i/${i}1/holders/`; done

Then it's a matter of making sure there are no open file handles on
the encrypted volume and unmounting it. You will still need to
completely clear out the partition table on that disk, which can be
tricky with GPT because it's not as simple as dd'in the start of the
volume. This is what the zapdisk parameter is for in
ceph-disk-prepare, I don't know enough about ceph-deploy to know if
you can somehow pass it.

After you know the device/dm mapping you can use udevadm to find out
where it should map to (uuids replaced with xxx's):

udevadm test /block/sdc/sdc1

run: '/sbin/cryptsetup --key-file /etc/ceph/dmcrypt-keys/x
--key-size 256 create  /dev/sdc1'
run: '/bin/bash -c 'while [ ! -e /dev/mapper/x ];do sleep 1; done''
run: '/usr/sbin/ceph-disk-activate /dev/mapper/x'

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] osd rebalance question

2014-03-22 Thread Kyle Bader

>  I need to add a extend server, which reside several osds, to a
> running ceph cluster. During add osds, ceph would not automatically modify
> the ceph.conf. So I manually modify the ceph.conf
>
> And restart the whole ceph cluster with command: ’service ceph –a restart’.
> I just confused that if I restart the ceph cluster, ceph would rebalance the
> whole data(redistribution whole data) among osds? Or just move some
>
> Data from existed osds to new osds? Anybody knows?

It depends on how you added the OSDs, if the initial crush weight is
set to 0 then no data will be moved to the OSD when it joins the
cluster. Only once the weight has been increased with the rest of the
OSD population will data start to move to the new OSD(s). If you add
new OSD(s) with an initial weight > 0 then they will start accepting
data from peers as soon as they are up/in.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD + FlashCache vs. Cache Pool for RBD...

2014-03-22 Thread Kyle Bader

> One downside of the above arrangement: I read that support for mapping
> newer-format RBDs is only present in fairly recent kernels.  I'm running
> Ubuntu 12.04 on the cluster at present with its stock 3.2 kernel.  There
> is a PPA for the 3.11 kernel used in Ubuntu 13.10, but if you're looking
> at a new deployment it might be better to wait until 14.04: then you'll
> get kernel 3.13.
>
> Anyone else have any ideas on the above?

I don't think there are any hairy udev issues or similar that will
make using a newer kernel on precise problematic. The only thing I can
think of that is a caveat of this kind of setup if if you lose a
hypervisor the cache will go with it and you likely wont be able to
migrate the guest to another host. The alternative is to use
flashcache on top of the OSD partition but then you introduce network
hops and is closer to what the tiering feature will offer, except the
flashcache OSD method is more particular about disk:ssd ratio, whereas
in a tier the flash could be on s completely separate hosts (possibly
dedicated flash machines).

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] What's the difference between using /dev/sdb and /dev/sdb1 as osd?

2014-03-22 Thread Kyle Bader

> If I want to use a disk dedicated for osd, can I just use something like
> /dev/sdb instead of /dev/sdb1? Is there any negative impact on performance?

You can pass /dev/sdb to ceph-disk-prepare and it will create two
partitions, one for the journal (raw partition) and one for the data
volume (defaults to formatting xfs). This is known as a single device
OSD, in contrast with a multi-device OSD where the journal is on a
completely different device (like a partition on a shared journaling
SSD).

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] if partition name changes, will ceph get corrupted?

2014-03-12 Thread Kyle Bader

> We use /dev/disk/by-path for this reason, but we confirmed that is stable
> for our HBAs. Maybe /dev/disk/by-something is consistent with your
> controller.

The upstart/udev scripts will handle mounting and osd id detection, at
least on Ubuntu.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Put Ceph Cluster Behind a Pair of LB

2014-03-12 Thread Kyle Bader

> This is in my lab. Plain passthrough setup with automap enabled on the F5. s3 
> & curl work fine as far as queries go. But file transfer rate degrades badly 
> once I start file up/download.

Maybe the difference can be attributed to LAN client traffic with
jumbo frames vs F5 using a smaller WAN MTU?

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Put Ceph Cluster Behind a Pair of LB

2014-03-12 Thread Kyle Bader

> You're right.  Sorry didn't specify I was trying this for Radosgw.  Even for 
> this I'm seeing performance degrade once my clients start to hit the LB VIP.

Could you tell us more about your load balancer and configuration?

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Put Ceph Cluster Behind a Pair of LB

2014-03-12 Thread Kyle Bader

> Anybody has a good practice on how to set up a ceph cluster behind a pair of 
> load balancer?

The only place you would want to put a load balancer in the context of
a Ceph cluster would be north of RGW nodes. You can do L3 transparent
load balancing or balance with a L7 proxy, ie Linux Virtual Server or
HAProxy/Nginx. The other components of Ceph are horizontally scalable
and because of the way Ceph's native protocols work you don't need
load balancers doing L2/L3/L7 tricks to achieve HA.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] qemu-rbd

2014-03-11 Thread Kyle Bader

> I tried rbd-fuse and it's throughput using fio is approx. 1/4 that of the 
> kernel client.
>
> Can you please let me know how to setup RBD backend for FIO? I'm assuming 
> this RBD backend is also based on librbd?

You will probably have to build fio from source since the rbd engine is new:

https://github.com/axboe/fio

Assuming you already have a cluster and a client configured this
should do the trick:

https://github.com/axboe/fio/blob/master/examples/rbd.fio

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Utilizing DAS on XEN or XCP hosts for Openstack Cinder

2014-03-11 Thread Kyle Bader

> 1.   Is it possible to install Ceph and Ceph monitors on the the XCP
> (XEN) Dom0 or would we need to install it on the DomU containing the
> Openstack components?

I'm not a Xen guru but in the case of KVM I would run the OSDs on the
hypervisor to avoid virtualization overhead.

> 2.   Is Ceph server aware, or Rack aware so that replicas are not stored
> on the same server?

Yes, placement is defined with your crush map and placement rules.

> 3.   Are 4Tb OSD’s too large? We are attempting to restrict the qty of
> OSD’s per server to minimise system overhead

Nope!

> Any other feedback regarding our plan would also be welcomed.

I would probably run each disk as it's own OSD, which means you need a
bit more memory per host. Networking could certainly be a bottleneck
with 8 to 16 spindle nodes. YMMV.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Encryption/Multi-tennancy

2014-03-11 Thread Kyle Bader

> There could be millions of tennants. Looking deeper at the docs, it looks 
> like Ceph prefers to have one OSD per disk.  We're aiming at having 
> backblazes, so will be looking at 45 OSDs per machine, many machines.  I want 
> to separate the tennants and separately encrypt their data.  The encryption 
> will be provided by us, but I was originally intending to have 
> passphrase-based encryption, and use programmatic means to either hash the 
> passphrase or/and encrypt it using the same passphrase.  This way, we 
> wouldn't be able to access the tennant's data, or the key for the passphrase, 
> although we'd still be able to store both.


The way I see it you have several options:

1. Encrypted OSDs

Preserve confidentiality in the event someone gets physical access to
a disk, whether theft or RMA. Requires tenant to trust provider.

vm
rbd
rados
osd <-here
disks

2. Whole disk VM encryption

Preserve confidentiality in the even someone gets physical access to
disk, whether theft or RMA.

tenant: key/passphrase
provider: nothing

tenant: passphrase
provider: key

tenant: nothing
provider: key

vm <--- here
rbd
rados
osd
disks

3. Encryption further up stack (application perhaps?)

To me, #1/#2 are identical except in the case of #2 when the rbd
volume is not attached to a VM. Block devices attached to a VM and
mounted will be decrypted, making the encryption only useful at
defending against unauthorized access to storage media. With a
different key per VM, with potentially millions of tenants, you now
have a massive key escrow/management problem that only buys you a bit
of additional security when block devices are detached. Sounds like a
crappy deal to me, I'd either go with #1 or #3.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Recommended node size for Ceph

2014-03-10 Thread Kyle Bader

> Why the limit of 6 OSDs per SSD?

SATA/SAS throughput generally.

> I am doing testing with a PCI-e based SSD, and showing that even with 15
OSD disk drives per SSD that the SSD is keeping up.

That will probably be fine performance wise but it's worth noting that all
OSDs will fail if the flash fails (same as node failure).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Encryption/Multi-tennancy

2014-03-10 Thread Kyle Bader

> Ceph is seriously badass, but my requirements are to create a cluster in 
> which I can host my customer's data in separate areas which are independently 
> encrypted, with passphrases which we as cloud admins do not have access to.
>
> My current thoughts are:
> 1. Create an OSD per machine stretching over all installed disks, then create 
> a user-sized block device per customer.  Mount this block device on an access 
> VM and create a LUKS container in to it followed by a zpool and then I can 
> allow the users to create separate bins of data as separate ZFS filesystems 
> in the container which is actually a blockdevice striped across the OSDs.
> 2. Create an OSD per customer and use dm-crypt, then store the dm-crypt key 
> somewhere which is rendered in some way so that we cannot access it, such as 
> a pgp-encrypted file using a passphrase which only the customer knows.

> My questions are:
> 1. What are people's comments regarding this problem (irrespective of my 
> thoughts)

What is the threat model that leads to these requirements? The story
"cloud admins do not have access" is not achievable through technology
alone.

> 2. Which would be the most efficient of (1) and (2) above?

In the case of #1 and #2, you are only protecting data at rest. With
#2 you would need to decrypt the key to open the block device, and the
key would remain in memory until it is unmounted (which the cloud
admin could access). This means #2 is safe so long as you never mount
the volume, which means it's utility is rather limited (archive
perhaps). Neither of these schemes buy you much more than the
encryption handling provided by ceph-disk-prepare (dmcrypted osd
data/journal volumes), the key management problem becomes more acute,
eg. per tenant.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Running a mon on a USB stick

2014-03-08 Thread Kyle Bader

> Is there an issue with IO performance?

Ceph monitors store cluster maps and various other things in leveldb,
which persists to disk. I wouldn't recommend using a sd/usb cards for
the monitor store because they tend to be slow and have poor
durability.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] questions about ceph cluster in multi-dacenter

2014-02-20 Thread Kyle Bader

> What could be the best replication ?

Are you using two sites to increase availability, durability, or both?

For availability your really better off using three sites and use
CRUSH to place each of three replicas in a different datacenter. In
this setup you can survive losing 1 of 3 datacenters. If you two sites
is the only option and your goal is availability and durability then I
would do 4 replicas, using osd_pool_default_min_size = 2.

> How to tune the crushmap of this kind of setup ?
> and last question : It's possible to have the reads from vms on DC1 to always 
> read datas on DC1 ?

No yet!

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How client choose among replications?

2014-02-11 Thread Kyle Bader

> Why would it help? Since it's not that ONE OSD will be primary for all
objects. There will be 1 Primary OSD per PG and you'll probably have a
couple of thousands PGs.

The primary may be across a oversubscribed/expensive link, in which case a
local replica with a common ancestor to the client may be preferable. It's
WIP with the goal of landing in firefly iirc.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] poor data distribution

2014-02-01 Thread Kyle Bader

> Change pg_num for .rgw.buckets to power of 2, an 'crush tunables
> optimal' didn't help :(

Did you bump pgp_num as well? The split pgs will stay in place until
pgp_num is bumped as well, if you do this be prepared for (potentially
lots) of data movement.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RADOS Gateway Issues

2014-01-23 Thread Kyle Bader

> HEALTH_WARN 1 pgs down; 3 pgs incomplete; 3 pgs stuck inactive; 3 pgs
stuck unclean; 7 requests are blocked > 32 sec; 3 osds have slow requests;
pool cloudstack has too few pgs; pool .rgw.buckets has too few pgs
> pg 14.0 is stuck inactive since forever, current state incomplete, last
acting [5,0]
> pg 14.2 is stuck inactive since forever, current state incomplete, last
acting [0,5]
> pg 14.6 is stuck inactive since forever, current state down+incomplete,
last acting [4,2]
> pg 14.0 is stuck unclean since forever, current state incomplete, last
acting [5,0]
> pg 14.2 is stuck unclean since forever, current state incomplete, last
acting [0,5]
> pg 14.6 is stuck unclean since forever, current state down+incomplete,
last acting [4,2]
> pg 14.0 is incomplete, acting [5,0]
> pg 14.2 is incomplete, acting [0,5]
> pg 14.6 is down+incomplete, acting [4,2]
> 3 ops are blocked > 8388.61 sec
> 3 ops are blocked > 4194.3 sec
> 1 ops are blocked > 2097.15 sec
> 1 ops are blocked > 8388.61 sec on osd.0
> 1 ops are blocked > 4194.3 sec on osd.0
> 2 ops are blocked > 8388.61 sec on osd.4
> 2 ops are blocked > 4194.3 sec on osd.5
> 1 ops are blocked > 2097.15 sec on osd.5
> 3 osds have slow requests
> pool cloudstack objects per pg (37316) is more than 27.1587 times cluster
average (1374)
> pool .rgw.buckets objects per pg (76219) is more than 55.4723 times
cluster average (1374)
>
>
> Ignore the cloudstack pool, I was using cloudstack but not anymore, it's
an inactive pool.

You will probably want to check osd 0,2,4,5 to make sure they are all up
and in. Pg 14.6 need (4,2) and the others need (0,5). Other than that you
may find that a pg query on the inactive/incomplete will provide more
insight.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Power Cycle Problems

2014-01-16 Thread Kyle Bader

> On two separate occasions I have lost power to my Ceph cluster. Both times, I 
> had trouble bringing the cluster back to good health. I am wondering if I 
> need to config something that would solve this problem?

No special configuration should be necessary, I've had the unfortunate
luck of witnessing several power loss events with large Ceph clusters.
In both cases something other than Ceph was the source of frustrations
once power was returned. That said, monitor daemons should be started
first and must form a quorum before the cluster will be usable. It
sounds like you have made it that far if your getting output from
"ceph health" commands. The next step is to get your Ceph OSD daemons
running, which will require the data partitions to be mounted and the
journal device present. In Ubuntu installations this is handled by
udev scripts installed by the Ceph packages (I think this is may be
true for RHEL/CentOS but have not verified). Short of the udev method
you can mount the data partition manually. Once the data partition is
mounted you can start the OSDs manually in the event that init still
doesn't work after mounting, to do so you will need to know the
location of your keyring, ceph.conf and the OSD id. If you are unsure
of what the OSD id is then you can look at the root of the OSD data
partition, after it is mounted, in a file named "whoami". To manually
start:

/usr/bin/ceph-osd -i ${OSD_ID} --pid-file
/var/run/ceph/osd.${OSD_ID}.pid -c /etc/ceph/ceph.conf

After that it's a matter of examining the logs if your still having
issues getting the OSDs to boot.

> After powering back up the cluster, “ceph health” revealed stale pages, mds 
> cluster degraded, 3/3 OSDs down. I tried to issue “sudo /etc/init.d/ceph -a 
> start” but I got no output from the command and the health status did not 
> change.

The placement groups are stale because none of the OSDs have reported
their state recently since they are down.

> I ended up having to re-install the cluster to fix the issue, but as my group 
> wants to use Ceph for VM storage in the future, we need to find a solution.

That's a shame, but at least you will be better prepared if it happens
again, hopefully your luck is not as unfortunate as mine!

-- 

Kyle Bader
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Networking questions

2013-12-26 Thread Kyle Bader

> Do monitors have to be on the cluster network as well or is it sufficient
> for them to be on the public network as
> http://ceph.com/docs/master/rados/configuration/network-config-ref/
> suggests?

Monitors only need to be on the public network.

> Also would the OSDs re-route their traffic over the public network if that
> was still available in case the cluster network fails?

Ceph doesn't currently support this type of configuration.

Hope that clears things up!

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Failure probability with largish deployments

2013-12-26 Thread Kyle Bader

> Yes, that also makes perfect sense, so the aforementioned 12500 objects
> for a 50GB image, at a 60 TB cluster/pool with 72 disk/OSDs and 3 way
> replication that makes 2400 PGs, following the recommended formula.
>
>> > What amount of disks (OSDs) did you punch in for the following run?
>> >> Disk Modeling Parameters
>> >> size:   3TiB
>> >> FIT rate:826 (MTBF = 138.1 years)
>> >> NRE rate:1.0E-16
>> >> RADOS parameters
>> >> auto mark-out: 10 minutes
>> >> recovery rate:50MiB/s (40 seconds/drive)
>> > Blink???
>> > I guess that goes back to the number of disks, but to restore 2.25GB at
>> > 50MB/s with 40 seconds per drive...
>>
>> The surviving replicas for placement groups that the failed OSDs
>> participated will naturally be distributed across many OSDs in the
>> cluster, when the failed OSD is marked out, it's replicas will be
>> remapped to many OSDs. It's not a 1:1 replacement like you might find
>> in a RAID array.
>>
> I completely get that part, however the total amount of data to be
> rebalanced after a single disk/OSD failure to fully restore redundancy is
> still 2.25TB (mistyped that as GB earlier) at the 75% utilization you
> assumed.
> What I'm still missing in this pictures is how many disks (OSDs) you
> calculated this with. Maybe I'm just misreading the 40 seconds per drive
> bit there. Because if that means each drive is only required to be just
> active for 40 seconds to do it's bit of recovery, we're talking 1100
> drives. ^o^ 1100 PGs would be another story.

To recreate the modeling:

git clone https://github.com/ceph/ceph-tools.git
cd ceph-tools/models/reliability/
python main.py -g

I used the following values:

Disk Type: Enterprise
Size: 3000 GiB
Primary FITs: 826
Secondary FITS: 826
NRE Rate: 1.0E-16

RAID Type: RAID6
Replace (hours): 6
Rebuild (MiB/s): 500
Volumes: 11

RADOS Copies: 3
Mark-out (min): 10
Recovery (MiB/s): 50
Space Usage: 75%
Declustering (pg): 1100
Stripe length: 1100 (limited by pgs anyway)

RADOS sites: 1
Rep Latency (s): 0
Recovery (MiB/s): 10
Disaster (years): 1000
Site Recovery (days): 30

NRE Model: Fail
Period (years): 1
Object Size: 4MB

It seems that the number of disks is not considered when calculating
the recovery window, only the number of pgs

https://github.com/ceph/ceph-tools/blob/master/models/reliability/RadosRely.py#L68

I could also see the recovery rates varying based on the max osd
backfill tunable.

http://ceph.com/docs/master/rados/configuration/osd-config-ref/#backfilling

Doing both would improve the quality of models generated by the tool.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Failure probability with largish deployments

2013-12-23 Thread Kyle Bader

> Is an object a CephFS file or a RBD image or is it the 4MB blob on the
> actual OSD FS?

Objects are at the RADOS level, CephFS filesystems, RBD images and RGW
objects are all composed by striping RADOS objects - default is 4MB.

> In my case, I'm only looking at RBD images for KVM volume storage, even
> given the default striping configuration I would assume that those 12500
> OSD objects for a 50GB image  would not be in the same PG and thus just on
> 3 (with 3 replicas set) OSDs total?

Objects are striped across placement groups, so you take your RBD size
/ 4 MB and cap it at the total number of placement groups in your
cluster.

> What amount of disks (OSDs) did you punch in for the following run?
>> Disk Modeling Parameters
>> size:   3TiB
>> FIT rate:826 (MTBF = 138.1 years)
>> NRE rate:1.0E-16
>> RADOS parameters
>> auto mark-out: 10 minutes
>> recovery rate:50MiB/s (40 seconds/drive)
> Blink???
> I guess that goes back to the number of disks, but to restore 2.25GB at
> 50MB/s with 40 seconds per drive...

The surviving replicas for placement groups that the failed OSDs
participated will naturally be distributed across many OSDs in the
cluster, when the failed OSD is marked out, it's replicas will be
remapped to many OSDs. It's not a 1:1 replacement like you might find
in a RAID array.

>> osd fullness:  75%
>> declustering:1100 PG/OSD
>> NRE model:  fail
>> object size:  4MB
>> stripe length:   1100
> I take it that is to mean that any RBD volume of sufficient size is indeed
> spread over all disks?

Spread over all placement groups, the difference is subtle but there
is a difference.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Failure probability with largish deployments

2013-12-20 Thread Kyle Bader

Using your data as inputs to in the Ceph reliability calculator [1]
results in the following:

Disk Modeling Parameters
size:   3TiB
FIT rate:826 (MTBF = 138.1 years)
NRE rate:1.0E-16
RAID parameters
replace:   6 hours
recovery rate:  500MiB/s (100 minutes)
NRE model:  fail
object size:4MiB

Column legends
1 storage unit/configuration being modeled
2 probability of object survival (per 1 years)
3 probability of loss due to site failures (per 1 years)
4 probability of loss due to drive failures (per 1 years)
5 probability of loss due to NREs during recovery (per 1 years)
6 probability of loss due to replication failure (per 1 years)
7 expected data loss per Petabyte (per 1 years)

storage   durabilityPL(site)  PL(copies)
PL(NRE) PL(rep)loss/PiB
----  --  --
--  --  --
RAID-6: 9+2  6-nines   0.000e+00   2.763e-10
0.11%   0.000e+00   9.317e+07


Disk Modeling Parameters
size:   3TiB
FIT rate:826 (MTBF = 138.1 years)
NRE rate:1.0E-16
RADOS parameters
auto mark-out: 10 minutes
recovery rate:50MiB/s (40 seconds/drive)
osd fullness:  75%
declustering:1100 PG/OSD
NRE model:  fail
object size:  4MB
stripe length:   1100

Column legends
1 storage unit/configuration being modeled
2 probability of object survival (per 1 years)
3 probability of loss due to site failures (per 1 years)
4 probability of loss due to drive failures (per 1 years)
5 probability of loss due to NREs during recovery (per 1 years)
6 probability of loss due to replication failure (per 1 years)
7 expected data loss per Petabyte (per 1 years)

storage   durabilityPL(site)  PL(copies)
PL(NRE) PL(rep)loss/PiB
----  --  --
--  --  --
RADOS: 3 cp 10-nines   0.000e+00   5.232e-08
0.000116%   0.000e+00   6.486e+03

[1] https://github.com/ceph/ceph-tools/tree/master/models/reliability

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph network topology with redundant switches

2013-12-20 Thread Kyle Bader

> The area I'm currently investigating is how to configure the
> networking. To avoid a SPOF I'd like to have redundant switches for
> both the public network and the internal network, most likely running
> at 10Gb. I'm considering splitting the nodes in to two separate racks
> and connecting each half to its own switch, and then trunk the
> switches together to allow the two halves of the cluster to see each
> other. The idea being that if a single switch fails I'd only lose half
> of the cluster.

This is fine if you are using a replication factor of 2, you would need 2/3 of
the cluster surviving if using a replication factor 3 with "osd pool default min
size" set to 2.

> My question is about configuring the public network. If it's all one
> subnet then the clients consuming the Ceph resources can't have both
> links active, so they'd be configured in an active/standby role. But
> this results in quite heavy usage of the trunk between the two
> switches when a client accesses nodes on the other switch than the one
> they're actively connected to.

The linux bonding driver supports several strategies for teaming network
adapters on L2 networks.

> So, can I configure multiple public networks? I think so, based on the
> documentation, but I'm not completely sure. Can I have one half of the
> cluster on one subnet, and the other half on another? And then the
> client machine can have interfaces in different subnets and "do the
> right thing" with both interfaces to talk to all the nodes. This seems
> like a fairly simple solution that avoids a SPOF in Ceph or the network
> layer.

You can have multiple networks for both the public and cluster networks,
the only restriction is that all subnets for a given type be within the same
supernet. For example

10.0.0.0/16 - Public supernet (configured in ceph.conf)
10.0.1.0/24 - Public rack 1
10.0.2.0/24 - Public rack 2
10.1.0.0/16 - Cluster supernet (configured in ceph.conf)
10.1.1.0/24 - Cluster rack 1
10.1.2.0/24 - Cluster rack 2

> Or maybe I'm missing an alternative that would be better? I'm aiming
> for something that keeps things as simple as possible while meeting
> the redundancy requirements.
>
> As an aside, there's a similar issue on the cluster network side with
> heavy traffic on the trunk between the two cluster switches. But I
> can't see that's avoidable, and presumably it's something people just
> have to deal with in larger Ceph installations?

A proper CRUSH configuration is going to place a replica on a node in
each rack, this means every write is going to cross the trunk. Other
traffic that you will see on the trunk:

* OSDs gossiping with one another
* OSD/Monitor traffic in the case where an OSD is connected to a
  monitor connected in the adjacent rack (map updates, heartbeats).
* OSD/Client traffic where the OSD and client are in adjacent racks

If you use all 4 40GbE uplinks (common on 10GbE ToR) then your
cluster level bandwidth is oversubscribed 4:1. To lower oversubscription
you are going to have to steal some of the other 48 ports, 12 for 2:1 and
24 for a non-blocking fabric. Given number of nodes you have/plan to
have you will be utilizing 6-12 links per switch, leaving you with 12-18
links for clients on a non-blocking fabric, 24-30 for 2:1 and 36-48 for 4:1.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] radosgw daemon stalls on download of some files

2013-12-19 Thread Kyle Bader

> Do you have any futher detail on this radosgw bug?

https://github.com/ceph/ceph/commit/0f36eddbe7e745665a634a16bf3bf35a3d0ac424
https://github.com/ceph/ceph/commit/0b9dc0e5890237368ba3dc34cb029010cb0b67fd

> Does it only apply to emperor?

The bug is present in dumpling too.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Rbd image performance

2013-12-15 Thread Kyle Bader

>> Has anyone tried scaling a VMs io by adding additional disks and
>> striping them in the guest os?  I am curious what effect this would have
>> on io performance?

> Why would it? You can also change the stripe size of the RBD image.
Depending on the workload you might change it from 4MB to something like
1MB or 32MB? That would give you more or less RADOS objects which will also
give you a different I/O pattern.

The question comes up because it's common for people operating on EC2 to
stripe EBS volumes together for higher iops rates. I've tried striping
kernel RBD volumes before but hit some sort of thread limitation where
throughput was consistent despite the volume count. I've since learned the
thread limit is configurable. I don't think there is a thread limit that
needs to be tweaked for RBD via KVM/QEMU but I haven't tested this
empirically. As Wido mentioned, if you are operating your own cluster
configuring the stripe size may achieve similar results. Google used to use
a 64MB chunk size with GFS but switched to 1MB after they started
supporting more and more seek heavy workloads.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] SysAdvent: Day 15 - Distributed Storage with Ceph

2013-12-15 Thread Kyle Bader

For you holiday pleasure I've prepared a SysAdvent article on Ceph:

http://sysadvent.blogspot.com/2013/12/day-15-distributed-storage-with-ceph.html

Check it out!

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CEPH and Savanna Integration

2013-12-14 Thread Kyle Bader

> Introduction of Savanna for those haven't heard of it:
>
> Savanna project aims to provide users with simple means to provision a
> Hadoop
>
> cluster at OpenStack by specifying several parameters like Hadoop version,
> cluster
>
> topology, nodes hardware details and a few more.
>
> For now, Savanna can use Swift as a storage for data that will be processed
> by
> Hadoop jobs. As far as I know, we can use Hadoop with CephFS.
> Is there anybody interested in CEPH and Savanna integration? and how to?

You could use a Ceph RADOS gateway as a drop in replacement that
provides a Swift compatible endpoint. Alternatively, the docs for
using Hadoop in conjunction with CephFS are here:

http://ceph.com/docs/master/cephfs/hadoop/

Hopefully that puts you in the right direction!

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] NUMA and ceph

2013-12-12 Thread Kyle Bader

It seems that NUMA can be problematic for ceph-osd daemons in certain
circumstances. Namely it seems that if a NUMA zone is running out of
memory due to uneven allocation it is possible for a NUMA zone to
enter reclaim mode when threads/processes are scheduled on a core in
that zone and those processes are request memory allocations greater
than the zones remaining memory. In order for the kernel to satisfy
the memory allocation for those processes it needs to page out some of
the contents of the contentious zone, which can have dramatic
performance implications due to cache misses, etc. I see two ways an
operator could alleviate these issues:

Set the vm.zone_reclaim_mode sysctl setting to 0, along with prefixing
ceph-osd daemons with "numactl --interleave=all". This should probably
be activated by a flag in /etc/default/ceph and modifying the
ceph-osd.conf upstart script, along with adding a depend to the ceph
package's debian/rules file on the "numactl" package.

The alternative is to use a cgroup for each ceph-osd daemon, pinning
each one to cores in the same NUMA zone using cpuset.cpu and
cpuset.mems. This would probably also live in /etc/default/ceph and
the upstart scripts.

-- 
Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph reliability in large RBD setups

2013-12-10 Thread Kyle Bader

> I've been running similar calculations recently. I've been using this
> tool from Inktank to calculate RADOS reliabilities with different
> assumptions:
>   https://github.com/ceph/ceph-tools/tree/master/models/reliability
>
> But I've also had similar questions about RBD (or any multi-part files
> stored in RADOS) -- naively, a file/device stored in N objects would
> be N times less reliable than a single object. But I hope there's an
> error in that logic.

It's worth pointing out that Ceph's RGW will actually stripe S3 objects
across many RADOS objects - even when it's not a multi-part upload, this
has been the case since the Bobtail release. There is a in depth Google
paper about availability modeling, it might provide some insight into what
the math should look like:

http://research.google.com/pubs/archive/36737.pdf

When reading it you can think of objects as chunks and pgs as stripes.
CRUSH should be configured based on failure domains that cause correlated
failures, ie power and networking. You also want to consider the
availability of the facility itself:

"Typical availability estimates used in the industry range from 99.7%
availability for tier II datacenters to 99.98% and 99.995% for tiers III
and IV, respectively."

http://www.morganclaypool.com/doi/pdf/10.2200/s00193ed1v01y200905cac006

If you combine the cluster availability metric and the facility
availability metric, you might be surprised. A cluster with 99.995%
availability in a Tier II facility is going to be dragged down to 99.7%
availability.  If a cluster goes down in the forest, does anyone know?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Anybody doing Ceph for OpenStack with OSDs across compute/hypervisor nodes?

2013-12-09 Thread Kyle Bader

> We're running OpenStack (KVM) with local disk for ephemeral storage.
> Currently we use local RAID10 arrays of 10k SAS drives, so we're quite rich
> for IOPS and have 20GE across the board. Some recent patches in OpenStack
> Havana make it possible to use Ceph RBD as the source of ephemeral VM
> storage, so I'm interested in the potential for clustered storage across our
> hypervisors for this purpose. Any experience out there?

I believe Piston converges their storage/compute, they refer to it as
a null-tier architecture.

http://www.pistoncloud.com/openstack-cloud-software/technology/#storage
-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] optimal setup with 4 x ethernet ports

2013-12-06 Thread Kyle Bader

> looking at tcpdump all the traffic is going exactly where it is supposed to 
> go, in particular an osd on the 192.168.228.x network appears to talk to an 
> osd on the 192.168.229.x network without anything strange happening. I was 
> just wondering if there was anything about ceph that could make this 
> non-optimal, assuming traffic was reasonably balanced between all the osd's 
> (eg all the same weights). I think the only time it would suffer is if writes 
> to other osds result in a replica write to a single osd, and even then a 
> single OSD is still limited to 7200RPM disk speed anyway so the loss isn't 
> going to be that great.

Should be fine given you only have a 1:1 ratio of link to disk.

> I think I'll be moving over to bonded setup anyway, although I'm not sure if 
> rr or lacp is best... rr will give the best potential throughput, but lacp 
> should give similar aggregate throughput if there are plenty of connections 
> going on, and less cpu load as no need to reassemble fragments.

One of the DreamHost clusters is using a pair of bonded 1GbE links on
the public network and another pair for the cluster network, we
configured each to use mode 802.3ad.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] optimal setup with 4 x ethernet ports

2013-12-04 Thread Kyle Bader

>> Is having two cluster networks like this a supported configuration? Every
>> osd and mon can reach every other so I think it should be.
>
> Maybe. If your back end network is a supernet and each cluster network is a
> subnet of that supernet. For example:
>
> Ceph.conf cluster network (supernet): 10.0.0.0/8
>
> Cluster network #1:  10.1.1.0/24
> Cluster network #2: 10.1.2.0/24
>
> With that configuration OSD address autodection *should* just work.

It should work but thinking more about it the OSDs will likely be
assigned IPs on a single network, whichever is inspected and matches
the supernet range (which could be in either subnet). In order to have
OSDs on two distinct networks you will likely have to use a
declarative configuration in /etc/ceph/ceph.conf which lists the OSD
IP addresses for each OSD (making sure to balance between links).

>> 1. move osd traffic to eth1. This obviously limits maximum throughput to
>> ~100Mbytes/second, but I'm getting nowhere near that right now anyway.
>
> Given three links I would probably do this if your replication factor is >=
> 3. Keep in mind 100Mbps links could very well end up being a limiting
> factor.

Sorry I read Mbytes and Mbps, big difference, the former is much preferable!

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] optimal setup with 4 x ethernet ports

2013-12-02 Thread Kyle Bader

> Is having two cluster networks like this a supported configuration? Every
osd and mon can reach every other so I think it should be.

Maybe. If your back end network is a supernet and each cluster network is a
subnet of that supernet. For example:

Ceph.conf cluster network (supernet): 10.0.0.0/8

Cluster network #1:  10.1.1.0/24
Cluster network #2: 10.1.2.0/24

With that configuration OSD address autodection *should* just work.

> 1. move osd traffic to eth1. This obviously limits maximum throughput to
~100Mbytes/second, but I'm getting nowhere near that right now anyway.

Given three links I would probably do this if your replication factor is >=
3. Keep in mind 100Mbps links could very well end up being a limiting
factor.

What are you backing each OSD with storage wise and how many OSDs do you
expect to participate in this cluster?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Impact of fancy striping

2013-11-30 Thread Kyle Bader

> This journal problem is a bit of wizardry to me, I even had weird
intermittent issues with OSDs not starting because the journal was not
found, so please do not hesitate to suggest a better journal setup.

You mentioned using SAS for journal, if your OSDs are SATA and a expander
is in the data path it might be slow from MUX/STP/etc overhead. If the
setup is all SAS you might try collocating the journal with it's matching
data partition on a single disk. Two spindles must be contended with 9
OSDs. How are your drives attached?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] installing OS on software RAID

2013-11-30 Thread Kyle Bader

> > Is the OS doing anything apart from ceph? Would booting a ramdisk-only
system from USB or compact flash work?

I haven't tested this kind of configuration myself but I can't think of
anything that would preclude this type of setup. I'd probably use sqashfs
layered with a tmpfs via aufs to avoid any writes to the USB drive. I would
also mount spinning high capacity media for /var/log or setup log streaming
to something like rsyslog/syslog-ng/logstash.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] 回复：Re: testing ceph performance issue

2013-11-27 Thread Kyle Bader

> How much performance can be improved if use SSDs  to storage journals?

You will see roughly twice the throughput unless you are using btrfs
(still improved but not as dramatic). You will also see lower latency
because the disk head doesn't have to seek back and forth between
journal and data partitions.

>   Kernel RBD Driver  ,  what is this ?

There are several RBD implementations, one is the kernel RBD driver in
upstream Linux, another is built into Qemu/KVM.

> and we want to know the RBD if  support XEN virual  ?

It is possible, but not nearly as well tested and not prevalent as RBD
via Qemu/KVM. This might be a starting point if your interested in
testing Xen/RBD integration:

http://wiki.xenproject.org/wiki/Ceph_and_libvirt_technology_preview

Hope that helps!

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD on an external, shared device

2013-11-26 Thread Kyle Bader

> Is there any way to manually configure which OSDs are started on which
> machines? The osd configuration block includes the osd name and host, so is
> there a way to say that, say, osd.0 should only be started on host vashti
> and osd.1 should only be started on host zadok?  I tried using this
> configuration:

The ceph udev rules are going to automatically mount disks that match
the ceph "magic" guids, to dig through the full logic you need to
inspect these files:

/lib/udev/rules.d/60-ceph-partuuid-workaround.rules
/lib/udev/rules.d/95-ceph-osd.rules

The upstart scripts look to see what is mounted at /var/lib/ceph/osd/
and starts osd daemons as appropriate:

/etc/init/ceph-osd-all-starter.conf

In theory you should be able to remove the udev scripts and mount the
osds in /var/lib/ceph/osd if your using upstart. You will want to make
sure that upgrades to the ceph package don't replace the files, maybe
that means making a null rule and using "-o
Dpkg::Options::='--force-confold" in ceph-deploy/chef/puppet/whatever.
You will also want to avoid putting the mounts in fstab because it
could render your node unbootable if the device or filesystem fails.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] installing OS on software RAID

2013-11-25 Thread Kyle Bader

Several people have reported issues with combining OS and OSD journals
on the same SSD drives/RAID due to contention. If you do something
like this I would definitely test to make sure it meets your
expectations. Ceph logs are going to compose the majority of the
writes to the OS storage devices.

On Mon, Nov 25, 2013 at 12:46 PM, James Harper
 wrote:
>>
>> We need to install the OS on the 3TB harddisks that come with our Dell
>> servers. (After many attempts, I've discovered that Dell servers won't allow
>> attaching an external harddisk via the PCIe slot. (I've tried everything). )
>>
>> But, must I therefore sacrifice two hard disks (RAID-1) for the OS?  I don't 
>> see
>> why I can't just create a small partition  (~30GB) on all 6 of my hard 
>> disks, do a
>> software-based RAID 1 on it, and be done.
>>
>> I know that software based RAID-5 seems computationally expensive, but
>> shouldn't RAID 1 be fast and computationally inexpensive for a computer
>> built over the last 4 years? I wouldn't think that a CEPH systems (with lots 
>> of
>> VMs but little data changes) would even do much writing to the OS
>> partitionbut I'm not sure. (And in the past, I have noticed that RAID5
>> systems did suck up a lot of CPU and caused lots of waits, unlike what the
>> blogs implied. But I'm thinking that a RAID 1 takes little CPU and the OS 
>> does
>> little writing to disk; it's mostly reads, which should hit the RAM.)
>>
>> Does anyone see any holes in the above idea? Any gut instincts? (I would try
>> it, but it's hard to tell how well the system would really behave under 
>> "real"
>> load conditions without some degree of experience and/or strong
>> theoretical knowledge.)
>
> Is the OS doing anything apart from ceph? Would booting a ramdisk-only system 
> from USB or compact flash work?
>
> If the OS doesn't produce a lot of writes then having it on the main disk 
> should work okay. I've done it exactly as you describe before.
>
> James
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] misc performance tuning queries (related to OpenStack in particular)

2013-11-19 Thread Kyle Bader

> So quick correction based on Michael's response. In question 4, I should
> have not made any reference to Ceph objects, since objects are not striped
> (per Michael's response). Instead, I should simply have used the words "Ceph
> VM Image" instead of "Ceph objects". A Ceph VM image would constitute 1000s
> of objects, and the different objects are striped/spread across multiple
> OSDs from multiple servers. In that situation, what's answer to #4

It depends on which linux bonding driver is in use, some drivers load
share on transmit, some load share on receive, some do both and some
only provide active/passive fault tolerance. I have Ceph OSD hosts
using LACP (bond-mode 802.3ad) and they load share on both receive and
transmit. We're utilizing a pair of bonded 1GbE links for the Ceph
public network and another pair of bonded 1GbE links for the cluster
network. The issues we've seen with 1GbE are complexity, shallow
buffers on 1GbE top of rack switch gear (Cisco 4948-10G) and the fact
that not all flows are equal (4x 1GbE != 4GbE).

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph performance

2013-11-15 Thread Kyle Bader

> We have the plan to run ceph as block storage for openstack, but from test
> we found the IOPS is slow.
>
> Our apps primarily use the block storage for saving logs (i.e, nginx's
> access logs).
> How to improve this?

There are a number of things you can do, notably:

1. Tuning cache on the hypervisor

http://ceph.com/docs/master/rbd/rbd-config-ref/

2. Separate device OSD journals, usually SSD is used (no longer
seeking between data and journal volumes on a single disk)

3. Flash based writeback cache for OSD data volume

https://github.com/facebook/flashcache/
http://bcache.evilpiepirate.org/

If you have any questions let us know!

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Today I’ve encountered multiple OSD down and multiple OSD won’t start and OSD disk access “Input/Output” error”

2013-11-15 Thread Kyle Bader

> 3).Comment out,  #hashtag the bad OSD drives in the “/etc/fstab”.

This is unnecessary if your using the provided upstart and udev
scripts, OSD data devices will be identified by label and mounted. If
you choose not to use the upstart and udev scripts then you should
write init scripts that do similar so that you don't have to have
/etc/fstab entries.

> 3).Login to Ceph Node  with bad OSD net/serial/video.

I'd put check dmesg somewhere near the top of this section, often if
you lose an OSD due to a filesystem hiccup then it will be evident in
dmesg output.

>  4).Stop only this local Ceph node  with “service Ceph stop”

You may want to set "noout" depending on whether you expect it to come
back online within your "mon osd down out interval" threshold.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph User Committee

2013-11-07 Thread Kyle Bader

> Would this be something like 
> http://wiki.ceph.com/01Planning/02Blueprints/Firefly/Ceph-Brag ?

Something very much like that :)

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph User Committee

2013-11-07 Thread Kyle Bader

> I think this is a great idea.  One of the big questions users have is
> "what kind of hardware should I buy."  An easy way for users to publish
> information about their setup (hardware, software versions, use-case,
> performance) when they have successful deployments would be very valuable.
> Maybe a section of wiki?

It would be interesting to a site where a Ceph admin can download an
API key/package that could be optionally installed and report
configuration information to a community API. The admin could then
supplement/correct that base information. Having much of the data
collection be automated lowers the barrier for contribution.  Bonus
points if this could be extended to SMART and failed drives so we
could have a community generated report similar to Google's disk
population study they presented at FAST'07.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Running on disks that lose their head

2013-11-07 Thread Kyle Bader

> Zackc, Loicd, and I have been the main participants in a weekly Teuthology
> call the past few weeks. We've talked mostly about methods to extend
> Teuthology to capture performance metrics. Would you be willing to join us
> during the Teuthology and Ceph-Brag sessions at the Firefly Developer
> Summit?

I'd be happy to!

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph cluster performance

2013-11-07 Thread Kyle Bader

> ST240FN0021 connected via a SAS2x36 to a LSI 9207-8i.

The problem might be SATA transport protocol overhead at the expander.
Have you tried directly connecting the SSDs to SATA2/3 ports on the
mainboard?

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] radosgw questions

2013-11-07 Thread Kyle Bader

> 1. To build a high performance yet cheap radosgw storage, which pools should
> be placed on ssd and which on hdd backed pools? Upon installation of
> radosgw, it created the following pools: .rgw, .rgw.buckets,
> .rgw.buckets.index, .rgw.control, .rgw.gc, .rgw.root, .usage, .users,
> .users.email.

There is a lot that goes into high performance, a few questions come to mind:

Do you want high performance reads, writes or both?
How hot is your data, can you bet better performance from buying more
memory for caching?
What size objects do you expect to handle, how many per bucket?

> 4. Which number of replaction would you suggest? In other words, which
> replication is need to achive 99.9% durability like dreamobjects states?

DreamObjects Engineer here, we used Ceph's durability modeling tools here:

https://github.com/ceph/ceph-tools

You will need to research your data disk's MTBF numbers and convert
them to FITS, measure your OSD backfill MTTR and factor in your
replication count. DreamObjects uses 3 replicas on enterprise SAS
disks. The durability figures exclude black swan events like fires and
other such datacenter or regional disasters, which is why having a
second location is important for DR.

> 5. Is it possible to map fqdn custom domain to buckets, not only subdomains?

You could map a domain's A/ records to an endpoint but if the
endpoint changes your SOL, using a CNAME at the domain root violates
DNS rfcs. Some DNS providers will fake a CNAME by doing a recursive
lookup in response to an A/ request as a work around.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Running on disks that lose their head

2013-11-07 Thread Kyle Bader

>> Once I know a drive has had a head failure, do I trust that the rest of the 
>> drive isn't going to go at an inconvenient moment vs just fixing it right 
>> now when it's not 3AM on Christmas morning? (true story)  As good as Ceph 
>> is, do I trust that Ceph is smart enough to prevent spreading corrupt data 
>> all over the cluster if I leave bad disks in place and they start doing 
>> terrible things to the data?

I have a lot more disks than I have trust in disks. If a drive lost a
head then I want it gone.

I love the idea of using smart data but can foresee see some
implementation issues. We have seen some raid configurations where
polling smart will halt all raid operations momentarily. Also, some
controllers require you to use their CLI tool to pool for smart vs
smartmontools.

It would be similarly awesome to embed something like an apdex score
against each osd, especially if it factored in hierarchy to identify
poor performing osds, nodes, racks, etc..

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph node Info

2013-10-30 Thread Kyle Bader

The quick start guide is linked below, it should help you hit the ground
running.

http://ceph.com/docs/master/start/quick-ceph-deploy/

Let us know if you have questions or bump into trouble!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph recovery killing vms

2013-10-29 Thread Kyle Bader

Recovering from a degraded state by copying existing replicas to other OSDs
is going to cause reads on existing replicas and writes to the new
locations. If you have slow media then this is going to be felt more
acutely. Tuning the backfill options I posted is one way to lessen the
impact, another option is to slowly lower the weight in CRUSH for the
OSD(s) you want to remove. Hopefully that helps!

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] changing journals post-bobcat?

2013-10-28 Thread Kyle Bader

The bobtail release added udev/upstart capabilities that allowed you
to not have per OSD entries in ceph.conf. Under the covers the new
udev/upstart scripts look for a special label on OSD data volumes,
matching volumes are mounted and then a few files are inspected:

journal_uuid  whoami

The journal_uuid is the uuid of the journal device for that OSD,
whoami indicates the OSD number the data volume belongs to. This
thread might be helpful for changing the journal device:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-October/005162.html

On Mon, Oct 28, 2013 at 11:39 AM, John Kinsella  wrote:
> Hey folks - looking around, I see plenty (OK, some) on how to modify journal 
> size and location for older ceph, when ceph.conf was used (I think the switch 
> from ceph.conf to storing osd/journal config elsewhere happened with 
> bobcat?). I recently deployed a cluster with ceph-deploy on 0.67 and wanted 
> to change the journal size for the OSDs.
>
> Is this a remove/re-create procedure now, or is there an easier way?
>
> John
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph recovery killing vms

2013-10-28 Thread Kyle Bader

You can change some OSD tunables to lower the priority of backfills:

osd recovery max chunk:   8388608
osd recovery op priority: 2

In general a lower op priority means it will take longer for your
placement groups to go from degraded to active+clean, the idea is to
balance recovery time and not starving client requests. I've found 2
to work well on our clusters, YMMV.

On Mon, Oct 28, 2013 at 10:16 AM, Kevin Weiler
 wrote:
> Hi all,
>
> We have a ceph cluster that being used as a backing store for several VMs
> (windows and linux). We notice that when we reboot a node, the cluster
> enters a degraded state (which is expected), but when it begins to recover,
> it starts backfilling and it kills the performance of our VMs. The VMs run
> slow, or not at all, and also seem to switch it's ceph mounts to read-only.
> I was wondering 2 things:
>
> Shouldn't we be recovering instead of backfilling? It seems like backfilling
> is much more intensive operation
> Can we improve the recovery/backfill performance so that our VMs don't go
> down when there is a problem with the cluster?
>
>
> --
>
> Kevin Weiler
>
> IT
>
>
>
> IMC Financial Markets | 233 S. Wacker Drive, Suite 4300 | Chicago, IL 60606
> | http://imc-chicago.com/
>
> Phone: +1 312-204-7439 | Fax: +1 312-244-3301 | E-Mail:
> kevin.wei...@imc-chicago.com
>
>
> 
>
> The information in this e-mail is intended only for the person or entity to
> which it is addressed.
>
> It may contain confidential and /or privileged material. If someone other
> than the intended recipient should receive this e-mail, he / she shall not
> be entitled to read, disseminate, disclose or duplicate it.
>
> If you receive this e-mail unintentionally, please inform us immediately by
> "reply" and then delete it from your system. Although this information has
> been compiled with great care, neither IMC Financial Markets & Asset
> Management nor any of its related entities shall accept any responsibility
> for any errors, omissions or other inaccuracies in this information or for
> the consequences thereof, nor shall it be bound in any way by the contents
> of this e-mail or its attachments. In the event of incomplete or incorrect
> transmission, please return the e-mail to the sender and permanently delete
> this message and any attachments.
>
> Messages and attachments are scanned for all known viruses. Always scan
> attachments before opening them.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Hardware: SFP+ or 10GBase-T

2013-10-24 Thread Kyle Bader

> I know that 10GBase-T has more delay then SFP+ with direct attached
> cables (.3 usec vs 2.6 usec per link), but does that matter? Some
> sites stay it is a huge hit, but we are talking usec, not ms, so I
> find it hard to believe that it causes that much of an issue. I like
> the lower cost and use of standard cabling vs SFP+, but I don't want
> to sacrifice on performance.

If you are talking about the links from the nodes with OSDs to their
ToR switches then I would suggest going with Twinax cables. Twinax
doesn't go very far but it's really durable and uses less power than
10GBase-T. Here's a blog post that goes into more detail:

http://etherealmind.com/difference-twinax-category-6-10-gigabit-ethernet/

I would probably go with the Arista 7050-S over the 7050-T and use
twinax for ToR to OSD node links and SFP+SR uplinks to spine switches
if you need longer runs.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS & Project Manila (OpenStack)

2013-10-23 Thread Kyle Bader

>> This is going to get horribly ugly when you add neutron into the mix, so
>> much so I'd consider this option a non-starter. If someone is using
>> openvswitch to create network overlays to isolate each tenant I can't
>> imagine this ever working.
>
> I'm not following here.  Are this only needed if ceph shares the same
> subnet as the VM?  I don't know much about how these things work, but I
> would expect that it would be possible to route IP traffic from a guest
> network to the storage network (or anywhere else, for that matter)...
>
> That aside, however, I think it would be a mistake to take the
> availability of cephfs vs nfs clients as a reason alone for a particular
> architectural approach.  One of the whole points of ceph is that we ignore
> legacy when it doesn't make sense.  (Hence rbd, not iscsi; cephfs, not
> [p]nfs.)

In an overlay world, physical VLANs have no relation to virtual
networks. An overlay is literally encapsulating layer 2 inside layer 3
and adding a VNI (virtual network identifier) and using tunnels
(VxLAN, STT, GRE, etc) to connect VMs on disparate hypervisors that
may not even have L2 connectivity to each other.  One of the core
tenants of virtual networks is providing tenants the ability to have
overlapping RFC1918 addressing, in this case you could have tenants
already utilizing the addresses used by the NFS storage at the
physical layer. Even if we could pretend that would never happen
(namespaces or jails maybe?) you would still need to provision a
distinct NFS IP per tenant and run a virtual switch that supports the
tunneling protocol used by the overlay and the southbound API used by
that overlays virtual switch to insert/remove flow information. The
only alternative to embedding a myriad of different virtual switch
protocols on the filer head would be to use a VTEP capable switch for
encapsulation. I think there are only 1-2 vendors that ship these,
Arista's 7150 and something in the Cumulus lineup.  Even if you could
get past all this I'm somewhat terrified by the proposition of
connecting the storage fabric to a tenant network, although this is
much more acute concern for public clouds.

Here's a good RFC wrt overlays if anyone is in dire need of bed time reading:

http://tools.ietf.org/html/draft-mity-nvo3-use-case-04

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS & Project Manila (OpenStack)

2013-10-23 Thread Kyle Bader

> Option 1) The service plugs your filesystem's IP into the VM's network
> and provides direct IP access. For a shared box (like an NFS server)
> this is fairly straightforward and works well (*everything* has a
> working NFS client). It's more troublesome for CephFS, since we'd need
> to include access to many hosts, lots of operating systems don't
> include good CephFS clients by default, and the client is capable of
> forcing some service disruptions if they misbehave or disappear (most
> likely via lease timeouts), but it may not be impossible.
>

This is going to get horribly ugly when you add neutron into the mix, so
much so I'd consider this option a non-starter. If someone is using
openvswitch to create network overlays to isolate each tenant I can't
imagine this ever working.


> Option 2) The hypervisor mediates access to the FS via some
> pass-through filesystem (presumably P9 — Plan 9 FS, which QEMU/KVM is
> already prepared to work with). This works better for us; the
> hypervisor host can have a single CephFS mount that it shares
> selectively to client VMs or something.
>

This seems like the only sane way to do it IMO.


> Option 3) An agent communicates with the client via a well-understood
> protocol (probably NFS) on their VLAN, and to the the backing
> filesystem on a different VLAN in the native protocol. This would also
> work for CephFS, but of course having to use a gateway agent (either
> on a per-tenant or per-many-tenants basis) is a bit of a bummer in
> terms of latency, etc.
>

Again, this still tricky with neutron and network overlays. You would need
one agent per tenant network and encapsulate the agents traffic using with
openvswitch (STT/VxLAN/etc) or a physical switch (only VxLAN is supported
in silicon).

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Rados bench result when increasing OSDs

2013-10-21 Thread Kyle Bader

Besides what Mark and Greg said it could be due to additional hops through
network devices. What network devices are you using, what is the network
topology and does your CRUSH map reflect the network topology?
On Oct 21, 2013 9:43 AM, "Gregory Farnum"  wrote:

> On Mon, Oct 21, 2013 at 7:13 AM, Guang Yang  wrote:
> > Dear ceph-users,
> > Recently I deployed a ceph cluster with RadosGW, from a small one (24
> OSDs) to a much bigger one (330 OSDs).
> >
> > When using rados bench to test the small cluster (24 OSDs), it showed
> the average latency was around 3ms (object size is 5K), while for the
> larger one (330 OSDs), the average latency was around 7ms (object size 5K),
> twice comparing the small cluster.
> >
> > The OSD within the two cluster have the same configuration, SAS disk,
>  and two partitions for one disk, one for journal and the other for
> metadata.
> >
> > For PG numbers, the small cluster tested with the pool having 100 PGs,
> and for the large cluster, the pool has 4 PGs (as I will to further
> scale the cluster, so I choose a much large PG).
> >
> > Does my test result make sense? Like when the PG number and OSD
> increase, the latency might drop?
>
> Besides what Mark said, can you describe your test in a little more
> detail? Writing/reading, length of time, number of objects, etc.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph configuration data sharing requirements

2013-10-17 Thread Kyle Bader

> > * The IP address of at least one MON in the Ceph cluster
>

If you configure nodes with a single monitor in the "mon hosts" directive
then I believe your nodes will have issues if that one monitor goes down.
With Chef I've gone back and forth between using Chef search and having
monitors be declarative. Chef search is problematic if you are not
declarative about how many monitors to expect, you could end up with 3
monitors and 3 single monitor quorums during initial cluster creation.


> > If cephx is disabled:
> >
> > * no other requirement
> >
> > If cephx is enabled:
> >
> > * an admin user is created by providing a keyring file with its
> description when the first
> >   MON is bootstraped
> >   http://ceph.com/docs/next/dev/mon-bootstrap/
>
> > * users must be created by injecting them into the MONs, for instance
> with auth import
> >https://github.com/ceph/ceph/blob/master/src/mon/MonCommands.h#L162
> >or auth add. There is not need to ask the MONs for a key, although it
> can be done. It is
> >not a requirement. When a user is created or later on, its
> capabilities can be set.
> >
> > * an osd must be created by the mon which return an unique osd ID which
> is then used to
> >further configure the osd.
> >https://github.com/ceph/ceph/blob/master/src/mon/MonCommands.h#L471
> >
> > * a client must be given a user id and a secret key
> >
> > It would also be helpful to better understand why people are happy with
> the way ceph-deploy currently works and how it deals with these
> requirements.
>

I haven't used ceph-deploy, but I did write a chef cookbook before
ceph-deploy was a thing.  You will want to get the OSD bootstrap key from
one of the monitors and distribute it to your OSD nodes. Once you have the
bootstrap key you can have puppet enable and start the upstart service.
After ceph-osd-all is running under upstart you can simply use
ceph-disk-prepare and a new OSD will be created based off the OSD bootstrap
key, the OSD id is automatically allocated by the monitor during this
process.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] mounting RBD in linux containers

2013-10-17 Thread Kyle Bader

My first guess would be that it's due to LXC dropping capabilities, I'd
investigate whether CAP_SYS_ADMIN is being dropped. You need CAP_SYS_ADMIN
for mount and block ioctls, if the container doesn't have those privs a map
will likely fail. Maybe try tracing the command with strace?

On Thu, Oct 17, 2013 at 2:45 PM, Kevin Weiler
wrote:

>  Hi all,
>
>  We're trying to mount an rbd image inside of a linux container that has
> been created with docker (https://www.docker.io/). We seem to have access
> to the rbd kernel module from inside the container:
>
>  # lsmod | grep ceph
> libceph   218854  1 rbd
> libcrc32c  12603  3 xfs,libceph,dm_persistent_data
>
>  And we can query the pool for available rbds and create rbds from inside
> the container:
>
>  # rbd -p dockers --id dockers --keyring
> /etc/ceph/ceph.client.dockers.keyring create lxctest --size 51200
> # rbd -p dockers --id dockers --keyring
> /etc/ceph/ceph.client.dockers.keyring ls
> lxctest
>
>  But for some reason, we can't seem to map the device to the container:
>
>  # rbd -p dockers --id dockers --keyring
> /etc/ceph/ceph.client.dockers.keyring map lxctest
> rbd: add failed: (22) Invalid argument
>
>  I don't see anything particularly interesting in dmesg or messages on
> either the container or the host box. Any ideas on how to troubleshoot this?
>
>  Thanks!
>
>
>  --
>
> *Kevin Weiler*
>
> IT
>
>
>
> IMC Financial Markets | 233 S. Wacker Drive, Suite 4300 | Chicago, IL
> 60606 | http://imc-chicago.com/
>
> Phone: +1 312-204-7439 | Fax: +1 312-244-3301 | E-Mail: *
> kevin.wei...@imc-chicago.com *
>
> --
>
> The information in this e-mail is intended only for the person or entity
> to which it is addressed.
>
> It may contain confidential and /or privileged material. If someone other
> than the intended recipient should receive this e-mail, he / she shall not
> be entitled to read, disseminate, disclose or duplicate it.
>
> If you receive this e-mail unintentionally, please inform us immediately
> by "reply" and then delete it from your system. Although this information
> has been compiled with great care, neither IMC Financial Markets & Asset
> Management nor any of its related entities shall accept any responsibility
> for any errors, omissions or other inaccuracies in this information or for
> the consequences thereof, nor shall it be bound in any way by the contents
> of this e-mail or its attachments. In the event of incomplete or incorrect
> transmission, please return the e-mail to the sender and permanently delete
> this message and any attachments.
>
> Messages and attachments are scanned for all known viruses. Always scan
> attachments before opening them.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Speed limit on RadosGW?

2013-10-14 Thread Kyle Bader

I've personally saturated 1Gbps links on multiple radosgw nodes on a large
cluster, if I remember correctly, Yehuda has tested it up into the 7Gbps
range with 10Gbps gear. Could you describe your clusters hardware and
connectivity?


On Mon, Oct 14, 2013 at 3:34 AM, Chu Duc Minh  wrote:

> Hi sorry, i missed this mail.
>
>
> > During writes, does the CPU usage on your RadosGW node go way up?
> No, CPU stay the same & very low (< 10%)
>
> When upload small files(300KB/file) over RadosGW:
>  - using 1 process: upload bandwidth ~ 3MB/s
>  - using 100 processes: upload bandwidth ~ 15MB/s
>
> When upload big files(3GB/file) over RadosGW:
>  - using 1 process: upload bandwidth ~ 70MB/s
> (Therefore i don't upload big files using multi-processes any more :D)
>
> Maybe, RadosGW have a problem when write many smail files. Or it's a
> problem of CEPH when simultaneously write many smail files into a bucket,
> that already have millions files?
>
>
> On Wed, Sep 25, 2013 at 7:24 PM, Mark Nelson wrote:
>
>> On 09/25/2013 02:49 AM, Chu Duc Minh wrote:
>>
>>> I have a CEPH cluster with 9 nodes (6 data nodes & 3 mon/mds nodes)
>>> And i setup 4 separate nodes to test performance of Rados-GW:
>>>   - 2 node run Rados-GW
>>>   - 2 node run multi-process put file to [multi] Rados-GW
>>>
>>> Result:
>>> a) When i use 1 RadosGW node & 1 upload-node, speed upload = 50MB/s
>>> /upload-node, Rados-GW input/output speed = 50MB/s
>>>
>>> b) When i use 2 RadosGW node & 1 upload-node, speed upload = 50MB/s
>>> /upload-node; each RadosGW have input/output = 25MB/s ==> sum
>>> input/ouput of 2 Rados-GW = 50MB/s
>>>
>>> c) When i use 1 RadosGW node & 2 upload-node, speed upload = 25MB/s
>>> /upload-node ==> sum output of 2 upload-node = 50MB/s, RadosGW have
>>> input/output = 50MB/s
>>>
>>> d) When i use 2 RadosGW node & 2 upload-node, speed upload = 25MB/s
>>> /upload-node ==> sum output of 2 upload-node = 50MB/s; each RadosGW have
>>> input/output = 25MB/s ==> sum input/ouput of 2 Rados-GW = 50MB/s
>>>
>>> _*Problem*_: i can pass limit 50MB/s when put file over Rados-GW,
>>>
>>> regardless of the number Rados-GW nodes and upload-nodes.
>>> When i use this CEPH cluster over librados (openstack/kvm), i can easily
>>> achieve > 300MB/s
>>>
>>> I don't know why performance of RadosGW is so low. What's bottleneck?
>>>
>>
>> During writes, does the CPU usage on your RadosGW node go way up?
>>
>> If this is a test cluster, you might want to try the wip-6286 build from
>> our gitbuilder site.  There is a fix that depending on the size of your
>> objects, could have a big impact on performance.  We're currently
>> investigating some other radosgw performance issues as well, so stay tuned.
>> :)
>>
>> Mark
>>
>>
>>
>>> Thank you very much!
>>>
>>>
>>>
>>>
>>> __**_
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**com
>>>
>>>
>> __**_
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**com
>>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Expanding ceph cluster by adding more OSDs

2013-10-10 Thread Kyle Bader

I've contracted and expanded clusters by up to a rack of 216 OSDs - 18
nodes, 12 drives each.  New disks are configured with a CRUSH weight of 0
and I slowly add weight (0.1 to 0.01 increments), wait for the cluster to
become active+clean and then add more weight. I was expanding after
contraction so my PG count didn't need to be corrected, I tend to be
liberal and opt for more PGs.  If I hadn't contracted the cluster prior to
expanding it I would probably add PGs after all the new OSDs have finished
being weighted into the cluster.


On Wed, Oct 9, 2013 at 8:55 PM, Michael Lowe wrote:

> I had those same questions, I think the answer I got was that it was
> better to have too few pg's than to have overloaded osd's.  So add osd's
> then add pg's.  I don't know the best increments to grow in, probably
> depends largely on the hardware in your osd's.
>
> Sent from my iPad
>
> > On Oct 9, 2013, at 11:34 PM, Guang  wrote:
> >
> > Thanks Mike. I get your point.
> >
> > There are still a few things confusing me:
> >  1) We expand Ceph cluster by adding more OSDs, which will trigger
> re-balance PGs across the old & new OSDs, and likely it will break the
> optimized PG numbers for the cluster.
> >   2) We can add more PGs which will trigger re-balance objects across
> old & new PGs.
> >
> > So:
> >  1) What is the recommended way to expand the cluster by adding OSDs
> (and potentially adding PGs), should we do them at the same time?
> >  2) What is the recommended way to scale a cluster from like 1PB to 2PB,
> should we scale it to like 1.1PB to 1.2PB or move to 2PB directly?
> >
> > Thanks,
> > Guang
> >
> >> On Oct 10, 2013, at 11:10 AM, Michael Lowe wrote:
> >>
> >> There used to be, can't find it right now.  Something like 'ceph osd
> set pg_num ' then 'ceph osd set pgp_num ' to actually move your
> data into the new pg's.  I successfully did it several months ago, when
> bobtail was current.
> >>
> >> Sent from my iPad
> >>
> >>> On Oct 9, 2013, at 10:30 PM, Guang  wrote:
> >>>
> >>> Thanks Mike.
> >>>
> >>> Is there any documentation for that?
> >>>
> >>> Thanks,
> >>> Guang
> >>>
>  On Oct 9, 2013, at 9:58 PM, Mike Lowe wrote:
> 
>  You can add PGs,  the process is called splitting.  I don't think PG
> merging, the reduction in the number of PGs, is ready yet.
> 
> > On Oct 8, 2013, at 11:58 PM, Guang  wrote:
> >
> > Hi ceph-users,
> > Ceph recommends the PGs number of a pool is (100 * OSDs) / Replicas,
> per my understanding, the number of PGs for a pool should be fixed even we
> scale out / in the cluster by adding / removing OSDs, does that mean if we
> double the OSD numbers, the PG number for a pool is not optimal any more
> and there is no chance to correct it?
> >
> >
> > Thanks,
> > Guang
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] About Ceph SSD and HDD strategy

2013-10-10 Thread Kyle Bader

It's hard to comment on how your experience could be made better without
more information about your configuration and how your testing. Anything
along the lines of what LSI controller model, PCI-E bus speed, number of
expander cables, drive type, number of SSDs and whether the SSDs were
connected to the controller or directly to SATA2/SATA3 port on the
mainboard. You mentioned using SSD journal but nothing about a writeback
cache, did you try both? I'm also curious about what kind of workload
didn't get better with an external journal, was this with rados-bench?

I'm really excited about tiering, it will disaggregate the SSDs and allow
more flexibility in cephstore chassis selection because you no longer have
to maintain strict SSD:drive ratios - this seems like a much more elegant
and maintainable solution.


On Wed, Oct 9, 2013 at 3:45 PM, Warren Wang  wrote:

> While in theory this should be true, I'm not finding it to be the case for
> a typical enterprise LSI card with 24 drives attached. We tried a variety
> of ratios and went back to collocated journals on the spinning drives.
>
> Eagerly awaiting the tiered performance changes to implement a faster tier
> via SSD.
>
> --
> Warren
>
> On Oct 9, 2013, at 5:52 PM, Kyle Bader  wrote:
>
> Journal on SSD should effectively double your throughput because data will
> not be written to the same device twice to ensure transactional integrity.
> Additionally, by placing the OSD journal on an SSD you should see less
> latency, the disk head no longer has to seek back and forth between the
> journal and data partitions. For large writes it's not as critical to
> have a device that supports high IOPs or throughput because large writes
> are striped across many 4MB rados objects, relatively evenly distributed
> across the cluster. Small write operations will benefit the most from an
> OSD data partition with a writeback cache like btier/flashcache because it
> can absorbs an order of magnitude more IOPs and allow a slower spinning
> device catch up when there is less activity.
>
>
> On Tue, Oct 8, 2013 at 12:09 AM, Robert van Leeuwen <
> robert.vanleeu...@spilgames.com> wrote:
>
>>  > I tried putting Flashcache on my spindle OSDs using an Intel SSL and
>> it works great.
>> > This is getting me read and write SSD caching instead of just write
>> performance on the journal.
>> > It should also allow me to protect the OSD journal on the same drive as
>> the OSD data and still get benefits of SSD caching for writes.
>>
>> Small note that on Red Hat based distro's + Flashcache + XFS:
>> There is a major issue (kernel panics) running xfs + flashcache on a 6.4
>> kernel. (anything higher then 2.6.32-279)
>> It should be fixed in kernel 2.6.32-387.el6 which, I assume, will be 6.5
>> which only just entered Beta.
>>
>> Fore more info, take a look here:
>> https://github.com/facebook/flashcache/issues/113
>>
>> Since I've hit this issue (thankfully in our dev environment) we are
>> slightly less enthusiastic about running flashcache :(
>> It also adds a layer of complexity so I would rather just run the
>> journals on SSD, at least on Redhat.
>> I'm not sure about the performance difference of just journals v.s.
>> Flashcache but I'd be happy to read any such comparison :)
>>
>> Also, if you want to make use of the SSD trim func
>>
>> P.S. My experience with Flashcache is on Openstack Swift & Nova not Ceph.
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
> --
>
> Kyle
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Same journal device for multiple OSDs?

2013-10-09 Thread Kyle Bader

You can certainly use a similarly named device to back an OSD journal if
the OSDs are on separate hosts. If you want to take a single SSD device and
utilize it as a journal for many OSDs on the same machine then you would
want to partition the SSD device and use a different partition for each OSD
journal. You might consider using /dev/disk/by-id/foo instead of /dev/fioa1
to avoid potential device reordering issues after a reboot. Hope that
helps, sorry if I misunderstood.


On Wed, Oct 9, 2013 at 7:03 AM, Andreas Bluemle  wrote:

> Hi,
>
> to avoid confusion: the configuration did *not* contain
> multiple osds referring to the same journal device (or file).
>
> The snippet from ceph.conf suggests osd.214 and osd.314
> both use the same journal -
> but it doesn't show that these osds run on different hosts.
>
>
> Regards
>
> Andreas Bluemle
>
>
> On Wed, 9 Oct 2013 11:23:18 +0200
> Andreas Friedrich  wrote:
>
> > Hello,
> >
> > I have a Ceph test cluster with 88 OSDs running well.
> >
> > In ceph.conf I found multiple OSDs that are using the same SSD block
> > device (without a file system) for their journal:
> >
> > [osd.214]
> >   osd journal = /dev/fioa1
> >   ...
> > [osd.314]
> >   osd journal = /dev/fioa1
> >   ...
> >
> > Is this a allowed configuration?
> >
> > Regards
> > Andreas Friedrich
> > --
> > FUJITSU
> > Fujitsu Technology Solutions GmbH
> > Heinz-Nixdorf-Ring 1, 33106 Paderborn, Germany
> > Tel: +49 (5251) 525-1512
> > Fax: +49 (5251) 525-321512
> > Email: andreas.friedr...@ts.fujitsu.com
> > Web: ts.fujitsu.com
> > Company details: de.ts.fujitsu.com/imprint
> > --
> >
> >
>
>
>
> --
> Andreas Bluemle mailto:andreas.blue...@itxperts.de
> ITXperts GmbH   http://www.itxperts.de
> Balanstrasse 73, Geb. 08Phone: (+49) 89 89044917
> D-81541 Muenchen (Germany)  Fax:   (+49) 89 89044910
>
> Company details: http://www.itxperts.de/imprint.htm
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] About Ceph SSD and HDD strategy

2013-10-09 Thread Kyle Bader

Journal on SSD should effectively double your throughput because data will
not be written to the same device twice to ensure transactional integrity.
Additionally, by placing the OSD journal on an SSD you should see less
latency, the disk head no longer has to seek back and forth between the
journal and data partitions. For large writes it's not as critical to have
a device that supports high IOPs or throughput because large writes are
striped across many 4MB rados objects, relatively evenly distributed across
the cluster. Small write operations will benefit the most from an OSD data
partition with a writeback cache like btier/flashcache because it can
absorbs an order of magnitude more IOPs and allow a slower spinning device
catch up when there is less activity.

On Tue, Oct 8, 2013 at 12:09 AM, Robert van Leeuwen <
robert.vanleeu...@spilgames.com> wrote:

>  > I tried putting Flashcache on my spindle OSDs using an Intel SSL and
> it works great.
> > This is getting me read and write SSD caching instead of just write
> performance on the journal.
> > It should also allow me to protect the OSD journal on the same drive as
> the OSD data and still get benefits of SSD caching for writes.
>
> Small note that on Red Hat based distro's + Flashcache + XFS:
> There is a major issue (kernel panics) running xfs + flashcache on a 6.4
> kernel. (anything higher then 2.6.32-279)
> It should be fixed in kernel 2.6.32-387.el6 which, I assume, will be 6.5
> which only just entered Beta.
>
> Fore more info, take a look here:
> https://github.com/facebook/flashcache/issues/113
>
> Since I've hit this issue (thankfully in our dev environment) we are
> slightly less enthusiastic about running flashcache :(
> It also adds a layer of complexity so I would rather just run the journals
> on SSD, at least on Redhat.
> I'm not sure about the performance difference of just journals v.s.
> Flashcache but I'd be happy to read any such comparison :)
>
> Also, if you want to make use of the SSD trim func
>
> P.S. My experience with Flashcache is on Openstack Swift & Nova not Ceph.
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

79 matches

Mail list logo