Re: [ceph-users] Ceph migration to AWS

2015-05-04 Thread Kyle Bader
 To those interested in a tricky problem,

 We have a Ceph cluster running at one of our data centers. One of our
 client's requirements is to have them hosted at AWS. My question is: How do
 we effectively migrate our data on our internal Ceph cluster to an AWS Ceph
 cluster?

 Ideas currently on the table:

 1. Build OSDs at AWS and add them to our current Ceph cluster. Build quorum
 at AWS then sever the connection between AWS and our data center.

I would highly discourage this.

 2. Build a Ceph cluster at AWS and send snapshots from our data center to
 our AWS cluster allowing us to migrate to AWS.

This sounds far more sensible. I'd look at the I2 (iops) or D2
(density) class instances, depending on use case.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] xfs/nobarrier

2014-12-27 Thread Kyle Bader
 do people consider a UPS + Shutdown procedures a suitable substitute?

I certainly wouldn't, I've seen utility power fail and the transfer
switch fail to transition to UPS strings. Had this happened to me with
nobarrier it would have been a very sad day.

-- 

Kyle Bader
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] private network - VLAN vs separate switch

2014-11-25 Thread Kyle Bader
 For a large network (say 100 servers and 2500 disks), are there any
 strong advantages to using separate switch and physical network
 instead of VLAN?

Physical isolation will ensure that congestion on one does not affect
the other. On the flip side, asymmetric network failures tend to be
more difficult to troubleshoot eg. backend failure with functional
front end. That said, in a pinch you can switch to using the front end
network for both until you can repair the backend.

 Also, how difficult it would be to switch from a VLAN to using
 separate switches later?

Should be relatively straight forward. Simply configure the
VLAN/subnets on the new physical switches and move links over one by
one. Once all the links are moved over you can remove the VLAN and
subnets that are now on the new kit from the original hardware.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Dependency issues in fresh ceph/CentOS 7 install

2014-08-06 Thread Kyle Bader
 Can you paste me the whole output of the install? I am curious why/how you 
 are getting el7 and el6 packages.

priority=1 required in /etc/yum.repos.d/ceph.repo entries

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is OSDs based on VFS?

2014-07-21 Thread Kyle Bader
 I wonder that OSDs use system calls of Virtual File System (i.e. open, read,
 write, etc) when they access disks.

 I mean ... Could I monitor I/O command requested by OSD to disks if I
 monitor VFS?

Ceph OSDs run on top of a traditional filesystem, so long as they
support xattrs - xfs by default. As such you can use kernel
instrumentation to view what is going on under the Ceph OSDs.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Journal SSD durability

2014-05-13 Thread Kyle Bader
 TL;DR: Power outages are more common than your colo facility will admit.

Seconded. I've seen power failures in at least 4 different facilities
and all of them had the usual gamut of batteries/generators/etc. Some
of those facilities I've seen problems multiple times in a single
year. Even a datacenter with five nines power availability is going to
see  5m of downtime per year, and that would qualify for the highest
rating from the Uptime Institute (Tier IV)! I've lost power to Ceph
clusters on several occasions, in all cases the journals were on
spinning media.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrate whole clusters

2014-05-09 Thread Kyle Bader
 Let's assume a test cluster up and running with real data on it.
 Which is the best way to migrate everything to a production (and
 larger) cluster?

 I'm thinking to add production MONs to the test cluster, after that,
 add productions OSDs to the test cluster, waiting for a full rebalance
 and then starting to remove test OSDs and test mons.

 This should migrate everything with no outage.

It's possible and I've done it, this was around the argonaut/bobtail
timeframe on a pre-production cluster. If your cluster has a lot of
data then it may take a long time or be disruptive, make sure you've
tested that your recovery tunables are suitable for your hardware
configuration.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSDs: cache pool/tier versus node-local block cache

2014-04-17 Thread Kyle Bader
  I think the timing should work that we'll be deploying with Firefly and
  so
  have Ceph cache pool tiering as an option, but I'm also evaluating
  Bcache
  versus Tier to act as node-local block cache device. Does anybody have
  real
  or anecdotal evidence about which approach has better performance?
  New idea that is dependent on failure behaviour of the cache tier...

 The problem with this type of configuration is it ties a VM to a
 specific hypervisor, in theory it should be faster because you don't
 have network latency from round trips to the cache tier, resulting in
 higher iops. Large sequential workloads may achieve higher throughput
 by parallelizing across many OSDs in a cache tier, whereas local flash
 would be limited to single device throughput.

 Ah, I was ambiguous. When I said node-local I meant OSD-local. So I'm really
 looking at:
 2-copy write-back object ssd cache-pool
 versus
 OSD write-back ssd block-cache
 versus
 1-copy write-around object cache-pool  ssd journal

Ceph cache pools allow you to scale the size of the cache pool
independent of the underlying storage and avoids constraints about
disk:ssd ratios (for flashcache, bcache, etc). Local block caches
should have lower latency than a cache tier for a cache miss, due to
the extra hop(s) across the network. I would lean towards using Ceph's
cache tiers for the scaling independence.

 This is undoubtedly true for a write-back cache-tier. But in the scenario
 I'm suggesting, a write-around cache, that needn't be bad news - if a
 cache-tier OSD is lost the cache simply just got smaller and some cached
 objects were unceremoniously flushed. The next read on those objects should
 just miss and bring them into the now smaller cache.

 The thing I'm trying to avoid with the above is double read-caching of
 objects (so as to get more aggregate read cache). I assume the standard
 wisdom with write-back cache-tiering is that the backing data pool shouldn't
 bother with ssd journals?

Currently, all cache tiers need to be durable - regardless of cache
mode. As such, cache tiers should be erasure coded or N+1 replicated
(I'd recommend N+2 or 3x replica). Ceph could potentially do what you
described in the future, it just doesn't yet.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSDs: cache pool/tier versus node-local block cache

2014-04-16 Thread Kyle Bader
 Obviously the ssds could be used as journal devices, but I'm not really
 convinced whether this is worthwhile when all nodes have 1GB of hardware
 writeback cache (writes to journal and data areas on the same spindle have
 time to coalesce in the cache and minimise seek time hurt). Any advice on
 this?

All writes need to be written to the journal before being written to
the data volume so it's going to impact your overall throughput and
cause seeking, a hardware cache will only help with the later (unless
you use btrfs).

 I think the timing should work that we'll be deploying with Firefly and so
 have Ceph cache pool tiering as an option, but I'm also evaluating Bcache
 versus Tier to act as node-local block cache device. Does anybody have real
 or anecdotal evidence about which approach has better performance?
 New idea that is dependent on failure behaviour of the cache tier...

The problem with this type of configuration is it ties a VM to a
specific hypervisor, in theory it should be faster because you don't
have network latency from round trips to the cache tier, resulting in
higher iops. Large sequential workloads may achieve higher throughput
by parallelizing across many OSDs in a cache tier, whereas local flash
would be limited to single device throughput.

 Carve the ssds 4-ways: each with 3 partitions for journals servicing the
 backing data pool and a fourth larger partition serving a write-around cache
 tier with only 1 object copy. Thus both reads and writes hit ssd but the ssd
 capacity is not halved by replication for availability.

 ...The crux is how the current implementation behaves in the face of cache
 tier OSD failures?

Cache tiers are durable by way of replication or erasure coding, OSDs
will remap degraded placement groups and backfill as appropriate. With
single replica cache pools loss of OSDs becomes a real concern, in the
case of RBD this means losing arbitrary chunk(s) of your block devices
- bad news. If you want host independence, durability and speed your
best bet is a replicated cache pool (2-3x).

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] question on harvesting freed space

2014-04-15 Thread Kyle Bader
 I'm assuming Ceph/RBD doesn't have any direct awareness of this since
 the file system doesn't traditionally have a give back blocks
 operation to the block device.  Is there anything special RBD does in
 this case that communicates the release of the Ceph storage back to the
 pool?

VMs running a 3.2+ kernel (iirc) can give back blocks by issuing TRIM.

http://wiki.qemu.org/Features/QED/Trim

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What's the difference between using /dev/sdb and /dev/sdb1 as osd?

2014-03-22 Thread Kyle Bader
 If I want to use a disk dedicated for osd, can I just use something like
 /dev/sdb instead of /dev/sdb1? Is there any negative impact on performance?

You can pass /dev/sdb to ceph-disk-prepare and it will create two
partitions, one for the journal (raw partition) and one for the data
volume (defaults to formatting xfs). This is known as a single device
OSD, in contrast with a multi-device OSD where the journal is on a
completely different device (like a partition on a shared journaling
SSD).

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD + FlashCache vs. Cache Pool for RBD...

2014-03-22 Thread Kyle Bader
 One downside of the above arrangement: I read that support for mapping
 newer-format RBDs is only present in fairly recent kernels.  I'm running
 Ubuntu 12.04 on the cluster at present with its stock 3.2 kernel.  There
 is a PPA for the 3.11 kernel used in Ubuntu 13.10, but if you're looking
 at a new deployment it might be better to wait until 14.04: then you'll
 get kernel 3.13.

 Anyone else have any ideas on the above?

I don't think there are any hairy udev issues or similar that will
make using a newer kernel on precise problematic. The only thing I can
think of that is a caveat of this kind of setup if if you lose a
hypervisor the cache will go with it and you likely wont be able to
migrate the guest to another host. The alternative is to use
flashcache on top of the OSD partition but then you introduce network
hops and is closer to what the tiering feature will offer, except the
flashcache OSD method is more particular about disk:ssd ratio, whereas
in a tier the flash could be on s completely separate hosts (possibly
dedicated flash machines).

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mounting with dmcrypt still fails

2014-03-22 Thread Kyle Bader
 ceph-disk-prepare --fs-type xfs --dmcrypt --dmcrypt-key-dir 
 /etc/ceph/dmcrypt-keys --cluster ceph -- /dev/sdb
 ceph-disk: Error: Device /dev/sdb2 is in use by a device-mapper mapping 
 (dm-crypt?): dm-0

It sounds like device-mapper still thinks it's using the the volume,
you might be able to track it down with this:

for i in `ls -1 /sys/block/ | grep sd`; do echo $i: `ls
/sys/block/$i/${i}1/holders/`; done

Then it's a matter of making sure there are no open file handles on
the encrypted volume and unmounting it. You will still need to
completely clear out the partition table on that disk, which can be
tricky with GPT because it's not as simple as dd'in the start of the
volume. This is what the zapdisk parameter is for in
ceph-disk-prepare, I don't know enough about ceph-deploy to know if
you can somehow pass it.

After you know the device/dm mapping you can use udevadm to find out
where it should map to (uuids replaced with xxx's):

udevadm test /block/sdc/sdc1
snip
run: '/sbin/cryptsetup --key-file /etc/ceph/dmcrypt-keys/x
--key-size 256 create  /dev/sdc1'
run: '/bin/bash -c 'while [ ! -e /dev/mapper/x ];do sleep 1; done''
run: '/usr/sbin/ceph-disk-activate /dev/mapper/x'

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Put Ceph Cluster Behind a Pair of LB

2014-03-12 Thread Kyle Bader
 Anybody has a good practice on how to set up a ceph cluster behind a pair of 
 load balancer?

The only place you would want to put a load balancer in the context of
a Ceph cluster would be north of RGW nodes. You can do L3 transparent
load balancing or balance with a L7 proxy, ie Linux Virtual Server or
HAProxy/Nginx. The other components of Ceph are horizontally scalable
and because of the way Ceph's native protocols work you don't need
load balancers doing L2/L3/L7 tricks to achieve HA.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Put Ceph Cluster Behind a Pair of LB

2014-03-12 Thread Kyle Bader
 You're right.  Sorry didn't specify I was trying this for Radosgw.  Even for 
 this I'm seeing performance degrade once my clients start to hit the LB VIP.

Could you tell us more about your load balancer and configuration?

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Put Ceph Cluster Behind a Pair of LB

2014-03-12 Thread Kyle Bader
 This is in my lab. Plain passthrough setup with automap enabled on the F5. s3 
  curl work fine as far as queries go. But file transfer rate degrades badly 
 once I start file up/download.

Maybe the difference can be attributed to LAN client traffic with
jumbo frames vs F5 using a smaller WAN MTU?

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Encryption/Multi-tennancy

2014-03-11 Thread Kyle Bader
 There could be millions of tennants. Looking deeper at the docs, it looks 
 like Ceph prefers to have one OSD per disk.  We're aiming at having 
 backblazes, so will be looking at 45 OSDs per machine, many machines.  I want 
 to separate the tennants and separately encrypt their data.  The encryption 
 will be provided by us, but I was originally intending to have 
 passphrase-based encryption, and use programmatic means to either hash the 
 passphrase or/and encrypt it using the same passphrase.  This way, we 
 wouldn't be able to access the tennant's data, or the key for the passphrase, 
 although we'd still be able to store both.


The way I see it you have several options:

1. Encrypted OSDs

Preserve confidentiality in the event someone gets physical access to
a disk, whether theft or RMA. Requires tenant to trust provider.

vm
rbd
rados
osd -here
disks

2. Whole disk VM encryption

Preserve confidentiality in the even someone gets physical access to
disk, whether theft or RMA.

tenant: key/passphrase
provider: nothing

tenant: passphrase
provider: key

tenant: nothing
provider: key

vm --- here
rbd
rados
osd
disks

3. Encryption further up stack (application perhaps?)

To me, #1/#2 are identical except in the case of #2 when the rbd
volume is not attached to a VM. Block devices attached to a VM and
mounted will be decrypted, making the encryption only useful at
defending against unauthorized access to storage media. With a
different key per VM, with potentially millions of tenants, you now
have a massive key escrow/management problem that only buys you a bit
of additional security when block devices are detached. Sounds like a
crappy deal to me, I'd either go with #1 or #3.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Utilizing DAS on XEN or XCP hosts for Openstack Cinder

2014-03-11 Thread Kyle Bader
 1.   Is it possible to install Ceph and Ceph monitors on the the XCP
 (XEN) Dom0 or would we need to install it on the DomU containing the
 Openstack components?

I'm not a Xen guru but in the case of KVM I would run the OSDs on the
hypervisor to avoid virtualization overhead.

 2.   Is Ceph server aware, or Rack aware so that replicas are not stored
 on the same server?

Yes, placement is defined with your crush map and placement rules.

 3.   Are 4Tb OSD’s too large? We are attempting to restrict the qty of
 OSD’s per server to minimise system overhead

Nope!

 Any other feedback regarding our plan would also be welcomed.

I would probably run each disk as it's own OSD, which means you need a
bit more memory per host. Networking could certainly be a bottleneck
with 8 to 16 spindle nodes. YMMV.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] qemu-rbd

2014-03-11 Thread Kyle Bader
 I tried rbd-fuse and it's throughput using fio is approx. 1/4 that of the 
 kernel client.

 Can you please let me know how to setup RBD backend for FIO? I'm assuming 
 this RBD backend is also based on librbd?

You will probably have to build fio from source since the rbd engine is new:

https://github.com/axboe/fio

Assuming you already have a cluster and a client configured this
should do the trick:

https://github.com/axboe/fio/blob/master/examples/rbd.fio

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] questions about ceph cluster in multi-dacenter

2014-02-20 Thread Kyle Bader
 What could be the best replication ?

Are you using two sites to increase availability, durability, or both?

For availability your really better off using three sites and use
CRUSH to place each of three replicas in a different datacenter. In
this setup you can survive losing 1 of 3 datacenters. If you two sites
is the only option and your goal is availability and durability then I
would do 4 replicas, using osd_pool_default_min_size = 2.

 How to tune the crushmap of this kind of setup ?
 and last question : It's possible to have the reads from vms on DC1 to always 
 read datas on DC1 ?

No yet!

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How client choose among replications?

2014-02-11 Thread Kyle Bader
 Why would it help? Since it's not that ONE OSD will be primary for all
objects. There will be 1 Primary OSD per PG and you'll probably have a
couple of thousands PGs.

The primary may be across a oversubscribed/expensive link, in which case a
local replica with a common ancestor to the client may be preferable. It's
WIP with the goal of landing in firefly iirc.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] poor data distribution

2014-02-01 Thread Kyle Bader
 Change pg_num for .rgw.buckets to power of 2, an 'crush tunables
 optimal' didn't help :(

Did you bump pgp_num as well? The split pgs will stay in place until
pgp_num is bumped as well, if you do this be prepared for (potentially
lots) of data movement.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Power Cycle Problems

2014-01-16 Thread Kyle Bader
 On two separate occasions I have lost power to my Ceph cluster. Both times, I 
 had trouble bringing the cluster back to good health. I am wondering if I 
 need to config something that would solve this problem?

No special configuration should be necessary, I've had the unfortunate
luck of witnessing several power loss events with large Ceph clusters.
In both cases something other than Ceph was the source of frustrations
once power was returned. That said, monitor daemons should be started
first and must form a quorum before the cluster will be usable. It
sounds like you have made it that far if your getting output from
ceph health commands. The next step is to get your Ceph OSD daemons
running, which will require the data partitions to be mounted and the
journal device present. In Ubuntu installations this is handled by
udev scripts installed by the Ceph packages (I think this is may be
true for RHEL/CentOS but have not verified). Short of the udev method
you can mount the data partition manually. Once the data partition is
mounted you can start the OSDs manually in the event that init still
doesn't work after mounting, to do so you will need to know the
location of your keyring, ceph.conf and the OSD id. If you are unsure
of what the OSD id is then you can look at the root of the OSD data
partition, after it is mounted, in a file named whoami. To manually
start:

/usr/bin/ceph-osd -i ${OSD_ID} --pid-file
/var/run/ceph/osd.${OSD_ID}.pid -c /etc/ceph/ceph.conf

After that it's a matter of examining the logs if your still having
issues getting the OSDs to boot.

 After powering back up the cluster, “ceph health” revealed stale pages, mds 
 cluster degraded, 3/3 OSDs down. I tried to issue “sudo /etc/init.d/ceph -a 
 start” but I got no output from the command and the health status did not 
 change.

The placement groups are stale because none of the OSDs have reported
their state recently since they are down.

 I ended up having to re-install the cluster to fix the issue, but as my group 
 wants to use Ceph for VM storage in the future, we need to find a solution.

That's a shame, but at least you will be better prepared if it happens
again, hopefully your luck is not as unfortunate as mine!

-- 

Kyle Bader
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Failure probability with largish deployments

2013-12-26 Thread Kyle Bader
 Yes, that also makes perfect sense, so the aforementioned 12500 objects
 for a 50GB image, at a 60 TB cluster/pool with 72 disk/OSDs and 3 way
 replication that makes 2400 PGs, following the recommended formula.

  What amount of disks (OSDs) did you punch in for the following run?
  Disk Modeling Parameters
  size:   3TiB
  FIT rate:826 (MTBF = 138.1 years)
  NRE rate:1.0E-16
  RADOS parameters
  auto mark-out: 10 minutes
  recovery rate:50MiB/s (40 seconds/drive)
  Blink???
  I guess that goes back to the number of disks, but to restore 2.25GB at
  50MB/s with 40 seconds per drive...

 The surviving replicas for placement groups that the failed OSDs
 participated will naturally be distributed across many OSDs in the
 cluster, when the failed OSD is marked out, it's replicas will be
 remapped to many OSDs. It's not a 1:1 replacement like you might find
 in a RAID array.

 I completely get that part, however the total amount of data to be
 rebalanced after a single disk/OSD failure to fully restore redundancy is
 still 2.25TB (mistyped that as GB earlier) at the 75% utilization you
 assumed.
 What I'm still missing in this pictures is how many disks (OSDs) you
 calculated this with. Maybe I'm just misreading the 40 seconds per drive
 bit there. Because if that means each drive is only required to be just
 active for 40 seconds to do it's bit of recovery, we're talking 1100
 drives. ^o^ 1100 PGs would be another story.

To recreate the modeling:

git clone https://github.com/ceph/ceph-tools.git
cd ceph-tools/models/reliability/
python main.py -g

I used the following values:

Disk Type: Enterprise
Size: 3000 GiB
Primary FITs: 826
Secondary FITS: 826
NRE Rate: 1.0E-16

RAID Type: RAID6
Replace (hours): 6
Rebuild (MiB/s): 500
Volumes: 11

RADOS Copies: 3
Mark-out (min): 10
Recovery (MiB/s): 50
Space Usage: 75%
Declustering (pg): 1100
Stripe length: 1100 (limited by pgs anyway)

RADOS sites: 1
Rep Latency (s): 0
Recovery (MiB/s): 10
Disaster (years): 1000
Site Recovery (days): 30

NRE Model: Fail
Period (years): 1
Object Size: 4MB

It seems that the number of disks is not considered when calculating
the recovery window, only the number of pgs

https://github.com/ceph/ceph-tools/blob/master/models/reliability/RadosRely.py#L68

I could also see the recovery rates varying based on the max osd
backfill tunable.

http://ceph.com/docs/master/rados/configuration/osd-config-ref/#backfilling

Doing both would improve the quality of models generated by the tool.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Networking questions

2013-12-26 Thread Kyle Bader
 Do monitors have to be on the cluster network as well or is it sufficient
 for them to be on the public network as
 http://ceph.com/docs/master/rados/configuration/network-config-ref/
 suggests?

Monitors only need to be on the public network.

 Also would the OSDs re-route their traffic over the public network if that
 was still available in case the cluster network fails?

Ceph doesn't currently support this type of configuration.

Hope that clears things up!

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Failure probability with largish deployments

2013-12-23 Thread Kyle Bader
 Is an object a CephFS file or a RBD image or is it the 4MB blob on the
 actual OSD FS?

Objects are at the RADOS level, CephFS filesystems, RBD images and RGW
objects are all composed by striping RADOS objects - default is 4MB.

 In my case, I'm only looking at RBD images for KVM volume storage, even
 given the default striping configuration I would assume that those 12500
 OSD objects for a 50GB image  would not be in the same PG and thus just on
 3 (with 3 replicas set) OSDs total?

Objects are striped across placement groups, so you take your RBD size
/ 4 MB and cap it at the total number of placement groups in your
cluster.

 What amount of disks (OSDs) did you punch in for the following run?
 Disk Modeling Parameters
 size:   3TiB
 FIT rate:826 (MTBF = 138.1 years)
 NRE rate:1.0E-16
 RADOS parameters
 auto mark-out: 10 minutes
 recovery rate:50MiB/s (40 seconds/drive)
 Blink???
 I guess that goes back to the number of disks, but to restore 2.25GB at
 50MB/s with 40 seconds per drive...

The surviving replicas for placement groups that the failed OSDs
participated will naturally be distributed across many OSDs in the
cluster, when the failed OSD is marked out, it's replicas will be
remapped to many OSDs. It's not a 1:1 replacement like you might find
in a RAID array.

 osd fullness:  75%
 declustering:1100 PG/OSD
 NRE model:  fail
 object size:  4MB
 stripe length:   1100
 I take it that is to mean that any RBD volume of sufficient size is indeed
 spread over all disks?

Spread over all placement groups, the difference is subtle but there
is a difference.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph network topology with redundant switches

2013-12-20 Thread Kyle Bader
 The area I'm currently investigating is how to configure the
 networking. To avoid a SPOF I'd like to have redundant switches for
 both the public network and the internal network, most likely running
 at 10Gb. I'm considering splitting the nodes in to two separate racks
 and connecting each half to its own switch, and then trunk the
 switches together to allow the two halves of the cluster to see each
 other. The idea being that if a single switch fails I'd only lose half
 of the cluster.

This is fine if you are using a replication factor of 2, you would need 2/3 of
the cluster surviving if using a replication factor 3 with osd pool default min
size set to 2.

 My question is about configuring the public network. If it's all one
 subnet then the clients consuming the Ceph resources can't have both
 links active, so they'd be configured in an active/standby role. But
 this results in quite heavy usage of the trunk between the two
 switches when a client accesses nodes on the other switch than the one
 they're actively connected to.

The linux bonding driver supports several strategies for teaming network
adapters on L2 networks.

 So, can I configure multiple public networks? I think so, based on the
 documentation, but I'm not completely sure. Can I have one half of the
 cluster on one subnet, and the other half on another? And then the
 client machine can have interfaces in different subnets and do the
 right thing with both interfaces to talk to all the nodes. This seems
 like a fairly simple solution that avoids a SPOF in Ceph or the network
 layer.

You can have multiple networks for both the public and cluster networks,
the only restriction is that all subnets for a given type be within the same
supernet. For example

10.0.0.0/16 - Public supernet (configured in ceph.conf)
10.0.1.0/24 - Public rack 1
10.0.2.0/24 - Public rack 2
10.1.0.0/16 - Cluster supernet (configured in ceph.conf)
10.1.1.0/24 - Cluster rack 1
10.1.2.0/24 - Cluster rack 2

 Or maybe I'm missing an alternative that would be better? I'm aiming
 for something that keeps things as simple as possible while meeting
 the redundancy requirements.

 As an aside, there's a similar issue on the cluster network side with
 heavy traffic on the trunk between the two cluster switches. But I
 can't see that's avoidable, and presumably it's something people just
 have to deal with in larger Ceph installations?

A proper CRUSH configuration is going to place a replica on a node in
each rack, this means every write is going to cross the trunk. Other
traffic that you will see on the trunk:

* OSDs gossiping with one another
* OSD/Monitor traffic in the case where an OSD is connected to a
  monitor connected in the adjacent rack (map updates, heartbeats).
* OSD/Client traffic where the OSD and client are in adjacent racks

If you use all 4 40GbE uplinks (common on 10GbE ToR) then your
cluster level bandwidth is oversubscribed 4:1. To lower oversubscription
you are going to have to steal some of the other 48 ports, 12 for 2:1 and
24 for a non-blocking fabric. Given number of nodes you have/plan to
have you will be utilizing 6-12 links per switch, leaving you with 12-18
links for clients on a non-blocking fabric, 24-30 for 2:1 and 36-48 for 4:1.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw daemon stalls on download of some files

2013-12-19 Thread Kyle Bader
 Do you have any futher detail on this radosgw bug?

https://github.com/ceph/ceph/commit/0f36eddbe7e745665a634a16bf3bf35a3d0ac424
https://github.com/ceph/ceph/commit/0b9dc0e5890237368ba3dc34cb029010cb0b67fd

 Does it only apply to emperor?

The bug is present in dumpling too.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] SysAdvent: Day 15 - Distributed Storage with Ceph

2013-12-15 Thread Kyle Bader
For you holiday pleasure I've prepared a SysAdvent article on Ceph:

http://sysadvent.blogspot.com/2013/12/day-15-distributed-storage-with-ceph.html

Check it out!

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Rbd image performance

2013-12-15 Thread Kyle Bader
 Has anyone tried scaling a VMs io by adding additional disks and
 striping them in the guest os?  I am curious what effect this would have
 on io performance?

 Why would it? You can also change the stripe size of the RBD image.
Depending on the workload you might change it from 4MB to something like
1MB or 32MB? That would give you more or less RADOS objects which will also
give you a different I/O pattern.

The question comes up because it's common for people operating on EC2 to
stripe EBS volumes together for higher iops rates. I've tried striping
kernel RBD volumes before but hit some sort of thread limitation where
throughput was consistent despite the volume count. I've since learned the
thread limit is configurable. I don't think there is a thread limit that
needs to be tweaked for RBD via KVM/QEMU but I haven't tested this
empirically. As Wido mentioned, if you are operating your own cluster
configuring the stripe size may achieve similar results. Google used to use
a 64MB chunk size with GFS but switched to 1MB after they started
supporting more and more seek heavy workloads.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] NUMA and ceph

2013-12-12 Thread Kyle Bader
It seems that NUMA can be problematic for ceph-osd daemons in certain
circumstances. Namely it seems that if a NUMA zone is running out of
memory due to uneven allocation it is possible for a NUMA zone to
enter reclaim mode when threads/processes are scheduled on a core in
that zone and those processes are request memory allocations greater
than the zones remaining memory. In order for the kernel to satisfy
the memory allocation for those processes it needs to page out some of
the contents of the contentious zone, which can have dramatic
performance implications due to cache misses, etc. I see two ways an
operator could alleviate these issues:

Set the vm.zone_reclaim_mode sysctl setting to 0, along with prefixing
ceph-osd daemons with numactl --interleave=all. This should probably
be activated by a flag in /etc/default/ceph and modifying the
ceph-osd.conf upstart script, along with adding a depend to the ceph
package's debian/rules file on the numactl package.

The alternative is to use a cgroup for each ceph-osd daemon, pinning
each one to cores in the same NUMA zone using cpuset.cpu and
cpuset.mems. This would probably also live in /etc/default/ceph and
the upstart scripts.

-- 
Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph reliability in large RBD setups

2013-12-10 Thread Kyle Bader
 I've been running similar calculations recently. I've been using this
 tool from Inktank to calculate RADOS reliabilities with different
 assumptions:
   https://github.com/ceph/ceph-tools/tree/master/models/reliability

 But I've also had similar questions about RBD (or any multi-part files
 stored in RADOS) -- naively, a file/device stored in N objects would
 be N times less reliable than a single object. But I hope there's an
 error in that logic.

It's worth pointing out that Ceph's RGW will actually stripe S3 objects
across many RADOS objects - even when it's not a multi-part upload, this
has been the case since the Bobtail release. There is a in depth Google
paper about availability modeling, it might provide some insight into what
the math should look like:

http://research.google.com/pubs/archive/36737.pdf

When reading it you can think of objects as chunks and pgs as stripes.
CRUSH should be configured based on failure domains that cause correlated
failures, ie power and networking. You also want to consider the
availability of the facility itself:

Typical availability estimates used in the industry range from 99.7%
availability for tier II datacenters to 99.98% and 99.995% for tiers III
and IV, respectively.

http://www.morganclaypool.com/doi/pdf/10.2200/s00193ed1v01y200905cac006

If you combine the cluster availability metric and the facility
availability metric, you might be surprised. A cluster with 99.995%
availability in a Tier II facility is going to be dragged down to 99.7%
availability.  If a cluster goes down in the forest, does anyone know?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Anybody doing Ceph for OpenStack with OSDs across compute/hypervisor nodes?

2013-12-09 Thread Kyle Bader
 We're running OpenStack (KVM) with local disk for ephemeral storage.
 Currently we use local RAID10 arrays of 10k SAS drives, so we're quite rich
 for IOPS and have 20GE across the board. Some recent patches in OpenStack
 Havana make it possible to use Ceph RBD as the source of ephemeral VM
 storage, so I'm interested in the potential for clustered storage across our
 hypervisors for this purpose. Any experience out there?

I believe Piston converges their storage/compute, they refer to it as
a null-tier architecture.

http://www.pistoncloud.com/openstack-cloud-software/technology/#storage
-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] optimal setup with 4 x ethernet ports

2013-12-06 Thread Kyle Bader
 looking at tcpdump all the traffic is going exactly where it is supposed to 
 go, in particular an osd on the 192.168.228.x network appears to talk to an 
 osd on the 192.168.229.x network without anything strange happening. I was 
 just wondering if there was anything about ceph that could make this 
 non-optimal, assuming traffic was reasonably balanced between all the osd's 
 (eg all the same weights). I think the only time it would suffer is if writes 
 to other osds result in a replica write to a single osd, and even then a 
 single OSD is still limited to 7200RPM disk speed anyway so the loss isn't 
 going to be that great.

Should be fine given you only have a 1:1 ratio of link to disk.

 I think I'll be moving over to bonded setup anyway, although I'm not sure if 
 rr or lacp is best... rr will give the best potential throughput, but lacp 
 should give similar aggregate throughput if there are plenty of connections 
 going on, and less cpu load as no need to reassemble fragments.

One of the DreamHost clusters is using a pair of bonded 1GbE links on
the public network and another pair for the cluster network, we
configured each to use mode 802.3ad.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] optimal setup with 4 x ethernet ports

2013-12-04 Thread Kyle Bader
 Is having two cluster networks like this a supported configuration? Every
 osd and mon can reach every other so I think it should be.

 Maybe. If your back end network is a supernet and each cluster network is a
 subnet of that supernet. For example:

 Ceph.conf cluster network (supernet): 10.0.0.0/8

 Cluster network #1:  10.1.1.0/24
 Cluster network #2: 10.1.2.0/24

 With that configuration OSD address autodection *should* just work.

It should work but thinking more about it the OSDs will likely be
assigned IPs on a single network, whichever is inspected and matches
the supernet range (which could be in either subnet). In order to have
OSDs on two distinct networks you will likely have to use a
declarative configuration in /etc/ceph/ceph.conf which lists the OSD
IP addresses for each OSD (making sure to balance between links).

 1. move osd traffic to eth1. This obviously limits maximum throughput to
 ~100Mbytes/second, but I'm getting nowhere near that right now anyway.

 Given three links I would probably do this if your replication factor is =
 3. Keep in mind 100Mbps links could very well end up being a limiting
 factor.

Sorry I read Mbytes and Mbps, big difference, the former is much preferable!

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] optimal setup with 4 x ethernet ports

2013-12-02 Thread Kyle Bader
 Is having two cluster networks like this a supported configuration? Every
osd and mon can reach every other so I think it should be.

Maybe. If your back end network is a supernet and each cluster network is a
subnet of that supernet. For example:

Ceph.conf cluster network (supernet): 10.0.0.0/8

Cluster network #1:  10.1.1.0/24
Cluster network #2: 10.1.2.0/24

With that configuration OSD address autodection *should* just work.

 1. move osd traffic to eth1. This obviously limits maximum throughput to
~100Mbytes/second, but I'm getting nowhere near that right now anyway.

Given three links I would probably do this if your replication factor is =
3. Keep in mind 100Mbps links could very well end up being a limiting
factor.

What are you backing each OSD with storage wise and how many OSDs do you
expect to participate in this cluster?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] installing OS on software RAID

2013-11-30 Thread Kyle Bader
  Is the OS doing anything apart from ceph? Would booting a ramdisk-only
system from USB or compact flash work?

I haven't tested this kind of configuration myself but I can't think of
anything that would preclude this type of setup. I'd probably use sqashfs
layered with a tmpfs via aufs to avoid any writes to the USB drive. I would
also mount spinning high capacity media for /var/log or setup log streaming
to something like rsyslog/syslog-ng/logstash.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Impact of fancy striping

2013-11-30 Thread Kyle Bader
 This journal problem is a bit of wizardry to me, I even had weird
intermittent issues with OSDs not starting because the journal was not
found, so please do not hesitate to suggest a better journal setup.

You mentioned using SAS for journal, if your OSDs are SATA and a expander
is in the data path it might be slow from MUX/STP/etc overhead. If the
setup is all SAS you might try collocating the journal with it's matching
data partition on a single disk. Two spindles must be contended with 9
OSDs. How are your drives attached?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 回复:Re: testing ceph performance issue

2013-11-27 Thread Kyle Bader
 How much performance can be improved if use SSDs  to storage journals?

You will see roughly twice the throughput unless you are using btrfs
(still improved but not as dramatic). You will also see lower latency
because the disk head doesn't have to seek back and forth between
journal and data partitions.

   Kernel RBD Driver  ,  what is this ?

There are several RBD implementations, one is the kernel RBD driver in
upstream Linux, another is built into Qemu/KVM.

 and we want to know the RBD if  support XEN virual  ?

It is possible, but not nearly as well tested and not prevalent as RBD
via Qemu/KVM. This might be a starting point if your interested in
testing Xen/RBD integration:

http://wiki.xenproject.org/wiki/Ceph_and_libvirt_technology_preview

Hope that helps!

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD on an external, shared device

2013-11-26 Thread Kyle Bader
 Is there any way to manually configure which OSDs are started on which
 machines? The osd configuration block includes the osd name and host, so is
 there a way to say that, say, osd.0 should only be started on host vashti
 and osd.1 should only be started on host zadok?  I tried using this
 configuration:

The ceph udev rules are going to automatically mount disks that match
the ceph magic guids, to dig through the full logic you need to
inspect these files:

/lib/udev/rules.d/60-ceph-partuuid-workaround.rules
/lib/udev/rules.d/95-ceph-osd.rules

The upstart scripts look to see what is mounted at /var/lib/ceph/osd/
and starts osd daemons as appropriate:

/etc/init/ceph-osd-all-starter.conf

In theory you should be able to remove the udev scripts and mount the
osds in /var/lib/ceph/osd if your using upstart. You will want to make
sure that upgrades to the ceph package don't replace the files, maybe
that means making a null rule and using -o
Dpkg::Options::='--force-confold in ceph-deploy/chef/puppet/whatever.
You will also want to avoid putting the mounts in fstab because it
could render your node unbootable if the device or filesystem fails.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] installing OS on software RAID

2013-11-25 Thread Kyle Bader
Several people have reported issues with combining OS and OSD journals
on the same SSD drives/RAID due to contention. If you do something
like this I would definitely test to make sure it meets your
expectations. Ceph logs are going to compose the majority of the
writes to the OS storage devices.

On Mon, Nov 25, 2013 at 12:46 PM, James Harper
james.har...@bendigoit.com.au wrote:

 We need to install the OS on the 3TB harddisks that come with our Dell
 servers. (After many attempts, I've discovered that Dell servers won't allow
 attaching an external harddisk via the PCIe slot. (I've tried everything). )

 But, must I therefore sacrifice two hard disks (RAID-1) for the OS?  I don't 
 see
 why I can't just create a small partition  (~30GB) on all 6 of my hard 
 disks, do a
 software-based RAID 1 on it, and be done.

 I know that software based RAID-5 seems computationally expensive, but
 shouldn't RAID 1 be fast and computationally inexpensive for a computer
 built over the last 4 years? I wouldn't think that a CEPH systems (with lots 
 of
 VMs but little data changes) would even do much writing to the OS
 partitionbut I'm not sure. (And in the past, I have noticed that RAID5
 systems did suck up a lot of CPU and caused lots of waits, unlike what the
 blogs implied. But I'm thinking that a RAID 1 takes little CPU and the OS 
 does
 little writing to disk; it's mostly reads, which should hit the RAM.)

 Does anyone see any holes in the above idea? Any gut instincts? (I would try
 it, but it's hard to tell how well the system would really behave under 
 real
 load conditions without some degree of experience and/or strong
 theoretical knowledge.)

 Is the OS doing anything apart from ceph? Would booting a ramdisk-only system 
 from USB or compact flash work?

 If the OS doesn't produce a lot of writes then having it on the main disk 
 should work okay. I've done it exactly as you describe before.

 James

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Running on disks that lose their head

2013-11-07 Thread Kyle Bader
 Once I know a drive has had a head failure, do I trust that the rest of the 
 drive isn't going to go at an inconvenient moment vs just fixing it right 
 now when it's not 3AM on Christmas morning? (true story)  As good as Ceph 
 is, do I trust that Ceph is smart enough to prevent spreading corrupt data 
 all over the cluster if I leave bad disks in place and they start doing 
 terrible things to the data?

I have a lot more disks than I have trust in disks. If a drive lost a
head then I want it gone.

I love the idea of using smart data but can foresee see some
implementation issues. We have seen some raid configurations where
polling smart will halt all raid operations momentarily. Also, some
controllers require you to use their CLI tool to pool for smart vs
smartmontools.

It would be similarly awesome to embed something like an apdex score
against each osd, especially if it factored in hierarchy to identify
poor performing osds, nodes, racks, etc..

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw questions

2013-11-07 Thread Kyle Bader
 1. To build a high performance yet cheap radosgw storage, which pools should
 be placed on ssd and which on hdd backed pools? Upon installation of
 radosgw, it created the following pools: .rgw, .rgw.buckets,
 .rgw.buckets.index, .rgw.control, .rgw.gc, .rgw.root, .usage, .users,
 .users.email.

There is a lot that goes into high performance, a few questions come to mind:

Do you want high performance reads, writes or both?
How hot is your data, can you bet better performance from buying more
memory for caching?
What size objects do you expect to handle, how many per bucket?

 4. Which number of replaction would you suggest? In other words, which
 replication is need to achive 99.9% durability like dreamobjects states?

DreamObjects Engineer here, we used Ceph's durability modeling tools here:

https://github.com/ceph/ceph-tools

You will need to research your data disk's MTBF numbers and convert
them to FITS, measure your OSD backfill MTTR and factor in your
replication count. DreamObjects uses 3 replicas on enterprise SAS
disks. The durability figures exclude black swan events like fires and
other such datacenter or regional disasters, which is why having a
second location is important for DR.

 5. Is it possible to map fqdn custom domain to buckets, not only subdomains?

You could map a domain's A/ records to an endpoint but if the
endpoint changes your SOL, using a CNAME at the domain root violates
DNS rfcs. Some DNS providers will fake a CNAME by doing a recursive
lookup in response to an A/ request as a work around.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph cluster performance

2013-11-07 Thread Kyle Bader
 ST240FN0021 connected via a SAS2x36 to a LSI 9207-8i.

The problem might be SATA transport protocol overhead at the expander.
Have you tried directly connecting the SSDs to SATA2/3 ports on the
mainboard?

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Running on disks that lose their head

2013-11-07 Thread Kyle Bader
 Zackc, Loicd, and I have been the main participants in a weekly Teuthology
 call the past few weeks. We've talked mostly about methods to extend
 Teuthology to capture performance metrics. Would you be willing to join us
 during the Teuthology and Ceph-Brag sessions at the Firefly Developer
 Summit?

I'd be happy to!

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph User Committee

2013-11-07 Thread Kyle Bader
 I think this is a great idea.  One of the big questions users have is
 what kind of hardware should I buy.  An easy way for users to publish
 information about their setup (hardware, software versions, use-case,
 performance) when they have successful deployments would be very valuable.
 Maybe a section of wiki?

It would be interesting to a site where a Ceph admin can download an
API key/package that could be optionally installed and report
configuration information to a community API. The admin could then
supplement/correct that base information. Having much of the data
collection be automated lowers the barrier for contribution.  Bonus
points if this could be extended to SMART and failed drives so we
could have a community generated report similar to Google's disk
population study they presented at FAST'07.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph User Committee

2013-11-07 Thread Kyle Bader
 Would this be something like 
 http://wiki.ceph.com/01Planning/02Blueprints/Firefly/Ceph-Brag ?

Something very much like that :)

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph node Info

2013-10-30 Thread Kyle Bader
The quick start guide is linked below, it should help you hit the ground
running.

http://ceph.com/docs/master/start/quick-ceph-deploy/

Let us know if you have questions or bump into trouble!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph recovery killing vms

2013-10-29 Thread Kyle Bader
Recovering from a degraded state by copying existing replicas to other OSDs
is going to cause reads on existing replicas and writes to the new
locations. If you have slow media then this is going to be felt more
acutely. Tuning the backfill options I posted is one way to lessen the
impact, another option is to slowly lower the weight in CRUSH for the
OSD(s) you want to remove. Hopefully that helps!

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph recovery killing vms

2013-10-28 Thread Kyle Bader
You can change some OSD tunables to lower the priority of backfills:

osd recovery max chunk:   8388608
osd recovery op priority: 2

In general a lower op priority means it will take longer for your
placement groups to go from degraded to active+clean, the idea is to
balance recovery time and not starving client requests. I've found 2
to work well on our clusters, YMMV.

On Mon, Oct 28, 2013 at 10:16 AM, Kevin Weiler
kevin.wei...@imc-chicago.com wrote:
 Hi all,

 We have a ceph cluster that being used as a backing store for several VMs
 (windows and linux). We notice that when we reboot a node, the cluster
 enters a degraded state (which is expected), but when it begins to recover,
 it starts backfilling and it kills the performance of our VMs. The VMs run
 slow, or not at all, and also seem to switch it's ceph mounts to read-only.
 I was wondering 2 things:

 Shouldn't we be recovering instead of backfilling? It seems like backfilling
 is much more intensive operation
 Can we improve the recovery/backfill performance so that our VMs don't go
 down when there is a problem with the cluster?


 --

 Kevin Weiler

 IT



 IMC Financial Markets | 233 S. Wacker Drive, Suite 4300 | Chicago, IL 60606
 | http://imc-chicago.com/

 Phone: +1 312-204-7439 | Fax: +1 312-244-3301 | E-Mail:
 kevin.wei...@imc-chicago.com


 

 The information in this e-mail is intended only for the person or entity to
 which it is addressed.

 It may contain confidential and /or privileged material. If someone other
 than the intended recipient should receive this e-mail, he / she shall not
 be entitled to read, disseminate, disclose or duplicate it.

 If you receive this e-mail unintentionally, please inform us immediately by
 reply and then delete it from your system. Although this information has
 been compiled with great care, neither IMC Financial Markets  Asset
 Management nor any of its related entities shall accept any responsibility
 for any errors, omissions or other inaccuracies in this information or for
 the consequences thereof, nor shall it be bound in any way by the contents
 of this e-mail or its attachments. In the event of incomplete or incorrect
 transmission, please return the e-mail to the sender and permanently delete
 this message and any attachments.

 Messages and attachments are scanned for all known viruses. Always scan
 attachments before opening them.

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] changing journals post-bobcat?

2013-10-28 Thread Kyle Bader
The bobtail release added udev/upstart capabilities that allowed you
to not have per OSD entries in ceph.conf. Under the covers the new
udev/upstart scripts look for a special label on OSD data volumes,
matching volumes are mounted and then a few files are inspected:

journal_uuid  whoami

The journal_uuid is the uuid of the journal device for that OSD,
whoami indicates the OSD number the data volume belongs to. This
thread might be helpful for changing the journal device:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-October/005162.html

On Mon, Oct 28, 2013 at 11:39 AM, John Kinsella j...@stratosec.co wrote:
 Hey folks - looking around, I see plenty (OK, some) on how to modify journal 
 size and location for older ceph, when ceph.conf was used (I think the switch 
 from ceph.conf to storing osd/journal config elsewhere happened with 
 bobcat?). I recently deployed a cluster with ceph-deploy on 0.67 and wanted 
 to change the journal size for the OSDs.

 Is this a remove/re-create procedure now, or is there an easier way?

 John
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hardware: SFP+ or 10GBase-T

2013-10-24 Thread Kyle Bader
 I know that 10GBase-T has more delay then SFP+ with direct attached
 cables (.3 usec vs 2.6 usec per link), but does that matter? Some
 sites stay it is a huge hit, but we are talking usec, not ms, so I
 find it hard to believe that it causes that much of an issue. I like
 the lower cost and use of standard cabling vs SFP+, but I don't want
 to sacrifice on performance.

If you are talking about the links from the nodes with OSDs to their
ToR switches then I would suggest going with Twinax cables. Twinax
doesn't go very far but it's really durable and uses less power than
10GBase-T. Here's a blog post that goes into more detail:

http://etherealmind.com/difference-twinax-category-6-10-gigabit-ethernet/

I would probably go with the Arista 7050-S over the 7050-T and use
twinax for ToR to OSD node links and SFP+SR uplinks to spine switches
if you need longer runs.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS Project Manila (OpenStack)

2013-10-23 Thread Kyle Bader
 This is going to get horribly ugly when you add neutron into the mix, so
 much so I'd consider this option a non-starter. If someone is using
 openvswitch to create network overlays to isolate each tenant I can't
 imagine this ever working.

 I'm not following here.  Are this only needed if ceph shares the same
 subnet as the VM?  I don't know much about how these things work, but I
 would expect that it would be possible to route IP traffic from a guest
 network to the storage network (or anywhere else, for that matter)...

 That aside, however, I think it would be a mistake to take the
 availability of cephfs vs nfs clients as a reason alone for a particular
 architectural approach.  One of the whole points of ceph is that we ignore
 legacy when it doesn't make sense.  (Hence rbd, not iscsi; cephfs, not
 [p]nfs.)

In an overlay world, physical VLANs have no relation to virtual
networks. An overlay is literally encapsulating layer 2 inside layer 3
and adding a VNI (virtual network identifier) and using tunnels
(VxLAN, STT, GRE, etc) to connect VMs on disparate hypervisors that
may not even have L2 connectivity to each other.  One of the core
tenants of virtual networks is providing tenants the ability to have
overlapping RFC1918 addressing, in this case you could have tenants
already utilizing the addresses used by the NFS storage at the
physical layer. Even if we could pretend that would never happen
(namespaces or jails maybe?) you would still need to provision a
distinct NFS IP per tenant and run a virtual switch that supports the
tunneling protocol used by the overlay and the southbound API used by
that overlays virtual switch to insert/remove flow information. The
only alternative to embedding a myriad of different virtual switch
protocols on the filer head would be to use a VTEP capable switch for
encapsulation. I think there are only 1-2 vendors that ship these,
Arista's 7150 and something in the Cumulus lineup.  Even if you could
get past all this I'm somewhat terrified by the proposition of
connecting the storage fabric to a tenant network, although this is
much more acute concern for public clouds.

Here's a good RFC wrt overlays if anyone is in dire need of bed time reading:

http://tools.ietf.org/html/draft-mity-nvo3-use-case-04

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mounting RBD in linux containers

2013-10-17 Thread Kyle Bader
My first guess would be that it's due to LXC dropping capabilities, I'd
investigate whether CAP_SYS_ADMIN is being dropped. You need CAP_SYS_ADMIN
for mount and block ioctls, if the container doesn't have those privs a map
will likely fail. Maybe try tracing the command with strace?

On Thu, Oct 17, 2013 at 2:45 PM, Kevin Weiler
kevin.wei...@imc-chicago.comwrote:

  Hi all,

  We're trying to mount an rbd image inside of a linux container that has
 been created with docker (https://www.docker.io/). We seem to have access
 to the rbd kernel module from inside the container:

  # lsmod | grep ceph
 libceph   218854  1 rbd
 libcrc32c  12603  3 xfs,libceph,dm_persistent_data

  And we can query the pool for available rbds and create rbds from inside
 the container:

  # rbd -p dockers --id dockers --keyring
 /etc/ceph/ceph.client.dockers.keyring create lxctest --size 51200
 # rbd -p dockers --id dockers --keyring
 /etc/ceph/ceph.client.dockers.keyring ls
 lxctest

  But for some reason, we can't seem to map the device to the container:

  # rbd -p dockers --id dockers --keyring
 /etc/ceph/ceph.client.dockers.keyring map lxctest
 rbd: add failed: (22) Invalid argument

  I don't see anything particularly interesting in dmesg or messages on
 either the container or the host box. Any ideas on how to troubleshoot this?

  Thanks!


  --

 *Kevin Weiler*

 IT



 IMC Financial Markets | 233 S. Wacker Drive, Suite 4300 | Chicago, IL
 60606 | http://imc-chicago.com/

 Phone: +1 312-204-7439 | Fax: +1 312-244-3301 | E-Mail: *
 kevin.wei...@imc-chicago.com kevin.wei...@imc-chicago.com*

 --

 The information in this e-mail is intended only for the person or entity
 to which it is addressed.

 It may contain confidential and /or privileged material. If someone other
 than the intended recipient should receive this e-mail, he / she shall not
 be entitled to read, disseminate, disclose or duplicate it.

 If you receive this e-mail unintentionally, please inform us immediately
 by reply and then delete it from your system. Although this information
 has been compiled with great care, neither IMC Financial Markets  Asset
 Management nor any of its related entities shall accept any responsibility
 for any errors, omissions or other inaccuracies in this information or for
 the consequences thereof, nor shall it be bound in any way by the contents
 of this e-mail or its attachments. In the event of incomplete or incorrect
 transmission, please return the e-mail to the sender and permanently delete
 this message and any attachments.

 Messages and attachments are scanned for all known viruses. Always scan
 attachments before opening them.

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph configuration data sharing requirements

2013-10-17 Thread Kyle Bader
  * The IP address of at least one MON in the Ceph cluster


If you configure nodes with a single monitor in the mon hosts directive
then I believe your nodes will have issues if that one monitor goes down.
With Chef I've gone back and forth between using Chef search and having
monitors be declarative. Chef search is problematic if you are not
declarative about how many monitors to expect, you could end up with 3
monitors and 3 single monitor quorums during initial cluster creation.


  If cephx is disabled:
 
  * no other requirement
 
  If cephx is enabled:
 
  * an admin user is created by providing a keyring file with its
 description when the first
MON is bootstraped
http://ceph.com/docs/next/dev/mon-bootstrap/

  * users must be created by injecting them into the MONs, for instance
 with auth import
 https://github.com/ceph/ceph/blob/master/src/mon/MonCommands.h#L162
 or auth add. There is not need to ask the MONs for a key, although it
 can be done. It is
 not a requirement. When a user is created or later on, its
 capabilities can be set.
 
  * an osd must be created by the mon which return an unique osd ID which
 is then used to
 further configure the osd.
 https://github.com/ceph/ceph/blob/master/src/mon/MonCommands.h#L471
 
  * a client must be given a user id and a secret key
 
  It would also be helpful to better understand why people are happy with
 the way ceph-deploy currently works and how it deals with these
 requirements.


I haven't used ceph-deploy, but I did write a chef cookbook before
ceph-deploy was a thing.  You will want to get the OSD bootstrap key from
one of the monitors and distribute it to your OSD nodes. Once you have the
bootstrap key you can have puppet enable and start the upstart service.
After ceph-osd-all is running under upstart you can simply use
ceph-disk-prepare and a new OSD will be created based off the OSD bootstrap
key, the OSD id is automatically allocated by the monitor during this
process.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Speed limit on RadosGW?

2013-10-14 Thread Kyle Bader
I've personally saturated 1Gbps links on multiple radosgw nodes on a large
cluster, if I remember correctly, Yehuda has tested it up into the 7Gbps
range with 10Gbps gear. Could you describe your clusters hardware and
connectivity?


On Mon, Oct 14, 2013 at 3:34 AM, Chu Duc Minh chu.ducm...@gmail.com wrote:

 Hi sorry, i missed this mail.


  During writes, does the CPU usage on your RadosGW node go way up?
 No, CPU stay the same  very low ( 10%)

 When upload small files(300KB/file) over RadosGW:
  - using 1 process: upload bandwidth ~ 3MB/s
  - using 100 processes: upload bandwidth ~ 15MB/s

 When upload big files(3GB/file) over RadosGW:
  - using 1 process: upload bandwidth ~ 70MB/s
 (Therefore i don't upload big files using multi-processes any more :D)

 Maybe, RadosGW have a problem when write many smail files. Or it's a
 problem of CEPH when simultaneously write many smail files into a bucket,
 that already have millions files?


 On Wed, Sep 25, 2013 at 7:24 PM, Mark Nelson mark.nel...@inktank.comwrote:

 On 09/25/2013 02:49 AM, Chu Duc Minh wrote:

 I have a CEPH cluster with 9 nodes (6 data nodes  3 mon/mds nodes)
 And i setup 4 separate nodes to test performance of Rados-GW:
   - 2 node run Rados-GW
   - 2 node run multi-process put file to [multi] Rados-GW

 Result:
 a) When i use 1 RadosGW node  1 upload-node, speed upload = 50MB/s
 /upload-node, Rados-GW input/output speed = 50MB/s

 b) When i use 2 RadosGW node  1 upload-node, speed upload = 50MB/s
 /upload-node; each RadosGW have input/output = 25MB/s == sum
 input/ouput of 2 Rados-GW = 50MB/s

 c) When i use 1 RadosGW node  2 upload-node, speed upload = 25MB/s
 /upload-node == sum output of 2 upload-node = 50MB/s, RadosGW have
 input/output = 50MB/s

 d) When i use 2 RadosGW node  2 upload-node, speed upload = 25MB/s
 /upload-node == sum output of 2 upload-node = 50MB/s; each RadosGW have
 input/output = 25MB/s == sum input/ouput of 2 Rados-GW = 50MB/s

 _*Problem*_: i can pass limit 50MB/s when put file over Rados-GW,

 regardless of the number Rados-GW nodes and upload-nodes.
 When i use this CEPH cluster over librados (openstack/kvm), i can easily
 achieve  300MB/s

 I don't know why performance of RadosGW is so low. What's bottleneck?


 During writes, does the CPU usage on your RadosGW node go way up?

 If this is a test cluster, you might want to try the wip-6286 build from
 our gitbuilder site.  There is a fix that depending on the size of your
 objects, could have a big impact on performance.  We're currently
 investigating some other radosgw performance issues as well, so stay tuned.
 :)

 Mark



 Thank you very much!




 __**_
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


 __**_
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] About Ceph SSD and HDD strategy

2013-10-10 Thread Kyle Bader
It's hard to comment on how your experience could be made better without
more information about your configuration and how your testing. Anything
along the lines of what LSI controller model, PCI-E bus speed, number of
expander cables, drive type, number of SSDs and whether the SSDs were
connected to the controller or directly to SATA2/SATA3 port on the
mainboard. You mentioned using SSD journal but nothing about a writeback
cache, did you try both? I'm also curious about what kind of workload
didn't get better with an external journal, was this with rados-bench?

I'm really excited about tiering, it will disaggregate the SSDs and allow
more flexibility in cephstore chassis selection because you no longer have
to maintain strict SSD:drive ratios - this seems like a much more elegant
and maintainable solution.


On Wed, Oct 9, 2013 at 3:45 PM, Warren Wang war...@wangspeed.com wrote:

 While in theory this should be true, I'm not finding it to be the case for
 a typical enterprise LSI card with 24 drives attached. We tried a variety
 of ratios and went back to collocated journals on the spinning drives.

 Eagerly awaiting the tiered performance changes to implement a faster tier
 via SSD.

 --
 Warren

 On Oct 9, 2013, at 5:52 PM, Kyle Bader kyle.ba...@gmail.com wrote:

 Journal on SSD should effectively double your throughput because data will
 not be written to the same device twice to ensure transactional integrity.
 Additionally, by placing the OSD journal on an SSD you should see less
 latency, the disk head no longer has to seek back and forth between the
 journal and data partitions. For large writes it's not as critical to
 have a device that supports high IOPs or throughput because large writes
 are striped across many 4MB rados objects, relatively evenly distributed
 across the cluster. Small write operations will benefit the most from an
 OSD data partition with a writeback cache like btier/flashcache because it
 can absorbs an order of magnitude more IOPs and allow a slower spinning
 device catch up when there is less activity.


 On Tue, Oct 8, 2013 at 12:09 AM, Robert van Leeuwen 
 robert.vanleeu...@spilgames.com wrote:

   I tried putting Flashcache on my spindle OSDs using an Intel SSL and
 it works great.
  This is getting me read and write SSD caching instead of just write
 performance on the journal.
  It should also allow me to protect the OSD journal on the same drive as
 the OSD data and still get benefits of SSD caching for writes.

 Small note that on Red Hat based distro's + Flashcache + XFS:
 There is a major issue (kernel panics) running xfs + flashcache on a 6.4
 kernel. (anything higher then 2.6.32-279)
 It should be fixed in kernel 2.6.32-387.el6 which, I assume, will be 6.5
 which only just entered Beta.

 Fore more info, take a look here:
 https://github.com/facebook/flashcache/issues/113

 Since I've hit this issue (thankfully in our dev environment) we are
 slightly less enthusiastic about running flashcache :(
 It also adds a layer of complexity so I would rather just run the
 journals on SSD, at least on Redhat.
 I'm not sure about the performance difference of just journals v.s.
 Flashcache but I'd be happy to read any such comparison :)

 Also, if you want to make use of the SSD trim func

 P.S. My experience with Flashcache is on Openstack Swift  Nova not Ceph.



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




 --

 Kyle

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Expanding ceph cluster by adding more OSDs

2013-10-10 Thread Kyle Bader
I've contracted and expanded clusters by up to a rack of 216 OSDs - 18
nodes, 12 drives each.  New disks are configured with a CRUSH weight of 0
and I slowly add weight (0.1 to 0.01 increments), wait for the cluster to
become active+clean and then add more weight. I was expanding after
contraction so my PG count didn't need to be corrected, I tend to be
liberal and opt for more PGs.  If I hadn't contracted the cluster prior to
expanding it I would probably add PGs after all the new OSDs have finished
being weighted into the cluster.


On Wed, Oct 9, 2013 at 8:55 PM, Michael Lowe j.michael.l...@gmail.comwrote:

 I had those same questions, I think the answer I got was that it was
 better to have too few pg's than to have overloaded osd's.  So add osd's
 then add pg's.  I don't know the best increments to grow in, probably
 depends largely on the hardware in your osd's.

 Sent from my iPad

  On Oct 9, 2013, at 11:34 PM, Guang yguan...@yahoo.com wrote:
 
  Thanks Mike. I get your point.
 
  There are still a few things confusing me:
   1) We expand Ceph cluster by adding more OSDs, which will trigger
 re-balance PGs across the old  new OSDs, and likely it will break the
 optimized PG numbers for the cluster.
2) We can add more PGs which will trigger re-balance objects across
 old  new PGs.
 
  So:
   1) What is the recommended way to expand the cluster by adding OSDs
 (and potentially adding PGs), should we do them at the same time?
   2) What is the recommended way to scale a cluster from like 1PB to 2PB,
 should we scale it to like 1.1PB to 1.2PB or move to 2PB directly?
 
  Thanks,
  Guang
 
  On Oct 10, 2013, at 11:10 AM, Michael Lowe wrote:
 
  There used to be, can't find it right now.  Something like 'ceph osd
 set pg_num num' then 'ceph osd set pgp_num num' to actually move your
 data into the new pg's.  I successfully did it several months ago, when
 bobtail was current.
 
  Sent from my iPad
 
  On Oct 9, 2013, at 10:30 PM, Guang yguan...@yahoo.com wrote:
 
  Thanks Mike.
 
  Is there any documentation for that?
 
  Thanks,
  Guang
 
  On Oct 9, 2013, at 9:58 PM, Mike Lowe wrote:
 
  You can add PGs,  the process is called splitting.  I don't think PG
 merging, the reduction in the number of PGs, is ready yet.
 
  On Oct 8, 2013, at 11:58 PM, Guang yguan...@yahoo.com wrote:
 
  Hi ceph-users,
  Ceph recommends the PGs number of a pool is (100 * OSDs) / Replicas,
 per my understanding, the number of PGs for a pool should be fixed even we
 scale out / in the cluster by adding / removing OSDs, does that mean if we
 double the OSD numbers, the PG number for a pool is not optimal any more
 and there is no chance to correct it?
 
 
  Thanks,
  Guang
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] About Ceph SSD and HDD strategy

2013-10-09 Thread Kyle Bader
Journal on SSD should effectively double your throughput because data will
not be written to the same device twice to ensure transactional integrity.
Additionally, by placing the OSD journal on an SSD you should see less
latency, the disk head no longer has to seek back and forth between the
journal and data partitions. For large writes it's not as critical to have
a device that supports high IOPs or throughput because large writes are
striped across many 4MB rados objects, relatively evenly distributed across
the cluster. Small write operations will benefit the most from an OSD data
partition with a writeback cache like btier/flashcache because it can
absorbs an order of magnitude more IOPs and allow a slower spinning device
catch up when there is less activity.


On Tue, Oct 8, 2013 at 12:09 AM, Robert van Leeuwen 
robert.vanleeu...@spilgames.com wrote:

   I tried putting Flashcache on my spindle OSDs using an Intel SSL and
 it works great.
  This is getting me read and write SSD caching instead of just write
 performance on the journal.
  It should also allow me to protect the OSD journal on the same drive as
 the OSD data and still get benefits of SSD caching for writes.

 Small note that on Red Hat based distro's + Flashcache + XFS:
 There is a major issue (kernel panics) running xfs + flashcache on a 6.4
 kernel. (anything higher then 2.6.32-279)
 It should be fixed in kernel 2.6.32-387.el6 which, I assume, will be 6.5
 which only just entered Beta.

 Fore more info, take a look here:
 https://github.com/facebook/flashcache/issues/113

 Since I've hit this issue (thankfully in our dev environment) we are
 slightly less enthusiastic about running flashcache :(
 It also adds a layer of complexity so I would rather just run the journals
 on SSD, at least on Redhat.
 I'm not sure about the performance difference of just journals v.s.
 Flashcache but I'd be happy to read any such comparison :)

 Also, if you want to make use of the SSD trim func

 P.S. My experience with Flashcache is on Openstack Swift  Nova not Ceph.



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Same journal device for multiple OSDs?

2013-10-09 Thread Kyle Bader
You can certainly use a similarly named device to back an OSD journal if
the OSDs are on separate hosts. If you want to take a single SSD device and
utilize it as a journal for many OSDs on the same machine then you would
want to partition the SSD device and use a different partition for each OSD
journal. You might consider using /dev/disk/by-id/foo instead of /dev/fioa1
to avoid potential device reordering issues after a reboot. Hope that
helps, sorry if I misunderstood.


On Wed, Oct 9, 2013 at 7:03 AM, Andreas Bluemle andreas.blue...@itxperts.de
 wrote:

 Hi,

 to avoid confusion: the configuration did *not* contain
 multiple osds referring to the same journal device (or file).

 The snippet from ceph.conf suggests osd.214 and osd.314
 both use the same journal -
 but it doesn't show that these osds run on different hosts.


 Regards

 Andreas Bluemle


 On Wed, 9 Oct 2013 11:23:18 +0200
 Andreas Friedrich andreas.friedr...@ts.fujitsu.com wrote:

  Hello,
 
  I have a Ceph test cluster with 88 OSDs running well.
 
  In ceph.conf I found multiple OSDs that are using the same SSD block
  device (without a file system) for their journal:
 
  [osd.214]
osd journal = /dev/fioa1
...
  [osd.314]
osd journal = /dev/fioa1
...
 
  Is this a allowed configuration?
 
  Regards
  Andreas Friedrich
  --
  FUJITSU
  Fujitsu Technology Solutions GmbH
  Heinz-Nixdorf-Ring 1, 33106 Paderborn, Germany
  Tel: +49 (5251) 525-1512
  Fax: +49 (5251) 525-321512
  Email: andreas.friedr...@ts.fujitsu.com
  Web: ts.fujitsu.com
  Company details: de.ts.fujitsu.com/imprint
  --
 
 



 --
 Andreas Bluemle mailto:andreas.blue...@itxperts.de
 ITXperts GmbH   http://www.itxperts.de
 Balanstrasse 73, Geb. 08Phone: (+49) 89 89044917
 D-81541 Muenchen (Germany)  Fax:   (+49) 89 89044910

 Company details: http://www.itxperts.de/imprint.htm
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com