Re: [ceph-users] Ceph migration to AWS

2015-05-04 Thread Kyle Bader
To those interested in a tricky problem, We have a Ceph cluster running at one of our data centers. One of our client's requirements is to have them hosted at AWS. My question is: How do we effectively migrate our data on our internal Ceph cluster to an AWS Ceph cluster? Ideas currently on

Re: [ceph-users] xfs/nobarrier

2014-12-27 Thread Kyle Bader
do people consider a UPS + Shutdown procedures a suitable substitute? I certainly wouldn't, I've seen utility power fail and the transfer switch fail to transition to UPS strings. Had this happened to me with nobarrier it would have been a very sad day. -- Kyle Bader

Re: [ceph-users] private network - VLAN vs separate switch

2014-11-25 Thread Kyle Bader
For a large network (say 100 servers and 2500 disks), are there any strong advantages to using separate switch and physical network instead of VLAN? Physical isolation will ensure that congestion on one does not affect the other. On the flip side, asymmetric network failures tend to be more

Re: [ceph-users] Dependency issues in fresh ceph/CentOS 7 install

2014-08-06 Thread Kyle Bader
Can you paste me the whole output of the install? I am curious why/how you are getting el7 and el6 packages. priority=1 required in /etc/yum.repos.d/ceph.repo entries -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com

Re: [ceph-users] Is OSDs based on VFS?

2014-07-21 Thread Kyle Bader
I wonder that OSDs use system calls of Virtual File System (i.e. open, read, write, etc) when they access disks. I mean ... Could I monitor I/O command requested by OSD to disks if I monitor VFS? Ceph OSDs run on top of a traditional filesystem, so long as they support xattrs - xfs by

Re: [ceph-users] Journal SSD durability

2014-05-13 Thread Kyle Bader
TL;DR: Power outages are more common than your colo facility will admit. Seconded. I've seen power failures in at least 4 different facilities and all of them had the usual gamut of batteries/generators/etc. Some of those facilities I've seen problems multiple times in a single year. Even a

Re: [ceph-users] Migrate whole clusters

2014-05-09 Thread Kyle Bader
Let's assume a test cluster up and running with real data on it. Which is the best way to migrate everything to a production (and larger) cluster? I'm thinking to add production MONs to the test cluster, after that, add productions OSDs to the test cluster, waiting for a full rebalance and

Re: [ceph-users] SSDs: cache pool/tier versus node-local block cache

2014-04-17 Thread Kyle Bader
I think the timing should work that we'll be deploying with Firefly and so have Ceph cache pool tiering as an option, but I'm also evaluating Bcache versus Tier to act as node-local block cache device. Does anybody have real or anecdotal evidence about which approach has better

Re: [ceph-users] SSDs: cache pool/tier versus node-local block cache

2014-04-16 Thread Kyle Bader
Obviously the ssds could be used as journal devices, but I'm not really convinced whether this is worthwhile when all nodes have 1GB of hardware writeback cache (writes to journal and data areas on the same spindle have time to coalesce in the cache and minimise seek time hurt). Any advice on

Re: [ceph-users] question on harvesting freed space

2014-04-15 Thread Kyle Bader
I'm assuming Ceph/RBD doesn't have any direct awareness of this since the file system doesn't traditionally have a give back blocks operation to the block device. Is there anything special RBD does in this case that communicates the release of the Ceph storage back to the pool? VMs running

Re: [ceph-users] What's the difference between using /dev/sdb and /dev/sdb1 as osd?

2014-03-22 Thread Kyle Bader
If I want to use a disk dedicated for osd, can I just use something like /dev/sdb instead of /dev/sdb1? Is there any negative impact on performance? You can pass /dev/sdb to ceph-disk-prepare and it will create two partitions, one for the journal (raw partition) and one for the data volume

Re: [ceph-users] OSD + FlashCache vs. Cache Pool for RBD...

2014-03-22 Thread Kyle Bader
One downside of the above arrangement: I read that support for mapping newer-format RBDs is only present in fairly recent kernels. I'm running Ubuntu 12.04 on the cluster at present with its stock 3.2 kernel. There is a PPA for the 3.11 kernel used in Ubuntu 13.10, but if you're looking at

Re: [ceph-users] Mounting with dmcrypt still fails

2014-03-22 Thread Kyle Bader
ceph-disk-prepare --fs-type xfs --dmcrypt --dmcrypt-key-dir /etc/ceph/dmcrypt-keys --cluster ceph -- /dev/sdb ceph-disk: Error: Device /dev/sdb2 is in use by a device-mapper mapping (dm-crypt?): dm-0 It sounds like device-mapper still thinks it's using the the volume, you might be able to

Re: [ceph-users] Put Ceph Cluster Behind a Pair of LB

2014-03-12 Thread Kyle Bader
Anybody has a good practice on how to set up a ceph cluster behind a pair of load balancer? The only place you would want to put a load balancer in the context of a Ceph cluster would be north of RGW nodes. You can do L3 transparent load balancing or balance with a L7 proxy, ie Linux Virtual

Re: [ceph-users] Put Ceph Cluster Behind a Pair of LB

2014-03-12 Thread Kyle Bader
You're right. Sorry didn't specify I was trying this for Radosgw. Even for this I'm seeing performance degrade once my clients start to hit the LB VIP. Could you tell us more about your load balancer and configuration? -- Kyle ___ ceph-users

Re: [ceph-users] Put Ceph Cluster Behind a Pair of LB

2014-03-12 Thread Kyle Bader
This is in my lab. Plain passthrough setup with automap enabled on the F5. s3 curl work fine as far as queries go. But file transfer rate degrades badly once I start file up/download. Maybe the difference can be attributed to LAN client traffic with jumbo frames vs F5 using a smaller WAN

Re: [ceph-users] Encryption/Multi-tennancy

2014-03-11 Thread Kyle Bader
There could be millions of tennants. Looking deeper at the docs, it looks like Ceph prefers to have one OSD per disk. We're aiming at having backblazes, so will be looking at 45 OSDs per machine, many machines. I want to separate the tennants and separately encrypt their data. The

Re: [ceph-users] Utilizing DAS on XEN or XCP hosts for Openstack Cinder

2014-03-11 Thread Kyle Bader
1. Is it possible to install Ceph and Ceph monitors on the the XCP (XEN) Dom0 or would we need to install it on the DomU containing the Openstack components? I'm not a Xen guru but in the case of KVM I would run the OSDs on the hypervisor to avoid virtualization overhead. 2. Is

Re: [ceph-users] qemu-rbd

2014-03-11 Thread Kyle Bader
I tried rbd-fuse and it's throughput using fio is approx. 1/4 that of the kernel client. Can you please let me know how to setup RBD backend for FIO? I'm assuming this RBD backend is also based on librbd? You will probably have to build fio from source since the rbd engine is new:

Re: [ceph-users] questions about ceph cluster in multi-dacenter

2014-02-20 Thread Kyle Bader
What could be the best replication ? Are you using two sites to increase availability, durability, or both? For availability your really better off using three sites and use CRUSH to place each of three replicas in a different datacenter. In this setup you can survive losing 1 of 3 datacenters.

Re: [ceph-users] How client choose among replications?

2014-02-11 Thread Kyle Bader
Why would it help? Since it's not that ONE OSD will be primary for all objects. There will be 1 Primary OSD per PG and you'll probably have a couple of thousands PGs. The primary may be across a oversubscribed/expensive link, in which case a local replica with a common ancestor to the client may

Re: [ceph-users] poor data distribution

2014-02-01 Thread Kyle Bader
Change pg_num for .rgw.buckets to power of 2, an 'crush tunables optimal' didn't help :( Did you bump pgp_num as well? The split pgs will stay in place until pgp_num is bumped as well, if you do this be prepared for (potentially lots) of data movement.

Re: [ceph-users] Power Cycle Problems

2014-01-16 Thread Kyle Bader
a solution. That's a shame, but at least you will be better prepared if it happens again, hopefully your luck is not as unfortunate as mine! -- Kyle Bader ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users

Re: [ceph-users] Failure probability with largish deployments

2013-12-26 Thread Kyle Bader
Yes, that also makes perfect sense, so the aforementioned 12500 objects for a 50GB image, at a 60 TB cluster/pool with 72 disk/OSDs and 3 way replication that makes 2400 PGs, following the recommended formula. What amount of disks (OSDs) did you punch in for the following run? Disk

Re: [ceph-users] Networking questions

2013-12-26 Thread Kyle Bader
Do monitors have to be on the cluster network as well or is it sufficient for them to be on the public network as http://ceph.com/docs/master/rados/configuration/network-config-ref/ suggests? Monitors only need to be on the public network. Also would the OSDs re-route their traffic over the

Re: [ceph-users] Failure probability with largish deployments

2013-12-23 Thread Kyle Bader
Is an object a CephFS file or a RBD image or is it the 4MB blob on the actual OSD FS? Objects are at the RADOS level, CephFS filesystems, RBD images and RGW objects are all composed by striping RADOS objects - default is 4MB. In my case, I'm only looking at RBD images for KVM volume storage,

Re: [ceph-users] Ceph network topology with redundant switches

2013-12-20 Thread Kyle Bader
The area I'm currently investigating is how to configure the networking. To avoid a SPOF I'd like to have redundant switches for both the public network and the internal network, most likely running at 10Gb. I'm considering splitting the nodes in to two separate racks and connecting each half

Re: [ceph-users] radosgw daemon stalls on download of some files

2013-12-19 Thread Kyle Bader
Do you have any futher detail on this radosgw bug? https://github.com/ceph/ceph/commit/0f36eddbe7e745665a634a16bf3bf35a3d0ac424 https://github.com/ceph/ceph/commit/0b9dc0e5890237368ba3dc34cb029010cb0b67fd Does it only apply to emperor? The bug is present in dumpling too.

[ceph-users] SysAdvent: Day 15 - Distributed Storage with Ceph

2013-12-15 Thread Kyle Bader
For you holiday pleasure I've prepared a SysAdvent article on Ceph: http://sysadvent.blogspot.com/2013/12/day-15-distributed-storage-with-ceph.html Check it out! -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com

Re: [ceph-users] Rbd image performance

2013-12-15 Thread Kyle Bader
Has anyone tried scaling a VMs io by adding additional disks and striping them in the guest os? I am curious what effect this would have on io performance? Why would it? You can also change the stripe size of the RBD image. Depending on the workload you might change it from 4MB to something

[ceph-users] NUMA and ceph

2013-12-12 Thread Kyle Bader
It seems that NUMA can be problematic for ceph-osd daemons in certain circumstances. Namely it seems that if a NUMA zone is running out of memory due to uneven allocation it is possible for a NUMA zone to enter reclaim mode when threads/processes are scheduled on a core in that zone and those

Re: [ceph-users] ceph reliability in large RBD setups

2013-12-10 Thread Kyle Bader
I've been running similar calculations recently. I've been using this tool from Inktank to calculate RADOS reliabilities with different assumptions: https://github.com/ceph/ceph-tools/tree/master/models/reliability But I've also had similar questions about RBD (or any multi-part files

Re: [ceph-users] Anybody doing Ceph for OpenStack with OSDs across compute/hypervisor nodes?

2013-12-09 Thread Kyle Bader
We're running OpenStack (KVM) with local disk for ephemeral storage. Currently we use local RAID10 arrays of 10k SAS drives, so we're quite rich for IOPS and have 20GE across the board. Some recent patches in OpenStack Havana make it possible to use Ceph RBD as the source of ephemeral VM

Re: [ceph-users] optimal setup with 4 x ethernet ports

2013-12-06 Thread Kyle Bader
looking at tcpdump all the traffic is going exactly where it is supposed to go, in particular an osd on the 192.168.228.x network appears to talk to an osd on the 192.168.229.x network without anything strange happening. I was just wondering if there was anything about ceph that could make

Re: [ceph-users] optimal setup with 4 x ethernet ports

2013-12-04 Thread Kyle Bader
Is having two cluster networks like this a supported configuration? Every osd and mon can reach every other so I think it should be. Maybe. If your back end network is a supernet and each cluster network is a subnet of that supernet. For example: Ceph.conf cluster network (supernet):

Re: [ceph-users] optimal setup with 4 x ethernet ports

2013-12-02 Thread Kyle Bader
Is having two cluster networks like this a supported configuration? Every osd and mon can reach every other so I think it should be. Maybe. If your back end network is a supernet and each cluster network is a subnet of that supernet. For example: Ceph.conf cluster network (supernet): 10.0.0.0/8

Re: [ceph-users] installing OS on software RAID

2013-11-30 Thread Kyle Bader
Is the OS doing anything apart from ceph? Would booting a ramdisk-only system from USB or compact flash work? I haven't tested this kind of configuration myself but I can't think of anything that would preclude this type of setup. I'd probably use sqashfs layered with a tmpfs via aufs to avoid

Re: [ceph-users] Impact of fancy striping

2013-11-30 Thread Kyle Bader
This journal problem is a bit of wizardry to me, I even had weird intermittent issues with OSDs not starting because the journal was not found, so please do not hesitate to suggest a better journal setup. You mentioned using SAS for journal, if your OSDs are SATA and a expander is in the data

Re: [ceph-users] 回复:Re: testing ceph performance issue

2013-11-27 Thread Kyle Bader
How much performance can be improved if use SSDs to storage journals? You will see roughly twice the throughput unless you are using btrfs (still improved but not as dramatic). You will also see lower latency because the disk head doesn't have to seek back and forth between journal and data

Re: [ceph-users] OSD on an external, shared device

2013-11-26 Thread Kyle Bader
Is there any way to manually configure which OSDs are started on which machines? The osd configuration block includes the osd name and host, so is there a way to say that, say, osd.0 should only be started on host vashti and osd.1 should only be started on host zadok? I tried using this

Re: [ceph-users] installing OS on software RAID

2013-11-25 Thread Kyle Bader
Several people have reported issues with combining OS and OSD journals on the same SSD drives/RAID due to contention. If you do something like this I would definitely test to make sure it meets your expectations. Ceph logs are going to compose the majority of the writes to the OS storage devices.

Re: [ceph-users] Running on disks that lose their head

2013-11-07 Thread Kyle Bader
Once I know a drive has had a head failure, do I trust that the rest of the drive isn't going to go at an inconvenient moment vs just fixing it right now when it's not 3AM on Christmas morning? (true story) As good as Ceph is, do I trust that Ceph is smart enough to prevent spreading

Re: [ceph-users] radosgw questions

2013-11-07 Thread Kyle Bader
1. To build a high performance yet cheap radosgw storage, which pools should be placed on ssd and which on hdd backed pools? Upon installation of radosgw, it created the following pools: .rgw, .rgw.buckets, .rgw.buckets.index, .rgw.control, .rgw.gc, .rgw.root, .usage, .users, .users.email.

Re: [ceph-users] ceph cluster performance

2013-11-07 Thread Kyle Bader
ST240FN0021 connected via a SAS2x36 to a LSI 9207-8i. The problem might be SATA transport protocol overhead at the expander. Have you tried directly connecting the SSDs to SATA2/3 ports on the mainboard? -- Kyle ___ ceph-users mailing list

Re: [ceph-users] Running on disks that lose their head

2013-11-07 Thread Kyle Bader
Zackc, Loicd, and I have been the main participants in a weekly Teuthology call the past few weeks. We've talked mostly about methods to extend Teuthology to capture performance metrics. Would you be willing to join us during the Teuthology and Ceph-Brag sessions at the Firefly Developer

Re: [ceph-users] Ceph User Committee

2013-11-07 Thread Kyle Bader
I think this is a great idea. One of the big questions users have is what kind of hardware should I buy. An easy way for users to publish information about their setup (hardware, software versions, use-case, performance) when they have successful deployments would be very valuable. Maybe a

Re: [ceph-users] Ceph User Committee

2013-11-07 Thread Kyle Bader
Would this be something like http://wiki.ceph.com/01Planning/02Blueprints/Firefly/Ceph-Brag ? Something very much like that :) -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph node Info

2013-10-30 Thread Kyle Bader
The quick start guide is linked below, it should help you hit the ground running. http://ceph.com/docs/master/start/quick-ceph-deploy/ Let us know if you have questions or bump into trouble! ___ ceph-users mailing list ceph-users@lists.ceph.com

Re: [ceph-users] ceph recovery killing vms

2013-10-29 Thread Kyle Bader
Recovering from a degraded state by copying existing replicas to other OSDs is going to cause reads on existing replicas and writes to the new locations. If you have slow media then this is going to be felt more acutely. Tuning the backfill options I posted is one way to lessen the impact, another

Re: [ceph-users] ceph recovery killing vms

2013-10-28 Thread Kyle Bader
You can change some OSD tunables to lower the priority of backfills: osd recovery max chunk: 8388608 osd recovery op priority: 2 In general a lower op priority means it will take longer for your placement groups to go from degraded to active+clean, the idea is to balance

Re: [ceph-users] changing journals post-bobcat?

2013-10-28 Thread Kyle Bader
The bobtail release added udev/upstart capabilities that allowed you to not have per OSD entries in ceph.conf. Under the covers the new udev/upstart scripts look for a special label on OSD data volumes, matching volumes are mounted and then a few files are inspected: journal_uuid whoami The

Re: [ceph-users] Hardware: SFP+ or 10GBase-T

2013-10-24 Thread Kyle Bader
I know that 10GBase-T has more delay then SFP+ with direct attached cables (.3 usec vs 2.6 usec per link), but does that matter? Some sites stay it is a huge hit, but we are talking usec, not ms, so I find it hard to believe that it causes that much of an issue. I like the lower cost and use

Re: [ceph-users] CephFS Project Manila (OpenStack)

2013-10-23 Thread Kyle Bader
This is going to get horribly ugly when you add neutron into the mix, so much so I'd consider this option a non-starter. If someone is using openvswitch to create network overlays to isolate each tenant I can't imagine this ever working. I'm not following here. Are this only needed if ceph

Re: [ceph-users] mounting RBD in linux containers

2013-10-17 Thread Kyle Bader
My first guess would be that it's due to LXC dropping capabilities, I'd investigate whether CAP_SYS_ADMIN is being dropped. You need CAP_SYS_ADMIN for mount and block ioctls, if the container doesn't have those privs a map will likely fail. Maybe try tracing the command with strace? On Thu, Oct

Re: [ceph-users] Ceph configuration data sharing requirements

2013-10-17 Thread Kyle Bader
* The IP address of at least one MON in the Ceph cluster If you configure nodes with a single monitor in the mon hosts directive then I believe your nodes will have issues if that one monitor goes down. With Chef I've gone back and forth between using Chef search and having monitors be

Re: [ceph-users] Speed limit on RadosGW?

2013-10-14 Thread Kyle Bader
I've personally saturated 1Gbps links on multiple radosgw nodes on a large cluster, if I remember correctly, Yehuda has tested it up into the 7Gbps range with 10Gbps gear. Could you describe your clusters hardware and connectivity? On Mon, Oct 14, 2013 at 3:34 AM, Chu Duc Minh

Re: [ceph-users] About Ceph SSD and HDD strategy

2013-10-10 Thread Kyle Bader
via SSD. -- Warren On Oct 9, 2013, at 5:52 PM, Kyle Bader kyle.ba...@gmail.com wrote: Journal on SSD should effectively double your throughput because data will not be written to the same device twice to ensure transactional integrity. Additionally, by placing the OSD journal on an SSD you

Re: [ceph-users] Expanding ceph cluster by adding more OSDs

2013-10-10 Thread Kyle Bader
I've contracted and expanded clusters by up to a rack of 216 OSDs - 18 nodes, 12 drives each. New disks are configured with a CRUSH weight of 0 and I slowly add weight (0.1 to 0.01 increments), wait for the cluster to become active+clean and then add more weight. I was expanding after contraction

Re: [ceph-users] About Ceph SSD and HDD strategy

2013-10-09 Thread Kyle Bader
Journal on SSD should effectively double your throughput because data will not be written to the same device twice to ensure transactional integrity. Additionally, by placing the OSD journal on an SSD you should see less latency, the disk head no longer has to seek back and forth between the

Re: [ceph-users] Same journal device for multiple OSDs?

2013-10-09 Thread Kyle Bader
You can certainly use a similarly named device to back an OSD journal if the OSDs are on separate hosts. If you want to take a single SSD device and utilize it as a journal for many OSDs on the same machine then you would want to partition the SSD device and use a different partition for each OSD