Re: [ceph-users] Ceph migration to AWS
To those interested in a tricky problem, We have a Ceph cluster running at one of our data centers. One of our client's requirements is to have them hosted at AWS. My question is: How do we effectively migrate our data on our internal Ceph cluster to an AWS Ceph cluster? Ideas currently on the table: 1. Build OSDs at AWS and add them to our current Ceph cluster. Build quorum at AWS then sever the connection between AWS and our data center. I would highly discourage this. 2. Build a Ceph cluster at AWS and send snapshots from our data center to our AWS cluster allowing us to migrate to AWS. This sounds far more sensible. I'd look at the I2 (iops) or D2 (density) class instances, depending on use case. -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] xfs/nobarrier
do people consider a UPS + Shutdown procedures a suitable substitute? I certainly wouldn't, I've seen utility power fail and the transfer switch fail to transition to UPS strings. Had this happened to me with nobarrier it would have been a very sad day. -- Kyle Bader ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] private network - VLAN vs separate switch
For a large network (say 100 servers and 2500 disks), are there any strong advantages to using separate switch and physical network instead of VLAN? Physical isolation will ensure that congestion on one does not affect the other. On the flip side, asymmetric network failures tend to be more difficult to troubleshoot eg. backend failure with functional front end. That said, in a pinch you can switch to using the front end network for both until you can repair the backend. Also, how difficult it would be to switch from a VLAN to using separate switches later? Should be relatively straight forward. Simply configure the VLAN/subnets on the new physical switches and move links over one by one. Once all the links are moved over you can remove the VLAN and subnets that are now on the new kit from the original hardware. -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Dependency issues in fresh ceph/CentOS 7 install
Can you paste me the whole output of the install? I am curious why/how you are getting el7 and el6 packages. priority=1 required in /etc/yum.repos.d/ceph.repo entries -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Is OSDs based on VFS?
I wonder that OSDs use system calls of Virtual File System (i.e. open, read, write, etc) when they access disks. I mean ... Could I monitor I/O command requested by OSD to disks if I monitor VFS? Ceph OSDs run on top of a traditional filesystem, so long as they support xattrs - xfs by default. As such you can use kernel instrumentation to view what is going on under the Ceph OSDs. -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Journal SSD durability
TL;DR: Power outages are more common than your colo facility will admit. Seconded. I've seen power failures in at least 4 different facilities and all of them had the usual gamut of batteries/generators/etc. Some of those facilities I've seen problems multiple times in a single year. Even a datacenter with five nines power availability is going to see 5m of downtime per year, and that would qualify for the highest rating from the Uptime Institute (Tier IV)! I've lost power to Ceph clusters on several occasions, in all cases the journals were on spinning media. -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Migrate whole clusters
Let's assume a test cluster up and running with real data on it. Which is the best way to migrate everything to a production (and larger) cluster? I'm thinking to add production MONs to the test cluster, after that, add productions OSDs to the test cluster, waiting for a full rebalance and then starting to remove test OSDs and test mons. This should migrate everything with no outage. It's possible and I've done it, this was around the argonaut/bobtail timeframe on a pre-production cluster. If your cluster has a lot of data then it may take a long time or be disruptive, make sure you've tested that your recovery tunables are suitable for your hardware configuration. -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] SSDs: cache pool/tier versus node-local block cache
I think the timing should work that we'll be deploying with Firefly and so have Ceph cache pool tiering as an option, but I'm also evaluating Bcache versus Tier to act as node-local block cache device. Does anybody have real or anecdotal evidence about which approach has better performance? New idea that is dependent on failure behaviour of the cache tier... The problem with this type of configuration is it ties a VM to a specific hypervisor, in theory it should be faster because you don't have network latency from round trips to the cache tier, resulting in higher iops. Large sequential workloads may achieve higher throughput by parallelizing across many OSDs in a cache tier, whereas local flash would be limited to single device throughput. Ah, I was ambiguous. When I said node-local I meant OSD-local. So I'm really looking at: 2-copy write-back object ssd cache-pool versus OSD write-back ssd block-cache versus 1-copy write-around object cache-pool ssd journal Ceph cache pools allow you to scale the size of the cache pool independent of the underlying storage and avoids constraints about disk:ssd ratios (for flashcache, bcache, etc). Local block caches should have lower latency than a cache tier for a cache miss, due to the extra hop(s) across the network. I would lean towards using Ceph's cache tiers for the scaling independence. This is undoubtedly true for a write-back cache-tier. But in the scenario I'm suggesting, a write-around cache, that needn't be bad news - if a cache-tier OSD is lost the cache simply just got smaller and some cached objects were unceremoniously flushed. The next read on those objects should just miss and bring them into the now smaller cache. The thing I'm trying to avoid with the above is double read-caching of objects (so as to get more aggregate read cache). I assume the standard wisdom with write-back cache-tiering is that the backing data pool shouldn't bother with ssd journals? Currently, all cache tiers need to be durable - regardless of cache mode. As such, cache tiers should be erasure coded or N+1 replicated (I'd recommend N+2 or 3x replica). Ceph could potentially do what you described in the future, it just doesn't yet. -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] SSDs: cache pool/tier versus node-local block cache
Obviously the ssds could be used as journal devices, but I'm not really convinced whether this is worthwhile when all nodes have 1GB of hardware writeback cache (writes to journal and data areas on the same spindle have time to coalesce in the cache and minimise seek time hurt). Any advice on this? All writes need to be written to the journal before being written to the data volume so it's going to impact your overall throughput and cause seeking, a hardware cache will only help with the later (unless you use btrfs). I think the timing should work that we'll be deploying with Firefly and so have Ceph cache pool tiering as an option, but I'm also evaluating Bcache versus Tier to act as node-local block cache device. Does anybody have real or anecdotal evidence about which approach has better performance? New idea that is dependent on failure behaviour of the cache tier... The problem with this type of configuration is it ties a VM to a specific hypervisor, in theory it should be faster because you don't have network latency from round trips to the cache tier, resulting in higher iops. Large sequential workloads may achieve higher throughput by parallelizing across many OSDs in a cache tier, whereas local flash would be limited to single device throughput. Carve the ssds 4-ways: each with 3 partitions for journals servicing the backing data pool and a fourth larger partition serving a write-around cache tier with only 1 object copy. Thus both reads and writes hit ssd but the ssd capacity is not halved by replication for availability. ...The crux is how the current implementation behaves in the face of cache tier OSD failures? Cache tiers are durable by way of replication or erasure coding, OSDs will remap degraded placement groups and backfill as appropriate. With single replica cache pools loss of OSDs becomes a real concern, in the case of RBD this means losing arbitrary chunk(s) of your block devices - bad news. If you want host independence, durability and speed your best bet is a replicated cache pool (2-3x). -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] question on harvesting freed space
I'm assuming Ceph/RBD doesn't have any direct awareness of this since the file system doesn't traditionally have a give back blocks operation to the block device. Is there anything special RBD does in this case that communicates the release of the Ceph storage back to the pool? VMs running a 3.2+ kernel (iirc) can give back blocks by issuing TRIM. http://wiki.qemu.org/Features/QED/Trim -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] What's the difference between using /dev/sdb and /dev/sdb1 as osd?
If I want to use a disk dedicated for osd, can I just use something like /dev/sdb instead of /dev/sdb1? Is there any negative impact on performance? You can pass /dev/sdb to ceph-disk-prepare and it will create two partitions, one for the journal (raw partition) and one for the data volume (defaults to formatting xfs). This is known as a single device OSD, in contrast with a multi-device OSD where the journal is on a completely different device (like a partition on a shared journaling SSD). -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD + FlashCache vs. Cache Pool for RBD...
One downside of the above arrangement: I read that support for mapping newer-format RBDs is only present in fairly recent kernels. I'm running Ubuntu 12.04 on the cluster at present with its stock 3.2 kernel. There is a PPA for the 3.11 kernel used in Ubuntu 13.10, but if you're looking at a new deployment it might be better to wait until 14.04: then you'll get kernel 3.13. Anyone else have any ideas on the above? I don't think there are any hairy udev issues or similar that will make using a newer kernel on precise problematic. The only thing I can think of that is a caveat of this kind of setup if if you lose a hypervisor the cache will go with it and you likely wont be able to migrate the guest to another host. The alternative is to use flashcache on top of the OSD partition but then you introduce network hops and is closer to what the tiering feature will offer, except the flashcache OSD method is more particular about disk:ssd ratio, whereas in a tier the flash could be on s completely separate hosts (possibly dedicated flash machines). -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Mounting with dmcrypt still fails
ceph-disk-prepare --fs-type xfs --dmcrypt --dmcrypt-key-dir /etc/ceph/dmcrypt-keys --cluster ceph -- /dev/sdb ceph-disk: Error: Device /dev/sdb2 is in use by a device-mapper mapping (dm-crypt?): dm-0 It sounds like device-mapper still thinks it's using the the volume, you might be able to track it down with this: for i in `ls -1 /sys/block/ | grep sd`; do echo $i: `ls /sys/block/$i/${i}1/holders/`; done Then it's a matter of making sure there are no open file handles on the encrypted volume and unmounting it. You will still need to completely clear out the partition table on that disk, which can be tricky with GPT because it's not as simple as dd'in the start of the volume. This is what the zapdisk parameter is for in ceph-disk-prepare, I don't know enough about ceph-deploy to know if you can somehow pass it. After you know the device/dm mapping you can use udevadm to find out where it should map to (uuids replaced with xxx's): udevadm test /block/sdc/sdc1 snip run: '/sbin/cryptsetup --key-file /etc/ceph/dmcrypt-keys/x --key-size 256 create /dev/sdc1' run: '/bin/bash -c 'while [ ! -e /dev/mapper/x ];do sleep 1; done'' run: '/usr/sbin/ceph-disk-activate /dev/mapper/x' -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Put Ceph Cluster Behind a Pair of LB
Anybody has a good practice on how to set up a ceph cluster behind a pair of load balancer? The only place you would want to put a load balancer in the context of a Ceph cluster would be north of RGW nodes. You can do L3 transparent load balancing or balance with a L7 proxy, ie Linux Virtual Server or HAProxy/Nginx. The other components of Ceph are horizontally scalable and because of the way Ceph's native protocols work you don't need load balancers doing L2/L3/L7 tricks to achieve HA. -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Put Ceph Cluster Behind a Pair of LB
You're right. Sorry didn't specify I was trying this for Radosgw. Even for this I'm seeing performance degrade once my clients start to hit the LB VIP. Could you tell us more about your load balancer and configuration? -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Put Ceph Cluster Behind a Pair of LB
This is in my lab. Plain passthrough setup with automap enabled on the F5. s3 curl work fine as far as queries go. But file transfer rate degrades badly once I start file up/download. Maybe the difference can be attributed to LAN client traffic with jumbo frames vs F5 using a smaller WAN MTU? -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Encryption/Multi-tennancy
There could be millions of tennants. Looking deeper at the docs, it looks like Ceph prefers to have one OSD per disk. We're aiming at having backblazes, so will be looking at 45 OSDs per machine, many machines. I want to separate the tennants and separately encrypt their data. The encryption will be provided by us, but I was originally intending to have passphrase-based encryption, and use programmatic means to either hash the passphrase or/and encrypt it using the same passphrase. This way, we wouldn't be able to access the tennant's data, or the key for the passphrase, although we'd still be able to store both. The way I see it you have several options: 1. Encrypted OSDs Preserve confidentiality in the event someone gets physical access to a disk, whether theft or RMA. Requires tenant to trust provider. vm rbd rados osd -here disks 2. Whole disk VM encryption Preserve confidentiality in the even someone gets physical access to disk, whether theft or RMA. tenant: key/passphrase provider: nothing tenant: passphrase provider: key tenant: nothing provider: key vm --- here rbd rados osd disks 3. Encryption further up stack (application perhaps?) To me, #1/#2 are identical except in the case of #2 when the rbd volume is not attached to a VM. Block devices attached to a VM and mounted will be decrypted, making the encryption only useful at defending against unauthorized access to storage media. With a different key per VM, with potentially millions of tenants, you now have a massive key escrow/management problem that only buys you a bit of additional security when block devices are detached. Sounds like a crappy deal to me, I'd either go with #1 or #3. -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Utilizing DAS on XEN or XCP hosts for Openstack Cinder
1. Is it possible to install Ceph and Ceph monitors on the the XCP (XEN) Dom0 or would we need to install it on the DomU containing the Openstack components? I'm not a Xen guru but in the case of KVM I would run the OSDs on the hypervisor to avoid virtualization overhead. 2. Is Ceph server aware, or Rack aware so that replicas are not stored on the same server? Yes, placement is defined with your crush map and placement rules. 3. Are 4Tb OSD’s too large? We are attempting to restrict the qty of OSD’s per server to minimise system overhead Nope! Any other feedback regarding our plan would also be welcomed. I would probably run each disk as it's own OSD, which means you need a bit more memory per host. Networking could certainly be a bottleneck with 8 to 16 spindle nodes. YMMV. -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] qemu-rbd
I tried rbd-fuse and it's throughput using fio is approx. 1/4 that of the kernel client. Can you please let me know how to setup RBD backend for FIO? I'm assuming this RBD backend is also based on librbd? You will probably have to build fio from source since the rbd engine is new: https://github.com/axboe/fio Assuming you already have a cluster and a client configured this should do the trick: https://github.com/axboe/fio/blob/master/examples/rbd.fio -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] questions about ceph cluster in multi-dacenter
What could be the best replication ? Are you using two sites to increase availability, durability, or both? For availability your really better off using three sites and use CRUSH to place each of three replicas in a different datacenter. In this setup you can survive losing 1 of 3 datacenters. If you two sites is the only option and your goal is availability and durability then I would do 4 replicas, using osd_pool_default_min_size = 2. How to tune the crushmap of this kind of setup ? and last question : It's possible to have the reads from vms on DC1 to always read datas on DC1 ? No yet! -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How client choose among replications?
Why would it help? Since it's not that ONE OSD will be primary for all objects. There will be 1 Primary OSD per PG and you'll probably have a couple of thousands PGs. The primary may be across a oversubscribed/expensive link, in which case a local replica with a common ancestor to the client may be preferable. It's WIP with the goal of landing in firefly iirc. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] poor data distribution
Change pg_num for .rgw.buckets to power of 2, an 'crush tunables optimal' didn't help :( Did you bump pgp_num as well? The split pgs will stay in place until pgp_num is bumped as well, if you do this be prepared for (potentially lots) of data movement. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Power Cycle Problems
On two separate occasions I have lost power to my Ceph cluster. Both times, I had trouble bringing the cluster back to good health. I am wondering if I need to config something that would solve this problem? No special configuration should be necessary, I've had the unfortunate luck of witnessing several power loss events with large Ceph clusters. In both cases something other than Ceph was the source of frustrations once power was returned. That said, monitor daemons should be started first and must form a quorum before the cluster will be usable. It sounds like you have made it that far if your getting output from ceph health commands. The next step is to get your Ceph OSD daemons running, which will require the data partitions to be mounted and the journal device present. In Ubuntu installations this is handled by udev scripts installed by the Ceph packages (I think this is may be true for RHEL/CentOS but have not verified). Short of the udev method you can mount the data partition manually. Once the data partition is mounted you can start the OSDs manually in the event that init still doesn't work after mounting, to do so you will need to know the location of your keyring, ceph.conf and the OSD id. If you are unsure of what the OSD id is then you can look at the root of the OSD data partition, after it is mounted, in a file named whoami. To manually start: /usr/bin/ceph-osd -i ${OSD_ID} --pid-file /var/run/ceph/osd.${OSD_ID}.pid -c /etc/ceph/ceph.conf After that it's a matter of examining the logs if your still having issues getting the OSDs to boot. After powering back up the cluster, “ceph health” revealed stale pages, mds cluster degraded, 3/3 OSDs down. I tried to issue “sudo /etc/init.d/ceph -a start” but I got no output from the command and the health status did not change. The placement groups are stale because none of the OSDs have reported their state recently since they are down. I ended up having to re-install the cluster to fix the issue, but as my group wants to use Ceph for VM storage in the future, we need to find a solution. That's a shame, but at least you will be better prepared if it happens again, hopefully your luck is not as unfortunate as mine! -- Kyle Bader ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Failure probability with largish deployments
Yes, that also makes perfect sense, so the aforementioned 12500 objects for a 50GB image, at a 60 TB cluster/pool with 72 disk/OSDs and 3 way replication that makes 2400 PGs, following the recommended formula. What amount of disks (OSDs) did you punch in for the following run? Disk Modeling Parameters size: 3TiB FIT rate:826 (MTBF = 138.1 years) NRE rate:1.0E-16 RADOS parameters auto mark-out: 10 minutes recovery rate:50MiB/s (40 seconds/drive) Blink??? I guess that goes back to the number of disks, but to restore 2.25GB at 50MB/s with 40 seconds per drive... The surviving replicas for placement groups that the failed OSDs participated will naturally be distributed across many OSDs in the cluster, when the failed OSD is marked out, it's replicas will be remapped to many OSDs. It's not a 1:1 replacement like you might find in a RAID array. I completely get that part, however the total amount of data to be rebalanced after a single disk/OSD failure to fully restore redundancy is still 2.25TB (mistyped that as GB earlier) at the 75% utilization you assumed. What I'm still missing in this pictures is how many disks (OSDs) you calculated this with. Maybe I'm just misreading the 40 seconds per drive bit there. Because if that means each drive is only required to be just active for 40 seconds to do it's bit of recovery, we're talking 1100 drives. ^o^ 1100 PGs would be another story. To recreate the modeling: git clone https://github.com/ceph/ceph-tools.git cd ceph-tools/models/reliability/ python main.py -g I used the following values: Disk Type: Enterprise Size: 3000 GiB Primary FITs: 826 Secondary FITS: 826 NRE Rate: 1.0E-16 RAID Type: RAID6 Replace (hours): 6 Rebuild (MiB/s): 500 Volumes: 11 RADOS Copies: 3 Mark-out (min): 10 Recovery (MiB/s): 50 Space Usage: 75% Declustering (pg): 1100 Stripe length: 1100 (limited by pgs anyway) RADOS sites: 1 Rep Latency (s): 0 Recovery (MiB/s): 10 Disaster (years): 1000 Site Recovery (days): 30 NRE Model: Fail Period (years): 1 Object Size: 4MB It seems that the number of disks is not considered when calculating the recovery window, only the number of pgs https://github.com/ceph/ceph-tools/blob/master/models/reliability/RadosRely.py#L68 I could also see the recovery rates varying based on the max osd backfill tunable. http://ceph.com/docs/master/rados/configuration/osd-config-ref/#backfilling Doing both would improve the quality of models generated by the tool. -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Networking questions
Do monitors have to be on the cluster network as well or is it sufficient for them to be on the public network as http://ceph.com/docs/master/rados/configuration/network-config-ref/ suggests? Monitors only need to be on the public network. Also would the OSDs re-route their traffic over the public network if that was still available in case the cluster network fails? Ceph doesn't currently support this type of configuration. Hope that clears things up! -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Failure probability with largish deployments
Is an object a CephFS file or a RBD image or is it the 4MB blob on the actual OSD FS? Objects are at the RADOS level, CephFS filesystems, RBD images and RGW objects are all composed by striping RADOS objects - default is 4MB. In my case, I'm only looking at RBD images for KVM volume storage, even given the default striping configuration I would assume that those 12500 OSD objects for a 50GB image would not be in the same PG and thus just on 3 (with 3 replicas set) OSDs total? Objects are striped across placement groups, so you take your RBD size / 4 MB and cap it at the total number of placement groups in your cluster. What amount of disks (OSDs) did you punch in for the following run? Disk Modeling Parameters size: 3TiB FIT rate:826 (MTBF = 138.1 years) NRE rate:1.0E-16 RADOS parameters auto mark-out: 10 minutes recovery rate:50MiB/s (40 seconds/drive) Blink??? I guess that goes back to the number of disks, but to restore 2.25GB at 50MB/s with 40 seconds per drive... The surviving replicas for placement groups that the failed OSDs participated will naturally be distributed across many OSDs in the cluster, when the failed OSD is marked out, it's replicas will be remapped to many OSDs. It's not a 1:1 replacement like you might find in a RAID array. osd fullness: 75% declustering:1100 PG/OSD NRE model: fail object size: 4MB stripe length: 1100 I take it that is to mean that any RBD volume of sufficient size is indeed spread over all disks? Spread over all placement groups, the difference is subtle but there is a difference. -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph network topology with redundant switches
The area I'm currently investigating is how to configure the networking. To avoid a SPOF I'd like to have redundant switches for both the public network and the internal network, most likely running at 10Gb. I'm considering splitting the nodes in to two separate racks and connecting each half to its own switch, and then trunk the switches together to allow the two halves of the cluster to see each other. The idea being that if a single switch fails I'd only lose half of the cluster. This is fine if you are using a replication factor of 2, you would need 2/3 of the cluster surviving if using a replication factor 3 with osd pool default min size set to 2. My question is about configuring the public network. If it's all one subnet then the clients consuming the Ceph resources can't have both links active, so they'd be configured in an active/standby role. But this results in quite heavy usage of the trunk between the two switches when a client accesses nodes on the other switch than the one they're actively connected to. The linux bonding driver supports several strategies for teaming network adapters on L2 networks. So, can I configure multiple public networks? I think so, based on the documentation, but I'm not completely sure. Can I have one half of the cluster on one subnet, and the other half on another? And then the client machine can have interfaces in different subnets and do the right thing with both interfaces to talk to all the nodes. This seems like a fairly simple solution that avoids a SPOF in Ceph or the network layer. You can have multiple networks for both the public and cluster networks, the only restriction is that all subnets for a given type be within the same supernet. For example 10.0.0.0/16 - Public supernet (configured in ceph.conf) 10.0.1.0/24 - Public rack 1 10.0.2.0/24 - Public rack 2 10.1.0.0/16 - Cluster supernet (configured in ceph.conf) 10.1.1.0/24 - Cluster rack 1 10.1.2.0/24 - Cluster rack 2 Or maybe I'm missing an alternative that would be better? I'm aiming for something that keeps things as simple as possible while meeting the redundancy requirements. As an aside, there's a similar issue on the cluster network side with heavy traffic on the trunk between the two cluster switches. But I can't see that's avoidable, and presumably it's something people just have to deal with in larger Ceph installations? A proper CRUSH configuration is going to place a replica on a node in each rack, this means every write is going to cross the trunk. Other traffic that you will see on the trunk: * OSDs gossiping with one another * OSD/Monitor traffic in the case where an OSD is connected to a monitor connected in the adjacent rack (map updates, heartbeats). * OSD/Client traffic where the OSD and client are in adjacent racks If you use all 4 40GbE uplinks (common on 10GbE ToR) then your cluster level bandwidth is oversubscribed 4:1. To lower oversubscription you are going to have to steal some of the other 48 ports, 12 for 2:1 and 24 for a non-blocking fabric. Given number of nodes you have/plan to have you will be utilizing 6-12 links per switch, leaving you with 12-18 links for clients on a non-blocking fabric, 24-30 for 2:1 and 36-48 for 4:1. -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] radosgw daemon stalls on download of some files
Do you have any futher detail on this radosgw bug? https://github.com/ceph/ceph/commit/0f36eddbe7e745665a634a16bf3bf35a3d0ac424 https://github.com/ceph/ceph/commit/0b9dc0e5890237368ba3dc34cb029010cb0b67fd Does it only apply to emperor? The bug is present in dumpling too. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] SysAdvent: Day 15 - Distributed Storage with Ceph
For you holiday pleasure I've prepared a SysAdvent article on Ceph: http://sysadvent.blogspot.com/2013/12/day-15-distributed-storage-with-ceph.html Check it out! -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Rbd image performance
Has anyone tried scaling a VMs io by adding additional disks and striping them in the guest os? I am curious what effect this would have on io performance? Why would it? You can also change the stripe size of the RBD image. Depending on the workload you might change it from 4MB to something like 1MB or 32MB? That would give you more or less RADOS objects which will also give you a different I/O pattern. The question comes up because it's common for people operating on EC2 to stripe EBS volumes together for higher iops rates. I've tried striping kernel RBD volumes before but hit some sort of thread limitation where throughput was consistent despite the volume count. I've since learned the thread limit is configurable. I don't think there is a thread limit that needs to be tweaked for RBD via KVM/QEMU but I haven't tested this empirically. As Wido mentioned, if you are operating your own cluster configuring the stripe size may achieve similar results. Google used to use a 64MB chunk size with GFS but switched to 1MB after they started supporting more and more seek heavy workloads. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] NUMA and ceph
It seems that NUMA can be problematic for ceph-osd daemons in certain circumstances. Namely it seems that if a NUMA zone is running out of memory due to uneven allocation it is possible for a NUMA zone to enter reclaim mode when threads/processes are scheduled on a core in that zone and those processes are request memory allocations greater than the zones remaining memory. In order for the kernel to satisfy the memory allocation for those processes it needs to page out some of the contents of the contentious zone, which can have dramatic performance implications due to cache misses, etc. I see two ways an operator could alleviate these issues: Set the vm.zone_reclaim_mode sysctl setting to 0, along with prefixing ceph-osd daemons with numactl --interleave=all. This should probably be activated by a flag in /etc/default/ceph and modifying the ceph-osd.conf upstart script, along with adding a depend to the ceph package's debian/rules file on the numactl package. The alternative is to use a cgroup for each ceph-osd daemon, pinning each one to cores in the same NUMA zone using cpuset.cpu and cpuset.mems. This would probably also live in /etc/default/ceph and the upstart scripts. -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph reliability in large RBD setups
I've been running similar calculations recently. I've been using this tool from Inktank to calculate RADOS reliabilities with different assumptions: https://github.com/ceph/ceph-tools/tree/master/models/reliability But I've also had similar questions about RBD (or any multi-part files stored in RADOS) -- naively, a file/device stored in N objects would be N times less reliable than a single object. But I hope there's an error in that logic. It's worth pointing out that Ceph's RGW will actually stripe S3 objects across many RADOS objects - even when it's not a multi-part upload, this has been the case since the Bobtail release. There is a in depth Google paper about availability modeling, it might provide some insight into what the math should look like: http://research.google.com/pubs/archive/36737.pdf When reading it you can think of objects as chunks and pgs as stripes. CRUSH should be configured based on failure domains that cause correlated failures, ie power and networking. You also want to consider the availability of the facility itself: Typical availability estimates used in the industry range from 99.7% availability for tier II datacenters to 99.98% and 99.995% for tiers III and IV, respectively. http://www.morganclaypool.com/doi/pdf/10.2200/s00193ed1v01y200905cac006 If you combine the cluster availability metric and the facility availability metric, you might be surprised. A cluster with 99.995% availability in a Tier II facility is going to be dragged down to 99.7% availability. If a cluster goes down in the forest, does anyone know? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Anybody doing Ceph for OpenStack with OSDs across compute/hypervisor nodes?
We're running OpenStack (KVM) with local disk for ephemeral storage. Currently we use local RAID10 arrays of 10k SAS drives, so we're quite rich for IOPS and have 20GE across the board. Some recent patches in OpenStack Havana make it possible to use Ceph RBD as the source of ephemeral VM storage, so I'm interested in the potential for clustered storage across our hypervisors for this purpose. Any experience out there? I believe Piston converges their storage/compute, they refer to it as a null-tier architecture. http://www.pistoncloud.com/openstack-cloud-software/technology/#storage -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] optimal setup with 4 x ethernet ports
looking at tcpdump all the traffic is going exactly where it is supposed to go, in particular an osd on the 192.168.228.x network appears to talk to an osd on the 192.168.229.x network without anything strange happening. I was just wondering if there was anything about ceph that could make this non-optimal, assuming traffic was reasonably balanced between all the osd's (eg all the same weights). I think the only time it would suffer is if writes to other osds result in a replica write to a single osd, and even then a single OSD is still limited to 7200RPM disk speed anyway so the loss isn't going to be that great. Should be fine given you only have a 1:1 ratio of link to disk. I think I'll be moving over to bonded setup anyway, although I'm not sure if rr or lacp is best... rr will give the best potential throughput, but lacp should give similar aggregate throughput if there are plenty of connections going on, and less cpu load as no need to reassemble fragments. One of the DreamHost clusters is using a pair of bonded 1GbE links on the public network and another pair for the cluster network, we configured each to use mode 802.3ad. -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] optimal setup with 4 x ethernet ports
Is having two cluster networks like this a supported configuration? Every osd and mon can reach every other so I think it should be. Maybe. If your back end network is a supernet and each cluster network is a subnet of that supernet. For example: Ceph.conf cluster network (supernet): 10.0.0.0/8 Cluster network #1: 10.1.1.0/24 Cluster network #2: 10.1.2.0/24 With that configuration OSD address autodection *should* just work. It should work but thinking more about it the OSDs will likely be assigned IPs on a single network, whichever is inspected and matches the supernet range (which could be in either subnet). In order to have OSDs on two distinct networks you will likely have to use a declarative configuration in /etc/ceph/ceph.conf which lists the OSD IP addresses for each OSD (making sure to balance between links). 1. move osd traffic to eth1. This obviously limits maximum throughput to ~100Mbytes/second, but I'm getting nowhere near that right now anyway. Given three links I would probably do this if your replication factor is = 3. Keep in mind 100Mbps links could very well end up being a limiting factor. Sorry I read Mbytes and Mbps, big difference, the former is much preferable! -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] optimal setup with 4 x ethernet ports
Is having two cluster networks like this a supported configuration? Every osd and mon can reach every other so I think it should be. Maybe. If your back end network is a supernet and each cluster network is a subnet of that supernet. For example: Ceph.conf cluster network (supernet): 10.0.0.0/8 Cluster network #1: 10.1.1.0/24 Cluster network #2: 10.1.2.0/24 With that configuration OSD address autodection *should* just work. 1. move osd traffic to eth1. This obviously limits maximum throughput to ~100Mbytes/second, but I'm getting nowhere near that right now anyway. Given three links I would probably do this if your replication factor is = 3. Keep in mind 100Mbps links could very well end up being a limiting factor. What are you backing each OSD with storage wise and how many OSDs do you expect to participate in this cluster? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] installing OS on software RAID
Is the OS doing anything apart from ceph? Would booting a ramdisk-only system from USB or compact flash work? I haven't tested this kind of configuration myself but I can't think of anything that would preclude this type of setup. I'd probably use sqashfs layered with a tmpfs via aufs to avoid any writes to the USB drive. I would also mount spinning high capacity media for /var/log or setup log streaming to something like rsyslog/syslog-ng/logstash. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Impact of fancy striping
This journal problem is a bit of wizardry to me, I even had weird intermittent issues with OSDs not starting because the journal was not found, so please do not hesitate to suggest a better journal setup. You mentioned using SAS for journal, if your OSDs are SATA and a expander is in the data path it might be slow from MUX/STP/etc overhead. If the setup is all SAS you might try collocating the journal with it's matching data partition on a single disk. Two spindles must be contended with 9 OSDs. How are your drives attached? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 回复:Re: testing ceph performance issue
How much performance can be improved if use SSDs to storage journals? You will see roughly twice the throughput unless you are using btrfs (still improved but not as dramatic). You will also see lower latency because the disk head doesn't have to seek back and forth between journal and data partitions. Kernel RBD Driver , what is this ? There are several RBD implementations, one is the kernel RBD driver in upstream Linux, another is built into Qemu/KVM. and we want to know the RBD if support XEN virual ? It is possible, but not nearly as well tested and not prevalent as RBD via Qemu/KVM. This might be a starting point if your interested in testing Xen/RBD integration: http://wiki.xenproject.org/wiki/Ceph_and_libvirt_technology_preview Hope that helps! -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD on an external, shared device
Is there any way to manually configure which OSDs are started on which machines? The osd configuration block includes the osd name and host, so is there a way to say that, say, osd.0 should only be started on host vashti and osd.1 should only be started on host zadok? I tried using this configuration: The ceph udev rules are going to automatically mount disks that match the ceph magic guids, to dig through the full logic you need to inspect these files: /lib/udev/rules.d/60-ceph-partuuid-workaround.rules /lib/udev/rules.d/95-ceph-osd.rules The upstart scripts look to see what is mounted at /var/lib/ceph/osd/ and starts osd daemons as appropriate: /etc/init/ceph-osd-all-starter.conf In theory you should be able to remove the udev scripts and mount the osds in /var/lib/ceph/osd if your using upstart. You will want to make sure that upgrades to the ceph package don't replace the files, maybe that means making a null rule and using -o Dpkg::Options::='--force-confold in ceph-deploy/chef/puppet/whatever. You will also want to avoid putting the mounts in fstab because it could render your node unbootable if the device or filesystem fails. -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] installing OS on software RAID
Several people have reported issues with combining OS and OSD journals on the same SSD drives/RAID due to contention. If you do something like this I would definitely test to make sure it meets your expectations. Ceph logs are going to compose the majority of the writes to the OS storage devices. On Mon, Nov 25, 2013 at 12:46 PM, James Harper james.har...@bendigoit.com.au wrote: We need to install the OS on the 3TB harddisks that come with our Dell servers. (After many attempts, I've discovered that Dell servers won't allow attaching an external harddisk via the PCIe slot. (I've tried everything). ) But, must I therefore sacrifice two hard disks (RAID-1) for the OS? I don't see why I can't just create a small partition (~30GB) on all 6 of my hard disks, do a software-based RAID 1 on it, and be done. I know that software based RAID-5 seems computationally expensive, but shouldn't RAID 1 be fast and computationally inexpensive for a computer built over the last 4 years? I wouldn't think that a CEPH systems (with lots of VMs but little data changes) would even do much writing to the OS partitionbut I'm not sure. (And in the past, I have noticed that RAID5 systems did suck up a lot of CPU and caused lots of waits, unlike what the blogs implied. But I'm thinking that a RAID 1 takes little CPU and the OS does little writing to disk; it's mostly reads, which should hit the RAM.) Does anyone see any holes in the above idea? Any gut instincts? (I would try it, but it's hard to tell how well the system would really behave under real load conditions without some degree of experience and/or strong theoretical knowledge.) Is the OS doing anything apart from ceph? Would booting a ramdisk-only system from USB or compact flash work? If the OS doesn't produce a lot of writes then having it on the main disk should work okay. I've done it exactly as you describe before. James ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Running on disks that lose their head
Once I know a drive has had a head failure, do I trust that the rest of the drive isn't going to go at an inconvenient moment vs just fixing it right now when it's not 3AM on Christmas morning? (true story) As good as Ceph is, do I trust that Ceph is smart enough to prevent spreading corrupt data all over the cluster if I leave bad disks in place and they start doing terrible things to the data? I have a lot more disks than I have trust in disks. If a drive lost a head then I want it gone. I love the idea of using smart data but can foresee see some implementation issues. We have seen some raid configurations where polling smart will halt all raid operations momentarily. Also, some controllers require you to use their CLI tool to pool for smart vs smartmontools. It would be similarly awesome to embed something like an apdex score against each osd, especially if it factored in hierarchy to identify poor performing osds, nodes, racks, etc.. -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] radosgw questions
1. To build a high performance yet cheap radosgw storage, which pools should be placed on ssd and which on hdd backed pools? Upon installation of radosgw, it created the following pools: .rgw, .rgw.buckets, .rgw.buckets.index, .rgw.control, .rgw.gc, .rgw.root, .usage, .users, .users.email. There is a lot that goes into high performance, a few questions come to mind: Do you want high performance reads, writes or both? How hot is your data, can you bet better performance from buying more memory for caching? What size objects do you expect to handle, how many per bucket? 4. Which number of replaction would you suggest? In other words, which replication is need to achive 99.9% durability like dreamobjects states? DreamObjects Engineer here, we used Ceph's durability modeling tools here: https://github.com/ceph/ceph-tools You will need to research your data disk's MTBF numbers and convert them to FITS, measure your OSD backfill MTTR and factor in your replication count. DreamObjects uses 3 replicas on enterprise SAS disks. The durability figures exclude black swan events like fires and other such datacenter or regional disasters, which is why having a second location is important for DR. 5. Is it possible to map fqdn custom domain to buckets, not only subdomains? You could map a domain's A/ records to an endpoint but if the endpoint changes your SOL, using a CNAME at the domain root violates DNS rfcs. Some DNS providers will fake a CNAME by doing a recursive lookup in response to an A/ request as a work around. -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph cluster performance
ST240FN0021 connected via a SAS2x36 to a LSI 9207-8i. The problem might be SATA transport protocol overhead at the expander. Have you tried directly connecting the SSDs to SATA2/3 ports on the mainboard? -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Running on disks that lose their head
Zackc, Loicd, and I have been the main participants in a weekly Teuthology call the past few weeks. We've talked mostly about methods to extend Teuthology to capture performance metrics. Would you be willing to join us during the Teuthology and Ceph-Brag sessions at the Firefly Developer Summit? I'd be happy to! -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph User Committee
I think this is a great idea. One of the big questions users have is what kind of hardware should I buy. An easy way for users to publish information about their setup (hardware, software versions, use-case, performance) when they have successful deployments would be very valuable. Maybe a section of wiki? It would be interesting to a site where a Ceph admin can download an API key/package that could be optionally installed and report configuration information to a community API. The admin could then supplement/correct that base information. Having much of the data collection be automated lowers the barrier for contribution. Bonus points if this could be extended to SMART and failed drives so we could have a community generated report similar to Google's disk population study they presented at FAST'07. -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph User Committee
Would this be something like http://wiki.ceph.com/01Planning/02Blueprints/Firefly/Ceph-Brag ? Something very much like that :) -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph node Info
The quick start guide is linked below, it should help you hit the ground running. http://ceph.com/docs/master/start/quick-ceph-deploy/ Let us know if you have questions or bump into trouble! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph recovery killing vms
Recovering from a degraded state by copying existing replicas to other OSDs is going to cause reads on existing replicas and writes to the new locations. If you have slow media then this is going to be felt more acutely. Tuning the backfill options I posted is one way to lessen the impact, another option is to slowly lower the weight in CRUSH for the OSD(s) you want to remove. Hopefully that helps! -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph recovery killing vms
You can change some OSD tunables to lower the priority of backfills: osd recovery max chunk: 8388608 osd recovery op priority: 2 In general a lower op priority means it will take longer for your placement groups to go from degraded to active+clean, the idea is to balance recovery time and not starving client requests. I've found 2 to work well on our clusters, YMMV. On Mon, Oct 28, 2013 at 10:16 AM, Kevin Weiler kevin.wei...@imc-chicago.com wrote: Hi all, We have a ceph cluster that being used as a backing store for several VMs (windows and linux). We notice that when we reboot a node, the cluster enters a degraded state (which is expected), but when it begins to recover, it starts backfilling and it kills the performance of our VMs. The VMs run slow, or not at all, and also seem to switch it's ceph mounts to read-only. I was wondering 2 things: Shouldn't we be recovering instead of backfilling? It seems like backfilling is much more intensive operation Can we improve the recovery/backfill performance so that our VMs don't go down when there is a problem with the cluster? -- Kevin Weiler IT IMC Financial Markets | 233 S. Wacker Drive, Suite 4300 | Chicago, IL 60606 | http://imc-chicago.com/ Phone: +1 312-204-7439 | Fax: +1 312-244-3301 | E-Mail: kevin.wei...@imc-chicago.com The information in this e-mail is intended only for the person or entity to which it is addressed. It may contain confidential and /or privileged material. If someone other than the intended recipient should receive this e-mail, he / she shall not be entitled to read, disseminate, disclose or duplicate it. If you receive this e-mail unintentionally, please inform us immediately by reply and then delete it from your system. Although this information has been compiled with great care, neither IMC Financial Markets Asset Management nor any of its related entities shall accept any responsibility for any errors, omissions or other inaccuracies in this information or for the consequences thereof, nor shall it be bound in any way by the contents of this e-mail or its attachments. In the event of incomplete or incorrect transmission, please return the e-mail to the sender and permanently delete this message and any attachments. Messages and attachments are scanned for all known viruses. Always scan attachments before opening them. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] changing journals post-bobcat?
The bobtail release added udev/upstart capabilities that allowed you to not have per OSD entries in ceph.conf. Under the covers the new udev/upstart scripts look for a special label on OSD data volumes, matching volumes are mounted and then a few files are inspected: journal_uuid whoami The journal_uuid is the uuid of the journal device for that OSD, whoami indicates the OSD number the data volume belongs to. This thread might be helpful for changing the journal device: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-October/005162.html On Mon, Oct 28, 2013 at 11:39 AM, John Kinsella j...@stratosec.co wrote: Hey folks - looking around, I see plenty (OK, some) on how to modify journal size and location for older ceph, when ceph.conf was used (I think the switch from ceph.conf to storing osd/journal config elsewhere happened with bobcat?). I recently deployed a cluster with ceph-deploy on 0.67 and wanted to change the journal size for the OSDs. Is this a remove/re-create procedure now, or is there an easier way? John ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Hardware: SFP+ or 10GBase-T
I know that 10GBase-T has more delay then SFP+ with direct attached cables (.3 usec vs 2.6 usec per link), but does that matter? Some sites stay it is a huge hit, but we are talking usec, not ms, so I find it hard to believe that it causes that much of an issue. I like the lower cost and use of standard cabling vs SFP+, but I don't want to sacrifice on performance. If you are talking about the links from the nodes with OSDs to their ToR switches then I would suggest going with Twinax cables. Twinax doesn't go very far but it's really durable and uses less power than 10GBase-T. Here's a blog post that goes into more detail: http://etherealmind.com/difference-twinax-category-6-10-gigabit-ethernet/ I would probably go with the Arista 7050-S over the 7050-T and use twinax for ToR to OSD node links and SFP+SR uplinks to spine switches if you need longer runs. -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS Project Manila (OpenStack)
This is going to get horribly ugly when you add neutron into the mix, so much so I'd consider this option a non-starter. If someone is using openvswitch to create network overlays to isolate each tenant I can't imagine this ever working. I'm not following here. Are this only needed if ceph shares the same subnet as the VM? I don't know much about how these things work, but I would expect that it would be possible to route IP traffic from a guest network to the storage network (or anywhere else, for that matter)... That aside, however, I think it would be a mistake to take the availability of cephfs vs nfs clients as a reason alone for a particular architectural approach. One of the whole points of ceph is that we ignore legacy when it doesn't make sense. (Hence rbd, not iscsi; cephfs, not [p]nfs.) In an overlay world, physical VLANs have no relation to virtual networks. An overlay is literally encapsulating layer 2 inside layer 3 and adding a VNI (virtual network identifier) and using tunnels (VxLAN, STT, GRE, etc) to connect VMs on disparate hypervisors that may not even have L2 connectivity to each other. One of the core tenants of virtual networks is providing tenants the ability to have overlapping RFC1918 addressing, in this case you could have tenants already utilizing the addresses used by the NFS storage at the physical layer. Even if we could pretend that would never happen (namespaces or jails maybe?) you would still need to provision a distinct NFS IP per tenant and run a virtual switch that supports the tunneling protocol used by the overlay and the southbound API used by that overlays virtual switch to insert/remove flow information. The only alternative to embedding a myriad of different virtual switch protocols on the filer head would be to use a VTEP capable switch for encapsulation. I think there are only 1-2 vendors that ship these, Arista's 7150 and something in the Cumulus lineup. Even if you could get past all this I'm somewhat terrified by the proposition of connecting the storage fabric to a tenant network, although this is much more acute concern for public clouds. Here's a good RFC wrt overlays if anyone is in dire need of bed time reading: http://tools.ietf.org/html/draft-mity-nvo3-use-case-04 -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mounting RBD in linux containers
My first guess would be that it's due to LXC dropping capabilities, I'd investigate whether CAP_SYS_ADMIN is being dropped. You need CAP_SYS_ADMIN for mount and block ioctls, if the container doesn't have those privs a map will likely fail. Maybe try tracing the command with strace? On Thu, Oct 17, 2013 at 2:45 PM, Kevin Weiler kevin.wei...@imc-chicago.comwrote: Hi all, We're trying to mount an rbd image inside of a linux container that has been created with docker (https://www.docker.io/). We seem to have access to the rbd kernel module from inside the container: # lsmod | grep ceph libceph 218854 1 rbd libcrc32c 12603 3 xfs,libceph,dm_persistent_data And we can query the pool for available rbds and create rbds from inside the container: # rbd -p dockers --id dockers --keyring /etc/ceph/ceph.client.dockers.keyring create lxctest --size 51200 # rbd -p dockers --id dockers --keyring /etc/ceph/ceph.client.dockers.keyring ls lxctest But for some reason, we can't seem to map the device to the container: # rbd -p dockers --id dockers --keyring /etc/ceph/ceph.client.dockers.keyring map lxctest rbd: add failed: (22) Invalid argument I don't see anything particularly interesting in dmesg or messages on either the container or the host box. Any ideas on how to troubleshoot this? Thanks! -- *Kevin Weiler* IT IMC Financial Markets | 233 S. Wacker Drive, Suite 4300 | Chicago, IL 60606 | http://imc-chicago.com/ Phone: +1 312-204-7439 | Fax: +1 312-244-3301 | E-Mail: * kevin.wei...@imc-chicago.com kevin.wei...@imc-chicago.com* -- The information in this e-mail is intended only for the person or entity to which it is addressed. It may contain confidential and /or privileged material. If someone other than the intended recipient should receive this e-mail, he / she shall not be entitled to read, disseminate, disclose or duplicate it. If you receive this e-mail unintentionally, please inform us immediately by reply and then delete it from your system. Although this information has been compiled with great care, neither IMC Financial Markets Asset Management nor any of its related entities shall accept any responsibility for any errors, omissions or other inaccuracies in this information or for the consequences thereof, nor shall it be bound in any way by the contents of this e-mail or its attachments. In the event of incomplete or incorrect transmission, please return the e-mail to the sender and permanently delete this message and any attachments. Messages and attachments are scanned for all known viruses. Always scan attachments before opening them. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph configuration data sharing requirements
* The IP address of at least one MON in the Ceph cluster If you configure nodes with a single monitor in the mon hosts directive then I believe your nodes will have issues if that one monitor goes down. With Chef I've gone back and forth between using Chef search and having monitors be declarative. Chef search is problematic if you are not declarative about how many monitors to expect, you could end up with 3 monitors and 3 single monitor quorums during initial cluster creation. If cephx is disabled: * no other requirement If cephx is enabled: * an admin user is created by providing a keyring file with its description when the first MON is bootstraped http://ceph.com/docs/next/dev/mon-bootstrap/ * users must be created by injecting them into the MONs, for instance with auth import https://github.com/ceph/ceph/blob/master/src/mon/MonCommands.h#L162 or auth add. There is not need to ask the MONs for a key, although it can be done. It is not a requirement. When a user is created or later on, its capabilities can be set. * an osd must be created by the mon which return an unique osd ID which is then used to further configure the osd. https://github.com/ceph/ceph/blob/master/src/mon/MonCommands.h#L471 * a client must be given a user id and a secret key It would also be helpful to better understand why people are happy with the way ceph-deploy currently works and how it deals with these requirements. I haven't used ceph-deploy, but I did write a chef cookbook before ceph-deploy was a thing. You will want to get the OSD bootstrap key from one of the monitors and distribute it to your OSD nodes. Once you have the bootstrap key you can have puppet enable and start the upstart service. After ceph-osd-all is running under upstart you can simply use ceph-disk-prepare and a new OSD will be created based off the OSD bootstrap key, the OSD id is automatically allocated by the monitor during this process. -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Speed limit on RadosGW?
I've personally saturated 1Gbps links on multiple radosgw nodes on a large cluster, if I remember correctly, Yehuda has tested it up into the 7Gbps range with 10Gbps gear. Could you describe your clusters hardware and connectivity? On Mon, Oct 14, 2013 at 3:34 AM, Chu Duc Minh chu.ducm...@gmail.com wrote: Hi sorry, i missed this mail. During writes, does the CPU usage on your RadosGW node go way up? No, CPU stay the same very low ( 10%) When upload small files(300KB/file) over RadosGW: - using 1 process: upload bandwidth ~ 3MB/s - using 100 processes: upload bandwidth ~ 15MB/s When upload big files(3GB/file) over RadosGW: - using 1 process: upload bandwidth ~ 70MB/s (Therefore i don't upload big files using multi-processes any more :D) Maybe, RadosGW have a problem when write many smail files. Or it's a problem of CEPH when simultaneously write many smail files into a bucket, that already have millions files? On Wed, Sep 25, 2013 at 7:24 PM, Mark Nelson mark.nel...@inktank.comwrote: On 09/25/2013 02:49 AM, Chu Duc Minh wrote: I have a CEPH cluster with 9 nodes (6 data nodes 3 mon/mds nodes) And i setup 4 separate nodes to test performance of Rados-GW: - 2 node run Rados-GW - 2 node run multi-process put file to [multi] Rados-GW Result: a) When i use 1 RadosGW node 1 upload-node, speed upload = 50MB/s /upload-node, Rados-GW input/output speed = 50MB/s b) When i use 2 RadosGW node 1 upload-node, speed upload = 50MB/s /upload-node; each RadosGW have input/output = 25MB/s == sum input/ouput of 2 Rados-GW = 50MB/s c) When i use 1 RadosGW node 2 upload-node, speed upload = 25MB/s /upload-node == sum output of 2 upload-node = 50MB/s, RadosGW have input/output = 50MB/s d) When i use 2 RadosGW node 2 upload-node, speed upload = 25MB/s /upload-node == sum output of 2 upload-node = 50MB/s; each RadosGW have input/output = 25MB/s == sum input/ouput of 2 Rados-GW = 50MB/s _*Problem*_: i can pass limit 50MB/s when put file over Rados-GW, regardless of the number Rados-GW nodes and upload-nodes. When i use this CEPH cluster over librados (openstack/kvm), i can easily achieve 300MB/s I don't know why performance of RadosGW is so low. What's bottleneck? During writes, does the CPU usage on your RadosGW node go way up? If this is a test cluster, you might want to try the wip-6286 build from our gitbuilder site. There is a fix that depending on the size of your objects, could have a big impact on performance. We're currently investigating some other radosgw performance issues as well, so stay tuned. :) Mark Thank you very much! __**_ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com __**_ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] About Ceph SSD and HDD strategy
It's hard to comment on how your experience could be made better without more information about your configuration and how your testing. Anything along the lines of what LSI controller model, PCI-E bus speed, number of expander cables, drive type, number of SSDs and whether the SSDs were connected to the controller or directly to SATA2/SATA3 port on the mainboard. You mentioned using SSD journal but nothing about a writeback cache, did you try both? I'm also curious about what kind of workload didn't get better with an external journal, was this with rados-bench? I'm really excited about tiering, it will disaggregate the SSDs and allow more flexibility in cephstore chassis selection because you no longer have to maintain strict SSD:drive ratios - this seems like a much more elegant and maintainable solution. On Wed, Oct 9, 2013 at 3:45 PM, Warren Wang war...@wangspeed.com wrote: While in theory this should be true, I'm not finding it to be the case for a typical enterprise LSI card with 24 drives attached. We tried a variety of ratios and went back to collocated journals on the spinning drives. Eagerly awaiting the tiered performance changes to implement a faster tier via SSD. -- Warren On Oct 9, 2013, at 5:52 PM, Kyle Bader kyle.ba...@gmail.com wrote: Journal on SSD should effectively double your throughput because data will not be written to the same device twice to ensure transactional integrity. Additionally, by placing the OSD journal on an SSD you should see less latency, the disk head no longer has to seek back and forth between the journal and data partitions. For large writes it's not as critical to have a device that supports high IOPs or throughput because large writes are striped across many 4MB rados objects, relatively evenly distributed across the cluster. Small write operations will benefit the most from an OSD data partition with a writeback cache like btier/flashcache because it can absorbs an order of magnitude more IOPs and allow a slower spinning device catch up when there is less activity. On Tue, Oct 8, 2013 at 12:09 AM, Robert van Leeuwen robert.vanleeu...@spilgames.com wrote: I tried putting Flashcache on my spindle OSDs using an Intel SSL and it works great. This is getting me read and write SSD caching instead of just write performance on the journal. It should also allow me to protect the OSD journal on the same drive as the OSD data and still get benefits of SSD caching for writes. Small note that on Red Hat based distro's + Flashcache + XFS: There is a major issue (kernel panics) running xfs + flashcache on a 6.4 kernel. (anything higher then 2.6.32-279) It should be fixed in kernel 2.6.32-387.el6 which, I assume, will be 6.5 which only just entered Beta. Fore more info, take a look here: https://github.com/facebook/flashcache/issues/113 Since I've hit this issue (thankfully in our dev environment) we are slightly less enthusiastic about running flashcache :( It also adds a layer of complexity so I would rather just run the journals on SSD, at least on Redhat. I'm not sure about the performance difference of just journals v.s. Flashcache but I'd be happy to read any such comparison :) Also, if you want to make use of the SSD trim func P.S. My experience with Flashcache is on Openstack Swift Nova not Ceph. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Expanding ceph cluster by adding more OSDs
I've contracted and expanded clusters by up to a rack of 216 OSDs - 18 nodes, 12 drives each. New disks are configured with a CRUSH weight of 0 and I slowly add weight (0.1 to 0.01 increments), wait for the cluster to become active+clean and then add more weight. I was expanding after contraction so my PG count didn't need to be corrected, I tend to be liberal and opt for more PGs. If I hadn't contracted the cluster prior to expanding it I would probably add PGs after all the new OSDs have finished being weighted into the cluster. On Wed, Oct 9, 2013 at 8:55 PM, Michael Lowe j.michael.l...@gmail.comwrote: I had those same questions, I think the answer I got was that it was better to have too few pg's than to have overloaded osd's. So add osd's then add pg's. I don't know the best increments to grow in, probably depends largely on the hardware in your osd's. Sent from my iPad On Oct 9, 2013, at 11:34 PM, Guang yguan...@yahoo.com wrote: Thanks Mike. I get your point. There are still a few things confusing me: 1) We expand Ceph cluster by adding more OSDs, which will trigger re-balance PGs across the old new OSDs, and likely it will break the optimized PG numbers for the cluster. 2) We can add more PGs which will trigger re-balance objects across old new PGs. So: 1) What is the recommended way to expand the cluster by adding OSDs (and potentially adding PGs), should we do them at the same time? 2) What is the recommended way to scale a cluster from like 1PB to 2PB, should we scale it to like 1.1PB to 1.2PB or move to 2PB directly? Thanks, Guang On Oct 10, 2013, at 11:10 AM, Michael Lowe wrote: There used to be, can't find it right now. Something like 'ceph osd set pg_num num' then 'ceph osd set pgp_num num' to actually move your data into the new pg's. I successfully did it several months ago, when bobtail was current. Sent from my iPad On Oct 9, 2013, at 10:30 PM, Guang yguan...@yahoo.com wrote: Thanks Mike. Is there any documentation for that? Thanks, Guang On Oct 9, 2013, at 9:58 PM, Mike Lowe wrote: You can add PGs, the process is called splitting. I don't think PG merging, the reduction in the number of PGs, is ready yet. On Oct 8, 2013, at 11:58 PM, Guang yguan...@yahoo.com wrote: Hi ceph-users, Ceph recommends the PGs number of a pool is (100 * OSDs) / Replicas, per my understanding, the number of PGs for a pool should be fixed even we scale out / in the cluster by adding / removing OSDs, does that mean if we double the OSD numbers, the PG number for a pool is not optimal any more and there is no chance to correct it? Thanks, Guang ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] About Ceph SSD and HDD strategy
Journal on SSD should effectively double your throughput because data will not be written to the same device twice to ensure transactional integrity. Additionally, by placing the OSD journal on an SSD you should see less latency, the disk head no longer has to seek back and forth between the journal and data partitions. For large writes it's not as critical to have a device that supports high IOPs or throughput because large writes are striped across many 4MB rados objects, relatively evenly distributed across the cluster. Small write operations will benefit the most from an OSD data partition with a writeback cache like btier/flashcache because it can absorbs an order of magnitude more IOPs and allow a slower spinning device catch up when there is less activity. On Tue, Oct 8, 2013 at 12:09 AM, Robert van Leeuwen robert.vanleeu...@spilgames.com wrote: I tried putting Flashcache on my spindle OSDs using an Intel SSL and it works great. This is getting me read and write SSD caching instead of just write performance on the journal. It should also allow me to protect the OSD journal on the same drive as the OSD data and still get benefits of SSD caching for writes. Small note that on Red Hat based distro's + Flashcache + XFS: There is a major issue (kernel panics) running xfs + flashcache on a 6.4 kernel. (anything higher then 2.6.32-279) It should be fixed in kernel 2.6.32-387.el6 which, I assume, will be 6.5 which only just entered Beta. Fore more info, take a look here: https://github.com/facebook/flashcache/issues/113 Since I've hit this issue (thankfully in our dev environment) we are slightly less enthusiastic about running flashcache :( It also adds a layer of complexity so I would rather just run the journals on SSD, at least on Redhat. I'm not sure about the performance difference of just journals v.s. Flashcache but I'd be happy to read any such comparison :) Also, if you want to make use of the SSD trim func P.S. My experience with Flashcache is on Openstack Swift Nova not Ceph. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Same journal device for multiple OSDs?
You can certainly use a similarly named device to back an OSD journal if the OSDs are on separate hosts. If you want to take a single SSD device and utilize it as a journal for many OSDs on the same machine then you would want to partition the SSD device and use a different partition for each OSD journal. You might consider using /dev/disk/by-id/foo instead of /dev/fioa1 to avoid potential device reordering issues after a reboot. Hope that helps, sorry if I misunderstood. On Wed, Oct 9, 2013 at 7:03 AM, Andreas Bluemle andreas.blue...@itxperts.de wrote: Hi, to avoid confusion: the configuration did *not* contain multiple osds referring to the same journal device (or file). The snippet from ceph.conf suggests osd.214 and osd.314 both use the same journal - but it doesn't show that these osds run on different hosts. Regards Andreas Bluemle On Wed, 9 Oct 2013 11:23:18 +0200 Andreas Friedrich andreas.friedr...@ts.fujitsu.com wrote: Hello, I have a Ceph test cluster with 88 OSDs running well. In ceph.conf I found multiple OSDs that are using the same SSD block device (without a file system) for their journal: [osd.214] osd journal = /dev/fioa1 ... [osd.314] osd journal = /dev/fioa1 ... Is this a allowed configuration? Regards Andreas Friedrich -- FUJITSU Fujitsu Technology Solutions GmbH Heinz-Nixdorf-Ring 1, 33106 Paderborn, Germany Tel: +49 (5251) 525-1512 Fax: +49 (5251) 525-321512 Email: andreas.friedr...@ts.fujitsu.com Web: ts.fujitsu.com Company details: de.ts.fujitsu.com/imprint -- -- Andreas Bluemle mailto:andreas.blue...@itxperts.de ITXperts GmbH http://www.itxperts.de Balanstrasse 73, Geb. 08Phone: (+49) 89 89044917 D-81541 Muenchen (Germany) Fax: (+49) 89 89044910 Company details: http://www.itxperts.de/imprint.htm ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com