To those interested in a tricky problem,
We have a Ceph cluster running at one of our data centers. One of our
client's requirements is to have them hosted at AWS. My question is: How do
we effectively migrate our data on our internal Ceph cluster to an AWS Ceph
cluster?
Ideas currently on
do people consider a UPS + Shutdown procedures a suitable substitute?
I certainly wouldn't, I've seen utility power fail and the transfer
switch fail to transition to UPS strings. Had this happened to me with
nobarrier it would have been a very sad day.
--
Kyle Bader
For a large network (say 100 servers and 2500 disks), are there any
strong advantages to using separate switch and physical network
instead of VLAN?
Physical isolation will ensure that congestion on one does not affect
the other. On the flip side, asymmetric network failures tend to be
more
Can you paste me the whole output of the install? I am curious why/how you
are getting el7 and el6 packages.
priority=1 required in /etc/yum.repos.d/ceph.repo entries
--
Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
I wonder that OSDs use system calls of Virtual File System (i.e. open, read,
write, etc) when they access disks.
I mean ... Could I monitor I/O command requested by OSD to disks if I
monitor VFS?
Ceph OSDs run on top of a traditional filesystem, so long as they
support xattrs - xfs by
TL;DR: Power outages are more common than your colo facility will admit.
Seconded. I've seen power failures in at least 4 different facilities
and all of them had the usual gamut of batteries/generators/etc. Some
of those facilities I've seen problems multiple times in a single
year. Even a
Let's assume a test cluster up and running with real data on it.
Which is the best way to migrate everything to a production (and
larger) cluster?
I'm thinking to add production MONs to the test cluster, after that,
add productions OSDs to the test cluster, waiting for a full rebalance
and
I think the timing should work that we'll be deploying with Firefly and
so
have Ceph cache pool tiering as an option, but I'm also evaluating
Bcache
versus Tier to act as node-local block cache device. Does anybody have
real
or anecdotal evidence about which approach has better
Obviously the ssds could be used as journal devices, but I'm not really
convinced whether this is worthwhile when all nodes have 1GB of hardware
writeback cache (writes to journal and data areas on the same spindle have
time to coalesce in the cache and minimise seek time hurt). Any advice on
I'm assuming Ceph/RBD doesn't have any direct awareness of this since
the file system doesn't traditionally have a give back blocks
operation to the block device. Is there anything special RBD does in
this case that communicates the release of the Ceph storage back to the
pool?
VMs running
If I want to use a disk dedicated for osd, can I just use something like
/dev/sdb instead of /dev/sdb1? Is there any negative impact on performance?
You can pass /dev/sdb to ceph-disk-prepare and it will create two
partitions, one for the journal (raw partition) and one for the data
volume
One downside of the above arrangement: I read that support for mapping
newer-format RBDs is only present in fairly recent kernels. I'm running
Ubuntu 12.04 on the cluster at present with its stock 3.2 kernel. There
is a PPA for the 3.11 kernel used in Ubuntu 13.10, but if you're looking
at
ceph-disk-prepare --fs-type xfs --dmcrypt --dmcrypt-key-dir
/etc/ceph/dmcrypt-keys --cluster ceph -- /dev/sdb
ceph-disk: Error: Device /dev/sdb2 is in use by a device-mapper mapping
(dm-crypt?): dm-0
It sounds like device-mapper still thinks it's using the the volume,
you might be able to
Anybody has a good practice on how to set up a ceph cluster behind a pair of
load balancer?
The only place you would want to put a load balancer in the context of
a Ceph cluster would be north of RGW nodes. You can do L3 transparent
load balancing or balance with a L7 proxy, ie Linux Virtual
You're right. Sorry didn't specify I was trying this for Radosgw. Even for
this I'm seeing performance degrade once my clients start to hit the LB VIP.
Could you tell us more about your load balancer and configuration?
--
Kyle
___
ceph-users
This is in my lab. Plain passthrough setup with automap enabled on the F5. s3
curl work fine as far as queries go. But file transfer rate degrades badly
once I start file up/download.
Maybe the difference can be attributed to LAN client traffic with
jumbo frames vs F5 using a smaller WAN
There could be millions of tennants. Looking deeper at the docs, it looks
like Ceph prefers to have one OSD per disk. We're aiming at having
backblazes, so will be looking at 45 OSDs per machine, many machines. I want
to separate the tennants and separately encrypt their data. The
1. Is it possible to install Ceph and Ceph monitors on the the XCP
(XEN) Dom0 or would we need to install it on the DomU containing the
Openstack components?
I'm not a Xen guru but in the case of KVM I would run the OSDs on the
hypervisor to avoid virtualization overhead.
2. Is
I tried rbd-fuse and it's throughput using fio is approx. 1/4 that of the
kernel client.
Can you please let me know how to setup RBD backend for FIO? I'm assuming
this RBD backend is also based on librbd?
You will probably have to build fio from source since the rbd engine is new:
What could be the best replication ?
Are you using two sites to increase availability, durability, or both?
For availability your really better off using three sites and use
CRUSH to place each of three replicas in a different datacenter. In
this setup you can survive losing 1 of 3 datacenters.
Why would it help? Since it's not that ONE OSD will be primary for all
objects. There will be 1 Primary OSD per PG and you'll probably have a
couple of thousands PGs.
The primary may be across a oversubscribed/expensive link, in which case a
local replica with a common ancestor to the client may
Change pg_num for .rgw.buckets to power of 2, an 'crush tunables
optimal' didn't help :(
Did you bump pgp_num as well? The split pgs will stay in place until
pgp_num is bumped as well, if you do this be prepared for (potentially
lots) of data movement.
a solution.
That's a shame, but at least you will be better prepared if it happens
again, hopefully your luck is not as unfortunate as mine!
--
Kyle Bader
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users
Yes, that also makes perfect sense, so the aforementioned 12500 objects
for a 50GB image, at a 60 TB cluster/pool with 72 disk/OSDs and 3 way
replication that makes 2400 PGs, following the recommended formula.
What amount of disks (OSDs) did you punch in for the following run?
Disk
Do monitors have to be on the cluster network as well or is it sufficient
for them to be on the public network as
http://ceph.com/docs/master/rados/configuration/network-config-ref/
suggests?
Monitors only need to be on the public network.
Also would the OSDs re-route their traffic over the
Is an object a CephFS file or a RBD image or is it the 4MB blob on the
actual OSD FS?
Objects are at the RADOS level, CephFS filesystems, RBD images and RGW
objects are all composed by striping RADOS objects - default is 4MB.
In my case, I'm only looking at RBD images for KVM volume storage,
The area I'm currently investigating is how to configure the
networking. To avoid a SPOF I'd like to have redundant switches for
both the public network and the internal network, most likely running
at 10Gb. I'm considering splitting the nodes in to two separate racks
and connecting each half
Do you have any futher detail on this radosgw bug?
https://github.com/ceph/ceph/commit/0f36eddbe7e745665a634a16bf3bf35a3d0ac424
https://github.com/ceph/ceph/commit/0b9dc0e5890237368ba3dc34cb029010cb0b67fd
Does it only apply to emperor?
The bug is present in dumpling too.
For you holiday pleasure I've prepared a SysAdvent article on Ceph:
http://sysadvent.blogspot.com/2013/12/day-15-distributed-storage-with-ceph.html
Check it out!
--
Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
Has anyone tried scaling a VMs io by adding additional disks and
striping them in the guest os? I am curious what effect this would have
on io performance?
Why would it? You can also change the stripe size of the RBD image.
Depending on the workload you might change it from 4MB to something
It seems that NUMA can be problematic for ceph-osd daemons in certain
circumstances. Namely it seems that if a NUMA zone is running out of
memory due to uneven allocation it is possible for a NUMA zone to
enter reclaim mode when threads/processes are scheduled on a core in
that zone and those
I've been running similar calculations recently. I've been using this
tool from Inktank to calculate RADOS reliabilities with different
assumptions:
https://github.com/ceph/ceph-tools/tree/master/models/reliability
But I've also had similar questions about RBD (or any multi-part files
We're running OpenStack (KVM) with local disk for ephemeral storage.
Currently we use local RAID10 arrays of 10k SAS drives, so we're quite rich
for IOPS and have 20GE across the board. Some recent patches in OpenStack
Havana make it possible to use Ceph RBD as the source of ephemeral VM
looking at tcpdump all the traffic is going exactly where it is supposed to
go, in particular an osd on the 192.168.228.x network appears to talk to an
osd on the 192.168.229.x network without anything strange happening. I was
just wondering if there was anything about ceph that could make
Is having two cluster networks like this a supported configuration? Every
osd and mon can reach every other so I think it should be.
Maybe. If your back end network is a supernet and each cluster network is a
subnet of that supernet. For example:
Ceph.conf cluster network (supernet):
Is having two cluster networks like this a supported configuration? Every
osd and mon can reach every other so I think it should be.
Maybe. If your back end network is a supernet and each cluster network is a
subnet of that supernet. For example:
Ceph.conf cluster network (supernet): 10.0.0.0/8
Is the OS doing anything apart from ceph? Would booting a ramdisk-only
system from USB or compact flash work?
I haven't tested this kind of configuration myself but I can't think of
anything that would preclude this type of setup. I'd probably use sqashfs
layered with a tmpfs via aufs to avoid
This journal problem is a bit of wizardry to me, I even had weird
intermittent issues with OSDs not starting because the journal was not
found, so please do not hesitate to suggest a better journal setup.
You mentioned using SAS for journal, if your OSDs are SATA and a expander
is in the data
How much performance can be improved if use SSDs to storage journals?
You will see roughly twice the throughput unless you are using btrfs
(still improved but not as dramatic). You will also see lower latency
because the disk head doesn't have to seek back and forth between
journal and data
Is there any way to manually configure which OSDs are started on which
machines? The osd configuration block includes the osd name and host, so is
there a way to say that, say, osd.0 should only be started on host vashti
and osd.1 should only be started on host zadok? I tried using this
Several people have reported issues with combining OS and OSD journals
on the same SSD drives/RAID due to contention. If you do something
like this I would definitely test to make sure it meets your
expectations. Ceph logs are going to compose the majority of the
writes to the OS storage devices.
Once I know a drive has had a head failure, do I trust that the rest of the
drive isn't going to go at an inconvenient moment vs just fixing it right
now when it's not 3AM on Christmas morning? (true story) As good as Ceph
is, do I trust that Ceph is smart enough to prevent spreading
1. To build a high performance yet cheap radosgw storage, which pools should
be placed on ssd and which on hdd backed pools? Upon installation of
radosgw, it created the following pools: .rgw, .rgw.buckets,
.rgw.buckets.index, .rgw.control, .rgw.gc, .rgw.root, .usage, .users,
.users.email.
ST240FN0021 connected via a SAS2x36 to a LSI 9207-8i.
The problem might be SATA transport protocol overhead at the expander.
Have you tried directly connecting the SSDs to SATA2/3 ports on the
mainboard?
--
Kyle
___
ceph-users mailing list
Zackc, Loicd, and I have been the main participants in a weekly Teuthology
call the past few weeks. We've talked mostly about methods to extend
Teuthology to capture performance metrics. Would you be willing to join us
during the Teuthology and Ceph-Brag sessions at the Firefly Developer
I think this is a great idea. One of the big questions users have is
what kind of hardware should I buy. An easy way for users to publish
information about their setup (hardware, software versions, use-case,
performance) when they have successful deployments would be very valuable.
Maybe a
Would this be something like
http://wiki.ceph.com/01Planning/02Blueprints/Firefly/Ceph-Brag ?
Something very much like that :)
--
Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
The quick start guide is linked below, it should help you hit the ground
running.
http://ceph.com/docs/master/start/quick-ceph-deploy/
Let us know if you have questions or bump into trouble!
___
ceph-users mailing list
ceph-users@lists.ceph.com
Recovering from a degraded state by copying existing replicas to other OSDs
is going to cause reads on existing replicas and writes to the new
locations. If you have slow media then this is going to be felt more
acutely. Tuning the backfill options I posted is one way to lessen the
impact, another
You can change some OSD tunables to lower the priority of backfills:
osd recovery max chunk: 8388608
osd recovery op priority: 2
In general a lower op priority means it will take longer for your
placement groups to go from degraded to active+clean, the idea is to
balance
The bobtail release added udev/upstart capabilities that allowed you
to not have per OSD entries in ceph.conf. Under the covers the new
udev/upstart scripts look for a special label on OSD data volumes,
matching volumes are mounted and then a few files are inspected:
journal_uuid whoami
The
I know that 10GBase-T has more delay then SFP+ with direct attached
cables (.3 usec vs 2.6 usec per link), but does that matter? Some
sites stay it is a huge hit, but we are talking usec, not ms, so I
find it hard to believe that it causes that much of an issue. I like
the lower cost and use
This is going to get horribly ugly when you add neutron into the mix, so
much so I'd consider this option a non-starter. If someone is using
openvswitch to create network overlays to isolate each tenant I can't
imagine this ever working.
I'm not following here. Are this only needed if ceph
My first guess would be that it's due to LXC dropping capabilities, I'd
investigate whether CAP_SYS_ADMIN is being dropped. You need CAP_SYS_ADMIN
for mount and block ioctls, if the container doesn't have those privs a map
will likely fail. Maybe try tracing the command with strace?
On Thu, Oct
* The IP address of at least one MON in the Ceph cluster
If you configure nodes with a single monitor in the mon hosts directive
then I believe your nodes will have issues if that one monitor goes down.
With Chef I've gone back and forth between using Chef search and having
monitors be
I've personally saturated 1Gbps links on multiple radosgw nodes on a large
cluster, if I remember correctly, Yehuda has tested it up into the 7Gbps
range with 10Gbps gear. Could you describe your clusters hardware and
connectivity?
On Mon, Oct 14, 2013 at 3:34 AM, Chu Duc Minh
via SSD.
--
Warren
On Oct 9, 2013, at 5:52 PM, Kyle Bader kyle.ba...@gmail.com wrote:
Journal on SSD should effectively double your throughput because data will
not be written to the same device twice to ensure transactional integrity.
Additionally, by placing the OSD journal on an SSD you
I've contracted and expanded clusters by up to a rack of 216 OSDs - 18
nodes, 12 drives each. New disks are configured with a CRUSH weight of 0
and I slowly add weight (0.1 to 0.01 increments), wait for the cluster to
become active+clean and then add more weight. I was expanding after
contraction
Journal on SSD should effectively double your throughput because data will
not be written to the same device twice to ensure transactional integrity.
Additionally, by placing the OSD journal on an SSD you should see less
latency, the disk head no longer has to seek back and forth between the
You can certainly use a similarly named device to back an OSD journal if
the OSDs are on separate hosts. If you want to take a single SSD device and
utilize it as a journal for many OSDs on the same machine then you would
want to partition the SSD device and use a different partition for each OSD
60 matches
Mail list logo