Hi Greg,

thank you for your (fast) answer.

Since we're going more in-depth, in must say :

 * we're running 2 Gentoo GNU/Linux servers doing both storage and
   virtualization (I know this is not recommended but we mostly have a
   low load and virtually no writes outside of ceph)
 * sys-cluster/ceph-0.56.4  USE="radosgw -debug -fuse -gtk -libatomic
   -static-libs -tcmalloc"
 * app-emulation/qemu-1.2.2-r3  USE="aio caps curl jpeg ncurses png rbd
   sasl seccomp threads uuid vhost-net vnc-alsa -bluetooth -brltty
   -debug -doc -fdt -mixemu -opengl -pulseaudio -python -sdl (-selinux)
   -smartcard -spice -static -static-softmmu -static-user -systemtap
   -tci -tls -usbredir -vde -virtfs -xattr -xen -xfs"
 * app-emulation/libvirt-1.0.2-r2  USE="caps iscsi libvirtd lvm lxc
   macvtap nls pcap python qemu rbd sasl udev vepa virt-network -audit
   -avahi -firewalld -fuse -nfs -numa -openvz -parted -phyp -policykit
   (-selinux) -uml -virtualbox -xen"
 * 1 SSD, 3 HDDs per host.
 * monitor filesystems on SSD
 * OSD journals on SSD
 * OSD data on spinnies
 * [client]
        rbd cache = true
        rbd cache size = 128M
        rbd cache max dirty = 32M
 * We can pay for some support if required ;)
 * I know cuttlefish has some scrub-related optimizations, but cannot
   upgrade now


On 09/07/2013 13:04, Gregory Farnum wrote:
What kinds of performance drops are you seeing during recovery?

Mostly high latencies making some websites non responsive (LAMP stacks, mostly). Same thing for some email servers. Another problem is that my munin has difficulties fetching its data from VMs during scrubs (the munin server is also a VM and writing at this time is okay).

On a sample host HDD, my latency averages are :

        Read (ms)
        Write (ms)
        Utilization (%)
        Read throughput (kB/s)
        Write throughput (kB/s)
not scrubbing (07:26-09:58)
        10.08
        195.41
        19.06
        80.40
        816.84
scrubbing (10:00-11:20)
        14.02
        198.08
        27.73
        102.30
        797.76


On a sample web and email server :

        data coverage (approx.)
        Read (ms)
        Write (ms)
not scrubbing (07:26-09:58)     100%
        45.02
        7.36
scrubbing (10:00-11:20)         20-30%
        432.73
        181.19



If for instance you've got clients sendings lots of operations that are small 
compared to object size then the bounding won't work out quite right, or maybe 
you're just knocking out a bunch of servers and getting bad long-tail latency 
effects.

I think I can't answer this. I tend to think it's the first case, because the drives don't seems to hit even 50% utilization (CPU is around 3% and I have more than 40GB of "free" RAM).

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to