We have a cluster in a sub-optimal configuration with data and journal
colocated on OSDs (that coincidentally are spinning disks).

During recovery/backfill, the entire cluster suffers degraded performance
because of the IO storm that backfills cause. Client IO becomes extremely
latent. I've tried to decrease the impact that recovery/backfill has with
the following:

ceph tell osd.* injectargs '--osd-max-backfills 1'
ceph tell osd.* injectargs '--osd-max-recovery-threads 1'
ceph tell osd.* injectargs '--osd-recovery-op-priority 1'
ceph tell osd.* injectargs '--osd-client-op-priority 63'
ceph tell osd.* injectargs '--osd-recovery-max-active 1'

The only other option I have left would be to use linux traffic shaping to
artificially reduce the bandwidth available to the interfaced tagged for
cluster traffic (instead of separate physical networks, we use VLAN
tagging). We are nowhere _near_ the point where network saturation would
cause the latency we're seeing, so I am left to believe that it is simply
disk IO saturation.

I could be wrong about this assumption, though, as iostat doesn't terrify
me. This could be suboptimal network configuration on the cluster as well.
I'm still looking into that possibility, but I wanted to get feedback on
what I'd done already first--as well as the proposed traffic shaping idea.

Thoughts?
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to