We have a cluster in a sub-optimal configuration with data and journal colocated on OSDs (that coincidentally are spinning disks).
During recovery/backfill, the entire cluster suffers degraded performance because of the IO storm that backfills cause. Client IO becomes extremely latent. I've tried to decrease the impact that recovery/backfill has with the following: ceph tell osd.* injectargs '--osd-max-backfills 1' ceph tell osd.* injectargs '--osd-max-recovery-threads 1' ceph tell osd.* injectargs '--osd-recovery-op-priority 1' ceph tell osd.* injectargs '--osd-client-op-priority 63' ceph tell osd.* injectargs '--osd-recovery-max-active 1' The only other option I have left would be to use linux traffic shaping to artificially reduce the bandwidth available to the interfaced tagged for cluster traffic (instead of separate physical networks, we use VLAN tagging). We are nowhere _near_ the point where network saturation would cause the latency we're seeing, so I am left to believe that it is simply disk IO saturation. I could be wrong about this assumption, though, as iostat doesn't terrify me. This could be suboptimal network configuration on the cluster as well. I'm still looking into that possibility, but I wanted to get feedback on what I'd done already first--as well as the proposed traffic shaping idea. Thoughts?
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com