Hi, I've yet to install 0.80.7 on one node to confirm its stability and use the new IO prirority tuning parameters enabling prioritized access to data from client requests.
In the meantime, faced with large slowdowns caused by resync or external IO load (although external IO load is not expected it can happen in migrations from other storage solutions like in our recent experience) I've got an idea related to the underlying problem (IO load concurrent with client requests or even concentrated client-requests) that might already be implemented (or not being of much value) so I'll write it down to get feedback. When IO load is not balanced correctly across OSDs the most loaded OSD becomes a bottleneck in both write and read requests and for many (most?) workloads will become a bottleneck for the whole storage network as seen by the client. This happened to us on numerous occasions (low filesystem performance, OSD restarts triggering backfills or recoveries) For read requests would it be beneficial for OSDs to communicate with their peers to find out their recent IO mean/median/... service time and make OSDs able to proxy requests to less loaded nodes when they are substantially more loaded than their peers? If the additional network load generated by proxying requests proves detrimental to the overall performance, maybe an update to librados to accept a hint to redirect read requests for a given PG and a given period might be a solution. I understand that even if this is possible for read requests this doesn't apply to write requests because they are synchronized across all replicas. That said diminishing read load on one OSD without modifying write behavior will obviously help the OSD process write requests faster. If the general idea isn't bad or already obsoleted by another it's obviously not trivial. For example it can create unstable feedback loops so if I were to try and implement it I'll probably start with a "selective" proxy/redirect with a probability of proxying/redirecting being computed from the respective loads of all OSDs storing a given PG to avoid "ping-pong" situations where read requests overload OSDs before overloading another and coming round again. Any thought? Is it based on wrong assumptions? Would it prove to be a can of worms if someone tried to implement it? Best regards, Lionel Bouton _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com