Re: [ceph-users] Question/idea about performance problems with a few overloaded OSDs
On 10/21/2014 01:06 PM, Lionel Bouton wrote: Hi Gregory, Le 21/10/2014 19:39, Gregory Farnum a écrit : On Tue, Oct 21, 2014 at 10:15 AM, Lionel Bouton wrote: [...] Any thought? Is it based on wrong assumptions? Would it prove to be a can of worms if someone tried to implement it? Yeah, there's one big thing you're missing: we strictly order reads and writes to an object, and the primary is the serialization point. Of course... I should have anticipated this. As you explain later (thanks for the detailed explanation by the way) implementing redirect would need a whole new way of coordinating accesses. I'm not yet familiar with Ceph internals but I suspect this would mutate Ceph in another beast entirely... If we were to proxy reads to another replica it would be easy enough for the primary to continue handling the ordering, but if it were just a redirect it wouldn't be able to do so (the primary doesn't know when the read is completed, allowing it to start a write). Setting up the proxy of course requires a lot more code, but more importantly it's more resource-intensive on the primary, so I'm not sure if it's worth it. :/ Difficult to know without real-life testing. It's a non-trivial CPU/network/disk capacity trade-off... The "primary affinity" value we recently introduced is designed to help alleviate persistent balancing problems around this by letting you reduce how many PGs an OSD is primary for without changing the location of the actual data in the cluster. But dynamic updates to that aren't really feasible either (it's a map change and requires repeering). [...] I forgot about this. Thanks for the reminder: this definitely would help in some of my use cases where the load is predictable over a relatively long period. I'll have to dig into the sources one day, I can't stop wondering about various aspects of the internals since I began using Ceph (I've worked on the code of distributed systems on several occasions and I've always been hooked easily)... At some point I'd like to experiment with creating some kind of datastore proxy layer to sit below OSDs and do a sort of similar scheme where latency statistics are tracked and writes get directed to different stores. The idea would be to generally keep the benefits of crush and deterministic placement (at least as far as getting the data to some OSD on a node), but then allow some level of flexibility in terms of avoiding hotspots on specific disks (heavy reads, seek contention, vibration, leveldb compaction stalls, etc). This unfortunately reintroduces something like a lookup table, but perhaps at the node level this could be made fast enough that it wouldn't be as much of a problem. I don't know if this would actually work in practice, but I think it would be a very interesting project to explore. Mark Best regards, Lionel Bouton ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Question/idea about performance problems with a few overloaded OSDs
Hi Gregory, Le 21/10/2014 19:39, Gregory Farnum a écrit : > On Tue, Oct 21, 2014 at 10:15 AM, Lionel Bouton > wrote: >> [...] >> Any thought? Is it based on wrong assumptions? Would it prove to be a >> can of worms if someone tried to implement it? > Yeah, there's one big thing you're missing: we strictly order reads > and writes to an object, and the primary is the serialization point. Of course... I should have anticipated this. As you explain later (thanks for the detailed explanation by the way) implementing redirect would need a whole new way of coordinating accesses. I'm not yet familiar with Ceph internals but I suspect this would mutate Ceph in another beast entirely... > If we were to proxy reads to another replica it would be easy enough > for the primary to continue handling the ordering, but if it were just > a redirect it wouldn't be able to do so (the primary doesn't know when > the read is completed, allowing it to start a write). Setting up the > proxy of course requires a lot more code, but more importantly it's > more resource-intensive on the primary, so I'm not sure if it's worth > it. :/ Difficult to know without real-life testing. It's a non-trivial CPU/network/disk capacity trade-off... > The "primary affinity" value we recently introduced is designed to > help alleviate persistent balancing problems around this by letting > you reduce how many PGs an OSD is primary for without changing the > location of the actual data in the cluster. But dynamic updates to > that aren't really feasible either (it's a map change and requires > repeering). [...] I forgot about this. Thanks for the reminder: this definitely would help in some of my use cases where the load is predictable over a relatively long period. I'll have to dig into the sources one day, I can't stop wondering about various aspects of the internals since I began using Ceph (I've worked on the code of distributed systems on several occasions and I've always been hooked easily)... Best regards, Lionel Bouton ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Question/idea about performance problems with a few overloaded OSDs
On Tue, Oct 21, 2014 at 10:15 AM, Lionel Bouton wrote: > Hi, > > I've yet to install 0.80.7 on one node to confirm its stability and use > the new IO prirority tuning parameters enabling prioritized access to > data from client requests. > > In the meantime, faced with large slowdowns caused by resync or external > IO load (although external IO load is not expected it can happen in > migrations from other storage solutions like in our recent experience) > I've got an idea related to the underlying problem (IO load concurrent > with client requests or even concentrated client-requests) that might > already be implemented (or not being of much value) so I'll write it > down to get feedback. > > When IO load is not balanced correctly across OSDs the most loaded OSD > becomes a bottleneck in both write and read requests and for many > (most?) workloads will become a bottleneck for the whole storage network > as seen by the client. This happened to us on numerous occasions (low > filesystem performance, OSD restarts triggering backfills or recoveries) > For read requests would it be beneficial for OSDs to communicate with > their peers to find out their recent IO mean/median/... service time and > make OSDs able to proxy requests to less loaded nodes when they are > substantially more loaded than their peers? > If the additional network load generated by proxying requests proves > detrimental to the overall performance, maybe an update to librados to > accept a hint to redirect read requests for a given PG and a given > period might be a solution. > > I understand that even if this is possible for read requests this > doesn't apply to write requests because they are synchronized across all > replicas. That said diminishing read load on one OSD without modifying > write behavior will obviously help the OSD process write requests faster. > If the general idea isn't bad or already obsoleted by another it's > obviously not trivial. For example it can create unstable feedback loops > so if I were to try and implement it I'll probably start with a > "selective" proxy/redirect with a probability of proxying/redirecting > being computed from the respective loads of all OSDs storing a given PG > to avoid "ping-pong" situations where read requests overload OSDs before > overloading another and coming round again. > > Any thought? Is it based on wrong assumptions? Would it prove to be a > can of worms if someone tried to implement it? Yeah, there's one big thing you're missing: we strictly order reads and writes to an object, and the primary is the serialization point. If we were to proxy reads to another replica it would be easy enough for the primary to continue handling the ordering, but if it were just a redirect it wouldn't be able to do so (the primary doesn't know when the read is completed, allowing it to start a write). Setting up the proxy of course requires a lot more code, but more importantly it's more resource-intensive on the primary, so I'm not sure if it's worth it. :/ The "primary affinity" value we recently introduced is designed to help alleviate persistent balancing problems around this by letting you reduce how many PGs an OSD is primary for without changing the location of the actual data in the cluster. But dynamic updates to that aren't really feasible either (it's a map change and requires repeering). There are also relaxed consistency mechanisms that let clients read from a replica (randomly, or the one "closest" to them, etc), but with these there's no good way to get load data from the OSDs to the clients. So redirects of some kind sound like a good feature, but I'm not sure how one could go about implementing them reasonably. I think the actual proxy is probably the best bet, but that's an awful lot of code in critical places and with lots of dependencies whose performance/balancing benefits I'm a little dubious of. :/ -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Question/idea about performance problems with a few overloaded OSDs
Hi, I've yet to install 0.80.7 on one node to confirm its stability and use the new IO prirority tuning parameters enabling prioritized access to data from client requests. In the meantime, faced with large slowdowns caused by resync or external IO load (although external IO load is not expected it can happen in migrations from other storage solutions like in our recent experience) I've got an idea related to the underlying problem (IO load concurrent with client requests or even concentrated client-requests) that might already be implemented (or not being of much value) so I'll write it down to get feedback. When IO load is not balanced correctly across OSDs the most loaded OSD becomes a bottleneck in both write and read requests and for many (most?) workloads will become a bottleneck for the whole storage network as seen by the client. This happened to us on numerous occasions (low filesystem performance, OSD restarts triggering backfills or recoveries) For read requests would it be beneficial for OSDs to communicate with their peers to find out their recent IO mean/median/... service time and make OSDs able to proxy requests to less loaded nodes when they are substantially more loaded than their peers? If the additional network load generated by proxying requests proves detrimental to the overall performance, maybe an update to librados to accept a hint to redirect read requests for a given PG and a given period might be a solution. I understand that even if this is possible for read requests this doesn't apply to write requests because they are synchronized across all replicas. That said diminishing read load on one OSD without modifying write behavior will obviously help the OSD process write requests faster. If the general idea isn't bad or already obsoleted by another it's obviously not trivial. For example it can create unstable feedback loops so if I were to try and implement it I'll probably start with a "selective" proxy/redirect with a probability of proxying/redirecting being computed from the respective loads of all OSDs storing a given PG to avoid "ping-pong" situations where read requests overload OSDs before overloading another and coming round again. Any thought? Is it based on wrong assumptions? Would it prove to be a can of worms if someone tried to implement it? Best regards, Lionel Bouton ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com