Re: [ceph-users] Question/idea about performance problems with a few overloaded OSDs

2014-10-21 Thread Mark Nelson

On 10/21/2014 01:06 PM, Lionel Bouton wrote:

Hi Gregory,

Le 21/10/2014 19:39, Gregory Farnum a écrit :

On Tue, Oct 21, 2014 at 10:15 AM, Lionel Bouton  wrote:

[...]
Any thought? Is it based on wrong assumptions? Would it prove to be a
can of worms if someone tried to implement it?

Yeah, there's one big thing you're missing: we strictly order reads
and writes to an object, and the primary is the serialization point.


Of course... I should have anticipated this. As you explain later
(thanks for the detailed explanation by the way) implementing redirect
would need a whole new way of coordinating accesses. I'm not yet
familiar with Ceph internals but I suspect this would mutate Ceph in
another beast entirely...



If we were to proxy reads to another replica it would be easy enough
for the primary to continue handling the ordering, but if it were just
a redirect it wouldn't be able to do so (the primary doesn't know when
the read is completed, allowing it to start a write). Setting up the
proxy of course requires a lot more code, but more importantly it's
more resource-intensive on the primary, so I'm not sure if it's worth
it. :/


Difficult to know without real-life testing. It's a non-trivial
CPU/network/disk capacity trade-off...


The "primary affinity" value we recently introduced is designed to
help alleviate persistent balancing problems around this by letting
you reduce how many PGs an OSD is primary for without changing the
location of the actual data in the cluster. But dynamic updates to
that aren't really feasible either (it's a map change and requires
repeering). [...]


I forgot about this. Thanks for the reminder: this definitely would help
in some of my use cases where the load is predictable over a relatively
long period.

I'll have to dig into the sources one day, I can't stop wondering about
various aspects of the internals since I began using Ceph (I've worked
on the code of distributed systems on several occasions and I've always
been hooked easily)...


At some point I'd like to experiment with creating some kind of 
datastore proxy layer to sit below OSDs and do a sort of similar scheme 
where latency statistics are tracked and writes get directed to 
different stores.  The idea would be to generally keep the benefits of 
crush and deterministic placement (at least as far as getting the data 
to some OSD on a node), but then allow some level of flexibility in 
terms of avoiding hotspots on specific disks (heavy reads, seek 
contention, vibration, leveldb compaction stalls, etc).  This 
unfortunately reintroduces something like a lookup table, but perhaps at 
the node level this could be made fast enough that it wouldn't be as 
much of a problem.


I don't know if this would actually work in practice, but I think it 
would be a very interesting project to explore.


Mark



Best regards,

Lionel Bouton
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question/idea about performance problems with a few overloaded OSDs

2014-10-21 Thread Lionel Bouton
Hi Gregory,

Le 21/10/2014 19:39, Gregory Farnum a écrit :
> On Tue, Oct 21, 2014 at 10:15 AM, Lionel Bouton  
> wrote:
>> [...]
>> Any thought? Is it based on wrong assumptions? Would it prove to be a
>> can of worms if someone tried to implement it?
> Yeah, there's one big thing you're missing: we strictly order reads
> and writes to an object, and the primary is the serialization point.

Of course... I should have anticipated this. As you explain later
(thanks for the detailed explanation by the way) implementing redirect
would need a whole new way of coordinating accesses. I'm not yet
familiar with Ceph internals but I suspect this would mutate Ceph in
another beast entirely...


> If we were to proxy reads to another replica it would be easy enough
> for the primary to continue handling the ordering, but if it were just
> a redirect it wouldn't be able to do so (the primary doesn't know when
> the read is completed, allowing it to start a write). Setting up the
> proxy of course requires a lot more code, but more importantly it's
> more resource-intensive on the primary, so I'm not sure if it's worth
> it. :/

Difficult to know without real-life testing. It's a non-trivial
CPU/network/disk capacity trade-off...

> The "primary affinity" value we recently introduced is designed to
> help alleviate persistent balancing problems around this by letting
> you reduce how many PGs an OSD is primary for without changing the
> location of the actual data in the cluster. But dynamic updates to
> that aren't really feasible either (it's a map change and requires
> repeering). [...]

I forgot about this. Thanks for the reminder: this definitely would help
in some of my use cases where the load is predictable over a relatively
long period.

I'll have to dig into the sources one day, I can't stop wondering about
various aspects of the internals since I began using Ceph (I've worked
on the code of distributed systems on several occasions and I've always
been hooked easily)...

Best regards,

Lionel Bouton
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question/idea about performance problems with a few overloaded OSDs

2014-10-21 Thread Gregory Farnum
On Tue, Oct 21, 2014 at 10:15 AM, Lionel Bouton  wrote:
> Hi,
>
> I've yet to install 0.80.7 on one node to confirm its stability and use
> the new IO prirority tuning parameters enabling prioritized access to
> data from client requests.
>
> In the meantime, faced with large slowdowns caused by resync or external
> IO load (although external IO load is not expected it can happen in
> migrations from other storage solutions like in our recent experience)
> I've got an idea related to the underlying problem (IO load concurrent
> with client requests or even concentrated client-requests) that might
> already be implemented (or not being of much value) so I'll write it
> down to get feedback.
>
> When IO load is not balanced correctly across OSDs the most loaded OSD
> becomes a bottleneck in both write and read requests and for many
> (most?) workloads will become a bottleneck for the whole storage network
> as seen by the client. This happened to us on numerous occasions (low
> filesystem performance, OSD restarts triggering backfills or recoveries)
> For read requests would it be beneficial for OSDs to communicate with
> their peers to find out their recent IO mean/median/... service time and
> make OSDs able to proxy requests to less loaded nodes when they are
> substantially more loaded than their peers?
> If the additional network load generated by proxying requests proves
> detrimental to the overall performance, maybe an update to librados to
> accept a hint to redirect read requests for a given PG and a given
> period might be a solution.
>
> I understand that even if this is possible for read requests this
> doesn't apply to write requests because they are synchronized across all
> replicas. That said diminishing read load on one OSD without modifying
> write behavior will obviously help the OSD process write requests faster.
> If the general idea isn't bad or already obsoleted by another it's
> obviously not trivial. For example it can create unstable feedback loops
> so if I were to try and implement it I'll probably start with a
> "selective" proxy/redirect with a probability of proxying/redirecting
> being computed from the respective loads of all OSDs storing a given PG
> to avoid "ping-pong" situations where read requests overload OSDs before
> overloading another and coming round again.
>
> Any thought? Is it based on wrong assumptions? Would it prove to be a
> can of worms if someone tried to implement it?

Yeah, there's one big thing you're missing: we strictly order reads
and writes to an object, and the primary is the serialization point.
If we were to proxy reads to another replica it would be easy enough
for the primary to continue handling the ordering, but if it were just
a redirect it wouldn't be able to do so (the primary doesn't know when
the read is completed, allowing it to start a write). Setting up the
proxy of course requires a lot more code, but more importantly it's
more resource-intensive on the primary, so I'm not sure if it's worth
it. :/
The "primary affinity" value we recently introduced is designed to
help alleviate persistent balancing problems around this by letting
you reduce how many PGs an OSD is primary for without changing the
location of the actual data in the cluster. But dynamic updates to
that aren't really feasible either (it's a map change and requires
repeering).

There are also relaxed consistency mechanisms that let clients read
from a replica (randomly, or the one "closest" to them, etc), but with
these there's no good way to get load data from the OSDs to the
clients.

So redirects of some kind sound like a good feature, but I'm not sure
how one could go about implementing them reasonably. I think the
actual proxy is probably the best bet, but that's an awful lot of code
in critical places and with lots of dependencies whose
performance/balancing benefits I'm a little dubious of. :/
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Question/idea about performance problems with a few overloaded OSDs

2014-10-21 Thread Lionel Bouton
Hi,

I've yet to install 0.80.7 on one node to confirm its stability and use
the new IO prirority tuning parameters enabling prioritized access to
data from client requests.

In the meantime, faced with large slowdowns caused by resync or external
IO load (although external IO load is not expected it can happen in
migrations from other storage solutions like in our recent experience)
I've got an idea related to the underlying problem (IO load concurrent
with client requests or even concentrated client-requests) that might
already be implemented (or not being of much value) so I'll write it
down to get feedback.

When IO load is not balanced correctly across OSDs the most loaded OSD
becomes a bottleneck in both write and read requests and for many
(most?) workloads will become a bottleneck for the whole storage network
as seen by the client. This happened to us on numerous occasions (low
filesystem performance, OSD restarts triggering backfills or recoveries)
For read requests would it be beneficial for OSDs to communicate with
their peers to find out their recent IO mean/median/... service time and
make OSDs able to proxy requests to less loaded nodes when they are
substantially more loaded than their peers?
If the additional network load generated by proxying requests proves
detrimental to the overall performance, maybe an update to librados to
accept a hint to redirect read requests for a given PG and a given
period might be a solution.

I understand that even if this is possible for read requests this
doesn't apply to write requests because they are synchronized across all
replicas. That said diminishing read load on one OSD without modifying
write behavior will obviously help the OSD process write requests faster.
If the general idea isn't bad or already obsoleted by another it's
obviously not trivial. For example it can create unstable feedback loops
so if I were to try and implement it I'll probably start with a
"selective" proxy/redirect with a probability of proxying/redirecting
being computed from the respective loads of all OSDs storing a given PG
to avoid "ping-pong" situations where read requests overload OSDs before
overloading another and coming round again.

Any thought? Is it based on wrong assumptions? Would it prove to be a
can of worms if someone tried to implement it?

Best regards,

Lionel Bouton
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com