>>> 
>>> One node might be busy doing GC and stay unresponsive for a whole
>>> second or longer, another one might be actually crashed and you didn't
>>> know that yet, these are unlikely but possible.
>> All these are possible but I would rather consider them as exceptional 
>> situations, possibly handled by a retry logic. We should *not* optimise for 
>> that these situations IMO.
>> Thinking about our last performance results, we have avg 26k    gets per 
>> second. Now with numOwners = 2, these means that each node handles 26k 
>> *redundant* gets every second: I'm not concerned about the network load, as 
>> Bela mentioned in a previous mail the network link should not be the 
>> bottleneck, but there's a huge unnecessary activity in OOB threads which 
>> should rather be used for releasing locks or whatever needed. On top of 
>> that, this consuming activity highly encourages GC pauses, as the effort for 
>> a get is practically numOwners higher than it should be.
>> 
>>> More likely, a rehash is in progress, you could then be asking a node
>>> which doesn't yet (or anymore) have the value.
>> 
>> this is a consistency issue and I think we can find a way to handle it some 
>> other way.
>>> 
>>> All good reasons for which imho it makes sense to send out "a couple"
>>> of requests in parallel, but I'd unlikely want to send more than 2,
>>> and I agree often 1 might be enough.
>>> Maybe it should even optimize for the most common case: send out just
>>> one, have a more aggressive timeout and in case of trouble ask for the
>>> next node.
>> +1
>>> 
>>> In addition, sending a single request might spare us some Future,
>>> await+notify messing in terms of CPU cost of sending the request.
>> it's the remote OOB thread that's the most costly resource imo.
> 
> I think I agree on all points, it makes more sense.
> Just that in a large cluster, let's say
> 1000 nodes, maybe I want 20 owners as a sweet spot for read/write
> performance tradeoff, and with such high numbers I guess doing 2-3
> gets in parallel might make sense as those "unlikely" events, suddenly
> are an almost certain.. especially the rehash in progress.
> So I'd propose a separate configuration option for # parallel get
> events, and one to define a "try next node" policy. Or this policy
> should be the whole strategy, and the #gets one of the options for the
> default implementation.

Agreed that having a configurable remote get policy makes sense. 
We already have a JIRA for this[1], I'll start working on it as the performance 
results are hunting me.
I'd like to have Dan's input on this as well first, as he has worked with 
remote gets and I still don't know why null results are not considered valid :)

[1] https://issues.jboss.org/browse/ISPN-825
_______________________________________________
infinispan-dev mailing list
infinispan-dev@lists.jboss.org
https://lists.jboss.org/mailman/listinfo/infinispan-dev

Reply via email to