[jira] [Commented] (IGNITE-12133) O(log n) partition exchange

2019-10-27 Thread Moti Nisenson-Ken (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16960532#comment-16960532
 ] 

Moti Nisenson-Ken commented on IGNITE-12133:


After looking a bit deeper into it, I think that [~ivan.glukos] is on the right 
track. It would make most sense to implement this first as a general mechanism 
for broadcasting a message to a set of nodes, before looking into this for 
discovery. That way it could be leveraged where it makes sense, and then a 
decision could be made whether it makes sense to use as part of discovery or 
not when it is stable.

In this case, we'd talk about the initiator (instead of the coordinator), and 
the skip list would then start from that the initiator. Since it's general, 
there should be efficient ways to support common cases from a message format 
perspective. Common cases would likely be things like : when the coordinator is 
the initiator, where the set of nodes is all nodes, or the set of nodes are 
those nodes having data for a cache, or all baseline topology nodes, etc. For a 
random subset of nodes it would require including the nodes to be contacted in 
the message of course.

> O(log n) partition exchange
> ---
>
> Key: IGNITE-12133
> URL: https://issues.apache.org/jira/browse/IGNITE-12133
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Moti Nisenson-Ken
>Priority: Major
>
> Currently, partition exchange leverages a ring. This means that 
> communications is O\(n) in number of nodes. It also means that if 
> non-coordinator nodes hang it can take much longer to successfully resolve 
> the topology.
> Instead, why not use something like a skip-list where the coordinator is 
> first. The coordinator can notify the first node at each level of the 
> skip-list. Each node then notifies all of its "near-neighbours" in the 
> skip-list, where node B is a near-neighbour of node-A, if max-level(nodeB) <= 
> max-level(nodeA), and nodeB is the first node at its level when traversing 
> from nodeA in the direction of nodeB, skipping over nodes C which have 
> max-level(C) > max-level(A). 
> 1
> 1 .  .  .3
> 1        3 . .  . 5
> 1 . 2 . 3 . 4 . 5 . 6
> In the above 1 would notify 2 and 3, 3 would notify 4 and 5, 2 -> 4, and 4 -> 
> 6, and 5 -> 6.
> One can achieve better redundancy by having each node traverse in both 
> directions, and having the coordinator also notify the last node in the list 
> at each level. This way in the above example if 2 and 3 were both down, 4 
> would still get notified from 5 and 6 (in the backwards direction).
>  
> The idea is that each individual node has O(log n) nodes to notify - so the 
> overall time is reduced. Additionally, we can deal well with at least 1 node 
> failure - if one includes the option of processing backwards, 2 consecutive 
> node failures can be handled as well. By taking this kind of an approach, 
> then the coordinator can basically treat any nodes it didn't receive a 
> message from as not-connected, and update the topology as well (disconnecting 
> any nodes that it didn't get a notification from). While there are some edge 
> cases here (e.g. 2 disconnected nodes, then 1 connected node, then 2 
> disconnected nodes - the connected node would be wrongly ejected from the 
> topology), these would generally be too rare to need explicit handling for.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-12133) O(log n) partition exchange

2019-10-06 Thread Moti Nisenson-Ken (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16945269#comment-16945269
 ] 

Moti Nisenson-Ken commented on IGNITE-12133:


I'm not familiar enough with Ignite internals - from a message ordering 
perspective, since the coordinator is the one initiating this, and the 
coordinator doesn't change without a change to the major topology version, 
shouldn't it be simple enough to establish an ordering? The change to the next 
major topology version should be the last message sent under the previous 
topology version. Nodes can then hold on to messages received out of order 
(possibly up to some maximum size) for handling in order. I'd assume that when 
the coordinator goes down, the new coordinator will need to notify under the 
previous topology version.

I agree that the ability to mutate messages is lost; this requires sending the 
updates back to the coordinator which then has additional work to do merging 
them. 

 

Regarding [~ivan.glukos] note - top level node failures are not an issue. The 
issue is around consecutive node failures at the bottom most layer. If you look 
in my diagram above 1 -> 3 and 1 -> 2. Thus, failure of 3 doesn't prevent 4 
from getting a message, since 2 -> 4 (3 is not a near neighbour of 2 because 
it's max level is higher than 2's). Again, adding comms in both directions 
(where we have 1 -> 5 and 1 -> 6 as well), makes this even more robust.

> O(log n) partition exchange
> ---
>
> Key: IGNITE-12133
> URL: https://issues.apache.org/jira/browse/IGNITE-12133
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Moti Nisenson-Ken
>Priority: Major
>
> Currently, partition exchange leverages a ring. This means that 
> communications is O\(n) in number of nodes. It also means that if 
> non-coordinator nodes hang it can take much longer to successfully resolve 
> the topology.
> Instead, why not use something like a skip-list where the coordinator is 
> first. The coordinator can notify the first node at each level of the 
> skip-list. Each node then notifies all of its "near-neighbours" in the 
> skip-list, where node B is a near-neighbour of node-A, if max-level(nodeB) <= 
> max-level(nodeA), and nodeB is the first node at its level when traversing 
> from nodeA in the direction of nodeB, skipping over nodes C which have 
> max-level(C) > max-level(A). 
> 1
> 1 .  .  .3
> 1        3 . .  . 5
> 1 . 2 . 3 . 4 . 5 . 6
> In the above 1 would notify 2 and 3, 3 would notify 4 and 5, 2 -> 4, and 4 -> 
> 6, and 5 -> 6.
> One can achieve better redundancy by having each node traverse in both 
> directions, and having the coordinator also notify the last node in the list 
> at each level. This way in the above example if 2 and 3 were both down, 4 
> would still get notified from 5 and 6 (in the backwards direction).
>  
> The idea is that each individual node has O(log n) nodes to notify - so the 
> overall time is reduced. Additionally, we can deal well with at least 1 node 
> failure - if one includes the option of processing backwards, 2 consecutive 
> node failures can be handled as well. By taking this kind of an approach, 
> then the coordinator can basically treat any nodes it didn't receive a 
> message from as not-connected, and update the topology as well (disconnecting 
> any nodes that it didn't get a notification from). While there are some edge 
> cases here (e.g. 2 disconnected nodes, then 1 connected node, then 2 
> disconnected nodes - the connected node would be wrongly ejected from the 
> topology), these would generally be too rare to need explicit handling for.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-12133) O(log n) partition exchange

2019-10-03 Thread Andrey Mashenkov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16943433#comment-16943433
 ] 

Andrey Mashenkov commented on IGNITE-12133:
---

Yandex DB uses SWIM protocol [1] for discovery and message propagation. 
And they are going to share their experience on HL++ conference in Moscow, 
Russia at November.
May be we can adapt it or use some ideas?

[1] https://www.cs.cornell.edu/projects/Quicksilver/public_pdfs/SWIM.pdf

> O(log n) partition exchange
> ---
>
> Key: IGNITE-12133
> URL: https://issues.apache.org/jira/browse/IGNITE-12133
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Moti Nisenson-Ken
>Priority: Major
>
> Currently, partition exchange leverages a ring. This means that 
> communications is O\(n) in number of nodes. It also means that if 
> non-coordinator nodes hang it can take much longer to successfully resolve 
> the topology.
> Instead, why not use something like a skip-list where the coordinator is 
> first. The coordinator can notify the first node at each level of the 
> skip-list. Each node then notifies all of its "near-neighbours" in the 
> skip-list, where node B is a near-neighbour of node-A, if max-level(nodeB) <= 
> max-level(nodeA), and nodeB is the first node at its level when traversing 
> from nodeA in the direction of nodeB, skipping over nodes C which have 
> max-level(C) > max-level(A). 
> 1
> 1 .  .  .3
> 1        3 . .  . 5
> 1 . 2 . 3 . 4 . 5 . 6
> In the above 1 would notify 2 and 3, 3 would notify 4 and 5, 2 -> 4, and 4 -> 
> 6, and 5 -> 6.
> One can achieve better redundancy by having each node traverse in both 
> directions, and having the coordinator also notify the last node in the list 
> at each level. This way in the above example if 2 and 3 were both down, 4 
> would still get notified from 5 and 6 (in the backwards direction).
>  
> The idea is that each individual node has O(log n) nodes to notify - so the 
> overall time is reduced. Additionally, we can deal well with at least 1 node 
> failure - if one includes the option of processing backwards, 2 consecutive 
> node failures can be handled as well. By taking this kind of an approach, 
> then the coordinator can basically treat any nodes it didn't receive a 
> message from as not-connected, and update the topology as well (disconnecting 
> any nodes that it didn't get a notification from). While there are some edge 
> cases here (e.g. 2 disconnected nodes, then 1 connected node, then 2 
> disconnected nodes - the connected node would be wrongly ejected from the 
> topology), these would generally be too rare to need explicit handling for.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-12133) O(log n) partition exchange

2019-09-04 Thread Alexei Scherbakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16922809#comment-16922809
 ] 

Alexei Scherbakov commented on IGNITE-12133:


PME protocol itself doesn't leverage ring and uses direct node to node 
communication for sending partition maps (except for special case), but ring is 
used by discovery protocol, which "discovers" topology changes and delivers 
event to grid nodes, which triggers PME due to topology changes, for example 
"node left" or "node added".
Also discovery protocol provides "guaranteed ordered messages delivery" which 
is extensively used by Ignite internals and cannot be replaced easily.

Actually, PME consists of three phases:

1. Discovery phase, having O(n) complexity for default TcpDiscoverySpi 
implementation.
2. Topology unlock waiting (out of this post's scope).
3. PME phase having k * O(m) complextity where m is number of I/O sender 
threads and k depends on topology size.

So total PME complexity is sum of 1 and 3.
To speed up PME we should improve 1 and 3.

How to improve 1 ?
Initially ring was designed for small topologies and still works very well for 
such cases with default settings.
Specially for large topologies zookeeper based discovery was introduced, which 
have better complexity.
So, for small topologies I suggest to use defaults.
For large topologies zookeeper discovery should be used.

How to improve 3 ?
For small topologies same as 1, use defaults.
For large topologies we could use [~mnk]'s proposal and use tree-like message 
propagation pattern to achieve log(N) complexity.
I agree with [~ivan.glukos] on increasing failover complexity, but I think it's 
doable.
NOTE: same idea could be used for increasing replicated caches performance on 
large topologies. We have long time known issue with performance degradation if 
topology is large.

[~Jokser] 
Gossip idea looks interesting, but looks like complicated change and 
reinventing the wheel. Why not stick to zookeeper?






> O(log n) partition exchange
> ---
>
> Key: IGNITE-12133
> URL: https://issues.apache.org/jira/browse/IGNITE-12133
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Moti Nisenson-Ken
>Priority: Major
>
> Currently, partition exchange leverages a ring. This means that 
> communications is O\(n) in number of nodes. It also means that if 
> non-coordinator nodes hang it can take much longer to successfully resolve 
> the topology.
> Instead, why not use something like a skip-list where the coordinator is 
> first. The coordinator can notify the first node at each level of the 
> skip-list. Each node then notifies all of its "near-neighbours" in the 
> skip-list, where node B is a near-neighbour of node-A, if max-level(nodeB) <= 
> max-level(nodeA), and nodeB is the first node at its level when traversing 
> from nodeA in the direction of nodeB, skipping over nodes C which have 
> max-level(C) > max-level(A). 
> 1
> 1 .  .  .3
> 1        3 . .  . 5
> 1 . 2 . 3 . 4 . 5 . 6
> In the above 1 would notify 2 and 3, 3 would notify 4 and 5, 2 -> 4, and 4 -> 
> 6, and 5 -> 6.
> One can achieve better redundancy by having each node traverse in both 
> directions, and having the coordinator also notify the last node in the list 
> at each level. This way in the above example if 2 and 3 were both down, 4 
> would still get notified from 5 and 6 (in the backwards direction).
>  
> The idea is that each individual node has O(log n) nodes to notify - so the 
> overall time is reduced. Additionally, we can deal well with at least 1 node 
> failure - if one includes the option of processing backwards, 2 consecutive 
> node failures can be handled as well. By taking this kind of an approach, 
> then the coordinator can basically treat any nodes it didn't receive a 
> message from as not-connected, and update the topology as well (disconnecting 
> any nodes that it didn't get a notification from). While there are some edge 
> cases here (e.g. 2 disconnected nodes, then 1 connected node, then 2 
> disconnected nodes - the connected node would be wrongly ejected from the 
> topology), these would generally be too rare to need explicit handling for.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (IGNITE-12133) O(log n) partition exchange

2019-09-04 Thread Pavel Kovalenko (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16922393#comment-16922393
 ] 

Pavel Kovalenko commented on IGNITE-12133:
--

[~ivan.glukos] I guess the general solution of such problem will be in 
introducing Gossip protocol like in Cassandra. In this case, we shouldn't have 
a pre-determined skip-list topology for the nodes. The probabilistic nature of 
Gossip also gives us ~ log(N) rounds for the dissemination of such messages.

> O(log n) partition exchange
> ---
>
> Key: IGNITE-12133
> URL: https://issues.apache.org/jira/browse/IGNITE-12133
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Moti Nisenson-Ken
>Priority: Major
>
> Currently, partition exchange leverages a ring. This means that 
> communications is O\(n) in number of nodes. It also means that if 
> non-coordinator nodes hang it can take much longer to successfully resolve 
> the topology.
> Instead, why not use something like a skip-list where the coordinator is 
> first. The coordinator can notify the first node at each level of the 
> skip-list. Each node then notifies all of its "near-neighbours" in the 
> skip-list, where node B is a near-neighbour of node-A, if max-level(nodeB) <= 
> max-level(nodeA), and nodeB is the first node at its level when traversing 
> from nodeA in the direction of nodeB, skipping over nodes C which have 
> max-level(C) > max-level(A). 
> 1
> 1 .  .  .3
> 1        3 . .  . 5
> 1 . 2 . 3 . 4 . 5 . 6
> In the above 1 would notify 2 and 3, 3 would notify 4 and 5, 2 -> 4, and 4 -> 
> 6, and 5 -> 6.
> One can achieve better redundancy by having each node traverse in both 
> directions, and having the coordinator also notify the last node in the list 
> at each level. This way in the above example if 2 and 3 were both down, 4 
> would still get notified from 5 and 6 (in the backwards direction).
>  
> The idea is that each individual node has O(log n) nodes to notify - so the 
> overall time is reduced. Additionally, we can deal well with at least 1 node 
> failure - if one includes the option of processing backwards, 2 consecutive 
> node failures can be handled as well. By taking this kind of an approach, 
> then the coordinator can basically treat any nodes it didn't receive a 
> message from as not-connected, and update the topology as well (disconnecting 
> any nodes that it didn't get a notification from). While there are some edge 
> cases here (e.g. 2 disconnected nodes, then 1 connected node, then 2 
> disconnected nodes - the connected node would be wrongly ejected from the 
> topology), these would generally be too rare to need explicit handling for.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (IGNITE-12133) O(log n) partition exchange

2019-09-04 Thread Andrew Mashenkov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16922349#comment-16922349
 ] 

Andrew Mashenkov commented on IGNITE-12133:
---

[~avinogradov], message ordering is important.
What if some nodes will got "create cache" and "drop cache" events in different 
order? 

> O(log n) partition exchange
> ---
>
> Key: IGNITE-12133
> URL: https://issues.apache.org/jira/browse/IGNITE-12133
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Moti Nisenson-Ken
>Priority: Major
>
> Currently, partition exchange leverages a ring. This means that 
> communications is O\(n) in number of nodes. It also means that if 
> non-coordinator nodes hang it can take much longer to successfully resolve 
> the topology.
> Instead, why not use something like a skip-list where the coordinator is 
> first. The coordinator can notify the first node at each level of the 
> skip-list. Each node then notifies all of its "near-neighbours" in the 
> skip-list, where node B is a near-neighbour of node-A, if max-level(nodeB) <= 
> max-level(nodeA), and nodeB is the first node at its level when traversing 
> from nodeA in the direction of nodeB, skipping over nodes C which have 
> max-level(C) > max-level(A). 
> 1
> 1 .  .  .3
> 1        3 . .  . 5
> 1 . 2 . 3 . 4 . 5 . 6
> In the above 1 would notify 2 and 3, 3 would notify 4 and 5, 2 -> 4, and 4 -> 
> 6, and 5 -> 6.
> One can achieve better redundancy by having each node traverse in both 
> directions, and having the coordinator also notify the last node in the list 
> at each level. This way in the above example if 2 and 3 were both down, 4 
> would still get notified from 5 and 6 (in the backwards direction).
>  
> The idea is that each individual node has O(log n) nodes to notify - so the 
> overall time is reduced. Additionally, we can deal well with at least 1 node 
> failure - if one includes the option of processing backwards, 2 consecutive 
> node failures can be handled as well. By taking this kind of an approach, 
> then the coordinator can basically treat any nodes it didn't receive a 
> message from as not-connected, and update the topology as well (disconnecting 
> any nodes that it didn't get a notification from). While there are some edge 
> cases here (e.g. 2 disconnected nodes, then 1 connected node, then 2 
> disconnected nodes - the connected node would be wrongly ejected from the 
> topology), these would generally be too rare to need explicit handling for.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (IGNITE-12133) O(log n) partition exchange

2019-09-03 Thread Ivan Rakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16921575#comment-16921575
 ] 

Ivan Rakov commented on IGNITE-12133:
-

Folks, what if we'll try to apply skip-list approach not to discovery messages 
routing, but to GridDhtPartitionsFullMessage delivery to all server nodes?
This may make onDone() phase of PME faster, especially when full message is 
heavy and cluster topology is large. On the other hand, handling of failed 
nodes during PME will be dramatically complicated: if one of top-level nodes 
will fail, its "children" should somehow request full map from coordinator or 
another server node (currently, we have extra failover logic only for 
coordinator leave case).

> O(log n) partition exchange
> ---
>
> Key: IGNITE-12133
> URL: https://issues.apache.org/jira/browse/IGNITE-12133
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Moti Nisenson-Ken
>Priority: Major
>
> Currently, partition exchange leverages a ring. This means that 
> communications is O\(n) in number of nodes. It also means that if 
> non-coordinator nodes hang it can take much longer to successfully resolve 
> the topology.
> Instead, why not use something like a skip-list where the coordinator is 
> first. The coordinator can notify the first node at each level of the 
> skip-list. Each node then notifies all of its "near-neighbours" in the 
> skip-list, where node B is a near-neighbour of node-A, if max-level(nodeB) <= 
> max-level(nodeA), and nodeB is the first node at its level when traversing 
> from nodeA in the direction of nodeB, skipping over nodes C which have 
> max-level(C) > max-level(A). 
> 1
> 1 .  .  .3
> 1        3 . .  . 5
> 1 . 2 . 3 . 4 . 5 . 6
> In the above 1 would notify 2 and 3, 3 would notify 4 and 5, 2 -> 4, and 4 -> 
> 6, and 5 -> 6.
> One can achieve better redundancy by having each node traverse in both 
> directions, and having the coordinator also notify the last node in the list 
> at each level. This way in the above example if 2 and 3 were both down, 4 
> would still get notified from 5 and 6 (in the backwards direction).
>  
> The idea is that each individual node has O(log n) nodes to notify - so the 
> overall time is reduced. Additionally, we can deal well with at least 1 node 
> failure - if one includes the option of processing backwards, 2 consecutive 
> node failures can be handled as well. By taking this kind of an approach, 
> then the coordinator can basically treat any nodes it didn't receive a 
> message from as not-connected, and update the topology as well (disconnecting 
> any nodes that it didn't get a notification from). While there are some edge 
> cases here (e.g. 2 disconnected nodes, then 1 connected node, then 2 
> disconnected nodes - the connected node would be wrongly ejected from the 
> topology), these would generally be too rare to need explicit handling for.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (IGNITE-12133) O(log n) partition exchange

2019-09-02 Thread Aleksey Plekhanov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16920699#comment-16920699
 ] 

Aleksey Plekhanov commented on IGNITE-12133:


[~mnk],

As an addition to [~avinogradov] comment, we also will lose an ability to 
mutate messages and will have the same limitations as ZooKeeper discovery.

Perhaps it's better to make an effort to stabilize ZooKeeper discovery (which 
works faster than ring discovery for big topologies but still have some 
problems) instead of implementing and support new discovery protocol.

> O(log n) partition exchange
> ---
>
> Key: IGNITE-12133
> URL: https://issues.apache.org/jira/browse/IGNITE-12133
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Moti Nisenson-Ken
>Priority: Major
>
> Currently, partition exchange leverages a ring. This means that 
> communications is O(n) in number of nodes. It also means that if 
> non-coordinator nodes hang it can take much longer to successfully resolve 
> the topology.
> Instead, why not use something like a skip-list where the coordinator is 
> first. The coordinator can notify the first node at each level of the 
> skip-list. Each node then notifies all of its "near-neighbours" in the 
> skip-list, where node B is a near-neighbour of node-A, if max-level(nodeB) <= 
> max-level(nodeA), and nodeB is the first node at its level when traversing 
> from nodeA in the direction of nodeB, skipping over nodes C which have 
> max-level(C) > max-level(A). 
> 1
> 1 .  .  .3
> 1        3 . .  . 5
> 1 . 2 . 3 . 4 . 5 . 6
> In the above 1 would notify 2 and 3, 3 would notify 4 and 5, 2 -> 4, and 4 -> 
> 6, and 5 -> 6.
> One can achieve better redundancy by having each node traverse in both 
> directions, and having the coordinator also notify the last node in the list 
> at each level. This way in the above example if 2 and 3 were both down, 4 
> would still get notified from 5 and 6 (in the backwards direction).
>  
> The idea is that each individual node has O(log n) nodes to notify - so the 
> overall time is reduced. Additionally, we can deal well with at least 1 node 
> failure - if one includes the option of processing backwards, 2 consecutive 
> node failures can be handled as well. By taking this kind of an approach, 
> then the coordinator can basically treat any nodes it didn't receive a 
> message from as not-connected, and update the topology as well (disconnecting 
> any nodes that it didn't get a notification from). While there are some edge 
> cases here (e.g. 2 disconnected nodes, then 1 connected node, then 2 
> disconnected nodes - the connected node would be wrongly ejected from the 
> topology), these would generally be too rare to need explicit handling for.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (IGNITE-12133) O(log n) partition exchange

2019-09-02 Thread Anton Vinogradov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16920687#comment-16920687
 ] 

Anton Vinogradov commented on IGNITE-12133:
---

[~mnk], 

1) In the case of skip-list approach usage, we will lose discovery message 
processing order guarantee.

P.s. Not sure we need it at some case :) just an addition.


2) BTW, we already have a discovery implementation based on ZooKeeper. 
([https://apacheignite.readme.io/docs/zookeeper-discovery])

Anyway, seems, this optimization will speed-up the ring instead of replacement 
with a star.

 

Could you start devlist discussion before the implementation?

> O(log n) partition exchange
> ---
>
> Key: IGNITE-12133
> URL: https://issues.apache.org/jira/browse/IGNITE-12133
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Moti Nisenson-Ken
>Priority: Major
>
> Currently, partition exchange leverages a ring. This means that 
> communications is O(n) in number of nodes. It also means that if 
> non-coordinator nodes hang it can take much longer to successfully resolve 
> the topology.
> Instead, why not use something like a skip-list where the coordinator is 
> first. The coordinator can notify the first node at each level of the 
> skip-list. Each node then notifies all of its "near-neighbours" in the 
> skip-list, where node B is a near-neighbour of node-A, if max-level(nodeB) <= 
> max-level(nodeA), and nodeB is the first node at its level when traversing 
> from nodeA in the direction of nodeB, skipping over nodes C which have 
> max-level(C) > max-level(A). 
> 1
> 1 .  .  .3
> 1        3 . .  . 5
> 1 . 2 . 3 . 4 . 5 . 6
> In the above 1 would notify 2 and 3, 3 would notify 4 and 5, 2 -> 4, and 4 -> 
> 6, and 5 -> 6.
> One can achieve better redundancy by having each node traverse in both 
> directions, and having the coordinator also notify the last node in the list 
> at each level. This way in the above example if 2 and 3 were both down, 4 
> would still get notified from 5 and 6 (in the backwards direction).
>  
> The idea is that each individual node has O(log n) nodes to notify - so the 
> overall time is reduced. Additionally, we can deal well with at least 1 node 
> failure - if one includes the option of processing backwards, 2 consecutive 
> node failures can be handled as well. By taking this kind of an approach, 
> then the coordinator can basically treat any nodes it didn't receive a 
> message from as not-connected, and update the topology as well (disconnecting 
> any nodes that it didn't get a notification from). While there are some edge 
> cases here (e.g. 2 disconnected nodes, then 1 connected node, then 2 
> disconnected nodes - the connected node would be wrongly ejected from the 
> topology), these would generally be too rare to need explicit handling for.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)