[
https://issues.apache.org/jira/browse/IGNITE-12133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16920687#comment-16920687
]
Anton Vinogradov commented on IGNITE-12133:
-------------------------------------------
[~mnk],
1) In the case of skip-list approach usage, we will lose discovery message
processing order guarantee.
P.s. Not sure we need it at some case :) just an addition.
2) BTW, we already have a discovery implementation based on ZooKeeper.
([https://apacheignite.readme.io/docs/zookeeper-discovery])
Anyway, seems, this optimization will speed-up the ring instead of replacement
with a star.
Could you start devlist discussion before the implementation?
> O(log n) partition exchange
> ---------------------------
>
> Key: IGNITE-12133
> URL: https://issues.apache.org/jira/browse/IGNITE-12133
> Project: Ignite
> Issue Type: Improvement
> Reporter: Moti Nisenson-Ken
> Priority: Major
>
> Currently, partition exchange leverages a ring. This means that
> communications is O(n) in number of nodes. It also means that if
> non-coordinator nodes hang it can take much longer to successfully resolve
> the topology.
> Instead, why not use something like a skip-list where the coordinator is
> first. The coordinator can notify the first node at each level of the
> skip-list. Each node then notifies all of its "near-neighbours" in the
> skip-list, where node B is a near-neighbour of node-A, if max-level(nodeB) <=
> max-level(nodeA), and nodeB is the first node at its level when traversing
> from nodeA in the direction of nodeB, skipping over nodes C which have
> max-level(C) > max-level(A).
> 1
> 1 . . .3
> 1 3 . . . 5
> 1 . 2 . 3 . 4 . 5 . 6
> In the above 1 would notify 2 and 3, 3 would notify 4 and 5, 2 -> 4, and 4 ->
> 6, and 5 -> 6.
> One can achieve better redundancy by having each node traverse in both
> directions, and having the coordinator also notify the last node in the list
> at each level. This way in the above example if 2 and 3 were both down, 4
> would still get notified from 5 and 6 (in the backwards direction).
>
> The idea is that each individual node has O(log n) nodes to notify - so the
> overall time is reduced. Additionally, we can deal well with at least 1 node
> failure - if one includes the option of processing backwards, 2 consecutive
> node failures can be handled as well. By taking this kind of an approach,
> then the coordinator can basically treat any nodes it didn't receive a
> message from as not-connected, and update the topology as well (disconnecting
> any nodes that it didn't get a notification from). While there are some edge
> cases here (e.g. 2 disconnected nodes, then 1 connected node, then 2
> disconnected nodes - the connected node would be wrongly ejected from the
> topology), these would generally be too rare to need explicit handling for.
--
This message was sent by Atlassian Jira
(v8.3.2#803003)