[ https://issues.apache.org/jira/browse/IGNITE-12133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16922349#comment-16922349 ]
Andrew Mashenkov commented on IGNITE-12133: ------------------------------------------- [~avinogradov], message ordering is important. What if some nodes will got "create cache" and "drop cache" events in different order? > O(log n) partition exchange > --------------------------- > > Key: IGNITE-12133 > URL: https://issues.apache.org/jira/browse/IGNITE-12133 > Project: Ignite > Issue Type: Improvement > Reporter: Moti Nisenson-Ken > Priority: Major > > Currently, partition exchange leverages a ring. This means that > communications is O\(n) in number of nodes. It also means that if > non-coordinator nodes hang it can take much longer to successfully resolve > the topology. > Instead, why not use something like a skip-list where the coordinator is > first. The coordinator can notify the first node at each level of the > skip-list. Each node then notifies all of its "near-neighbours" in the > skip-list, where node B is a near-neighbour of node-A, if max-level(nodeB) <= > max-level(nodeA), and nodeB is the first node at its level when traversing > from nodeA in the direction of nodeB, skipping over nodes C which have > max-level(C) > max-level(A). > 1 > 1 . . .3 > 1 3 . . . 5 > 1 . 2 . 3 . 4 . 5 . 6 > In the above 1 would notify 2 and 3, 3 would notify 4 and 5, 2 -> 4, and 4 -> > 6, and 5 -> 6. > One can achieve better redundancy by having each node traverse in both > directions, and having the coordinator also notify the last node in the list > at each level. This way in the above example if 2 and 3 were both down, 4 > would still get notified from 5 and 6 (in the backwards direction). > > The idea is that each individual node has O(log n) nodes to notify - so the > overall time is reduced. Additionally, we can deal well with at least 1 node > failure - if one includes the option of processing backwards, 2 consecutive > node failures can be handled as well. By taking this kind of an approach, > then the coordinator can basically treat any nodes it didn't receive a > message from as not-connected, and update the topology as well (disconnecting > any nodes that it didn't get a notification from). While there are some edge > cases here (e.g. 2 disconnected nodes, then 1 connected node, then 2 > disconnected nodes - the connected node would be wrongly ejected from the > topology), these would generally be too rare to need explicit handling for. -- This message was sent by Atlassian Jira (v8.3.2#803003)