>> But data exchange and node discovery are taking place in different SPI. I just worry about if the discovery thread has the high enough priority to finish the join process first when the communication threads are also very busy.
So when a new server node is joining the topology, after the coordinator adds it to its local NodeRing, it begins to do the partition re-balance and doesn't wait for all the server nodes to confirm the join process is done(coordinator receives the NodeAddFinishedMessage again), right? would you like to share some design doc or some diagram for the discovery and communication SPI during the join process. >> Can you provide log files from all nodes? Will try to provide, but there're too many logs. Just because I turn on the DEBUG for Discovery & Communication SPI. Or any suggestion on which module to be turned on "DEBUG"? BTW, would you like to suggest some keywords in the logs so that I can extract some of them to ease your debugging? Thanks, -Jason -- View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Fail-to-join-topology-and-repeat-join-process-tp6987p7084.html Sent from the Apache Ignite Users mailing list archive at Nabble.com.