Thanks Val. But seems that the failed node cannot be isolated by setting the FailureDetectionTimeout, say 2000ms.
When the cluster with 22 server nodes and 146 client runs some time, e.g. half day, and if there's failed server nodes, e.g. network issue, the new started nodes will be hang with "Failed to wait for initial partition map exchange" error. I've done some debugging for the hang node, found that after send the GridDhtAffinityAssignmentRequest in the requestFromNextNode method, then it's blocked by GridDhtAssignmentFetchFuture in the "exchange worker" thread but no response incoming, and this keeps for almost one day, and still cannot recover. And i've also checked the log in the Node that's requested in the requestFromNextNode, no error except the normal metric. BTW, the attached is the thread_dump for this failed node and the requested node, please help took a look, and any suggestion will be appreciated. FYI, we've spent over one month on testing the Ignite so far, but seems that if this cannot be rooted cause and resolved, we can only give up all our effort in Ignite now. thread_dump_for_failed_to_wait_for_initial_partition_map_exchange.txt <http://apache-ignite-users.70518.x6.nabble.com/file/n6830/thread_dump_for_failed_to_wait_for_initial_partition_map_exchange.txt> thread_dump_for_requested_node.log <http://apache-ignite-users.70518.x6.nabble.com/file/n6830/thread_dump_for_requested_node.log> failed_node.log <http://apache-ignite-users.70518.x6.nabble.com/file/n6830/failed_node.log> reqeusted_node.log <http://apache-ignite-users.70518.x6.nabble.com/file/n6830/reqeusted_node.log> Thanks, -Jason -- View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Failed-to-wait-for-initial-partition-map-exchange-tp6252p6830.html Sent from the Apache Ignite Users mailing list archive at Nabble.com.
