Re: Failed to wait for initial partition map exchange

Jason Sat, 06 Aug 2016 07:40:36 -0700

Thanks Val.

But seems that the failed node cannot be isolated by setting the
FailureDetectionTimeout, say 2000ms.


When the cluster with 22 server nodes and 146 client runs some time, e.g.
half day, and if there's failed server nodes, e.g. network issue, the new
started nodes will be hang with "Failed to wait for initial partition map
exchange" error.

I've done some debugging for the hang node, found that after send the
GridDhtAffinityAssignmentRequest in the requestFromNextNode method, then
it's blocked by GridDhtAssignmentFetchFuture in the "exchange worker" thread
but no response incoming, and this keeps for almost one day, and still
cannot recover.

And i've also checked the log in the Node that's requested in the
requestFromNextNode, no error except the normal metric.

BTW, the attached is the thread_dump for this failed node and the requested
node, please help took a look, and any suggestion will be appreciated.

FYI, we've spent over one month on testing the Ignite so far, but seems that
if this cannot be rooted cause and resolved, we can only give up all our
effort in Ignite now.

thread_dump_for_failed_to_wait_for_initial_partition_map_exchange.txt
<http://apache-ignite-users.70518.x6.nabble.com/file/n6830/thread_dump_for_failed_to_wait_for_initial_partition_map_exchange.txt>
  
thread_dump_for_requested_node.log
<http://apache-ignite-users.70518.x6.nabble.com/file/n6830/thread_dump_for_requested_node.log>
  
failed_node.log
<http://apache-ignite-users.70518.x6.nabble.com/file/n6830/failed_node.log>  
reqeusted_node.log
<http://apache-ignite-users.70518.x6.nabble.com/file/n6830/reqeusted_node.log>  


Thanks,
-Jason






--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Failed-to-wait-for-initial-partition-map-exchange-tp6252p6830.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: Failed to wait for initial partition map exchange

Reply via email to