I'm testing a 32 nodes cluster with a partitioned cache with one backup. If 2 of them crashed (not if, when) I have the lost partitions problem.
Now I ssh to one of the nodes and execute *control.sh --baseline.* >From every node other than the one marked as "coordinator" (?) I get this output: -------------------------------------------------------------------------------- Failed to execute baseline command='collect' Failed to communicate with grid nodes (maximum count of retries reached). Connection to cluster failed. Failed to communicate with grid nodes (maximum count of retries reached). Ok, I went to every node and do the same until I found the 'coordinator'. Once I made the failing nodes get online again I execute: *control.sh --cache reset_lost_partitions mycache* To my surprise, I'm getting -------------------------------------------------------------------------------- Connection to cluster failed. Failed to communicate with grid nodes (maximum count of retries reached). So, started again looking for the nodes where that command actually works. I'm sure I'm doing something wrong. Could someone help me? -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/