Hi. I have 2 ignite instances that use IgniteCache to store some cache values. The cache is configured with replication on, so both instances have the same data.
Since I am running JNI code to get the cache values, it sometimes (on rare occasions) crashes, which in turn kills the ignite instance. I have an external script that starts the failed ignite instance as soon as it crashes. I was expecting the non crashed ignite instance (ignite1) to quickly update the crashed instance (ignite2) and both to continue working as usual. This was exactly what was going on for a few days, until one time, ignite2 has crashed, and ignite1 seems to get into a deadlock. As soon as ignite2 got back up, it failed to recognize ignite1, and failed to replicate from it. Any client connections to ignite instances stopped working as well. I am seeing this error in the log: Failed to wait for initial partition map exchange. Possible reasons are: ^-- Transactions in deadlock. ^-- Long running transactions (ignore if this is the case). ^-- Unreleased explicit locks. and also: Local node has detected failed nodes and started cluster-wide procedure. To speed up failure detection please see 'Failure Detection' section under javadoc for 'TcpDiscoverySpi' I am using ignite v1.4 Any suggestions or ideas will be highly appreciated. Thanks!
