Denis Magda created IGNITE-8967: ----------------------------------- Summary: Automatic Handling of Long Stop-the-World Pauses Key: IGNITE-8967 URL: https://issues.apache.org/jira/browse/IGNITE-8967 Project: Ignite Issue Type: New Feature Reporter: Denis Magda
Based on the discussion on the dev list: http://apache-ignite-developers.2346864.n4.nabble.com/Automatic-Handling-of-Long-Stop-the-World-Pauses-td31847.html Ignite goes with a number of self-healing capabilities: * Ignite can already handle critical failures such as OOM, File I/O issues, etc. [1] * There is an endeavor to fix cluster lock-ins due to partition map exchange issues. [2] There is one more notorious problem that might affect Ignite deployments which is long stop-the-world GC pauses. We did a little progress in this direction [3] by providing particular metrics that help to monitor the pauses. Presently, I would either create specific policies similar to the critical failures policies [4] or just add a long STP issue to the list of critical failures [1]. [1] https://apacheignite.readme.io/docs/critical-failures-handling [2] http://apache-ignite-developers.2346864.n4.nabble.com/IEP-25-Partition-Map-Exchange-hangs-resolving-td31819.html [3] https://issues.apache.org/jira/browse/IGNITE-6171 [4] https://apacheignite.readme.io/docs/critical-failures-handling#section-failure-handling -- This message was sent by Atlassian JIRA (v7.6.3#76005)