Hi, all I would like to start the discussion about introducing chaos experiments in apache spark.
This SPIP is proposed to introduce chaos experiments in apache spark to make sure spark can withstand the unpredictability of in-production environment, to help developers more quickly identify and resolve issues that might not be captured with unit and integration testing. Distributed systems face complex and unpredictable production environments, such as disk failures, machine power loss, network isolation, and that’s just the tip of the iceberg. To make distributed systems more robust, we need a method to simulate unpredictable failures and test responses to these failures. After chaos experiments: - Increases reliability and resiliency for apache spark. - Unplanned downtime and outages are far less likely to occur due to proactive and constant testing. - Strengthens system integrity. It will also help apache spark to expose issues about reliability and resiliency faster and earlier, making it easier to reproduce user-reported production issues. See more in SPIP DOC: https://docs.google.com/document/d/17dpBLUJcmqqKz7LMoyr4UJgr5t5ZUS3FdwXoybbHDCE -- Best! Qian Sun