Hi, all

I would like to start the discussion about introducing chaos experiments in
apache spark.

This SPIP is proposed to introduce chaos experiments in apache spark to
make sure spark can withstand the unpredictability of in-production
environment, to help developers more quickly identify and resolve issues
that might not be captured with unit and integration testing.

Distributed systems face complex and unpredictable production environments,
such as disk failures, machine power loss, network isolation, and that’s
just the tip of the iceberg. To make distributed systems  more robust, we
need a method to simulate unpredictable failures and test responses to
these failures.


After chaos experiments:

   -

   Increases reliability and resiliency for apache spark.


   -

   Unplanned downtime and outages are far less likely to occur due to
   proactive and constant testing.
   -

   Strengthens system integrity.


It will also help apache spark to expose issues about reliability and
resiliency faster and earlier, making it easier to reproduce user-reported
production issues.

See more in SPIP DOC:
https://docs.google.com/document/d/17dpBLUJcmqqKz7LMoyr4UJgr5t5ZUS3FdwXoybbHDCE

-- 
Best!
Qian Sun

Reply via email to