Robin Wolters created SPARK-29861:
-------------------------------------

             Summary: Reduce leader election downtime in Spark standalone HA
                 Key: SPARK-29861
                 URL: https://issues.apache.org/jira/browse/SPARK-29861
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 2.2.1
            Reporter: Robin Wolters


As officially stated in the spark [HA 
documention|https://spark.apache.org/docs/latest/spark-standalone.html#standby-masters-with-zookeeper],
 the recovery process of Spark (standalone) master in HA with zookeeper takes 
about 1-2 minutes. During this time no spark master is active, which makes 
interaction with spark essentially impossible. 

After looking for a way to reduce this downtime, it seems that this is mainly 
caused by the leader election, which waits for open zookeeper connections to be 
closed. This seems like an unnecessary downtime for example in case of a 
planned VM update.

I have fixed this in my setup by:
 # Closing open zookeeper connections during spark shutdown
 # Bumping the curator version and implementing a custom error policy that is 
tolerant to a zookeeper connection suspension.

I am preparing a pull request for review / further discussion on this issue.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to