[ https://issues.apache.org/jira/browse/LIVY-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17005063#comment-17005063 ]
Marco Gaido commented on LIVY-718: ---------------------------------- Thanks for your comments [~bikassaha]. I am not sure (actually I think it is not true) that all the metadata information is present in the Spark driver process and there is metadata which is frequently accessed/changed (eg. the ongoing statements for a given session) on the Livy server side (at least for the thrift part). Indeed, there are definitely metadata which are currently kept in the server memory which would need to be saved for HA sake. Hence, I am afraid that at least for the thrift case, the usage of a slow storage like HDFS would at least would require a significant revisit of the thrift part. I agree that active-active is by far the most desirable choice. I see, though, that it is not easy to implement, IMHO, because for the metadata above, it would require a distributed state store being the source of truth for that. Given your negative opinion on ZK, I hardly see any other system which would fit (a relational DB cluster maybe? but not easier to maintain than ZK for sure, I'd say). Hence I am drawn to consider that we would need to trade off things here, unless I am very mistaken on the point above: namely, the REST part has really no significant metadata on Livy server side and we keep the thrift one out of scope here. > Support multi-active high availability in Livy > ---------------------------------------------- > > Key: LIVY-718 > URL: https://issues.apache.org/jira/browse/LIVY-718 > Project: Livy > Issue Type: Epic > Components: RSC, Server > Reporter: Yiheng Wang > Priority: Major > > In this JIRA we want to discuss how to implement multi-active high > availability in Livy. > Currently, Livy only supports single node recovery. This is not sufficient in > some production environments. In our scenario, the Livy server serves many > notebook and JDBC services. We want to make Livy service more fault-tolerant > and scalable. > There're already some proposals in the community for high availability. But > they're not so complete or just for active-standby high availability. So we > propose a multi-active high availability design to achieve the following > goals: > # One or more servers will serve the client requests at the same time. > # Sessions are allocated among different servers. > # When one node crashes, the affected sessions will be moved to other active > services. > Here's our design document, please review and comment: > https://docs.google.com/document/d/1bD3qYZpw14_NuCcSGUOfqQ0pqvSbCQsOLFuZp26Ohjc/edit?usp=sharing > -- This message was sent by Atlassian Jira (v8.3.4#803005)