[ 
https://issues.apache.org/jira/browse/LIVY-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17005063#comment-17005063
 ] 

Marco Gaido commented on LIVY-718:
----------------------------------

Thanks for your comments [~bikassaha]. I am not sure (actually I think it is 
not true) that all the metadata information is present in the Spark driver 
process and there is metadata which is frequently accessed/changed (eg. the 
ongoing statements for a given session) on the Livy server side (at least for 
the thrift part). Indeed, there are definitely metadata which are currently 
kept in the server memory which would need to be saved for HA sake. Hence, I am 
afraid that at least for the thrift case, the usage of a slow storage like HDFS 
would at least would require a significant revisit of the thrift part.

I agree that active-active is by far the most desirable choice. I see, though, 
that it is not easy to implement, IMHO, because for the metadata above, it 
would require a distributed state store being the source of truth for that. 
Given your negative opinion on ZK, I hardly see any other system which would 
fit (a relational DB cluster maybe? but not easier to maintain than ZK for 
sure, I'd say). Hence I am drawn to consider that we would need to trade off 
things here, unless I am very mistaken on the point above: namely, the REST 
part has really no significant metadata on Livy server side and we keep the 
thrift one out of scope here.


> Support multi-active high availability in Livy
> ----------------------------------------------
>
>                 Key: LIVY-718
>                 URL: https://issues.apache.org/jira/browse/LIVY-718
>             Project: Livy
>          Issue Type: Epic
>          Components: RSC, Server
>            Reporter: Yiheng Wang
>            Priority: Major
>
> In this JIRA we want to discuss how to implement multi-active high 
> availability in Livy.
> Currently, Livy only supports single node recovery. This is not sufficient in 
> some production environments. In our scenario, the Livy server serves many 
> notebook and JDBC services. We want to make Livy service more fault-tolerant 
> and scalable.
> There're already some proposals in the community for high availability. But 
> they're not so complete or just for active-standby high availability. So we 
> propose a multi-active high availability design to achieve the following 
> goals:
> # One or more servers will serve the client requests at the same time.
> # Sessions are allocated among different servers.
> # When one node crashes, the affected sessions will be moved to other active 
> services.
> Here's our design document, please review and comment:
> https://docs.google.com/document/d/1bD3qYZpw14_NuCcSGUOfqQ0pqvSbCQsOLFuZp26Ohjc/edit?usp=sharing
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to