[ 
https://issues.apache.org/jira/browse/LIVY-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17004873#comment-17004873
 ] 

Meisam commented on LIVY-718:
-----------------------------

I also have concerns about the second point that Bikas raised, i.e. "Service 
availability for a given object". When a server fails, its sessions become 
unavailable until other servers are designated to handle them. This was not 
acceptable behavior, at least for clusters that I worked with in my previous 
job. But I'd like to share the my experience with those clusters, which ran 
tens of thousands of spark jobs daily. It makes me think that consistent 
hashing is an over-kill for Livy HA.

My first observation was that, even with the the largest and busiest clusters, 
the whole session metadata easily fits into a Java process with 500MB of RAM 
and less than 20% CPU usage at pick (Yarn and ZooKeeper connections were the 
bottlenecks).

My second observation was that, ease of use of the API was extremely important 
for many end users, especially for users with data science background. Ease of 
use was important for integrating Livy to other tools and services.

My third observation was that, the number of ports in the firewall that needs 
to be opened for Livy HA can become a security concern. It was important that 
HA Livy handle any request for any session, which allows to open one single 
port in the Firewall for a load-balancer in front all HA Livy servers.

 

> Support multi-active high availability in Livy
> ----------------------------------------------
>
>                 Key: LIVY-718
>                 URL: https://issues.apache.org/jira/browse/LIVY-718
>             Project: Livy
>          Issue Type: Epic
>          Components: RSC, Server
>            Reporter: Yiheng Wang
>            Priority: Major
>
> In this JIRA we want to discuss how to implement multi-active high 
> availability in Livy.
> Currently, Livy only supports single node recovery. This is not sufficient in 
> some production environments. In our scenario, the Livy server serves many 
> notebook and JDBC services. We want to make Livy service more fault-tolerant 
> and scalable.
> There're already some proposals in the community for high availability. But 
> they're not so complete or just for active-standby high availability. So we 
> propose a multi-active high availability design to achieve the following 
> goals:
> # One or more servers will serve the client requests at the same time.
> # Sessions are allocated among different servers.
> # When one node crashes, the affected sessions will be moved to other active 
> services.
> Here's our design document, please review and comment:
> https://docs.google.com/document/d/1bD3qYZpw14_NuCcSGUOfqQ0pqvSbCQsOLFuZp26Ohjc/edit?usp=sharing
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to