[jira] [Commented] (ATLAS-511) Ability to run multiple instances of Atlas Server with automatic failover to one active server

Hemanth Yamijala (JIRA) Tue, 08 Mar 2016 19:11:35 -0800

    [ 
https://issues.apache.org/jira/browse/ATLAS-511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15186402#comment-15186402
 ]


Hemanth Yamijala commented on ATLAS-511:
----------------------------------------

[~vmadugun] / [~cassiodossantos], From my (possibly incomplete) understanding 
of the core backend, a few points stand out in consideration of the TypeSystem 
cache:
* Currently Atlas relies on it *completely* for all reads. As Venkat mentioned 
in his comments, DSL query translation to Gremlin query relies on this 
information. Since the volume of reads is expected to be high, I intuitively 
feel that the cache is of value. Possibly not in the aggressive manner in which 
it is currently relying on, but at least as a significant performance 
optimization. Completely turning off the Cache in that sense seems to me a bit 
too extreme. If we are modeling this, we could possibly model it as a strategy 
of which no caching is one alternative, and read with fall through could be 
another. I am convinced by Cassio's point that letting the types grow unbounded 
(the current implementation) feels a little too extreme as well.
* You mention that types will be relatively unchanging. I am assuming that you 
are saying this based on the usage pattern you have seen (or are envisioning to 
see). I had a question on this. Seeing that trait definitions are also types 
and are also cached in the TypeSystem and that all lookups of traits happen 
from here, how frequently are these CRUD'ed in your case? Of course, this can 
be solved by a programmatic refresh (or using a dirty read mechanism) as you 
both have suggested.

I am happy that we are aligned on basing any of these decisions on concrete 
measurements. We have been working to set up some very basic test suites that 
well help us get started with performance measurement. I will open JIRAs to 
spell out more details on this.
 
Venkat, thanks for your offer for help in this task. At this stage, since you 
have specific interest in improving the cache behavior , it may be good if you 
can spend some energy on this and see what you find. Please feel free to open 
JIRAs and propose your approach / solutions.

Needless to say, there are folks more experienced on this area than I am. I am 
hoping they will chime in with thoughts (in particular, if we're going down the 
wrong track).

> Ability to run multiple instances of Atlas Server with automatic failover to 
> one active server
> ----------------------------------------------------------------------------------------------
>
>                 Key: ATLAS-511
>                 URL: https://issues.apache.org/jira/browse/ATLAS-511
>             Project: Atlas
>          Issue Type: Sub-task
>            Reporter: Hemanth Yamijala
>            Assignee: Hemanth Yamijala
>         Attachments: HADesign.pdf
>
>
> One of the most important components that only supports active-standby mode 
> currently is the Atlas server which hosts the API / UI for Atlas. As 
> described in the [HA 
> Documentation|http://atlas.incubator.apache.org/0.6.0-incubating/HighAvailability.html],
>  we currently are limited to running only one instance of the Atlas server 
> behind a proxy service. If the running instance goes down, a manual process 
> is required to bring up another instance.
> In this JIRA, we propose to have an ability to run multiple Atlas server 
> instances. However, as a first step, only one of them will be actively 
> processing requests. To have a consistent terminology, let us call that 
> server the *master*. Any requests sent to the other servers will be 
> redirected to the master.
> When the master suffers a partition, one of the other servers must 
> automatically become the master and start processing requests. What this mode 
> brings us over the current system is the ability to automatically failover 
> the Atlas server instance without any  manual intervention. Note that this 
> can be arguably called an [active/active 
> setup|https://en.wikipedia.org/wiki/High-availability_cluster]
> ATLAS-488 raised to support multiple active Atlas server instances. While 
> that would be ideal, we have to learn more about the underlying system 
> behavior before we can get there, and hopefully we can take smaller steps to 
> improve the system systematically. The method proposed here is similar to 
> what is adopted in many other Hadoop components including HDFS NameNode, 
> HBase HMaster etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ATLAS-511) Ability to run multiple instances of Atlas Server with automatic failover to one active server

Reply via email to