[ https://issues.apache.org/jira/browse/ATLAS-511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15186402#comment-15186402 ]
Hemanth Yamijala commented on ATLAS-511: ---------------------------------------- [~vmadugun] / [~cassiodossantos], From my (possibly incomplete) understanding of the core backend, a few points stand out in consideration of the TypeSystem cache: * Currently Atlas relies on it *completely* for all reads. As Venkat mentioned in his comments, DSL query translation to Gremlin query relies on this information. Since the volume of reads is expected to be high, I intuitively feel that the cache is of value. Possibly not in the aggressive manner in which it is currently relying on, but at least as a significant performance optimization. Completely turning off the Cache in that sense seems to me a bit too extreme. If we are modeling this, we could possibly model it as a strategy of which no caching is one alternative, and read with fall through could be another. I am convinced by Cassio's point that letting the types grow unbounded (the current implementation) feels a little too extreme as well. * You mention that types will be relatively unchanging. I am assuming that you are saying this based on the usage pattern you have seen (or are envisioning to see). I had a question on this. Seeing that trait definitions are also types and are also cached in the TypeSystem and that all lookups of traits happen from here, how frequently are these CRUD'ed in your case? Of course, this can be solved by a programmatic refresh (or using a dirty read mechanism) as you both have suggested. I am happy that we are aligned on basing any of these decisions on concrete measurements. We have been working to set up some very basic test suites that well help us get started with performance measurement. I will open JIRAs to spell out more details on this. Venkat, thanks for your offer for help in this task. At this stage, since you have specific interest in improving the cache behavior , it may be good if you can spend some energy on this and see what you find. Please feel free to open JIRAs and propose your approach / solutions. Needless to say, there are folks more experienced on this area than I am. I am hoping they will chime in with thoughts (in particular, if we're going down the wrong track). > Ability to run multiple instances of Atlas Server with automatic failover to > one active server > ---------------------------------------------------------------------------------------------- > > Key: ATLAS-511 > URL: https://issues.apache.org/jira/browse/ATLAS-511 > Project: Atlas > Issue Type: Sub-task > Reporter: Hemanth Yamijala > Assignee: Hemanth Yamijala > Attachments: HADesign.pdf > > > One of the most important components that only supports active-standby mode > currently is the Atlas server which hosts the API / UI for Atlas. As > described in the [HA > Documentation|http://atlas.incubator.apache.org/0.6.0-incubating/HighAvailability.html], > we currently are limited to running only one instance of the Atlas server > behind a proxy service. If the running instance goes down, a manual process > is required to bring up another instance. > In this JIRA, we propose to have an ability to run multiple Atlas server > instances. However, as a first step, only one of them will be actively > processing requests. To have a consistent terminology, let us call that > server the *master*. Any requests sent to the other servers will be > redirected to the master. > When the master suffers a partition, one of the other servers must > automatically become the master and start processing requests. What this mode > brings us over the current system is the ability to automatically failover > the Atlas server instance without any manual intervention. Note that this > can be arguably called an [active/active > setup|https://en.wikipedia.org/wiki/High-availability_cluster] > ATLAS-488 raised to support multiple active Atlas server instances. While > that would be ideal, we have to learn more about the underlying system > behavior before we can get there, and hopefully we can take smaller steps to > improve the system systematically. The method proposed here is similar to > what is adopted in many other Hadoop components including HDFS NameNode, > HBase HMaster etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)