Repository: incubator-atlas
Updated Branches:
  refs/heads/master a90800197 -> b8f4ffb68


ATLAS-603 Document High Availability of Atlas (yhemanth via sumasai)


Project: http://git-wip-us.apache.org/repos/asf/incubator-atlas/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-atlas/commit/b8f4ffb6
Tree: http://git-wip-us.apache.org/repos/asf/incubator-atlas/tree/b8f4ffb6
Diff: http://git-wip-us.apache.org/repos/asf/incubator-atlas/diff/b8f4ffb6

Branch: refs/heads/master
Commit: b8f4ffb68cf6adab4015d545715d1370a1ee57b9
Parents: a908001
Author: Suma Shivaprasad <[email protected]>
Authored: Fri Apr 8 10:35:49 2016 -0700
Committer: Suma Shivaprasad <[email protected]>
Committed: Fri Apr 8 10:35:49 2016 -0700

----------------------------------------------------------------------
 docs/src/site/twiki/Configuration.twiki    |  41 +++++++
 docs/src/site/twiki/HighAvailability.twiki | 157 +++++++++++++++++++++---
 release-log.txt                            |   1 +
 3 files changed, 182 insertions(+), 17 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-atlas/blob/b8f4ffb6/docs/src/site/twiki/Configuration.twiki
----------------------------------------------------------------------
diff --git a/docs/src/site/twiki/Configuration.twiki 
b/docs/src/site/twiki/Configuration.twiki
index 460f2aa..023f5a0 100644
--- a/docs/src/site/twiki/Configuration.twiki
+++ b/docs/src/site/twiki/Configuration.twiki
@@ -163,3 +163,44 @@ The following property is used to toggle the SSL feature.
 atlas.enableTLS=false
 </verbatim>
 
+---++ High Availability Properties
+
+The following properties describe High Availability related configuration 
options:
+
+<verbatim>
+# Set the following property to true, to enable High Availability. Default = 
false.
+atlas.server.ha.enabled=true
+
+# Define a unique set of strings to identify each instance that should run an 
Atlas Web Service instance as a comma separated list.
+atlas.server.ids=id1,id2
+# For each string defined above, define the host and port on which Atlas 
server binds to.
+atlas.server.address.id1=host1.company.com:21000
+atlas.server.address.id2=host2.company.com:31000
+
+# Specify Zookeeper properties needed for HA.
+# Specify the list of services running Zookeeper servers as a comma separated 
list.
+atlas.server.ha.zookeeper.connect=zk1.company.com:2181,zk2.company.com:2181,zk3.company.com:2181
+# Specify how many times should connection try to be established with a 
Zookeeper cluster, in case of any connection issues.
+atlas.server.ha.zookeeper.num.retries=3
+# Specify how much time should the server wait before attempting connections 
to Zookeeper, in case of any connection issues.
+atlas.server.ha.zookeeper.retry.sleeptime.ms=1000
+# Specify how long a session to Zookeeper should last without inactiviy to be 
deemed as unreachable.
+atlas.server.ha.zookeeper.session.timeout.ms=20000
+
+# Specify the scheme and the identity to be used for setting up ACLs on nodes 
created in Zookeeper for HA.
+# The format of these options is <scheme>:<identity>. For more information 
refer to 
http://zookeeper.apache.org/doc/r3.2.2/zookeeperProgrammers.html#sc_ZooKeeperAccessControl.
+# The 'acl' option allows to specify a scheme, identity pair to setup an ACL 
for.
+atlas.server.ha.zookeeper.acl=auth:sasl:[email protected]
+# The 'auth' option specifies the authentication that should be used for 
connecting to Zookeeper.
+atlas.server.ha.zookeeper.auth=sasl:[email protected]
+
+# Since Zookeeper is a shared service that is typically used by many 
components,
+# it is preferable for each component to set its znodes under a namespace.
+# Specify the namespace under which the znodes should be written. Default = 
/apache_atlas
+atlas.server.ha.zookeeper.zkroot=/apache_atlas
+
+# Specify number of times a client should retry with an instance before 
selecting another active instance, or failing an operation.
+atlas.client.ha.retries=4
+# Specify interval between retries for a client.
+atlas.client.ha.sleep.interval.ms=5000
+</verbatim>

http://git-wip-us.apache.org/repos/asf/incubator-atlas/blob/b8f4ffb6/docs/src/site/twiki/HighAvailability.twiki
----------------------------------------------------------------------
diff --git a/docs/src/site/twiki/HighAvailability.twiki 
b/docs/src/site/twiki/HighAvailability.twiki
index 2a87067..1e52c85 100644
--- a/docs/src/site/twiki/HighAvailability.twiki
+++ b/docs/src/site/twiki/HighAvailability.twiki
@@ -3,9 +3,9 @@
 ---++ Introduction
 
 Apache Atlas uses and interacts with a variety of systems to provide metadata 
management and data lineage to data
-administrators. By choosing and configuring these dependencies appropriately, 
it is possible to achieve a good degree
+administrators. By choosing and configuring these dependencies appropriately, 
it is possible to achieve a high degree
 of service availability with Atlas. This document describes the state of high 
availability support in Atlas,
-including its capabilities and current limitations, and also the configuration 
required for achieving a this level of
+including its capabilities and current limitations, and also the configuration 
required for achieving this level of
 high availability.
 
 [[Architecture][The architecture page]] in the wiki gives an overview of the 
various components that make up Atlas.
@@ -14,22 +14,146 @@ review before proceeding to read this page.
 
 ---++ Atlas Web Service
 
-Currently, the Atlas Web service has a limitation that it can only have one 
active instance at a time. Therefore, in
-case of errors to the host running the service, a new Atlas web service 
instance should be brought up and pointed to
-from the clients. In future versions of the system, we plan to provide full 
High Availability of the service, thereby
-enabling hot failover. To minimize service loss, we recommend the following:
+Currently, the Atlas Web Service has a limitation that it can only have one 
active instance at a time. In earlier
+releases of Atlas, a backup instance could be provisioned and kept available. 
However, a manual failover was
+required to make this backup instance active.
 
-   * An extra physical host with the Atlas system software and configuration 
is available to be brought up on demand.
-   * It would be convenient to have the web service fronted by a proxy 
solution like 
[[https://cbonte.github.io/haproxy-dconv/configuration-1.5.html#5.2][HAProxy]] 
which can be used to provide both the monitoring and transparent switching of 
the backend instance clients talk to.
-      * An example HAProxy configuration of this form will allow a transparent 
failover to a backup server:
+From this release, Atlas will support multiple instances of the Atlas Web 
service in an active/passive configuration
+with automated failover. This means that users can deploy and start multiple 
instances of the Atlas Web Service on
+different physical hosts at the same time. One of these instances will be 
automatically selected as an 'active'
+instance to service user requests. The others will automatically be deemed 
'passive'. If the 'active' instance
+becomes unavailable either because it is deliberately stopped, or due to 
unexpected failures, one of the other
+instances will automatically be elected as an 'active' instance and start to 
service user requests.
+
+An 'active' instance is the only instance that can respond to user requests 
correctly. It can create, delete, modify
+or respond to queries on metadata objects. A 'passive' instance will accept 
user requests, but will redirect them
+using HTTP redirect to the currently known 'active' instance. Specifically, a 
passive instance will not itself
+respond to any queries on metadata objects. However, all instances (both 
active and passive), will respond to admin
+requests that return information about that instance.
+
+When configured in a High Availability mode, users can get the following 
operational benefits:
+
+   * *Uninterrupted service during maintenance intervals*: If an active 
instance of the Atlas Web Service needs to be brought down for maintenance, 
another instance would automatically become active and can service requests.
+   * *Uninterrupted service in event of unexpected failures*: If an active 
instance of the Atlas Web Service fails due to software or hardware errors, 
another instance would automatically become active and can service requests.
+
+In the following sub-sections, we describe the steps required to setup High 
Availability for the Atlas Web Service.
+We also describe how the deployment and client can be designed to take 
advantage of this capability.
+Finally, we describe a few details of the underlying implementation.
+
+---+++ Setting up the High Availability feature in Atlas
+
+The following pre-requisites must be met for setting up the High Availability 
feature.
+
+   * Ensure that you install Apache Zookeeper on a cluster of machines (a 
minimum of 3 servers is recommended for production).
+   * Select 2 or more physical machines to run the Atlas Web Service instances 
on. These machines define what we refer to as a 'server ensemble' for Atlas.
+
+To setup High Availability in Atlas, a few configuration options must be 
defined in the =atlas-application.properties=
+file. While the complete list of configuration items are defined in the 
[[Configuration][Configuration Page]], this
+section lists a few of the main options.
+
+   * High Availability is an optional feature in Atlas. Hence, it must be 
enabled by setting the configuration option =atlas.server.ha.enabled= to true.
+   * Next, define a list of identifiers, one for each physical machine you 
have selected for the Atlas Web Service instance. These identifiers can be 
simple strings like =id1=, =id2= etc. They should be unique and should not 
contain a comma.
+   * Define a comma separated list of these identifiers as the value of the 
option =atlas.server.ids=.
+   * For each physical machine, list the IP Address/hostname and port as the 
value of the configuration =atlas.server.address.id=, where =id= refers to the 
identifier string for this physical machine.
+      * For e.g., if you have selected 2 machines with hostnames 
=host1.company.com= and =host2.company.com=, you can define the configuration 
options as below:
+      <verbatim>
+      atlas.server.ids=id1,id2
+      atlas.server.address.id1=host1.company.com:21000
+      atlas.server.address.id2=host2.company.com:21000
+      </verbatim>
+   * Define the Zookeeper quorum which will be used by the Atlas High 
Availability feature.
       <verbatim>
-      listen atlas
-        bind <proxy hostname>:<proxy port>
-        balance roundrobin
-        server inst1 <atlas server hostname>:<port> check
-        server inst2 <atlas backup server hostname>:<port> check backup
+      
atlas.server.ha.zookeeper.connect=zk1.company.com:2181,zk2.company.com:2181,zk3.company.com:2181
       </verbatim>
-   * The stores that hold Atlas data can be configured to be highly available 
as described below.
+   * You can review other configuration options that are defined for the High 
Availability feature, and set them up as desired in the 
=atlas-application.properties= file.
+   * For production environments, the components that Atlas depends on must 
also be set up in High Availability mode. This is described in detail in the 
following sections. Follow those instructions to setup and configure them.
+   * Install the Atlas software on the selected physical machines.
+   * Copy the =atlas-application.properties= file created using the steps 
above to the configuration directory of all the machines.
+   * Start the dependent components.
+   * Start each instance of the Atlas Web Service.
+
+To verify that High Availability is working, run the following script on each 
of the instances where Atlas Web Service
+is installed.
+<verbatim>
+$ATLAS_HOME/bin/atlas_admin.py -status
+</verbatim>
+This script can print one of the values below as response:
+
+   * *ACTIVE*: This instance is active and can respond to user requests.
+   * *PASSIVE*: This instance is PASSIVE. It will redirect any user requests 
it receives to the current active instance.
+   * *BECOMING_ACTIVE*: This would be printed if the server is transitioning 
to become an ACTIVE instance. The server cannot service any metadata user 
requests in this state.
+   * *BECOMING_PASSIVE*: This would be printed if the server is transitioning 
to become a PASSIVE instance. The server cannot service any metadata user 
requests in this state.
+
+Under normal operating circumstances, only one of these instances should print 
the value *ACTIVE* as response to
+the script, and the others would print *PASSIVE*.
+
+---+++ Configuring clients to use the High Availability feature
+
+The Atlas Web Service can be accessed in two ways:
+
+   * *Using the Atlas Web UI*: This is a browser based client that can be used 
to query the metadata stored in Atlas.
+   * *Using the Atlas REST API*: As Atlas exposes a RESTful API, one can use 
any standard REST client including libraries in other applications. In fact, 
Atlas ships with a client called !AtlasClient that can be used as an example to 
build REST client access.
+
+In order to take advantage of the High Availability feature in the clients, 
there are two options possible.
+
+---++++ Using an intermediate proxy
+
+The simplest solution to enable highly available access to Atlas is to install 
and configure some intermediate proxy
+that has a capability to transparently switch services based on status. One 
such proxy solution is [[http://www.haproxy.org/][HAProxy]].
+
+Here is an example HAProxy configuration that can be used. Note this is 
provided for illustration only, and not as a
+recommended production configuration. For that, please refer to the HAProxy 
documentation for appropriate instructions.
+
+<verbatim>
+frontend atlas_fe
+  bind *:41000
+  default_backend atlas_be
+
+backend atlas_be
+  mode http
+  option httpchk get /api/atlas/admin/status
+  http-check expect string ACTIVE
+  balance roundrobin
+  server host1_21000 host1:21000 check
+  server host2_21000 host2:21000 check backup
+
+listen atlas
+  bind localhost:42000
+</verbatim>
+
+The above configuration binds HAProxy to listen on port 41000 for incoming 
client connections. It then routes
+the connections to either of the hosts host1 or host2 depending on a HTTP 
status check. The status check is
+done using a HTTP GET on the REST URL =/api/atlas/admin/status=, and is deemed 
successful only if the HTTP response
+contains the string ACTIVE.
+
+---++++ Using automatic detection of active instance
+
+If one does not want to setup and manage a separate proxy, then the other 
option to use the High Availability
+feature is to build a client application that is capable of detecting status 
and retrying operations. In such a
+setting, the client application can be launched with the URLs of all Atlas Web 
Service instances that form the
+ensemble. The client should then call the REST URL =/api/atlas/admin/status= 
on each of these to determine which is
+the active instance. The response from the Active instance would be of the 
form ={Status:ACTIVE}=. Also, when the
+client faces any exceptions in the course of an operation, it should again 
determine which of the remaining URLs
+is active and retry the operation.
+
+The !AtlasClient class that ships with Atlas can be used as an example client 
library that implements the logic
+for working with an ensemble and selecting the right Active server instance.
+
+Utilities in Atlas, like =quick_start.py= and =import-hive.sh= can be 
configured to run with multiple server
+URLs. When launched in this mode, the !AtlasClient automatically selects and 
works with the current active instance.
+If a proxy is set up in between, then its address can be used when running 
quick_start.py or import-hive.sh.
+
+---+++ Implementation Details of Atlas High Availability
+
+The Atlas High Availability work is tracked under the master JIRA 
[[https://issues.apache.org/jira/browse/ATLAS-510][ATLAS-510]].
+The JIRAs filed under it have detailed information about how the High 
Availability feature has been implemented.
+At a high level the following points can be called out:
+
+   * The automatic selection of an Active instance, as well as automatic 
failover to a new Active instance happen through a leader election algorithm.
+   * For leader election, we use the 
[[http://curator.apache.org/curator-recipes/leader-latch.html][Leader Latch 
Recipe]] of [[http://curator.apache.org][Apache Curator]].
+   * The Active instance is the only one which initializes, modifies or reads 
state in the backend stores to keep them consistent.
+   * Also, when an instance is elected as Active, it refreshes any cached 
information from the backend stores to get up to date.
+   * A servlet filter ensures that only the active instance services user 
requests. If a passive instance receives these requests, it automatically 
redirects them to the current active instance.
 
 ---++ Metadata Store
 
@@ -84,5 +208,4 @@ to configure Atlas to use Kafka in HA mode, do the following:
 
 ---++ Known Issues
 
-   * [[https://issues.apache.org/jira/browse/ATLAS-338][ATLAS-338]]: 
ATLAS-338: Metadata events generated from a Hive CLI (as opposed to Beeline or 
any client going !HiveServer2) would be lost if Atlas server is down.
-   * If the HBase region servers hosting the Atlas ‘titan’ HTable are 
down, Atlas would not be able to store or retrieve metadata from HBase until 
they are brought back online.
+   * If the HBase region servers hosting the Atlas ‘titan’ HTable are 
down, Atlas would not be able to store or retrieve metadata from HBase until 
they are brought back online.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-atlas/blob/b8f4ffb6/release-log.txt
----------------------------------------------------------------------
diff --git a/release-log.txt b/release-log.txt
index aff850f..a2fd4ea 100644
--- a/release-log.txt
+++ b/release-log.txt
@@ -13,6 +13,7 @@ ATLAS-409 Atlas will not import avro tables with schema read 
from a file (dosset
 ATLAS-379 Create sqoop and falcon metadata addons 
(venkatnrangan,bvellanki,sowmyaramesh via shwethags)
 
 ALL CHANGES:
+ATLAS-603 Document High Availability of Atlas (yhemanth via sumasai)
 ATLAS-498 Support Embedded HBase (tbeerbower via sumasai)
 ATLAS-527 Support lineage for load table, import, export (sumasai via 
shwethags)
 ATLAS-572 Handle secure instance of Zookeeper for leader election.(yhemanth 
via sumasai)

Reply via email to