Repository: incubator-atlas Updated Branches: refs/heads/master 0003160b0 -> f3ac2c0f1
ATLAS-181 Integrate storm topology metadata into Atlas (svenkat,yhemanth via shwethags) Project: http://git-wip-us.apache.org/repos/asf/incubator-atlas/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-atlas/commit/f3ac2c0f Tree: http://git-wip-us.apache.org/repos/asf/incubator-atlas/tree/f3ac2c0f Diff: http://git-wip-us.apache.org/repos/asf/incubator-atlas/diff/f3ac2c0f Branch: refs/heads/master Commit: f3ac2c0f1e2b9e7db2aa867b2a05a68dfee75f1b Parents: 0003160 Author: Shwetha GS <[email protected]> Authored: Wed Jan 20 13:12:49 2016 +0530 Committer: Shwetha GS <[email protected]> Committed: Wed Jan 20 13:12:49 2016 +0530 ---------------------------------------------------------------------- docs/src/site/twiki/Architecture.twiki | 1 + docs/src/site/twiki/StormAtlasHook.twiki | 114 ++++++++++++++++++++++++++ docs/src/site/twiki/index.twiki | 2 +- release-log.txt | 1 + 4 files changed, 117 insertions(+), 1 deletion(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-atlas/blob/f3ac2c0f/docs/src/site/twiki/Architecture.twiki ---------------------------------------------------------------------- diff --git a/docs/src/site/twiki/Architecture.twiki b/docs/src/site/twiki/Architecture.twiki index 7bfe4b6..d63ac62 100755 --- a/docs/src/site/twiki/Architecture.twiki +++ b/docs/src/site/twiki/Architecture.twiki @@ -25,6 +25,7 @@ Available bridges are: * [[Bridge-Hive][Hive Bridge]] * [[Bridge-Sqoop][Sqoop Bridge]] * [[Bridge-Falcon][Falcon Bridge]] + * [[StormAtlasHook][Storm Bridge]] ---++ Notification Notification is used for reliable entity registration from hooks and for entity/type change notifications. Atlas, by default, provides Kafka integration, but its possible to provide other implementations as well. Atlas service starts embedded Kafka server by default. http://git-wip-us.apache.org/repos/asf/incubator-atlas/blob/f3ac2c0f/docs/src/site/twiki/StormAtlasHook.twiki ---------------------------------------------------------------------- diff --git a/docs/src/site/twiki/StormAtlasHook.twiki b/docs/src/site/twiki/StormAtlasHook.twiki new file mode 100644 index 0000000..3e560db --- /dev/null +++ b/docs/src/site/twiki/StormAtlasHook.twiki @@ -0,0 +1,114 @@ +---+ Storm Atlas Bridge + +---++ Introduction + +Apache Storm is a distributed real-time computation system. Storm makes it +easy to reliably process unbounded streams of data, doing for real-time +processing what Hadoop did for batch processing. The process is essentially +a DAG of nodes, which is called *topology*. + +Apache Atlas is a metadata repository that enables end-to-end data lineage, +search and associate business classification. + +The goal of this integration is to push the operational topology +metadata along with the underlying data source(s), target(s), derivation +processes and any available business context so Atlas can capture the +lineage for this topology. + +There are 2 parts in this process detailed below: + * Data model to represent the concepts in Storm + * Storm Atlas Hook to update metadata in Atlas + + +---++ Storm Data Model + +A data model is represented as Types in Atlas. It contains the descriptions +of various nodes in the topology graph, such as spouts and bolts and the +corresponding producer and consumer types. + +The following types are added in Atlas. + + * storm_topology - represents the coarse-grained topology. A storm_topology derives from an Atlas Process type and hence can be used to inform Atlas about lineage. + * Following data sets are added - kafka_topic, jms_topic, hbase_table, hdfs_data_set. These all derive from an Atlas Dataset type and hence form the end points of a lineage graph. + * storm_spout - Data Producer having outputs, typically Kafka, JMS + * storm_bolt - Data Consumer having inputs and outputs, typically Hive, HBase, HDFS, etc. + +The Storm Atlas hook auto registers dependent models like the Hive data model +if it finds that these are not known to the Atlas server. + +The data model for each of the types is described in +the class definition at org.apache.atlas.storm.model.StormDataModel. + +---++ Storm Atlas Hook + +Atlas is notified when a new topology is registered successfully in +Storm. Storm provides a hook, backtype.storm.ISubmitterHook, at the Storm client used to +submit a storm topology. + +The Storm Atlas hook intercepts the hook post execution and extracts the metadata from the +topology and updates Atlas using the types defined. Atlas implements the +Storm client hook interface in org.apache.atlas.storm.hook.StormAtlasHook. + + +---++ Limitations + +The following apply for the first version of the integration. + + * Only new topology submissions are registered with Atlas, any lifecycle changes are not reflected in Atlas. + * The Atlas server needs to be online when a Storm topology is submitted for the metadata to be captured. + * The Hook currently does not support capturing lineage for custom spouts and bolts. + + +---++ Installation + +The Storm Atlas Hook needs to be manually installed in Storm on the client side. The hook +artifacts are available at: $ATLAS_PACKAGE/hook/storm + +Storm Atlas hook jars need to be copied to $STORM_HOME/extlib. +Replace STORM_HOME with storm installation path. + +Restart all daemons after you have installed the atlas hook into Storm. + + +---++ Configuration + +---+++ Storm Configuration + +The Storm Atlas Hook needs to be configured in Storm client config +in *$STORM_HOME/conf/storm.yaml* as: + +<verbatim> +storm.topology.submission.notifier.plugin.class: "org.apache.atlas.storm.hook.StormAtlasHook" +</verbatim> + +Also set a 'cluster name' that would be used as a namespace for objects registered in Atlas. +This name would be used for namespacing the Storm topology, spouts and bolts. + +The other objects like data sets should ideally be identified with the cluster name of +the components that generate them. For e.g. Hive tables and databases should be +identified using the cluster name set in Hive. The Storm Atlas hook will pick this up +if the Hive configuration is available in the Storm topology jar that is submitted on +the client and the cluster name is defined there. This happens similarly for HBase +data sets. In case this configuration is not available, the cluster name set in the Storm +configuration will be used. + +<verbatim> +atlas.cluster.name: "cluster_name" +</verbatim> + +In *$STORM_HOME/conf/storm_env.ini*, set an environment variable as follows: + +<verbatim> +STORM_JAR_JVM_OPTS:"-Datlas.conf=$ATLAS_HOME/conf/" +</verbatim> + +where ATLAS_HOME is pointing to where ATLAS is installed. + +You could also set this up programatically in Storm Config as: + +<verbatim> + Config stormConf = new Config(); + ... + stormConf.put(Config.STORM_TOPOLOGY_SUBMISSION_NOTIFIER_PLUGIN, + org.apache.atlas.storm.hook.StormAtlasHook.class.getName()); +</verbatim> http://git-wip-us.apache.org/repos/asf/incubator-atlas/blob/f3ac2c0f/docs/src/site/twiki/index.twiki ---------------------------------------------------------------------- diff --git a/docs/src/site/twiki/index.twiki b/docs/src/site/twiki/index.twiki index 6f7333c..8c57d06 100755 --- a/docs/src/site/twiki/index.twiki +++ b/docs/src/site/twiki/index.twiki @@ -49,9 +49,9 @@ allows integration with the whole enterprise data ecosystem. * [[Bridge-Hive][Hive Bridge]] * [[Bridge-Sqoop][Sqoop Bridge]] * [[Bridge-Falcon][Falcon Bridge]] + * [[StormAtlasHook][Storm Bridge]] * [[HighAvailability][Fault Tolerance And High Availability Options]] - ---++ API Documentation * <a href="api/rest.html">REST API Documentation</a> http://git-wip-us.apache.org/repos/asf/incubator-atlas/blob/f3ac2c0f/release-log.txt ---------------------------------------------------------------------- diff --git a/release-log.txt b/release-log.txt index 5f3fd06..819300a 100644 --- a/release-log.txt +++ b/release-log.txt @@ -7,6 +7,7 @@ ATLAS-409 Atlas will not import avro tables with schema read from a file (dosset ATLAS-379 Create sqoop and falcon metadata addons (venkatnrangan,bvellanki,sowmyaramesh via shwethags) ALL CHANGES: +ATLAS-181 Integrate storm topology metadata into Atlas (svenkat,yhemanth via shwethags) ATLAS-311 UI: Local storage for traits - caching [not cleared on refresh] To be cleared on time lapse for 1hr (Anilg via shwethags) ATLAS-106 Store createTimestamp and modified timestamp separately for an entity (dkantor via shwethags) ATLAS-433 Fix checkstyle issues for common and notification module (shwethags)
