[ https://issues.apache.org/jira/browse/ATLAS-184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrew Ahn updated ATLAS-184: ----------------------------- Description: Apache Sqoop Integration with Apache Atlas (incubating) Introduction Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores such as relational databases. Apache Atlas is a metadata repository that enables end-to-end data lineage, search and associate business classification. Overview The goal of this integration is to at minimum push the Sqoop generated query metadata along with the source provenance, target(s), and any available business context so Atlas can capture the lineage for this topology. There are 2 parts in this process detailed below: 1. Data model to represent the concepts in Sqoop 2. Sqoop Bridge/Hook to update metadata in Atlas Data Model A data model is represented as a Type in Atlas. This can reuse or closely be modeled after Hive data types that already exist. At the least, we need to create types for: • Sqoop processes containing the SQL query text, start/end times, user, etc. • Source Provenance, fine-grained at DB, Table, Column, etc. so we have a 1-1 mapping between source and target assets • Target (typically Hive, HBase, HDFS, etc.) You can take a look at the data model code for Hive. Sqoop should reuse the data model from Hive or closely model after that. Pushing Metadata into Atlas There are 2 parts to the bridge: 1. Sqoop Bridge This does not apply to Sqoop tool. However, will apply if and when we migrate to Sqoop 2. 2. Post-execution Hook Atlas needs to be notified when a new Sqoop Ingest is executed successfully or when someone changes the definition of an existing Sqoop Job. You can refer to the hook code for Hive. 3. Column-level lineage It would be good to have column level lineage for data flowing from the source database/WH into Hive. > Integrate Sqoop metadata into Atlas > ----------------------------------- > > Key: ATLAS-184 > URL: https://issues.apache.org/jira/browse/ATLAS-184 > Project: Atlas > Issue Type: Improvement > Affects Versions: 0.6-incubating > Reporter: Venkatesh Seetharam > Fix For: 0.6-incubating > > > Apache Sqoop Integration with Apache Atlas (incubating) > Introduction > Apache Sqoop is a tool designed for efficiently transferring bulk data > between Apache Hadoop and structured data stores such as relational databases. > Apache Atlas is a metadata repository that enables end-to-end data lineage, > search and associate business classification. > Overview > The goal of this integration is to at minimum push the Sqoop generated query > metadata along with the source provenance, target(s), and any available > business context so Atlas can capture the lineage for this topology. > There are 2 parts in this process detailed below: > 1. Data model to represent the concepts in Sqoop > 2. Sqoop Bridge/Hook to update metadata in Atlas > Data Model > A data model is represented as a Type in Atlas. This can reuse or closely be > modeled after Hive data types that already exist. At the least, we need to > create types for: > • Sqoop processes containing the SQL query text, start/end times, user, > etc. > • Source Provenance, fine-grained at DB, Table, Column, etc. so we have a > 1-1 mapping between source and target assets > • Target (typically Hive, HBase, HDFS, etc.) > You can take a look at the data model code for Hive. Sqoop should reuse the > data model from Hive or closely model after that. > Pushing Metadata into Atlas > There are 2 parts to the bridge: > 1. Sqoop Bridge > This does not apply to Sqoop tool. However, will apply if and when we migrate > to Sqoop 2. > 2. Post-execution Hook > Atlas needs to be notified when a new Sqoop Ingest is executed successfully > or when someone changes the definition of an existing Sqoop Job. > You can refer to the hook code for Hive. > 3. Column-level lineage > It would be good to have column level lineage for data flowing from the > source database/WH into Hive. -- This message was sent by Atlassian JIRA (v6.3.4#6332)