Hi All,

Lineage is a big subject so it’s hard to cover all angles. I’ll describe what 
we are doing (or attempt to do) in ING Bank.

For us it’s important to capture every step what happens with data. This is 
required from a regulatory perspective, security perspective, model performance 
perspective. We are a big company with a lot of legacy and thus we deploy 
several technologies that capture this to a certain extend. The main 
technologies are Apache Atlas and IBM’s IGC. Federation happens (will happen) 
through Egeria.

To capture events you can do this push or pull. I see many companies opt for 
pull first, but eventually shifting to a hybrid scenario. Push to capture as 
quickly as possible' pull to close gaps and to further enrich. Storing lineage 
data is often done in a graph database. I do see companies sometimes reinvent 
the wheel here. A lot of thought has been put into Apache Atlas, but it’s 
documentation and UI is a bit lacking (hence our involvement with Amundsen).

My team started with Apache Atlas and with push as Apache Atlas delivers many 
connectors out of the box (Hive, HDFS, Kylin etc. there is a custom somewhat 
immature for Spark). In case you go outside of this environment lineage is 
gone, as airflow orchestrates workflows it can fill much of this gap. This is 
where Airflow’s inlets and outlets come in. 

The inlets and outlets, as a concept it isn’t mature, a way to track lineage 
inside Airflow but also, with the right backend, outside. Backends can be 
Apache Atlas, but also for example Lyft’s Amundsen. Using inlets/outlets might 
seem a bit awkward at first, but can actually speed up development of DAGs and 
generalize some patterns. You could for example create a DAG that independent 
of the table it is getting can remove PII data based on the metadata that is 
associated with it. Using inlets and outlets also allows a different paradigm 
in Airflow, I think we discussed this as dataflow in the past, so you can set 
dependencies on data rather than on tasks.

I hope this helps. The inlets and outlets feature can really use some help and 
use cases driving it. The meta part of it can be a bit daunting at first, but I 
really believe when we get this right it can really ease a lot of development 
and will put Airflow at the next level where data is going.

Cheers
Bolke


Verstuurd vanaf mijn iPad

> Op 6 jun. 2019 om 15:45 heeft Jason Rich <[email protected]> het 
> volgende geschreven:
> 
> Great questions. This makes two of us thinking about data lineage inside 
> and/or outside of airflow. 
> 
> Thank you for the questions Germain. 
> 
> Cheers,
> Jason
> 
>> On Jun 6, 2019, at 9:19 AM, Germain TANGUY 
>> <[email protected]> wrote:
>> 
>> Hello everyone,
>> We were wondering if some of you are using a data lineage solution ?
>> 
>> We are aware of the experimental inlets/outlets with Apache 
>> Atlas<https://airflow.readthedocs.io/en/stable/lineage.html>, does someone 
>> have feedback to share ?
>> Does someone have experience with others solutions outside airflow (as all 
>> the workflow are not necessarily an airflow DAG)?
>> 
>> In my current company, we have hundreds of DAGs that run every day, many of 
>> which depend on data built by another DAG (DAGs are often chained through 
>> sensors on partitions or files in buckets, not trigger_dag).  When one of 
>> the DAGs fails, downstream DAGs will also start failing once their retries 
>> expire; similarly when we discover a bug in data, we want to mark that data 
>> as tainted so the challenge resides in determining impacted downstream DAGs 
>> (possibly having to convert periodicities) and then clear them.
>> 
>> Rebuilding from scratch is not ideal but we haven't found something that 
>> suits our needs so our idea is to implement the following :
>> 
>> 1.  Build a status table that describes the state of each artifact produced 
>> by our DAG (valid, estimated, tainted...etc), we think this can be down 
>> through "on_success_callback" of airflow.
>> 2.  Create a graph of our artefacts class model and the pipelines producing 
>> their (time-based) instances so that an airflow sensor can easily know what 
>> the status of the parent artefacts is through an API. We would use this 
>> sensor before running the task that creates each artefact.
>> 3.  From this we can handle the failure use-case: we can create an API that 
>> takes a DAG and an execution date as input and returns the list of tasks to 
>> clear and DAGRun to start downstream
>> 4.  From this we can handle the tainted/backfill use-case : we can build an 
>> on-the-fly @once DAG which will update the data status table to taint all 
>> the downstream data sources build from a corrupted one and then clear all 
>> dependent DAGs from the corrupted one (then wait to be reprocessed..).
>> 
>> Any experience shared will be much appreciate.
>> 
>> Thank you!
>> 
>> Germain T.
> 

Reply via email to