Ok, I’ve put the most interesting pieces into a small Gist:
https://gist.github.com/ms32035/f85cfbaff132f0d0ec8c309558330b7d
The solution is based on SQLAlchemy’s declarative layer, which I found to
be closest to what you can find in a metadata repository in commercial ETL
tools. Luckily, most
Forgot to answer your question for S3 it could look like:
s3_file = File("s3a://bucket/key")
Inlets = {"datasets:" [s3_file,]}
Obviously if you do something with the s3 file outside of Airflow you need to
track lineage yourself somehow.
B.
Sent from my iPhone
> On 6 May 2018, at 11:05,
Hi Gerardo,
Any lineage tracking system is dependent on how much data you can give it. So
if you do transfers outside of the 'view' such a system has then lineage
information is gone. Airflow can help in this area by tracking its internal
lineage and providing that to those lineage systems.
Hi Bolke,
Data lineage support sounds very interesting.
I'm not very familiar with Atlas but first sight seems like a tool specific
to the Hadoop ecosystem. How would this look like if the files (inlets or
outlets) were stored on s3?.
An example of a service that manages a similar use case is
Hi Marcin,
That would be awesome! The reason I chose to use DataSets is because it aligns
easily with Apache Atlas’ understanding of what a dataset is. I had no other
example apart from IBM’s Infosphere which I really do not like to get into. So
I am definitely open for changes.
Another thing
^this
On Sat, May 5, 2018, 15:37 Marcin Szymański wrote:
> Hi Bolke
>
> Great stuff. Pieces of this this remind work I have done for one
> organization. However in that case, instead of defining base classes like
> Dataset form scratch, I extended objects from SQLAlchemy,