Re: Lineage

2018-05-06 Thread Marcin Szymański
Ok, I’ve put the most interesting pieces into a small Gist: https://gist.github.com/ms32035/f85cfbaff132f0d0ec8c309558330b7d The solution is based on SQLAlchemy’s declarative layer, which I found to be closest to what you can find in a metadata repository in commercial ETL tools. Luckily, most

Re: Lineage

2018-05-06 Thread Bolke de Bruin
Forgot to answer your question for S3 it could look like: s3_file = File("s3a://bucket/key") Inlets = {"datasets:" [s3_file,]} Obviously if you do something with the s3 file outside of Airflow you need to track lineage yourself somehow. B. Sent from my iPhone > On 6 May 2018, at 11:05,

Re: Lineage

2018-05-06 Thread Bolke de Bruin
Hi Gerardo, Any lineage tracking system is dependent on how much data you can give it. So if you do transfers outside of the 'view' such a system has then lineage information is gone. Airflow can help in this area by tracking its internal lineage and providing that to those lineage systems.

Re: Lineage

2018-05-06 Thread Gerardo Curiel
Hi Bolke, Data lineage support sounds very interesting. I'm not very familiar with Atlas but first sight seems like a tool specific to the Hadoop ecosystem. How would this look like if the files (inlets or outlets) were stored on s3?. An example of a service that manages a similar use case is

Re: Lineage

2018-05-06 Thread Bolke de Bruin
Hi Marcin, That would be awesome! The reason I chose to use DataSets is because it aligns easily with Apache Atlas’ understanding of what a dataset is. I had no other example apart from IBM’s Infosphere which I really do not like to get into. So I am definitely open for changes. Another thing

Re: Lineage

2018-05-06 Thread Alex Tronchin-James 949-412-7220
^this On Sat, May 5, 2018, 15:37 Marcin Szymański wrote: > Hi Bolke > > Great stuff. Pieces of this this remind work I have done for one > organization. However in that case, instead of defining base classes like > Dataset form scratch, I extended objects from SQLAlchemy,