Re: Flow-based Airflow?

2017-02-06 Thread Dan Davydov
Woops looks like I replied to the wrong thread! Thanks Bolke. On Mon, Feb 6, 2017 at 1:42 PM, Bolke de Bruin wrote: > Dataflow or 1.8? > > Sent from my iPhone > > > On 6 Feb 2017, at 22:35, Dan Davydov > wrote: > > > > We have been running in

Re: Flow-based Airflow?

2017-02-06 Thread Bolke de Bruin
Dataflow or 1.8? Sent from my iPhone > On 6 Feb 2017, at 22:35, Dan Davydov wrote: > > We have been running in our staging and have found a couple of issues. I > will report back with them soon. > >> On Thu, Feb 2, 2017 at 2:23 PM, Jeremiah Lowin

Re: Flow-based Airflow?

2017-02-06 Thread Dan Davydov
We have been running in our staging and have found a couple of issues. I will report back with them soon. On Thu, Feb 2, 2017 at 2:23 PM, Jeremiah Lowin wrote: > Very good point -- however I'm hesitant to overcomplicate the base class. > At the moment users only have to

Re: Flow-based Airflow?

2017-02-02 Thread Jeremiah Lowin
Very good point -- however I'm hesitant to overcomplicate the base class. At the moment users only have to override "serialize()" and "deserialize()" to build any form of remote-backed dataflow, and I like the simplicity of that. However, if you look at my implementation of the GCSDataflow, the

Re: Flow-based Airflow?

2017-02-02 Thread Laura Lorenz
This is great! We work with a lot of external data in wildly non-standard formats so another enhancement here we'd use and support is passing customizable serializers to Dataflow subclasses. This would let the dataflows keyword arg for a task handle dependency management, the Dataflow class or

Re: Flow-based Airflow?

2017-02-01 Thread Jeremiah Lowin
Great point. I think the best solution is to solve this for all XComs by checking object size before adding it to the DB. I don't see a built in way of handling it (though apparently MySQL is internally limited to 64kb). I'll look into a PR that would enforce a similar limit for all databases. On

Re: Flow-based Airflow?

2017-02-01 Thread Maxime Beauchemin
I'm not sure about XCom being the default, it seems pretty dangerous. It just takes one person that is not fully aware of the size of the data, or one day with an outlier and that could put the Airflow db in jeopardy. I guess it's always been an aspect of XCom, and it could be good to have some

Re: Flow-based Airflow?

2017-02-01 Thread Jeremiah Lowin
Yesterday I began converting a complex script to a DAG. It turned out to be a perfect test case for the dataflow model: a big chunk of data moving through a series of modification steps. So I have built an extensible dataflow extension for Airflow on top of XCom and the existing dependency

Re: Flow-based Airflow?

2017-01-26 Thread Jeremiah Lowin
Arthur, That's an excellent point. I think it touches on some ambiguity in how we conceptualize these two types of workflows (task-driven and data-driven) and so I think we should develop a more concrete vocabulary for describing them. Here's my attempt. We are all comfortable with a

Re: Flow-based Airflow?

2017-01-25 Thread Arthur Wiedmer
>From our own data warehouse, there are definitely cases where knowing that the data is there is not enough. While I agree that ideally the dependency in data should be explicit, the current dependency engine allows you to compress some of the data dependencies by using the task dependencies.

Re: Flow-based Airflow?

2017-01-25 Thread Maxime Beauchemin
Related: when Dan worked on making the dependency engine more modular, we were talking about allowing for composition of dependency rules, allowing people to use parenthesis and logical AND and OR operators, and defining new dependency rules. A mock of what the API could look like: # in this

Re: Flow-based Airflow?

2017-01-25 Thread Jeremiah Lowin
At the simplest level, a data-dependency should just create an automatic task-dependency (since a task shouldn't run before required data is available). Therefore it should be possible to reason about dataflow using the existing dependency framework. Is there any reason that wouldn't hold for all

Re: Flow-based Airflow?

2017-01-24 Thread Maxime Beauchemin
I'm happy working on a design doc. I don't think Sankeys are the way to go as they are typically used to show some metric (say number of users flowing through pages on a website), and even if we'd have something like row count throughout I don't think we'd want to make it that centric to the

Re: Flow-based Airflow?

2017-01-23 Thread Maxime Beauchemin
A few other thoughts related to this. Early on in the project, I had designed but never launched a feature called "data lineage annotations" allowing people to define a list of sources, and a list of targets related to a each task for documentation purposes. My idea was to use a simple annotation

Re: Flow-based Airflow?

2017-01-23 Thread Maxime Beauchemin
Just commented on the blog post: I agree that workflow engines should expose a way to document data objects it reads from and writes to, so that it can be aware of the full graph of tasks and data objects and how it all relates. This metadata allows for clarity around

Re: Flow-based Airflow?

2017-01-23 Thread Van Klaveren, Brian N.
I can give some insight from the physics world as far as this goes. First off, I think the dataflow puck is moving to platforms like Apache Beam. The main reason people (in science) don't just use Beam would be because they don't control the clusters they execute on. This is almost always true

Re: Flow-based Airflow?

2017-01-23 Thread Bolke de Bruin
O that’s interesting! I think the way Airflow uses tasks doesn’t entirely fit with the Flow model, e.g. in Luigi one is normal to derive from a Task. In Tasks you can just add the inlets (data dependency) you require for your particular dag. In Airflow we use templating more extensively and

Re: Flow-based Airflow?

2017-01-23 Thread Edwards, Jesse
gt; Sent: Monday, January 23, 2017 10:41:47 AM To: dev@airflow.incubator.apache.org Subject: Re: Flow-based Airflow? this is a good discussion. Most of traditional ETL tools (SSIS, Informatica, DataStage etc.) have both - control flow (or task dependency) and data flow. Some tools like SSIS make a c

Re: Flow-based Airflow?

2017-01-23 Thread Boris Tyukin
this is a good discussion. Most of traditional ETL tools (SSIS, Informatica, DataStage etc.) have both - control flow (or task dependency) and data flow. Some tools like SSIS make a clear distinction between them - you create a control flow that calls data flows as a part of overall control flow.

Re: Flow-based Airflow?

2017-01-23 Thread Glenn McClements
We’ve just started using Airflow as a platform to replace some older internally built systems, but one of the things we also looked at was a _newer_ internally built system which basically did the below. In fact it came as a surprise when I started looking around at open source systems like

Re: Flow-based Airflow?

2017-01-23 Thread Laura Lorenz
We were struggling with the same problem and came up with fileflow which is what we wrote to deal with passing data down a DAG in Airflow. We co-opt Airflow's task dependency system to represent the data dependencies and let fileflow handle knowing where

Flow-based Airflow?

2017-01-23 Thread Bolke de Bruin
Hi All, I came by a write up of some of the downsides in current workflow management systems like Airflow and Luigi (http://bionics.it/posts/workflows-dataflow-not-task-deps) where they argue dependencies should be between inputs and outputs of tasks rather than between tasks