Re: Flow-based Airflow?

Dan Davydov Mon, 06 Feb 2017 13:36:23 -0800

We have been running in our staging and have found a couple of issues. I
will report back with them soon.


On Thu, Feb 2, 2017 at 2:23 PM, Jeremiah Lowin <jlo...@apache.org> wrote:

> Very good point -- however I'm hesitant to overcomplicate the base class.
> At the moment users only have to override "serialize()" and "deserialize()"
> to build any form of remote-backed dataflow, and I like the simplicity of
> that.
>
> However, if you look at my implementation of the GCSDataflow, the
> constructor gets passed serializer and deserializer functions that are
> applied to the data before storage and after recovery. I think that sort of
> runtime-configurable serialization is in the spirit of what you're
> describing and it should be straightforward to adapt it for more specific
> requirements.
>
> On Thu, Feb 2, 2017 at 12:37 PM Laura Lorenz <llor...@industrydive.com>
> wrote:
>
> > This is great!
> >
> > We work with a lot of external data in wildly non-standard formats so
> > another enhancement here we'd use and support is passing customizable
> > serializers to Dataflow subclasses. This would let the dataflows keyword
> > arg for a task handle dependency management, the Dataflow class or
> > subclasses handle IO, and the Serializer subclasses handle parsing.
> >
> > Happy to contribute here, perhaps to create an S3Dataflow subclass in the
> > style of your Google Cloud storage one for this PR.
> >
> > Laura
> >
> > On Wed, Feb 1, 2017 at 6:14 PM, Jeremiah Lowin <jlo...@apache.org>
> wrote:
> >
> > > Great point. I think the best solution is to solve this for all XComs
> by
> > > checking object size before adding it to the DB. I don't see a built in
> > way
> > > of handling it (though apparently MySQL is internally limited to 64kb).
> > > I'll look into a PR that would enforce a similar limit for all
> databases.
> > >
> > > On Wed, Feb 1, 2017 at 4:52 PM Maxime Beauchemin <
> > > maximebeauche...@gmail.com>
> > > wrote:
> > >
> > > I'm not sure about XCom being the default, it seems pretty dangerous.
> It
> > > just takes one person that is not fully aware of the size of the data,
> or
> > > one day with an outlier and that could put the Airflow db in jeopardy.
> > >
> > > I guess it's always been an aspect of XCom, and it could be good to
> have
> > > some explicit gatekeeping there regardless of this PR/feature. Perhaps
> > the
> > > DB itself has protection against large blobs?
> > >
> > > Max
> > >
> > > On Wed, Feb 1, 2017 at 12:42 PM, Jeremiah Lowin <jlo...@apache.org>
> > wrote:
> > >
> > > > Yesterday I began converting a complex script to a DAG. It turned out
> > to
> > > be
> > > > a perfect test case for the dataflow model: a big chunk of data
> moving
> > > > through a series of modification steps.
> > > >
> > > > So I have built an extensible dataflow extension for Airflow on top
> of
> > > XCom
> > > > and the existing dependency engine:
> > > > https://issues.apache.org/jira/browse/AIRFLOW-825
> > > > https://github.com/apache/incubator-airflow/pull/2046 (still waiting
> > for
> > > > tests... it will be quite embarrassing if they don't pass)
> > > >
> > > > The philosophy is simple:
> > > > Dataflow objects represent the output of upstream tasks. Downstream
> > tasks
> > > > add Dataflows with a specific key. When the downstream task runs, the
> > > > (optionally indexed) upstream result is available in the downstream
> > > context
> > > > under context['dataflows'][key]. In addition, PythonOperators receive
> > the
> > > > data as a keyword argument.
> > > >
> > > > The basic Dataflow serializes the data through XComs, but is
> trivially
> > > > extended to alternative storage via subclasses. I have provided (in
> > > > contrib) implementations of a local filesystem-based Dataflow as well
> > as
> > > a
> > > > Google Cloud Storage dataflow.
> > > >
> > > > Laura, I hope you can have a look and see if this will bring some of
> > your
> > > > requirements in to Airflow as first-class citizens.
> > > >
> > > > Jeremiah
> > > >
> > >
> >
>

Re: Flow-based Airflow?

Reply via email to