We have been running in our staging and have found a couple of issues. I will report back with them soon.
On Thu, Feb 2, 2017 at 2:23 PM, Jeremiah Lowin <jlo...@apache.org> wrote: > Very good point -- however I'm hesitant to overcomplicate the base class. > At the moment users only have to override "serialize()" and "deserialize()" > to build any form of remote-backed dataflow, and I like the simplicity of > that. > > However, if you look at my implementation of the GCSDataflow, the > constructor gets passed serializer and deserializer functions that are > applied to the data before storage and after recovery. I think that sort of > runtime-configurable serialization is in the spirit of what you're > describing and it should be straightforward to adapt it for more specific > requirements. > > On Thu, Feb 2, 2017 at 12:37 PM Laura Lorenz <llor...@industrydive.com> > wrote: > > > This is great! > > > > We work with a lot of external data in wildly non-standard formats so > > another enhancement here we'd use and support is passing customizable > > serializers to Dataflow subclasses. This would let the dataflows keyword > > arg for a task handle dependency management, the Dataflow class or > > subclasses handle IO, and the Serializer subclasses handle parsing. > > > > Happy to contribute here, perhaps to create an S3Dataflow subclass in the > > style of your Google Cloud storage one for this PR. > > > > Laura > > > > On Wed, Feb 1, 2017 at 6:14 PM, Jeremiah Lowin <jlo...@apache.org> > wrote: > > > > > Great point. I think the best solution is to solve this for all XComs > by > > > checking object size before adding it to the DB. I don't see a built in > > way > > > of handling it (though apparently MySQL is internally limited to 64kb). > > > I'll look into a PR that would enforce a similar limit for all > databases. > > > > > > On Wed, Feb 1, 2017 at 4:52 PM Maxime Beauchemin < > > > maximebeauche...@gmail.com> > > > wrote: > > > > > > I'm not sure about XCom being the default, it seems pretty dangerous. > It > > > just takes one person that is not fully aware of the size of the data, > or > > > one day with an outlier and that could put the Airflow db in jeopardy. > > > > > > I guess it's always been an aspect of XCom, and it could be good to > have > > > some explicit gatekeeping there regardless of this PR/feature. Perhaps > > the > > > DB itself has protection against large blobs? > > > > > > Max > > > > > > On Wed, Feb 1, 2017 at 12:42 PM, Jeremiah Lowin <jlo...@apache.org> > > wrote: > > > > > > > Yesterday I began converting a complex script to a DAG. It turned out > > to > > > be > > > > a perfect test case for the dataflow model: a big chunk of data > moving > > > > through a series of modification steps. > > > > > > > > So I have built an extensible dataflow extension for Airflow on top > of > > > XCom > > > > and the existing dependency engine: > > > > https://issues.apache.org/jira/browse/AIRFLOW-825 > > > > https://github.com/apache/incubator-airflow/pull/2046 (still waiting > > for > > > > tests... it will be quite embarrassing if they don't pass) > > > > > > > > The philosophy is simple: > > > > Dataflow objects represent the output of upstream tasks. Downstream > > tasks > > > > add Dataflows with a specific key. When the downstream task runs, the > > > > (optionally indexed) upstream result is available in the downstream > > > context > > > > under context['dataflows'][key]. In addition, PythonOperators receive > > the > > > > data as a keyword argument. > > > > > > > > The basic Dataflow serializes the data through XComs, but is > trivially > > > > extended to alternative storage via subclasses. I have provided (in > > > > contrib) implementations of a local filesystem-based Dataflow as well > > as > > > a > > > > Google Cloud Storage dataflow. > > > > > > > > Laura, I hope you can have a look and see if this will bring some of > > your > > > > requirements in to Airflow as first-class citizens. > > > > > > > > Jeremiah > > > > > > > > > >