Re: Airflow 1.8.0 Release Candidate 1

2017-02-02 Thread Jayesh Senjaliya
Thank You Bolke for all the efforts you are putting in !!

I have deployed this RC now.

On Thu, Feb 2, 2017 at 3:02 PM, Jeremiah Lowin  wrote:

> Fantastic work on this Bolke, thank you!
>
> We've deployed the RC and will report if there are any issues...
>
> On Thu, Feb 2, 2017 at 4:32 PM Bolke de Bruin  wrote:
>
> > Now I am blushing :-)
> >
> > Sent from my iPhone
> >
> > > On 2 Feb 2017, at 22:05, Boris Tyukin  wrote:
> > >
> > > LOL awesome!
> > >
> > > On Thu, Feb 2, 2017 at 4:00 PM, Maxime Beauchemin <
> > > maximebeauche...@gmail.com> wrote:
> > >
> > >> The Apache mailing doesn't support images so here's a link:
> > >>
> > >> http://i.imgur.com/DUkpjZu.png
> > >> ​
> > >>
> > >> On Thu, Feb 2, 2017 at 12:52 PM, Boris Tyukin 
> > >> wrote:
> > >>
> > >>> Bolke, you are our hero! I am sure you put a lot of your time to make
> > it
> > >>> happen
> > >>>
> > >>> On Thu, Feb 2, 2017 at 2:50 PM, Bolke de Bruin 
> > >> wrote:
> > >>>
> >  Hi All,
> > 
> >  I have made the (first) RELEASE CANDIDATE of Airflow 1.8.0 available
> > >> at:
> >  https://dist.apache.org/repos/dist/dev/incubator/airflow/ , public
> > >> keys
> >  are available at
> > https://dist.apache.org/repos/dist/release/incubator/
> >  airflow/ . It is tagged with a local version “apache.incubating” so
> it
> >  allows upgrading from earlier releases. This should be considered of
> >  release quality, but not yet officially vetted as a release yet.
> > 
> >  Issues fixed:
> >  * Use static nvd3 and d3
> >  * Python 3 incompatibilities
> >  * CLI API trigger dag issue
> > 
> >  As the difference between beta 5 and the release candidate is
> > >> relatively
> >  small I hope to start the VOTE for releasing 1.8.0 quite soon (2
> > >> days?),
> > >>> if
> >  the vote passes also a vote needs to happen at the IPMC mailinglist.
> > As
> >  this is our first Apache release I expect some comments and required
> >  changes and probably a RC 2.
> > 
> >  Furthermore, we now have a “v1-8-stable” branch. This has version
> >  “1.8.0rc1” and will graduate to “1.8.0” when we release. The
> > >> “v1-8-test”
> >  branch now has version “1.8.1alpha0” as version and “master” has
> > >> version
> >  “1.9.0dev0”. Note that “v1-8-stable” is now closed. This means that,
> > >> per
> >  release guidelines, patches accompanied with an ASSIGNED Jira and a
> >  sign-off from a committer. Only then the release manager applies the
> > >>> patch
> >  to stable (In this case that would be me). The release manager then
> > >>> closes
> >  the bug when the patches have landed in the appropriate branches.
> For
> > >>> more
> >  information please see: https://cwiki.apache.org/
> >  confluence/display/AIRFLOW/Airflow+Release+Planning+and+
> >  Supported+Release+Lifetime  >  confluence/display/AIRFLOW/Airflow+Release+Planning+and+
> >  Supported+Release+Lifetime> .
> > 
> >  Any questions or suggestions don’t hesitate to ask!
> > 
> >  Cheers
> >  Bolke
> > >>>
> > >>
> >
>


Re: Airflow Meetup in NYC @ Blue Apron

2017-02-02 Thread siddharth anand
Hope this went well. Feel free to share videos and slides. Also, it would
be great if we could create a NY Apache Airflow meetup page. Would you be
interested in setting one up? It would be easier to promote a meetup page
on social media than an email on this list.

-s

On Fri, Jan 20, 2017 at 10:37 AM, Joseph Napolitano <
joseph.napolit...@blueapron.com.invalid> wrote:

> Hi all!
>
> I want to officially announce a Meetup for Airflow in NYC!  I'm looking
> forward to meeting other community members to share knowledge and network.
>
> We may create an official Meetup page, but in the meantime please signup
> here:
> https://docs.google.com/spreadsheets/d/1WmfgZeExSVdLf-
> u1uh3IleeHy8QTwaJ4BkkSkVM-X1E/edit?usp=sharing
>
> I have a confirmed date of February 1st @ 6:30 at Blue Apron's
> headquarters.
>
> In Summary:
> Date: Feb 1st
> Time 6:30 - 9pm EST
> Location: 40 W 23rd St. New York, NY 10010
> https://www.google.com/maps/place/40+W+23rd+St,+New+York,+
> NY+10010/@40.7420885,-73.9938457,17z/data=!3m1!4b1!4m5!
> 3m4!1s0x89c259a46471d2a1:0xc2517d92b1b68bba!8m2!3d40.
> 7420845!4d-73.9916517?hl=en
>
> We're on the 5th floor.  You need to check in with security in the building
> lobby, and again when you reach the fifth floor to get a name tag.
>
> Food & drink will be provided!
>
> Let me know if you would like to present.  We'd love to hear about your
> architecture and war stories.  We will have a large projector and PA system
> setup.
>
> Sorry about the short notice, but it took a while to get approved over the
> holidays and new year.  If we can't generate enough interest we can
> certainly push it back a month.
>
> Thanks, and Bon Appétite!
>
> --
> *Joe Napolitano *| Sr. Data Engineer
> www.blueapron.com | 5 Crosby Street, New York, NY 10013
>


Re: Airflow Meetup @ Paypal (San Jose)

2017-02-02 Thread siddharth anand
Cool! I've tweeted it out using the ApacheAirflow account and also added it
to https://cwiki.apache.org/confluence/display/AIRFLOW/Announcements

FYI,
I was mistaken about the drive between Strata and PayPal. I had used the
wrong venue. Strata is at the SJ Convention Center this year.  It's still
very close.. 5 miles (12 minutes - reverse commute).

https://goo.gl/maps/nwxmkYsNFKQ2

BTW, I heard Bolke may be attending ;-) Bolke, would you like to speak at
the Meetup?

Jakob (other committers), will be down here for Strata?

-s

On Thu, Feb 2, 2017 at 5:02 PM, Jayesh Senjaliya 
wrote:

> Sure,
> I have created event on Meetup :
> https://www.meetup.com/Bay-Area-Apache-Airflow-Incubating-Meetup/events/
> 237412864/
>
> Thanks for helping on this Siddharth.
> Jayesh
>
>
>
> On Wed, Feb 1, 2017 at 7:50 PM, siddharth anand  wrote:
>
> > IMHO, I'd publish the meet-up. You still have 6 weeks to find a 3rd
> > speaker. If Bolke and Alex are traveling all the way for Strata, perhaps
> > one of them can speak :-)
> >
> > -s
> >
> > On Wed, Feb 1, 2017 at 1:48 PM, Russell Jurney  >
> > wrote:
> >
> > > Maybe start a new thread with a title "Call for Speakers for Meetup on
> > Mar
> > > 14" ?
> > >
> > > On Wed, Feb 1, 2017 at 11:59 AM Jayesh Senjaliya 
> > > wrote:
> > >
> > > > Yes, we are still waiting for more speakers.
> > > >
> > > > can anybody from Airbnb present ?
> > > >
> > > > anybody else ?
> > > >
> > > >
> > > > - Jayesh
> > > >
> > > > On Tue, Jan 31, 2017 at 8:16 PM, siddharth anand 
> > > > wrote:
> > > >
> > > > > Jayesh,
> > > > > Looks good. No need to vote. Just publish a new event with details
> on
> > > the
> > > > > meet-up page:
> > > > > https://www.meetup.com/Bay-Area-Apache-Airflow-Incubating-Meetup/
> > > > >
> > > > > Please add a short abstract as well for the talks and find a 3rd
> > > speaker.
> > > > > Please be sure to record the meet-up so that we can publish it.
> Once
> > > the
> > > > > meet-up event is up, please respond to this email! We can help
> > promote
> > > > it.
> > > > > I suggest picking a start time after the Strata talks end but not
> > super
> > > > > late either.
> > > > >
> > > > > -s
> > > > >
> > > > > On Tue, Jan 31, 2017 at 9:19 AM, Jayesh Senjaliya <
> > jhsonl...@gmail.com
> > > >
> > > > > wrote:
> > > > >
> > > > > > HI All,
> > > > > >
> > > > > > √ I have approval from Paypal to host Airflow meetup.  How about
> > > March
> > > > > 14th
> > > > > > ? Please vote.
> > > > > >
> > > > > > √ we will have food and drinks.
> > > > > > Please let me know if anybody has any special request, I will try
> > to
> > > > > > accommodate :)
> > > > > >
> > > > > > For presentations:
> > > > > >  1) Disk recommission using airflow with overall automation of
> > > "Hadoop
> > > > > Node
> > > > > > and Disk Remediation". - Jayesh Senjaliya ( Paypal )
> > > > > >  2) Predictive Analytics with Airflow and PySpark - ( Russell
> > Jurney
> > > )
> > > > > >
> > > > > >
> > > > > > Please send request to present to this email thread if you are
> > > > interested
> > > > > > in presenting.
> > > > > >
> > > > > > Thanks
> > > > > > Jayesh
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Thu, Jan 26, 2017 at 4:08 PM, Russell Jurney <
> > > > > russell.jur...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Cool!
> > > > > > >
> > > > > > > On Wed, Jan 25, 2017 at 11:23 PM Jayesh Senjaliya <
> > > > jhsonl...@gmail.com
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Russell,
> > > > > > > >
> > > > > > > > yes, I will be presenting from Paypal side.
> > > > > > > > Once i have official approval from Paypal, I will sent out
> > email.
> > > > > > > > I am basically going by the steps what Siddharth outlined
> > earlier
> > > > in
> > > > > > the
> > > > > > > > thread.
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > > Jayesh
> > > > > > > >
> > > > > > > > On Wed, Jan 25, 2017 at 7:50 PM, Russell Jurney <
> > > > > > > russell.jur...@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Is someone from Paypal likely to speak? Should we start a
> new
> > > > > thread
> > > > > > > > with a
> > > > > > > > > call for another speaker? There was mention of three being
> > > > needed.
> > > > > > > > >
> > > > > > > > > On Wed, Jan 25, 2017 at 5:33 PM Jayesh Senjaliya <
> > > > > > jhsonl...@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Yes I am waiting for response from facilities about it,
> > most
> > > > > likely
> > > > > > > by
> > > > > > > > > > early next week.
> > > > > > > > > >
> > > > > > > > > > Thanks
> > > > > > > > > > Jayesh
> > > > > > > > > >
> > > > > > > > > > On Wed, Jan 25, 2017 at 4:52 PM, Russell Jurney <
> > > > > > > > > russell.jur...@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Boris, would you be able to attend an evening 

Re: Airflow 1.8.0 Release Candidate 1

2017-02-02 Thread Jeremiah Lowin
Fantastic work on this Bolke, thank you!

We've deployed the RC and will report if there are any issues...

On Thu, Feb 2, 2017 at 4:32 PM Bolke de Bruin  wrote:

> Now I am blushing :-)
>
> Sent from my iPhone
>
> > On 2 Feb 2017, at 22:05, Boris Tyukin  wrote:
> >
> > LOL awesome!
> >
> > On Thu, Feb 2, 2017 at 4:00 PM, Maxime Beauchemin <
> > maximebeauche...@gmail.com> wrote:
> >
> >> The Apache mailing doesn't support images so here's a link:
> >>
> >> http://i.imgur.com/DUkpjZu.png
> >> ​
> >>
> >> On Thu, Feb 2, 2017 at 12:52 PM, Boris Tyukin 
> >> wrote:
> >>
> >>> Bolke, you are our hero! I am sure you put a lot of your time to make
> it
> >>> happen
> >>>
> >>> On Thu, Feb 2, 2017 at 2:50 PM, Bolke de Bruin 
> >> wrote:
> >>>
>  Hi All,
> 
>  I have made the (first) RELEASE CANDIDATE of Airflow 1.8.0 available
> >> at:
>  https://dist.apache.org/repos/dist/dev/incubator/airflow/ , public
> >> keys
>  are available at
> https://dist.apache.org/repos/dist/release/incubator/
>  airflow/ . It is tagged with a local version “apache.incubating” so it
>  allows upgrading from earlier releases. This should be considered of
>  release quality, but not yet officially vetted as a release yet.
> 
>  Issues fixed:
>  * Use static nvd3 and d3
>  * Python 3 incompatibilities
>  * CLI API trigger dag issue
> 
>  As the difference between beta 5 and the release candidate is
> >> relatively
>  small I hope to start the VOTE for releasing 1.8.0 quite soon (2
> >> days?),
> >>> if
>  the vote passes also a vote needs to happen at the IPMC mailinglist.
> As
>  this is our first Apache release I expect some comments and required
>  changes and probably a RC 2.
> 
>  Furthermore, we now have a “v1-8-stable” branch. This has version
>  “1.8.0rc1” and will graduate to “1.8.0” when we release. The
> >> “v1-8-test”
>  branch now has version “1.8.1alpha0” as version and “master” has
> >> version
>  “1.9.0dev0”. Note that “v1-8-stable” is now closed. This means that,
> >> per
>  release guidelines, patches accompanied with an ASSIGNED Jira and a
>  sign-off from a committer. Only then the release manager applies the
> >>> patch
>  to stable (In this case that would be me). The release manager then
> >>> closes
>  the bug when the patches have landed in the appropriate branches. For
> >>> more
>  information please see: https://cwiki.apache.org/
>  confluence/display/AIRFLOW/Airflow+Release+Planning+and+
>  Supported+Release+Lifetime   confluence/display/AIRFLOW/Airflow+Release+Planning+and+
>  Supported+Release+Lifetime> .
> 
>  Any questions or suggestions don’t hesitate to ask!
> 
>  Cheers
>  Bolke
> >>>
> >>
>


Re: Flow-based Airflow?

2017-02-02 Thread Jeremiah Lowin
Very good point -- however I'm hesitant to overcomplicate the base class.
At the moment users only have to override "serialize()" and "deserialize()"
to build any form of remote-backed dataflow, and I like the simplicity of
that.

However, if you look at my implementation of the GCSDataflow, the
constructor gets passed serializer and deserializer functions that are
applied to the data before storage and after recovery. I think that sort of
runtime-configurable serialization is in the spirit of what you're
describing and it should be straightforward to adapt it for more specific
requirements.

On Thu, Feb 2, 2017 at 12:37 PM Laura Lorenz 
wrote:

> This is great!
>
> We work with a lot of external data in wildly non-standard formats so
> another enhancement here we'd use and support is passing customizable
> serializers to Dataflow subclasses. This would let the dataflows keyword
> arg for a task handle dependency management, the Dataflow class or
> subclasses handle IO, and the Serializer subclasses handle parsing.
>
> Happy to contribute here, perhaps to create an S3Dataflow subclass in the
> style of your Google Cloud storage one for this PR.
>
> Laura
>
> On Wed, Feb 1, 2017 at 6:14 PM, Jeremiah Lowin  wrote:
>
> > Great point. I think the best solution is to solve this for all XComs by
> > checking object size before adding it to the DB. I don't see a built in
> way
> > of handling it (though apparently MySQL is internally limited to 64kb).
> > I'll look into a PR that would enforce a similar limit for all databases.
> >
> > On Wed, Feb 1, 2017 at 4:52 PM Maxime Beauchemin <
> > maximebeauche...@gmail.com>
> > wrote:
> >
> > I'm not sure about XCom being the default, it seems pretty dangerous. It
> > just takes one person that is not fully aware of the size of the data, or
> > one day with an outlier and that could put the Airflow db in jeopardy.
> >
> > I guess it's always been an aspect of XCom, and it could be good to have
> > some explicit gatekeeping there regardless of this PR/feature. Perhaps
> the
> > DB itself has protection against large blobs?
> >
> > Max
> >
> > On Wed, Feb 1, 2017 at 12:42 PM, Jeremiah Lowin 
> wrote:
> >
> > > Yesterday I began converting a complex script to a DAG. It turned out
> to
> > be
> > > a perfect test case for the dataflow model: a big chunk of data moving
> > > through a series of modification steps.
> > >
> > > So I have built an extensible dataflow extension for Airflow on top of
> > XCom
> > > and the existing dependency engine:
> > > https://issues.apache.org/jira/browse/AIRFLOW-825
> > > https://github.com/apache/incubator-airflow/pull/2046 (still waiting
> for
> > > tests... it will be quite embarrassing if they don't pass)
> > >
> > > The philosophy is simple:
> > > Dataflow objects represent the output of upstream tasks. Downstream
> tasks
> > > add Dataflows with a specific key. When the downstream task runs, the
> > > (optionally indexed) upstream result is available in the downstream
> > context
> > > under context['dataflows'][key]. In addition, PythonOperators receive
> the
> > > data as a keyword argument.
> > >
> > > The basic Dataflow serializes the data through XComs, but is trivially
> > > extended to alternative storage via subclasses. I have provided (in
> > > contrib) implementations of a local filesystem-based Dataflow as well
> as
> > a
> > > Google Cloud Storage dataflow.
> > >
> > > Laura, I hope you can have a look and see if this will bring some of
> your
> > > requirements in to Airflow as first-class citizens.
> > >
> > > Jeremiah
> > >
> >
>


Re: DAG design for interacting with APIs

2017-02-02 Thread Steve Annessa
Thanks for the feedback Laura and Bolke. I think I'll try Laura's approach
and make the call to launch the task in the Sensor's init function.

-- Steve

On Thu, Feb 2, 2017 at 9:55 AM, Laura Lorenz 
wrote:

> We've gotten around this by implementing the external async job API call in
> the __init__ of the sensor and then poll as normal. If the polling fails,
> the next sensor instantiates a new external async job. Note this will also
> trigger new jobs if you hit the timeout.
>
> Here's the gist with our dag and sensor:
>
> https://gist.github.com/lauralorenz/bf47280b90067c71fe691bdf70b4145a
>
> On Thu, Feb 2, 2017 at 8:36 AM, Bolke de Bruin  wrote:
>
> > Hi Steve,
> >
> > At the moment we don’t have the possibility in Airflow to combine
> multiple
> > tasks in one unit of analysis, eg. when a task fails return to the
> > beginning of the set. We also don’t expose the functionality of
> resetting a
> > task state by API at the moment. You could mimic this behaviour (warning
> > this really is a hack) that if you get to a failed state you clear the
> > state of the earlier task in the database. I never tried it and it
> > certainly isn’t very clean or will be supported in anyway.
> >
> > What you could do is spit your dag in two. One that runs your
> > SinglarityOperator and one that monitors it. If it fails the monitor can
> > trigger a new dag_run for your first dag.
> >
> > - Bolke
> >
> > > On 1 Feb 2017, at 08:40, Steve Annessa 
> wrote:
> > >
> > > I need help designing a DAG
> > >
> > > High level problem:
> > > I need a way to launch tasks through an API and manage their state,
> when
> > > they fail I need the ability to automatically retry.
> > >
> > > What's involved:
> > > We use Singularity (https://github.com/HubSpot/Singularity) to launch
> > tasks
> > > on Mesos which can be standalone containers or Spark jobs that run for
> > > hours.
> > >
> > > What I've done so far:
> > > I've written an Operator for interacting with the Singularity API and I
> > can
> > > launch a task that Singularity manages. I then need to wait and poll
> the
> > > API for changes to the task state. The task can be in a few states but
> > the
> > > most important are FINISHED and FAILED. So I wrote a Sensor that polls
> > the
> > > API and watches for the task UID, that was passed through XCom from the
> > > SingularityOperator, each poll it checks the various states. If
> > everything
> > > passes, everything is great and the DAG moves along.
> > > The problem happens when the Singularity task fails, the
> > SingularitySensor
> > > will fail which is fine, but I don't know of a way to tell the previous
> > > SingularityOperator task to re-execute, so the DAG is stuck.
> > >
> > > Options I'm considering to resolve this problem:
> > > 1. Remove the Sensor and put the polling logic in the execute function
> > for
> > > the SingularityOperator. That will mean the Operator task will last for
> > the
> > > duration of the Singularity managed task which can be 4+ hours and the
> > > majority of the time will be spent polling the API. I'll also have to
> > write
> > > my own poll logic, which isn't terrible but I won't get to use the work
> > > already written in the BaseSensorOperator
> > > 2. Find a way to call back to the previous task in the event of Sensor
> > > failure; I'd like the flow to go "execute_singularity_task ->
> > > check_singularity_task"; if "check_singularity_task" is in the FAILED
> > > state, clear both "execute_singularity_task" and
> "check_singularity_task"
> > > and rerun from "execute_singularity_task" on.
> > > 3. Ask you guys for a better design
> > >
> > > The end goal is to have the following:
> > > 0. The ability to launch and manage tasks through the Singularity API
> > > 1. The ability for retries on failure at any point in the DAG without
> > human
> > > intervention
> > > 2. A simple as possible DAG
> > >
> > > Here's a gist for the DAG:
> > > https://gist.github.com/sannessa/dea05f743a1250c1e5e8a8e10c49d7b5
> > >
> > > Here's a gist for the Operator:
> > > https://gist.github.com/sannessa/7652c97de3c99426663d9541b2abeba3
> > >
> > > Here's a gist for the Sensor:
> > > https://gist.github.com/sannessa/14a427ee55f90ec2dff60e038e93edb5
> > > (This is a crude implementation and doesn't handle all of the states. I
> > > figured before I invested more time in making this more robust and
> > elegant
> > > I'd spend time figuring out if this was the correct tool for the job.)
> > >
> > > Thanks!
> > >
> > > -- Steve
> >
> >
>


Re: Airflow 1.8.0 Release Candidate 1

2017-02-02 Thread Bolke de Bruin
Now I am blushing :-)

Sent from my iPhone

> On 2 Feb 2017, at 22:05, Boris Tyukin  wrote:
> 
> LOL awesome!
> 
> On Thu, Feb 2, 2017 at 4:00 PM, Maxime Beauchemin <
> maximebeauche...@gmail.com> wrote:
> 
>> The Apache mailing doesn't support images so here's a link:
>> 
>> http://i.imgur.com/DUkpjZu.png
>> ​
>> 
>> On Thu, Feb 2, 2017 at 12:52 PM, Boris Tyukin 
>> wrote:
>> 
>>> Bolke, you are our hero! I am sure you put a lot of your time to make it
>>> happen
>>> 
>>> On Thu, Feb 2, 2017 at 2:50 PM, Bolke de Bruin 
>> wrote:
>>> 
 Hi All,
 
 I have made the (first) RELEASE CANDIDATE of Airflow 1.8.0 available
>> at:
 https://dist.apache.org/repos/dist/dev/incubator/airflow/ , public
>> keys
 are available at https://dist.apache.org/repos/dist/release/incubator/
 airflow/ . It is tagged with a local version “apache.incubating” so it
 allows upgrading from earlier releases. This should be considered of
 release quality, but not yet officially vetted as a release yet.
 
 Issues fixed:
 * Use static nvd3 and d3
 * Python 3 incompatibilities
 * CLI API trigger dag issue
 
 As the difference between beta 5 and the release candidate is
>> relatively
 small I hope to start the VOTE for releasing 1.8.0 quite soon (2
>> days?),
>>> if
 the vote passes also a vote needs to happen at the IPMC mailinglist. As
 this is our first Apache release I expect some comments and required
 changes and probably a RC 2.
 
 Furthermore, we now have a “v1-8-stable” branch. This has version
 “1.8.0rc1” and will graduate to “1.8.0” when we release. The
>> “v1-8-test”
 branch now has version “1.8.1alpha0” as version and “master” has
>> version
 “1.9.0dev0”. Note that “v1-8-stable” is now closed. This means that,
>> per
 release guidelines, patches accompanied with an ASSIGNED Jira and a
 sign-off from a committer. Only then the release manager applies the
>>> patch
 to stable (In this case that would be me). The release manager then
>>> closes
 the bug when the patches have landed in the appropriate branches. For
>>> more
 information please see: https://cwiki.apache.org/
 confluence/display/AIRFLOW/Airflow+Release+Planning+and+
 Supported+Release+Lifetime  .
 
 Any questions or suggestions don’t hesitate to ask!
 
 Cheers
 Bolke
>>> 
>> 


Re: Airflow 1.8.0 Release Candidate 1

2017-02-02 Thread Boris Tyukin
LOL awesome!

On Thu, Feb 2, 2017 at 4:00 PM, Maxime Beauchemin <
maximebeauche...@gmail.com> wrote:

> The Apache mailing doesn't support images so here's a link:
>
> http://i.imgur.com/DUkpjZu.png
> ​
>
> On Thu, Feb 2, 2017 at 12:52 PM, Boris Tyukin 
> wrote:
>
> > Bolke, you are our hero! I am sure you put a lot of your time to make it
> > happen
> >
> > On Thu, Feb 2, 2017 at 2:50 PM, Bolke de Bruin 
> wrote:
> >
> > > Hi All,
> > >
> > > I have made the (first) RELEASE CANDIDATE of Airflow 1.8.0 available
> at:
> > > https://dist.apache.org/repos/dist/dev/incubator/airflow/ , public
> keys
> > > are available at https://dist.apache.org/repos/dist/release/incubator/
> > > airflow/ . It is tagged with a local version “apache.incubating” so it
> > > allows upgrading from earlier releases. This should be considered of
> > > release quality, but not yet officially vetted as a release yet.
> > >
> > > Issues fixed:
> > > * Use static nvd3 and d3
> > > * Python 3 incompatibilities
> > > * CLI API trigger dag issue
> > >
> > > As the difference between beta 5 and the release candidate is
> relatively
> > > small I hope to start the VOTE for releasing 1.8.0 quite soon (2
> days?),
> > if
> > > the vote passes also a vote needs to happen at the IPMC mailinglist. As
> > > this is our first Apache release I expect some comments and required
> > > changes and probably a RC 2.
> > >
> > > Furthermore, we now have a “v1-8-stable” branch. This has version
> > > “1.8.0rc1” and will graduate to “1.8.0” when we release. The
> “v1-8-test”
> > > branch now has version “1.8.1alpha0” as version and “master” has
> version
> > > “1.9.0dev0”. Note that “v1-8-stable” is now closed. This means that,
> per
> > > release guidelines, patches accompanied with an ASSIGNED Jira and a
> > > sign-off from a committer. Only then the release manager applies the
> > patch
> > > to stable (In this case that would be me). The release manager then
> > closes
> > > the bug when the patches have landed in the appropriate branches. For
> > more
> > > information please see: https://cwiki.apache.org/
> > > confluence/display/AIRFLOW/Airflow+Release+Planning+and+
> > > Supported+Release+Lifetime  > > confluence/display/AIRFLOW/Airflow+Release+Planning+and+
> > > Supported+Release+Lifetime> .
> > >
> > > Any questions or suggestions don’t hesitate to ask!
> > >
> > > Cheers
> > > Bolke
> >
>


Re: Airflow 1.8.0 Release Candidate 1

2017-02-02 Thread Maxime Beauchemin
The Apache mailing doesn't support images so here's a link:

http://i.imgur.com/DUkpjZu.png
​

On Thu, Feb 2, 2017 at 12:52 PM, Boris Tyukin  wrote:

> Bolke, you are our hero! I am sure you put a lot of your time to make it
> happen
>
> On Thu, Feb 2, 2017 at 2:50 PM, Bolke de Bruin  wrote:
>
> > Hi All,
> >
> > I have made the (first) RELEASE CANDIDATE of Airflow 1.8.0 available at:
> > https://dist.apache.org/repos/dist/dev/incubator/airflow/ , public keys
> > are available at https://dist.apache.org/repos/dist/release/incubator/
> > airflow/ . It is tagged with a local version “apache.incubating” so it
> > allows upgrading from earlier releases. This should be considered of
> > release quality, but not yet officially vetted as a release yet.
> >
> > Issues fixed:
> > * Use static nvd3 and d3
> > * Python 3 incompatibilities
> > * CLI API trigger dag issue
> >
> > As the difference between beta 5 and the release candidate is relatively
> > small I hope to start the VOTE for releasing 1.8.0 quite soon (2 days?),
> if
> > the vote passes also a vote needs to happen at the IPMC mailinglist. As
> > this is our first Apache release I expect some comments and required
> > changes and probably a RC 2.
> >
> > Furthermore, we now have a “v1-8-stable” branch. This has version
> > “1.8.0rc1” and will graduate to “1.8.0” when we release. The “v1-8-test”
> > branch now has version “1.8.1alpha0” as version and “master” has version
> > “1.9.0dev0”. Note that “v1-8-stable” is now closed. This means that, per
> > release guidelines, patches accompanied with an ASSIGNED Jira and a
> > sign-off from a committer. Only then the release manager applies the
> patch
> > to stable (In this case that would be me). The release manager then
> closes
> > the bug when the patches have landed in the appropriate branches. For
> more
> > information please see: https://cwiki.apache.org/
> > confluence/display/AIRFLOW/Airflow+Release+Planning+and+
> > Supported+Release+Lifetime  > confluence/display/AIRFLOW/Airflow+Release+Planning+and+
> > Supported+Release+Lifetime> .
> >
> > Any questions or suggestions don’t hesitate to ask!
> >
> > Cheers
> > Bolke
>


Re: Airflow 1.8.0 Release Candidate 1

2017-02-02 Thread Boris Tyukin
Bolke, you are our hero! I am sure you put a lot of your time to make it
happen

On Thu, Feb 2, 2017 at 2:50 PM, Bolke de Bruin  wrote:

> Hi All,
>
> I have made the (first) RELEASE CANDIDATE of Airflow 1.8.0 available at:
> https://dist.apache.org/repos/dist/dev/incubator/airflow/ , public keys
> are available at https://dist.apache.org/repos/dist/release/incubator/
> airflow/ . It is tagged with a local version “apache.incubating” so it
> allows upgrading from earlier releases. This should be considered of
> release quality, but not yet officially vetted as a release yet.
>
> Issues fixed:
> * Use static nvd3 and d3
> * Python 3 incompatibilities
> * CLI API trigger dag issue
>
> As the difference between beta 5 and the release candidate is relatively
> small I hope to start the VOTE for releasing 1.8.0 quite soon (2 days?), if
> the vote passes also a vote needs to happen at the IPMC mailinglist. As
> this is our first Apache release I expect some comments and required
> changes and probably a RC 2.
>
> Furthermore, we now have a “v1-8-stable” branch. This has version
> “1.8.0rc1” and will graduate to “1.8.0” when we release. The “v1-8-test”
> branch now has version “1.8.1alpha0” as version and “master” has version
> “1.9.0dev0”. Note that “v1-8-stable” is now closed. This means that, per
> release guidelines, patches accompanied with an ASSIGNED Jira and a
> sign-off from a committer. Only then the release manager applies the patch
> to stable (In this case that would be me). The release manager then closes
> the bug when the patches have landed in the appropriate branches. For more
> information please see: https://cwiki.apache.org/
> confluence/display/AIRFLOW/Airflow+Release+Planning+and+
> Supported+Release+Lifetime  confluence/display/AIRFLOW/Airflow+Release+Planning+and+
> Supported+Release+Lifetime> .
>
> Any questions or suggestions don’t hesitate to ask!
>
> Cheers
> Bolke


Re: DAG design for interacting with APIs

2017-02-02 Thread Laura Lorenz
We've gotten around this by implementing the external async job API call in
the __init__ of the sensor and then poll as normal. If the polling fails,
the next sensor instantiates a new external async job. Note this will also
trigger new jobs if you hit the timeout.

Here's the gist with our dag and sensor:

https://gist.github.com/lauralorenz/bf47280b90067c71fe691bdf70b4145a

On Thu, Feb 2, 2017 at 8:36 AM, Bolke de Bruin  wrote:

> Hi Steve,
>
> At the moment we don’t have the possibility in Airflow to combine multiple
> tasks in one unit of analysis, eg. when a task fails return to the
> beginning of the set. We also don’t expose the functionality of resetting a
> task state by API at the moment. You could mimic this behaviour (warning
> this really is a hack) that if you get to a failed state you clear the
> state of the earlier task in the database. I never tried it and it
> certainly isn’t very clean or will be supported in anyway.
>
> What you could do is spit your dag in two. One that runs your
> SinglarityOperator and one that monitors it. If it fails the monitor can
> trigger a new dag_run for your first dag.
>
> - Bolke
>
> > On 1 Feb 2017, at 08:40, Steve Annessa  wrote:
> >
> > I need help designing a DAG
> >
> > High level problem:
> > I need a way to launch tasks through an API and manage their state, when
> > they fail I need the ability to automatically retry.
> >
> > What's involved:
> > We use Singularity (https://github.com/HubSpot/Singularity) to launch
> tasks
> > on Mesos which can be standalone containers or Spark jobs that run for
> > hours.
> >
> > What I've done so far:
> > I've written an Operator for interacting with the Singularity API and I
> can
> > launch a task that Singularity manages. I then need to wait and poll the
> > API for changes to the task state. The task can be in a few states but
> the
> > most important are FINISHED and FAILED. So I wrote a Sensor that polls
> the
> > API and watches for the task UID, that was passed through XCom from the
> > SingularityOperator, each poll it checks the various states. If
> everything
> > passes, everything is great and the DAG moves along.
> > The problem happens when the Singularity task fails, the
> SingularitySensor
> > will fail which is fine, but I don't know of a way to tell the previous
> > SingularityOperator task to re-execute, so the DAG is stuck.
> >
> > Options I'm considering to resolve this problem:
> > 1. Remove the Sensor and put the polling logic in the execute function
> for
> > the SingularityOperator. That will mean the Operator task will last for
> the
> > duration of the Singularity managed task which can be 4+ hours and the
> > majority of the time will be spent polling the API. I'll also have to
> write
> > my own poll logic, which isn't terrible but I won't get to use the work
> > already written in the BaseSensorOperator
> > 2. Find a way to call back to the previous task in the event of Sensor
> > failure; I'd like the flow to go "execute_singularity_task ->
> > check_singularity_task"; if "check_singularity_task" is in the FAILED
> > state, clear both "execute_singularity_task" and "check_singularity_task"
> > and rerun from "execute_singularity_task" on.
> > 3. Ask you guys for a better design
> >
> > The end goal is to have the following:
> > 0. The ability to launch and manage tasks through the Singularity API
> > 1. The ability for retries on failure at any point in the DAG without
> human
> > intervention
> > 2. A simple as possible DAG
> >
> > Here's a gist for the DAG:
> > https://gist.github.com/sannessa/dea05f743a1250c1e5e8a8e10c49d7b5
> >
> > Here's a gist for the Operator:
> > https://gist.github.com/sannessa/7652c97de3c99426663d9541b2abeba3
> >
> > Here's a gist for the Sensor:
> > https://gist.github.com/sannessa/14a427ee55f90ec2dff60e038e93edb5
> > (This is a crude implementation and doesn't handle all of the states. I
> > figured before I invested more time in making this more robust and
> elegant
> > I'd spend time figuring out if this was the correct tool for the job.)
> >
> > Thanks!
> >
> > -- Steve
>
>


Re: Flow-based Airflow?

2017-02-02 Thread Laura Lorenz
This is great!

We work with a lot of external data in wildly non-standard formats so
another enhancement here we'd use and support is passing customizable
serializers to Dataflow subclasses. This would let the dataflows keyword
arg for a task handle dependency management, the Dataflow class or
subclasses handle IO, and the Serializer subclasses handle parsing.

Happy to contribute here, perhaps to create an S3Dataflow subclass in the
style of your Google Cloud storage one for this PR.

Laura

On Wed, Feb 1, 2017 at 6:14 PM, Jeremiah Lowin  wrote:

> Great point. I think the best solution is to solve this for all XComs by
> checking object size before adding it to the DB. I don't see a built in way
> of handling it (though apparently MySQL is internally limited to 64kb).
> I'll look into a PR that would enforce a similar limit for all databases.
>
> On Wed, Feb 1, 2017 at 4:52 PM Maxime Beauchemin <
> maximebeauche...@gmail.com>
> wrote:
>
> I'm not sure about XCom being the default, it seems pretty dangerous. It
> just takes one person that is not fully aware of the size of the data, or
> one day with an outlier and that could put the Airflow db in jeopardy.
>
> I guess it's always been an aspect of XCom, and it could be good to have
> some explicit gatekeeping there regardless of this PR/feature. Perhaps the
> DB itself has protection against large blobs?
>
> Max
>
> On Wed, Feb 1, 2017 at 12:42 PM, Jeremiah Lowin  wrote:
>
> > Yesterday I began converting a complex script to a DAG. It turned out to
> be
> > a perfect test case for the dataflow model: a big chunk of data moving
> > through a series of modification steps.
> >
> > So I have built an extensible dataflow extension for Airflow on top of
> XCom
> > and the existing dependency engine:
> > https://issues.apache.org/jira/browse/AIRFLOW-825
> > https://github.com/apache/incubator-airflow/pull/2046 (still waiting for
> > tests... it will be quite embarrassing if they don't pass)
> >
> > The philosophy is simple:
> > Dataflow objects represent the output of upstream tasks. Downstream tasks
> > add Dataflows with a specific key. When the downstream task runs, the
> > (optionally indexed) upstream result is available in the downstream
> context
> > under context['dataflows'][key]. In addition, PythonOperators receive the
> > data as a keyword argument.
> >
> > The basic Dataflow serializes the data through XComs, but is trivially
> > extended to alternative storage via subclasses. I have provided (in
> > contrib) implementations of a local filesystem-based Dataflow as well as
> a
> > Google Cloud Storage dataflow.
> >
> > Laura, I hope you can have a look and see if this will bring some of your
> > requirements in to Airflow as first-class citizens.
> >
> > Jeremiah
> >
>


Re: DAG design for interacting with APIs

2017-02-02 Thread Bolke de Bruin
Hi Steve,

At the moment we don’t have the possibility in Airflow to combine multiple 
tasks in one unit of analysis, eg. when a task fails return to the beginning of 
the set. We also don’t expose the functionality of resetting a task state by 
API at the moment. You could mimic this behaviour (warning this really is a 
hack) that if you get to a failed state you clear the state of the earlier task 
in the database. I never tried it and it certainly isn’t very clean or will be 
supported in anyway. 

What you could do is spit your dag in two. One that runs your 
SinglarityOperator and one that monitors it. If it fails the monitor can 
trigger a new dag_run for your first dag.

- Bolke
 
> On 1 Feb 2017, at 08:40, Steve Annessa  wrote:
> 
> I need help designing a DAG
> 
> High level problem:
> I need a way to launch tasks through an API and manage their state, when
> they fail I need the ability to automatically retry.
> 
> What's involved:
> We use Singularity (https://github.com/HubSpot/Singularity) to launch tasks
> on Mesos which can be standalone containers or Spark jobs that run for
> hours.
> 
> What I've done so far:
> I've written an Operator for interacting with the Singularity API and I can
> launch a task that Singularity manages. I then need to wait and poll the
> API for changes to the task state. The task can be in a few states but the
> most important are FINISHED and FAILED. So I wrote a Sensor that polls the
> API and watches for the task UID, that was passed through XCom from the
> SingularityOperator, each poll it checks the various states. If everything
> passes, everything is great and the DAG moves along.
> The problem happens when the Singularity task fails, the SingularitySensor
> will fail which is fine, but I don't know of a way to tell the previous
> SingularityOperator task to re-execute, so the DAG is stuck.
> 
> Options I'm considering to resolve this problem:
> 1. Remove the Sensor and put the polling logic in the execute function for
> the SingularityOperator. That will mean the Operator task will last for the
> duration of the Singularity managed task which can be 4+ hours and the
> majority of the time will be spent polling the API. I'll also have to write
> my own poll logic, which isn't terrible but I won't get to use the work
> already written in the BaseSensorOperator
> 2. Find a way to call back to the previous task in the event of Sensor
> failure; I'd like the flow to go "execute_singularity_task ->
> check_singularity_task"; if "check_singularity_task" is in the FAILED
> state, clear both "execute_singularity_task" and "check_singularity_task"
> and rerun from "execute_singularity_task" on.
> 3. Ask you guys for a better design
> 
> The end goal is to have the following:
> 0. The ability to launch and manage tasks through the Singularity API
> 1. The ability for retries on failure at any point in the DAG without human
> intervention
> 2. A simple as possible DAG
> 
> Here's a gist for the DAG:
> https://gist.github.com/sannessa/dea05f743a1250c1e5e8a8e10c49d7b5
> 
> Here's a gist for the Operator:
> https://gist.github.com/sannessa/7652c97de3c99426663d9541b2abeba3
> 
> Here's a gist for the Sensor:
> https://gist.github.com/sannessa/14a427ee55f90ec2dff60e038e93edb5
> (This is a crude implementation and doesn't handle all of the states. I
> figured before I invested more time in making this more robust and elegant
> I'd spend time figuring out if this was the correct tool for the job.)
> 
> Thanks!
> 
> -- Steve